Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Kacper explained more of the Whole Tale design:
    • There will be a website where the user can enter the DOI, DOI will resolve to remote repository (e.g., DataOne). Ingest will only happen at that point (on-demand)
    • When the data can't be moved, if compute near the data it will be used
    • Need to support composing multiple datasets – e.g., DarkSky + some smaller dataset. In this case, the smaller data will be moved to site with the large dataset.
  • Might look into Agave project for capabilities API (long-term)
  • Specific comments about SC16 demo:
    • folderId requirement in volman can be removed – just hardcode the mountpoint. So can the userId requirement.
    • tmpnb can be used to simply create a temporary notebook – not tied to a user/folder
    • The folderId is useful when the user want's access to a subset/subdirectory
    • tmpnb is nothing new, so alone this isn't much of a demo.
    • DarkSky data is NFS mountable read
    • Regarding Norman dataset, Girder does support Swift for ingest, but need to test it.
    • Girder supports Oauth, if useful
  • There is now a presentation on November 16th
  • Next steps – for the demo, we will used the "Federated" model above, but long term there's still much to discuss
    • Data transfer (MHD)
    • Write the registry API and UI for proof-of-concept
    • Swift problem: Ingest into Girder directly or find out how best to mount Swift into container
    • Create VM near data at SDSC with Girder stack
    • Example notebooks for 3 datasets
  • Discussion of big-data publishing stack (spitballing)
    • Girder+tmpnb is now an option we can recommend to SCs to make these big datasets available.  Install these services, and you can make an inaccessible dataset accessible, with minimal analysis support.
    • This isn't the only stack – one of many options, but this works for the large physics simulation data.
    • If they install this stack, they could (potentially, with much more thought) be Whole-Tale compatible.
  • Discussion of "Data DNS" service (spitballing)
    • This came up during an earlier whiteboard session. The resolver can be a sort of data DNS service – given an identifier, resolve to one or more locations.
    • This would be different than the RDA PID concept, not an authoritative registry, just a way of saying "I have a copy of this data available here already" for larger datasets
    • Sites could possibly publish capabilities – I have this data and can launch a Docker container (e.g., Jupyter); I have this data in Hadoop/HDFS; I can support MPI jobs, etc.
    • The identifier now is anything that uniquely identifies the dataset (PID, DOI, Handle, URN, URL)

Storyboard for Demo Presentation

...