All, can you help me be prepared for questions?  (Can ignore until Monday of course and please do in the wiki if easier).
 
Q: What turn-key packages need to be installed?
  • Craig: What's in the demo/prototype is the ytHub stack: Girder+ytHub plugin; tmpnb proxy; yt volman and notebook.  I'd hope in the long term that we would require one particular stack, supporting multiple different environments, but we don't know what those are yet. (For example, Clowder+ToolManager used by TERRA-REF).


Q: What other analysis tools could sit beside the datasets (aside from Jupyter)?
  • Craig: Anything from the Jupyter suite, which now includes JupyterLab, Datascience/R notebooks; RStudio; cloud-based IDEs such as those in Workbench (Cloud9; PyCharm, etc). Pretty much anything that can be run in a Docker container.  For the long term, we've also discussed the concept of "head" containers that have all of the tooling needed to launch jobs in specialized cluster/compute environments (Spark/Hadoop; PBS) – of course, this would require that the hosting site have these services available.

Q: What do you think the process will be like for registering a dataset?
  • Craig: While manual registration will be possible, we also offer an API. I can also imaging something tied into Girder (or similar) that allows the user or data manager to register.

Q: Is DataDNS for large datasets only?  Would there be a point to using it with small datasets too?
  • Craig:  
    • Large datasets only – No, but at the moment that seems to be what differentiates it from other services. 
    • Small datasets – Maybe, but I think there's more of a need to help researchers producing these types of large datasets publish, preserve and provide access to enable reuse.
 
Q: Isn’t this already in place?
A: Other products have many of the pieces and the gaps in their roadmap, but none have all. (??)
  • Craig: 
    • Yes, there are products that already allow you to import/register/publish datasets with associated analysis tools, and sometimes run more complex analysis:
      • ytHub's Girder+Jupyter tmpnb
      • Clowder + Tool Server (Jupyter, RStudio) + Extractor architecture
      • Dataverse (repository) + TwoRavens (Analysis) + RServe (Analysis)
    • As far as I know there are no centralized discovery services that allow you register/find these types of services, particularly for very large datasets.
     
 
Anything you’d ask if you were in the audience?
  • Is it a repository? Does it support long-term preservation?
  • This is a nice prototype, but what is the roadmap? What's the long-term vision for this service?
  • What about real compute/reprocessing?
  • No labels