Goals

NDS Share(d) datasets: present online the below datasets so that users can find and obtain them. Highlight DOI's of both paper and dataset.
Provide analysis tools along with each dataset (ideally deployable data-side to avoid having to copy the data).

Presentation Logistics

Booth demos: NCSA, SDSC, ...
Draft list of specific individuals to invite to booths
Create a flyer (address how we approached the problem? discuss tech options)
SCinet late breaking news talk

Technology

Globus Publish (https://github.com/globus/globus-publish-dspace)
- Modified DSpace with Globus Authentication, Groups, and Transfer
- Example instance: Materials Data Facility
- Requires a Globus endpoint on each resource (not available for Swift stores, e.g. SDSC Cloud)
- Can skin a collection (currently doesn't support skinning instance)
- Analysis currently not supported
yt Hub
- Built on Girder with additional plugins to support Jupyter, ownCloud, ...
- Example skinned instance: Galaxy Cluster Merger Catalog
- Would need allocation on BlueWaters to run tools there. May need to move Moesta dataset elsewhere...
Resolution Service
- Given DOI → get URL to data, get loction, get machine, get local path on machine
- Given notebook, location, path → run notebook on data at resource
- Allows independence from repo technologies
- Allow repos to provide location information as metatdata, if not available attempt to resolve (e.g. from URL, index)
- Repos that don't have local computation options would need to move data
- Only requirement from repos is that data can be accessed via a URL
- Identify at least one notebook for demonstration
- Build as service with python library interface that can be shown in Jupyter
- Create an alternative bookmarklet client that can be shown on any of repo
  - click on link get list of resources to run a selected notebook on
- Discussed as a need within TERRA effort
- Leverage work with Girder in Whole Tale:
  - https://girder.wholetale.org/#collection/57fc1a1986ed1d000173b463
  - API

Datasets

1. MHD Turbulence in Core-Collapse Supernovae
Authors: Philipp Moesta (pmoesta@berkeley.edu), Christian Ott (cott@tapir.caltech.edu)
Paper URL: http://www.nature.com/nature/journal/v528/n7582/full/nature15755.html
Paper DOI: dx.doi.org/10.1038/nature15755
Data URL: https://go-bluewaters.ncsa.illinois.edu/globus-app/transfer?origin_id=8fc2bb2a-9712-11e5-9991-22000b96db58&origin_path=%2F
Data DOI: ??
Size: 90 TB
Code & Tools: Einstein Toolkit, see this page for list of available vis tools for this format

The dataset is a series of snapshots in time from 4 ultra-high resolution 3D magnetohydrodynamic simulations of rapidly rotating stellar core-collapse. The 3D domain for all simulations is in quadrant symmetry with dimensions 0 < x,y < 66.5km, -66.5km < z < 66.5km. It covers the newly born neutron star and it's shear layer with a uniform resolution. The simulations were performed at 4 different resolutions [500m,200m,100m,50m]. There are a total of 350 snapshots over the simulated time of 10ms with 10 variables capturing the state of the magnetofluid. For the highest resolution simulation, a single 3D output variable for a single time is ~26GB in size. The entire dataset is ~90TB in size. The highest resolution simulation used 60 million CPU hours on BlueWaters. The dataset may be used to analyze the turbulent state of the fluid and perform analysis going beyond the published results in Nature doi:10.1038/nature15755.

2. Probing the Ultraviolet Luminosity Function of the Earliest Galaxies with the Renaissance Simulations
Authors: Brian O'Shea (oshea@msu.edu), John Wise, Hao Xu, Michael Norman
Paper URL: http://iopscience.iop.org/article/10.1088/2041-8205/807/1/L12/meta;jsessionid=40CF566DDA56AD74A99FE108F573F445.c1.iopscience.cld.iop.org
Paper DOI: dx.doi.org/10.1088/2041-8205/807/1/L12
Data URL:

Data DOI: ??
Size: 89 TB
Code & Tools: Enzo

In this paper, we present the first results from the Renaissance Simulations, a suite of extremely high-resolution and physics-rich AMR calculations of high-redshift galaxy formation performed on the Blue Waters supercomputer. These simulations contain hundreds of well-resolved galaxies at z ~ 25–8, and make several novel, testable predictions. Most critically, we show that the ultraviolet luminosity function of our simulated galaxies is consistent with observations of high-z galaxy populations at the bright end of the luminosity function (M1600 ⩽ -17), but at lower luminosities is essentially flat rather than rising steeply, as has been inferred by Schechter function fits to high-z observations, and has a clearly defined lower limit in UV luminosity. This behavior of the luminosity function is due to two factors: (i) the strong dependence of the star formation rate (SFR) on halo virial mass in our simulated galaxy population, with lower-mass halos having systematically lower SFRs and thus lower UV luminosities; and (ii) the fact that halos with virial masses below ~2 x 10^8 M do not universally contain stars, with the fraction of halos containing stars dropping to zero at ~7 x 10^6 M . Finally, we show that the brightest of our simulated galaxies may be visible to current and future ultra-deep space-based surveys, particularly if lensed regions are chosen for observation.

3. Dark Sky Simulation
Authors: Michael Warren, Alexandar Friedland, Daniel Holz, Samuel Skillman, Paul Sutter, Matthew Turk (mjturk@illinois.edu), Risa Wechsler
Paper URL: https://zenodo.org/record/10777#.V_VvKtwcK1M, https://arxiv.org/abs/1407.2600
Paper DOI: http://dx.doi.org/10.5281/zenodo.10777
Data URL:

- https://girder.hub.yt/api/v1/collection/578501e0c2a5f40001cec1d6/download (https://girder.hub.yt/#collection/578501e0c2a5f40001cec1d6)
- http://darksky.slac.stanford.edu/about.html

Data DOI: ??
Size: 31 TB
Code & Tools: https://bitbucket.org/darkskysims/darksky_tour/

The cosmological N-body simulation designed to provide a quantitative and accessible model of the evolution of the large-scale Universe.

4. ...

Design Notes

Planning discussion 1 (NDSC6)

Photo of whiteboard from NDSC6

On the left is the repository landing page for a dataset (Globus, SEAD, Dataverse) with a button/link to the "Job Submission" UI
Job Submission UI is basically the Tool manager or Jupyter tmpnb
At the top (faintly) is a registry that resolves a dataset URL to it's location with mountable path
- (There was some confusion whether this was the dataset URL or dataset DOI or other PID, but now it sounds like URL – see example below)
On the right are the datasets at their locations (SDSC, NCSA)
The user can launch a container (e.g., Jupyter) that mounts the datasets readonly and runs on a docker-enabled host at each site.
Todo list on the right:
- Data access at SDSC (we need a docker-enabled host that can mount the Norman dataset)
- Auth – how are we auth'ing users?
- Container orchestration – how are we launching/managing containers at each site
- Analysis?
- BW → Condo : Copy the MHD dataset from Blue Waters to storage condo at NCSA
- Dataset metadata (Kenton)
- Resolution (registry) (Kyle)

Planning Discussion 2

Notes from discussion (Craig W, David R, Mike L) based on above whiteboard diagram:

sc16-box-diagram

What we have:

"Tool Manager" demonstrated at NDSC6
- Angular UI over a very simple Python/Flask REST API.
- The REST API allows you to get a list of supported tools, post/put/delete instances of running tools.
- Fronted with a basic NGINX proxy that routes traffic to the running container based on ID (e.g., http://tooserver/containerId/)
- Data is retrieved via HTTP get using repository-specific APIs. Only Clowder and Dataverse are supported
- Docker containers are managed via system calls (docker executable)
WholeTale/ytHub/tmpnb:
- The yt.hub team has extended Jupyter tmpnb to support volume mounts. They've created fuse mounts for Girder.
PRAGMA PID service
- Demonstrated at NDSC6, appears to allow attaching arbitrary metadata to registered PID.
Analysis
- For the Dark Sky dataset, we can use the notebook demonstrated by the yt team.

What we need to do:

Copy MHD dataset to storage condo
Docker-enabled hosts with access to each dataset (e.g., NFS) at SDSC, possibly in the yt DXL project, and in the NDS Labs project for MHD
Decide whether to use/extend existing Tool Manager, yt/tmpnb or Jupyter tmpnp (or something else)
Define strategy for managing containers at each site
- Simple: "ssh docker run -v" or use the Docker API
- Harder: Use Kubernetes or Docker Swarm for container orchestration. For example, launch a jupyter container on a node with label "sdsc"
Implement the resolution/registry
- Ability to register a data URL with some associated metadata.
- Metadata would include site (SDSC, NCSA) and volume mount information for the dataset.
- The PRAGMA PID service looks possible at first glance, but may be too complex for what we're trying to do. It requires handle.net integration.
Implement bookmarklet: There was discussion of providing some bookmarklet javascript to link a data DOI/PID to the "tool manager" service
Authentication:
- TBD – how do we control who gets access, or is it open to the public?
- In the case of Clowder/Dataverse, all API requests include an API key
Analysis:
- Need to get notebooks/code to demonstrate how to work with the MHD and Norman data.

Example case for resolution (not a real dataset for SC16)

A dataset has a Globus Publish landing page https://publish.globus.org/jspui/handle/ITEM/113
This dataset has the URL
- https://www.globus.org/app/transfer?origin_id=82f1b5c6-6e9b-11e5-ba47-22000b92c6ec&origin_path=/unpublished/publication_113/
This would map to Nebula:
- /scratch/mdf/publication_113

Component Options

We will need to select one from each of the following categories.

All combinations are possible, although some combinations will likely be easier to accomplish than others.

"Repository" - User Frontend
1. ~~user installs bookmarklet~~ this may be restricted in modern browsers... more research is necessary
  - Pros
    - browser-agnostic
  - Cons
    - probably lots of learning involved here
    - user must seek out and install this
    - injecting arbitrary JavaScript into pages does not feel very secure, and has since been replaced by modern browser extensions
      - see https://hypothes.is/blog/farewell-to-bookmarklets/
      - see https://medium.com/making-instapaper/bookmarklets-are-dead-d470d4bbb626#.co504ji62
2. user installs browser extension
  - Pros
    - more secure than bookmarklets... I guess?
  - Cons
    - probably lots of learning involved here
    - user must seek out and install this
    - browser-specific (we would need to develop and maintain one for each browser)
3. developer(s) add a link to repo UI which leads to the existing ToolManager UI landing page, as in the NDSC6 demo
  - Pros
    - user does not need to install anything special on their local machine to launch tools
  - Cons
    - repo UI developers who want to integrate with us need to add one line to their source to integrate with us
      - Dataverse, Clowder, Globus Publish, etc
"Resolver" - API endpoint to resolve DOIs to tmpnb proxy URLs
1. Serve a JSON file from disk? (this is more or less how the existing ToolManager works)
  - Pros
    - Easy to set up and modify as we need to
  - Cons
    - Likely not a long-term solution, but simple enough to accomplish in the short-term
2. Girder?
  - Pros
    - Well-documented, extensible API, with existing notions of file, resource, and user management
  - Cons
    - likely overkill for this system, as we don't need any of the file management capabilities for resolving
3. etcd?
  - Pros
    - familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's etcd.go
  - Cons
    - it might be quite a bit of work to build up a new API around etcd
4. PRAGMA PID service?
  - Pros
    - sounds remarkably similar to what we're trying to accomplish here
    - supports a wide variety of different handle types (and repos?)
  - Cons
    - may be too complex to accomplish in the short term
    - unfamiliar code base / languages
"Agent" - launches containers alongside the data on a Docker-enabled host
1. existing ToolManager?
  - Pros
    - already parameterized to launch multiple tools (jupyter and rstudio)
  - Cons
    - no notion of "user" or authentication
2. Girder/tmpnb?
  - Pros
    - notebooks automatically time out after a given period
  - Cons
    - can only launch single image type, currently (only jupyter)
3. Kubernetes / Docker Swarm?
  - Pros
    - familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's kube.go
    - orchestration keeps containers alive if possible when anything goes wrong
  - Cons
    - may be too complex to accomplish in the short term
4. docker -H?
  - Pros
    - zero setup necessary, just need Docker installed and the port open
  - Cons
    - HIGHLY insecure
"Data" - large datasets need to be mountable on a Docker-enabled host
1. NFS?
2. GFS?
3. other options?

Federation options

Centralized
1. New sites register with central API server as they come online (i.e. POST to /metadata)
  1. POSTed metadata should include all urls, DOIs, and other necessary info
2. Central API server (Resolver) receives all requests, resolves DOIs to sites that have registered, and delegates jobs to the Agent
Decentralized
1. New sites register with each other (is this a broadcast? handshake? how to handle synchronization?)
2. Any API server receives request and can resolve and delegate to the appropriate Agent

Synchronization options

Sites push their status to the API
- Assumption: failures are retried after a reasonable period
- Pros
  - Updates happen in real-time (no delay except network latency)
- Cons
  - Congestion if many sites come online at precisely the same second
API polls for each site's status
- Assumption: failures are silent, and retried on the next poll interval
- Pros
  - ???
- Cons
  - Time delay between polls means we could be desynchronized
  - Not scalable - this is either one thread per site, or one giant thread looping through all sites

Storyboard for Demo Presentation

Here is my PPT version of my napkin sketch for the SC demo. Also context on where the demo product fits in the story. Comments, please!

nds_sc16_demo.pptx

Space shortcuts

Page tree

Goals

Presentation Logistics

Technology

Datasets

Design Notes

Planning discussion 1 (NDSC6)

Planning Discussion 2

Component Options

Federation options

Synchronization options

Storyboard for Demo Presentation

Space shortcuts

Page tree

SC16 Demo Planning

Goals

Presentation Logistics

Technology

Datasets

Design Notes

Planning discussion 1 (NDSC6)

Planning Discussion 2

Component Options

Federation options

Synchronization options

Storyboard for Demo Presentation