Goals

NDS Share(d) datasets: present online the below datasets so that users can find and obtain them. Highlight DOI's of both paper and dataset.
Provide analysis tools along with each dataset (ideally deployable data-side to avoid having to copy the data).

Presentation Logistics

Booth demos: NCSA, SDSC, ...
Draft list of specific individuals to invite to booths
Create a flyer (address how we approached the problem? discuss tech options)
SCinet late breaking news talk

Technology

Globus Publish (https://github.com/globus/globus-publish-dspace)
- Modified DSpace with Globus Authentication, Groups, and Transfer
- Example instance: Materials Data Facility
- Requires a Globus endpoint on each resource (not available for Swift stores, e.g. SDSC Cloud)
- Can skin a collection (currently doesn't support skinning instance)
- Analysis currently not supported
yt Hub
- Built on Girder with additional plugins to support Jupyter, ownCloud, ...
- Example skinned instance: Galaxy Cluster Merger Catalog
- Would need allocation on BlueWaters to run tools there. May need to move Moesta dataset elsewhere...
Resolution Service
- Given DOI → get URL to data, get loction, get machine, get local path on machine
- Given notebook, location, path → run notebook on data at resource
- Allows independence from repo technologies
- Allow repos to provide location information as metatdata, if not available attempt to resolve (e.g. from URL, index)
- Repos that don't have local computation options would need to move data
- Only requirement from repos is that data can be accessed via a URL
- Identify at least one notebook for demonstration
- Build as service with python library interface that can be shown in Jupyter
- Create an alternative bookmarklet client that can be shown on any of repo
  - click on link get list of resources to run a selected notebook on
- Discussed as a need within TERRA effort
- Leverage work with Girder in Whole Tale:
  - https://girder.wholetale.org/#collection/57fc1a1986ed1d000173b463
  - API

Datasets

1. MHD Turbulence in Core-Collapse Supernovae
Authors: Philipp Moesta (pmoesta@berkeley.edu), Christian Ott (cott@tapir.caltech.edu)
Paper URL: http://www.nature.com/nature/journal/v528/n7582/full/nature15755.html
Paper DOI: dx.doi.org/10.1038/nature15755
Data URL: https://go-bluewaters.ncsa.illinois.edu/globus-app/transfer?origin_id=8fc2bb2a-9712-11e5-9991-22000b96db58&origin_path=%2F
Data DOI: ??
Size: 90 TB
Code & Tools: Einstein Toolkit, see this page for list of available vis tools for this format

The dataset is a series of snapshots in time from 4 ultra-high resolution 3D magnetohydrodynamic simulations of rapidly rotating stellar core-collapse. The 3D domain for all simulations is in quadrant symmetry with dimensions 0 < x,y < 66.5km, -66.5km < z < 66.5km. It covers the newly born neutron star and it's shear layer with a uniform resolution. The simulations were performed at 4 different resolutions [500m,200m,100m,50m]. There are a total of 350 snapshots over the simulated time of 10ms with 10 variables capturing the state of the magnetofluid. For the highest resolution simulation, a single 3D output variable for a single time is ~26GB in size. The entire dataset is ~90TB in size. The highest resolution simulation used 60 million CPU hours on BlueWaters. The dataset may be used to analyze the turbulent state of the fluid and perform analysis going beyond the published results in Nature doi:10.1038/nature15755.

2. Probing the Ultraviolet Luminosity Function of the Earliest Galaxies with the Renaissance Simulations
Authors: Brian O'Shea (oshea@msu.edu), John Wise, Hao Xu, Michael Norman
Paper URL: http://iopscience.iop.org/article/10.1088/2041-8205/807/1/L12/meta;jsessionid=40CF566DDA56AD74A99FE108F573F445.c1.iopscience.cld.iop.org
Paper DOI: dx.doi.org/10.1088/2041-8205/807/1/L12
Data URL:

Data DOI: ??
Size: 89 TB
Code & Tools: Enzo

In this paper, we present the first results from the Renaissance Simulations, a suite of extremely high-resolution and physics-rich AMR calculations of high-redshift galaxy formation performed on the Blue Waters supercomputer. These simulations contain hundreds of well-resolved galaxies at z ~ 25–8, and make several novel, testable predictions. Most critically, we show that the ultraviolet luminosity function of our simulated galaxies is consistent with observations of high-z galaxy populations at the bright end of the luminosity function (M1600 ⩽ -17), but at lower luminosities is essentially flat rather than rising steeply, as has been inferred by Schechter function fits to high-z observations, and has a clearly defined lower limit in UV luminosity. This behavior of the luminosity function is due to two factors: (i) the strong dependence of the star formation rate (SFR) on halo virial mass in our simulated galaxy population, with lower-mass halos having systematically lower SFRs and thus lower UV luminosities; and (ii) the fact that halos with virial masses below ~2 x 10^8 M do not universally contain stars, with the fraction of halos containing stars dropping to zero at ~7 x 10^6 M . Finally, we show that the brightest of our simulated galaxies may be visible to current and future ultra-deep space-based surveys, particularly if lensed regions are chosen for observation.

3. Dark Sky Simulation
Authors: Michael Warren, Alexandar Friedland, Daniel Holz, Samuel Skillman, Paul Sutter, Matthew Turk (mjturk@illinois.edu), Risa Wechsler
Paper URL: https://zenodo.org/record/10777#.V_VvKtwcK1M, https://arxiv.org/abs/1407.2600
Paper DOI: http://dx.doi.org/10.5281/zenodo.10777
Data URL:

- https://girder.hub.yt/api/v1/collection/578501e0c2a5f40001cec1d6/download (https://girder.hub.yt/#collection/578501e0c2a5f40001cec1d6)
- http://darksky.slac.stanford.edu/about.html

Data DOI: ??
Size: 31 TB
Code & Tools: https://bitbucket.org/darkskysims/darksky_tour/

The cosmological N-body simulation designed to provide a quantitative and accessible model of the evolution of the large-scale Universe.

4. ...

Design Notes

Planning discussion 1 (NDSC6)

Photo of whiteboard from NDSC6

On the left is the repository landing page for a dataset (Globus, SEAD, Dataverse) with a button/link to the "Job Submission" UI
Job Submission UI is basically the Tool manager or Jupyter tmpnb
At the top (faintly) is a registry that resolves a dataset URL to it's location with mountable path
- (There was some confusion whether this was the dataset URL or dataset DOI or other PID, but now it sounds like URL – see example below)
On the right are the datasets at their locations (SDSC, NCSA)
The user can launch a container (e.g., Jupyter) that mounts the datasets readonly and runs on a docker-enabled host at each site.
Todo list on the right:
- Data access at SDSC (we need a docker-enabled host that can mount the Norman dataset)
- Auth – how are we auth'ing users?
- Container orchestration – how are we launching/managing containers at each site
- Analysis?
- BW → Condo : Copy the MHD dataset from Blue Waters to storage condo at NCSA
- Dataset metadata (Kenton)
- Resolution (registry) (Kyle)

Planning Discussion 2

Notes from discussion (Craig W, David R, Mike L) based on above whiteboard diagram:

sc16-box-diagram

What we have:

"Tool Manager" demonstrated at NDSC6
- Angular UI over a very simple Python/Flask REST API.
- The REST API allows you to get a list of supported tools, post/put/delete instances of running tools.
- Fronted with a basic NGINX proxy that routes traffic to the running container based on ID (e.g., http://tooserver/containerId/)
- Data is retrieved via HTTP get using repository-specific APIs. Only Clowder and Dataverse are supported
- Docker containers are managed via system calls (docker executable)
WholeTale/ytHub/tmpnb:
- The yt.hub team has extended Jupyter tmpnb to support volume mounts. They've created fuse mounts for Girder.
PRAGMA PID service
- Demonstrated at NDSC6, appears to allow attaching arbitrary metadata to registered PID.
Analysis
- For the Dark Sky dataset, we can use the notebook demonstrated by the yt team.

What we need to do:

Copy MHD dataset to storage condo
Docker-enabled hosts with access to each dataset (e.g., NFS) at SDSC, possibly in the yt DXL project, and in the NDS Labs project for MHD
Decide whether to use/extend existing Tool Manager, yt/tmpnb or Jupyter tmpnp (or something else)
Define strategy for managing containers at each site
- Simple: "ssh docker run -v" or use the Docker API
- Harder: Use Kubernetes or Docker Swarm for container orchestration. For example, launch a jupyter container on a node with label "sdsc"
Implement the resolution/registry
- Ability to register a data URL with some associated metadata.
- Metadata would include site (SDSC, NCSA) and volume mount information for the dataset.
- The PRAGMA PID service looks possible at first glance, but may be too complex for what we're trying to do. It requires handle.net integration.
Implement bookmarklet: There was discussion of providing some bookmarklet javascript to link a data DOI/PID to the "tool manager" service
Authentication:
- TBD – how do we control who gets access, or is it open to the public?
- In the case of Clowder/Dataverse, all API requests include an API key
Analysis:
- Need to get notebooks/code to demonstrate how to work with the MHD and Norman data.

Example case for resolution (not a real dataset for SC16)

A dataset has a Globus Publish landing page https://publish.globus.org/jspui/handle/ITEM/113
This dataset has the URL
- https://www.globus.org/app/transfer?origin_id=82f1b5c6-6e9b-11e5-ba47-22000b92c6ec&origin_path=/unpublished/publication_113/
This would map to Nebula:
- /scratch/mdf/publication_113

Component Options

We will need to select one from each of the following categories.

All combinations are possible, although some combinations will likely be easier to accomplish than others.

"Repository" - User Frontend

1. ~~user installs bookmarklet~~ this may be restricted in modern browsers... more research is necessary
  - Pros
    - browser-agnostic
  - Cons
    - probably lots of learning involved here
    - user must seek out and install this
    - no notion of authentication (see below)
    - more complex scenarios may not work in modern browsers
      - see https://hypothes.is/blog/farewell-to-bookmarklets/
      - see https://medium.com/making-instapaper/bookmarklets-are-dead-d470d4bbb626#.co504ji62
2. user installs browser extension
  - Pros
    - more secure than bookmarklets... I guess?
  - Cons
    - probably lots of learning involved here
    - user must seek out and install this
    - no notion of authentication (see below)
    - browser-specific (we would need to develop and maintain one for each browser)
3. developer(s) add a link to repo UI which leads to the existing ToolManager UI landing page, as in the NDSC6 demo
  - Pros
    - user does not need to install anything special on their local machine to launch tools
    - most repos inherently have a notion of "user" whose username and/or email we can use to identify tools launched by this user
  - Cons
    - repo UI developers who want to integrate with us need to add one line to their source to integrate with us
      - Dataverse, Clowder, Globus Publish, etc

"Resolver" - API endpoint to resolve DOIs to tmpnb proxy URLs

Open question: federation (see below) - is this centralized or decentralized?

1. existing ToolManager - this will very simply serve a JSON file from disk
  - Pros
    - Easy to set up and modify as we need to
  - Cons
    - Likely not a long-term solution, but simple enough to accomplish in the short-term
2. Girder?
  - Pros
    - Well-documented, extensible API, with existing notions of file, resource, and user management
  - Cons
    - likely overkill for this system, as we don't need any of the file management capabilities for resolving
    - language barriers in modifying Girder - python + javascript (raw? nodejs?)
3. etcd?
  - Pros
    - Familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's etcd.go
  - Cons
    - it might be quite a bit of work to build up a new API around etcd
4. PRAGMA PID service?
  - Pros
    - sounds remarkably similar to what we're trying to accomplish here
    - supports a wide variety of different handle types (and repos?)
  - Cons
    - may be too complex to accomplish in the short term
    - unfamiliar code base / languages

"Agent" - launches containers alongside the data on a Docker-enabled host

1. existing ToolManager?
  - Pros
    - already parameterized to launch multiple tools (jupyter and rstudio)
  - Cons
    - no notion of "user" or authentication
2. Girder/tmpnb?
  - Pros
    - notebooks automatically time out after a given period
    - inherited notion of "user"
  - Cons
    - can only launch single image type, currently (only jupyter)
    - inherited notion of "user" may present an interesting auth problem - how do we share these accounts between sites?
3. Kubernetes / Docker Swarm?
  - Pros
    - familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's kube.go
    - orchestration keeps containers alive if possible when anything goes wrong
  - Cons
    - may be too complex to accomplish in the short term
4. docker -H?
  - Pros
    - zero setup necessary, just need Docker installed and the port open
  - Cons
    - HIGHLY insecure - would require some form of authentication (see below)

"Data" - large datasets need to be mountable on a Docker-enabled host

1. NFS?
2. GFS?
3. other options?

Federation options

Centralized
1. New sites register with central API server as they come online (i.e. POST to /metadata)
  1. POSTed metadata should include all urls, DOIs, and other necessary info
2. Central API server (Resolver) receives all requests, resolves DOIs to sites that have registered, and delegates jobs to the Agent
- Pros
  - synchronization / authentication (see below) may be slightly easier to solve
- Cons
  - single point of failure
Decentralized
1. New sites register with each other
2. Any API server receives request and can resolve and delegate to the appropriate Agent
- Pros
  - no single point of failure
- Cons
  - synchronization (see below) is still an open question (is this an open broadcast? handshake? do we keep a record of nearest neighbors?)
  - authentication (see below) and sharing credentials between sites becomes a more complex problem

Synchronization options

Sites push their status to the API
- Assumption: failures are retried after a reasonable period
- Pros
  - Updates happen in real-time (no delay except network latency)
- Cons
  - Congestion if many sites come online at precisely the same second
  - More work for whatever we choose as the scheduler / orchestration system - a site missing a scheduled push means we may need to pull it out of rotation
API polls for each site's status
- Assumption: failures are silent, and retried on the next poll interval
- Pros
  - We will know explicitly when sites' are no longer available for launching tools
- Cons
  - Time delay between polls means we could have stale data
  - Threading nightmare - this is either one thread short-lived per site, or one giant thread looping through all sites

Authentication options

Auth may inhibit the "one-touch" behavior desired for these tools - you will always need to at least choose a tool and possibly enter credentials when launching a tool

Build some kind of quasi-auth scheme (similar to ndslabs) on top of the existing ToolManager
Inherit Girder's auth scheme and solve the problem of sharing these "users" between sites
Create a "guest" user at each site and use that to launch tools from remote sources
- NOTE: tmpnb only allows one notebook per user (per folder?), so anyone launching remotely would be sharing a notebook
- this is undesirable, as ideally each request would launch a separate instance
- lingering question: how do we get you back to the notebook if you lose the link? how do we know which notebook is yours?

Inclinations: SC16 Demo

transfer (if necessary) each dataset to existing cloud architecture - in progress?
discover mount points for each large dataset within existing cloud architecture - in progress?
spin up a Docker-enabled host and mount nearby datasets (NFS, direct mount, etc.) - in progress?
using docker-compose, bring up provided girder-dev on each Docker host - pending
extend existing ToolManager to receive site metadata - in progress
modify girder-dev to POST site metadata on startup - in progress
extend existing ToolManager to delegate tmpnb jobs to remote instances of Girder
wrap existing ToolManager in a simple auth mechanism
- could we possibly import existing users from Girder using their API? probably not, due to security
- we could callback to Girder when sites push their metadata (assuming this can be done as Girder comes online)
run a centralized ToolManager instance on Nebula for the purposes of the demo
modify existing ToolManager UI to list off collections in connected Girder instance
- Add a "Launch Notebook" button next to each dataset where no notebook is running
- Add a "Stop Notebook" button next to each dataset where a notebook has been launched

Using the above we would be able to show:

Searching for data on compute-enabled systems (albeit in a list of only 3 datasets registered in the system), possibly linking back to the original data source
Launch a Jupyter notebook next to each remote dataset without explicitly navigating to where that data is stored (i.e. the Girder UI)
How to bring this same stack up next to your data to make it searchable in our UI (we could even demonstrate this live, if it goes smoothly enough)

Inclinations: As a Long-Term service

leverage existing Labs/Kubernetes API for authentication and container orchestration / access across remote sites
- etcd.go / kube.go can likely take care of talking to the necessary APIs for us, maybe needing some slight modification
- possibly extend Labs apiserver to include the functionality of delegating jobs to tmpnb and/or ToolManager agents?
- this leaves an open questions: single geodistributed kubernetes cluster? or one kubernetes cluster per site, federated across all sites ("ubernetes")?

Storyboard for Demo Presentation

Here is my PPT version of my napkin sketch for the SC demo. Also context on where the demo product fits in the story. Comments, please!

nds_sc16_demo.pptx

Space shortcuts

Page tree

Goals

Presentation Logistics

Technology

Datasets

Design Notes

Planning discussion 1 (NDSC6)

Planning Discussion 2

Component Options

Federation options

Synchronization options

Authentication options

Inclinations: SC16 Demo

Inclinations: As a Long-Term service

Storyboard for Demo Presentation

Space shortcuts

Page tree

SC16 Demo Planning

Goals

Presentation Logistics

Technology

Datasets

Design Notes

Planning discussion 1 (NDSC6)

Planning Discussion 2

Component Options

Federation options

Synchronization options

Authentication options

Inclinations: SC16 Demo

Inclinations: As a Long-Term service

Storyboard for Demo Presentation