http://sc16.supercomputing.org/ - Wednesday, Nov. 16th, 3:00pm MST

Goals

  1. NDS Share(d) datasets: present online the below datasets so that users can find and obtain them.  Highlight DOI's of both paper and dataset.
  2. Provide analysis tools along with each dataset (ideally deployable data-side to avoid having to copy the data).

Presentation Logistics

Technology

  1. Globus Publish (https://github.com/globus/globus-publish-dspace)
  2. yt Hub
  3. Resolution Service

Datasets

1. MHD Turbulence in Core-Collapse Supernovae
Authors: Philipp Moesta (pmoesta@berkeley.edu)
, Christian Ott (cott@tapir.caltech.edu)

Paper Citation:   Mösta, P., Ott, C. D., Radice, D., Roberts, L. F., Schnetter, E., & Haas, R. (2015). A large-scale dynamo and magnetoturbulence in rapidly rotating core-collapse supernovae. Nature, 528(7582), 376–379. http://dx.doi.org/10.1038/nature15755
Paper URL: http://www.nature.com/nature/journal/v528/n7582/full/nature15755.html
Paper DOI: dx.doi.org/10.1038/nature15755
Data Citation: ??
Data URL: https://go-bluewaters.ncsa.illinois.edu/globus-app/transfer?origin_id=8fc2bb2a-9712-11e5-9991-22000b96db58&origin_path=%2F
Data DOI: ??
Size: 90 TB
Code & Tools: Einstein Toolkit, see this page for list of available vis tools for this format
Jupyter Notebook: ??

The dataset is a series of snapshots in time from 4 ultra-high resolution 3D magnetohydrodynamic simulations of rapidly rotating stellar core-collapse. The 3D domain for all simulations is in quadrant symmetry with dimensions 0 < x,y < 66.5km, -66.5km < z < 66.5km. It covers the newly born neutron star and it's shear layer with a uniform resolution. The simulations were performed at 4 different resolutions [500m,200m,100m,50m]. There are a total of 350 snapshots over the simulated time of 10ms with 10 variables capturing the state of the magnetofluid. For the highest resolution simulation, a single 3D output variable for a single time is ~26GB in size. The entire dataset is ~90TB in size. The highest resolution simulation used 60 million CPU hours on BlueWaters. The dataset may be used to analyze the turbulent state of the fluid and perform analysis going beyond the published results in Nature doi:10.1038/nature15755.

2. Probing the Ultraviolet Luminosity Function of the Earliest Galaxies with the Renaissance Simulations 
Authors: Brian O'Shea (oshea@msu.edu), John Wise, Hao Xu, Michael Norman

Paper Citation: Norman, B. W. O. and J. H. W. and H. X. and M. L. (2015). Probing the Ultraviolet Luminosity Function of the Earliest Galaxies with the Renaissance Simulations. The Astrophysical Journal Letters, 807(1), L12.
Paper URL: http://iopscience.iop.org/article/10.1088/2041-8205/807/1/L12/meta;jsessionid=40CF566DDA56AD74A99FE108F573F445.c1.iopscience.cld.iop.org
Paper DOI: dx.doi.org/10.1088/2041-8205/807/1/L12
Data URL: 

Data Citation: ??
Data DOI: ??
Size: 89 TB
Code & Tools: Enzo
Jupyter Notebook: http://yt-project.org/docs/dev/cookbook/cosmological_analysis.html

In this paper, we present the first results from the Renaissance Simulations, a suite of extremely high-resolution and physics-rich AMR calculations of high-redshift galaxy formation performed on the Blue Waters supercomputer. These simulations contain hundreds of well-resolved galaxies at z ~ 25–8, and make several novel, testable predictions. Most critically, we show that the ultraviolet luminosity function of our simulated galaxies is consistent with observations of high-z galaxy populations at the bright end of the luminosity function (M1600 -17), but at lower luminosities is essentially flat rather than rising steeply, as has been inferred by Schechter function fits to high-z observations, and has a clearly defined lower limit in UV luminosity. This behavior of the luminosity function is due to two factors: (i) the strong dependence of the star formation rate (SFR) on halo virial mass in our simulated galaxy population, with lower-mass halos having systematically lower SFRs and thus lower UV luminosities; and (ii) the fact that halos with virial masses below ~2 x 10^8 M do not universally contain stars, with the fraction of halos containing stars dropping to zero at ~7 x 10^6 M . Finally, we show that the brightest of our simulated galaxies may be visible to current and future ultra-deep space-based surveys, particularly if lensed regions are chosen for observation.

3. Dark Sky Simulation
Authors: Michael Warren, Alexandar Friedland, Daniel Holz, Samuel Skillman, Paul Sutter, Matthew Turk (mjturk@illinois.edu), Risa Wechsler

Paper Citation: Warren, M. S., Friedland, A., Holz, D. E., Skillman, S. W., Sutter, P. M., Turk, M. J., & Wechsler, R. H. (2014). Dark Sky Simulations Collaboration. Zenodo. https://doi.org/10.5281/zenodo.10777
Paper URL: https://zenodo.org/record/10777#.V_VvKtwcK1M, https://arxiv.org/abs/1407.2600
Paper DOI: http://dx.doi.org/10.5281/zenodo.10777
Data Citation: Warren, M. S., Friedland, A., Holz, D. E., Skillman, S. W., Sutter, P. M., Turk, M. J., & Wechsler, R. H. (2014). Dark Sky Simulations Collaboration. Zenodo. https://doi.org/10.5281/zenodo.10777
Data URL:

Data DOI: https://doi.org/10.5281/zenodo.10777 (Although this is classified as a report in Zenodo, the authors intended this to be the DOI for the dataset)
Size: 31 TB
Code & Tools: https://bitbucket.org/darkskysims/darksky_tour/
Jupyter Notebook: https://girder.hub.yt/#user/570bd8fc2f2b14000176822c/folder/5820b9c09ea95c00014c71a1

The cosmological N-body simulation designed to provide a quantitative and accessible model of the evolution of the large-scale Universe.

4. ... 
 

Design Notes

Planning discussion 1 (NDSC6)

Photo of whiteboard from NDSC6

Planning Discussion 2

Notes from discussion (Craig W, David R, Mike L) based on above whiteboard diagram:

Components that we already have available:

What we need to do:

Example case for resolution (not a real dataset for SC16)

Planning Discussion 3

Notes from third set of discussions (Craig, David, Mike – with Slack guidance from Kacper).

Component Options

Looking at the above diagram, we see four categories of services:

All combinations are possible, although some combinations will likely be easier to accomplish than others.  

Open question: We've discussed the idea of federated services versus a centralized "resolver" service. See the "Federation Options" section for details.

"Repository" - User Frontend

 Options:

    1. user installs bookmarklet this may be restricted in modern browsers... more research is necessary
    2. user installs browser extension
      • Pros
        • more secure than bookmarklets... I guess?
      • Cons
        • probably lots of learning involved here
        • user must seek out and install this
        • no notion of authentication (see below)
        • browser-specific (we would need to develop and maintain one for each browser)
    3. developer(s) add a link to repo UI which leads to the existing ToolManager UI landing page, as in the NDSC6 demo
      • Pros
        • user does not need to install anything special on their local machine to launch tools
        • most repos inherently have a notion of "user" whose username and/or email we can use to identify tools launched by this user
      • Cons
        • repo UI developers who want to integrate with us need to add one line to their source to integrate with us
          • Dataverse, Clowder, Globus Publish, etc

 

"Resolver" - API endpoint to resolve identifiers (e.g., DOI, URN, URL) to notebook URLs

Options:

    1. Extend the existing NCSA ToolManager to add a /lookup endpoint
      • Pros
        • Easy to set up and modify as we need to
      • Cons
        • Likely not a long-term solution, but simple enough to accomplish in the short-term
    2. Girder+yt: add identifier to metadata and use mongo_search function to resolve 
      • Pros
        • Well-documented, extensible API, with existing notions of file, resource, and user management
      • Cons
        • likely overkill for this system, as we don't need any of the file management capabilities for resolving
        • language barriers in modifying Girder - python + javascript (raw? nodejs?)
    3. Build REST API over etcd
      • Pros
        • Familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's etcd.go
      • Cons
        • it might be quite a bit of work to build up a new API around etcd
    4. PRAGMA PID service?
      • Pros
        • sounds remarkably similar to what we're trying to accomplish here
        • supports a wide variety of different handle types (and repos?)
      • Cons
        • may be too complex to accomplish in the short term
        • unfamiliar code base / languages
        • Has specific notion of a PID that may be too restrictive.
        • Won't support multiple location resolution?

 

"Agent" - launches containers alongside the data on a Docker-enabled host

Options

    1. Use the existing ToolManager?
      • Pros
        • already parameterized to launch multiple tools (jupyter and rstudio)
      • Cons
        • no notion of "user" or authentication
    2. Girder/tmpnb?
      • Pros
        • notebooks automatically time out after a given period
        • inherited notion of "user"
      • Cons
        • can only launch single image type, currently (only jupyter)
        • inherited notion of "user" may present an interesting auth problem - how do we share these accounts between sites?
        • Girderisms: need to pass around "folderIds" or resolve dataset identifiers to folders.
    3. Kubernetes / Docker Swarm?
      • Pros
        • familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's kube.go
        • orchestration keeps containers alive if possible when anything goes wrong
      • Cons
        • may be too complex to accomplish in the short term
    4. docker -H?
      • Pros
        • zero setup necessary, just need Docker installed and the port open
      • Cons
        • HIGHLY insecure - would require some form of authentication (see below)

 

"Data" - large datasets need to be mountable on a Docker-enabled host

    1. NFS?
    2. GFS?
    3. S3?
    4. other options?

Federation options

  1. Centralized
    1. All datasets are registered with a central service that is responsible for resolving identifiers to locations and launching notebooks at those locations.
  2. Federated
    1. Each site has it's own local stack but register with a federation server for id → location resolution

 

Additional notes for this diagram:

Synchronization options

  1. Sites push their status to the resolver API
  2. Resolver service polls for each site's status

Authentication options

Auth may inhibit the "one-touch" behavior desired for these tools - you will always need to at least choose a tool and possibly enter credentials when launching a tool

  1. Build some kind of quasi-auth scheme (similar to ndslabs) on top of the existing ToolManager
  2. Inherit Girder's auth scheme and solve the problem of sharing these "users" between sites
  3. Create a "guest" user at each site and use that to launch tools from remote sources

Inclinations: SC16 Demo

Using the above we would be able to show:

Inclinations: As a Long-Term service

Notes from 10/28 meeting

Present: Mike, David, Kenton, Kandace, Kacper, Craig

Storyboard for Demo Presentation

Here is my PPT version of my napkin sketch for the SC demo.  Also context on where the demo product fits in the story.  Comments, please!

nds_sc16_demo.pptx

Presentation v1

Please give me your feedback!  Graphics executed to the edge of my abilities and patience.

nds_sc16_demo_111216.pptx