Documenting a few discussions we've had this week trying to contextualize the SC16 demo:

 

What does it mean to support publishing data?

  • A publishing workflow for the user
  • A unique identifier (e.g., DOI) to facilitate identification and citation
  • A "landing page" (e.g., entry in a repository/catalog) describing the dataset
  • Sufficient metadata to facilitate selection for re-use
  • Ability for other users to access the data for re-use, possibly requiring authentication/authorization
  • Data management plan support

What does it mean to be a data repository?

  • Commitment to ongoing stewardship and preservation of the data to facilitate future discovery, access, and re-use:
  • Commitment to preservation/archiving standards, such as OAIS

Why is "big data" different?

  • Traditional community and institutional repositories can't handle the data volume. Traditional preservation methods won't work,
  • Presents new challenges in re-use and access
    • For example, user's can't easily download or copy the data to local facilities for analysis
    • Access and re-use in these environments is often complex and domain-specific (i.e., may require specialized applications and techniques).
  • Presents new challenges in preservation
    • LOCKSS might not work, but local preservation is problematic in the event of a disaster.

Example: SC16 demo case:

  • Supercomputing and campus research computing centers facilitate computational research that generates data that is too large for traditional repositories and often too expensive to reproduce.  For example, a large simulation on the Blue Waters system might be reproducible, but at this time reproducing the simulation is cost prohibitive.  This is a similar situation faced by CERN.
  • SC centers are in a unique position (and have a need) to provide a local repository that conforms to community standards, supporting discovery/access/re-use/preservation without necessarily moving the data to a new location.
  • The NDSC can recommend a set of models and toolkits to enable SCs and research computing centers to become data repositories. For example:
    • Tools to preserve and ensure the integrity of large datasets, conforming to archive/repository community standards
    • Tools and techniques to enable discovery, access, and re-use of these datasets. For example:
      • Access to HPC environments
      • The ability to host dataset-specific applications/interfaces; for example – Python notebooks
    • Possibly a preservation network, ensuring that large datasets are available at more than one location

Example: CERN

Example: TERRA-REF

  • The ARPA-E TERRA-REF project will produce hundreds of TB of data and a specialized infrastructure for re-use.  However, it's not clear how they will address the preservation of this data over the long term. There are likely many "big data" applications that would benefit from guidance on how to address the problem of long-term access and preservation.

 

Example: Galaxy Cluster Merger Catalog (GCMC)

http://gcmc.hub.yt/

  • This is a website dedicated to a set of simulations
  • It is a browsable catalog with links to downloadable files and summary pages with generated visualizations
  • http://gcmc.hub.yt/sloshing/R5_b500/0000.html
    • It is also possible to launch Jupyter notebooks
    • Notebooks are general purpose (not a specific notebook for the dataset)

Example: Renaissance simulations

Example: Dark Sky simulations

http://darksky.slac.stanford.edu/edr.html

  • Data release: 
  • In yt.hub
    • DarkSky data is now available via yt.hub, which enables analysis
  • coLaboratory plugin
    • "A Chrome App for running a port of IPython to PNaCl, in a notebook environment with Google Drive Integration."
  • darksky_catalog
  • thingking
    • "page-caching system capable of injecting raw binary data from the WWW into locally running Python sessions"

Example: Blue Waters Data Sharing Service

http://www.ncsa.illinois.edu/news/story/new_capabilities_make_data_on_blue_waters_sharable_and_movable

  • DSS Pilot ended in 2016?
  • "At the 2014 Blue Waters Symposium and in other conversations, the science and engineering partners identified the ability to share their data related to massive supercomputing projects with the broader scientific community as an important service for the future."
  • Two classes of sharing based on the needs of the partners and data:
    • Active data sharing for projects with current allocations on Blue Waters
    • Community sharing plan for data produced by prior projects. 
  • Prototype DSS for Blue Water
    • The service will allow supercomputer users to share their research data with colleagues who do not have access to the supercomputer. 
  • Users from each group can share data using 
    • Globus Online's sharing capabilities
    • Web service interface. 
    • Projects (PIs) can submit a service request, which is really just a means for us to help the teams better prepare their data for distribution
    • Obtain a Data Object Identifier (DOI) for the data set.
    • The shared data also counts toward the science team’s storage limit on the Blue Waters sub-storage systems.


  • No labels