Publishing Big Data

Documenting a few discussions we've had this week trying to contextualize the SC16 demo:

What does it mean to support publishing data?

A publishing workflow for the user
A unique identifier (e.g., DOI) to facilitate identification and citation
A "landing page" (e.g., entry in a repository/catalog) describing the dataset
Sufficient metadata to facilitate selection for re-use
Ability for other users to access the data for re-use, possibly requiring authentication/authorization
Data management plan support

What does it mean to be a data repository?

Commitment to ongoing stewardship and preservation of the data to facilitate future discovery, access, and re-use:
- See the Trustworthy Repositories Audit & Certification Checklist
- Includes everything from governance, organization, financing, staffing, technology
Commitment to preservation/archiving standards, such as OAIS

Why is "big data" different?

Traditional community and institutional repositories can't handle the data volume. Traditional preservation methods won't work,
Presents new challenges in re-use and access
- For example, user's can't easily download or copy the data to local facilities for analysis
- Access and re-use in these environments is often complex and domain-specific (i.e., may require specialized applications and techniques).
Presents new challenges in preservation
- LOCKSS might not work, but local preservation is problematic in the event of a disaster.

Example: SC16 demo case:

Supercomputing and campus research computing centers facilitate computational research that generates data that is too large for traditional repositories and often too expensive to reproduce. For example, a large simulation on the Blue Waters system might be reproducible, but at this time reproducing the simulation is cost prohibitive. This is a similar situation faced by CERN.
SC centers are in a unique position (and have a need) to provide a local repository that conforms to community standards, supporting discovery/access/re-use/preservation without necessarily moving the data to a new location.
The NDSC can recommend a set of models and toolkits to enable SCs and research computing centers to become data repositories. For example:
- Tools to preserve and ensure the integrity of large datasets, conforming to archive/repository community standards
- Tools and techniques to enable discovery, access, and re-use of these datasets. For example:
  - Access to HPC environments
  - The ability to host dataset-specific applications/interfaces; for example – Python notebooks
- Possibly a preservation network, ensuring that large datasets are available at more than one location

Example: CERN

CMS released a 300TB open dataset via their Open Data Portal
They've started work on the CERN Analysis Preservation Framework as well as long-term data preservation strategies – including discussion of ISO-16363 certification (trustworthy repositories)

Example: TERRA-REF

The ARPA-E TERRA-REF project will produce hundreds of TB of data and a specialized infrastructure for re-use. However, it's not clear how they will address the preservation of this data over the long term. There are likely many "big data" applications that would benefit from guidance on how to address the problem of long-term access and preservation.

Example: Galaxy Cluster Merger Catalog (GCMC)

http://gcmc.hub.yt/

This is a website dedicated to a set of simulations
It is a browsable catalog with links to downloadable files and summary pages with generated visualizations
http://gcmc.hub.yt/sloshing/R5_b500/0000.html
- It is also possible to launch Jupyter notebooks
- Notebooks are general purpose (not a specific notebook for the dataset)

Example: Renaissance simulations

http://www.ncsa.illinois.edu/news/story/blue_waters_simulations_suggest_there_are_fewer_faint_galaxies_than_expecte
- "Because these simulations are so costly to generate, the team moved the entire output of the Renaissance Simulations to SDSC Cloud—some 100 terabytes of data, or the equivalent of about 150,000 audio compact discs. "A data access portal is being set up so that others can investigate their properties in more detail."

Example: Dark Sky simulations

http://darksky.slac.stanford.edu/edr.html

Data release:
- The raw data is available simply via HTTP with some instructions for use
- Data publication: http://arxiv.org/abs/1407.2600
- Data release page: http://darksky.slac.stanford.edu/data_release/
- "Just beware of clicking on those 34 TB files!"
In yt.hub
- DarkSky data is now available via yt.hub, which enables analysis
coLaboratory plugin
- "A Chrome App for running a port of IPython to PNaCl, in a notebook environment with Google Drive Integration."
darksky_catalog
- https://bitbucket.org/darkskysims/darksky_catalog
- "This is a python package designed to make access to the Dark Sky Simulations simple and straightforward."
thingking
- "page-caching system capable of injecting raw binary data from the WWW into locally running Python sessions"

Example: Blue Waters Data Sharing Service

DSS Pilot ended in 2016?
"At the 2014 Blue Waters Symposium and in other conversations, the science and engineering partners identified the ability to share their data related to massive supercomputing projects with the broader scientific community as an important service for the future."
Two classes of sharing based on the needs of the partners and data:
- Active data sharing for projects with current allocations on Blue Waters
- Community sharing plan for data produced by prior projects.
Prototype DSS for Blue Water
- The service will allow supercomputer users to share their research data with colleagues who do not have access to the supercomputer.
Users from each group can share data using
- Globus Online's sharing capabilities
- Web service interface.
- Projects (PIs) can submit a service request, which is really just a means for us to help the teams better prepare their data for distribution
- Obtain a Data Object Identifier (DOI) for the data set.
- The shared data also counts toward the science team’s storage limit on the Blue Waters sub-storage systems.

Space shortcuts