Documenting a few discussions we've had this week trying to contextualize the SC16 demo:
What does it mean to support publishing data?
- A publishing workflow for the user
- A unique identifier (e.g., DOI) to facilitate identification and citation
- A "landing page" (e.g., entry in a repository/catalog) describing the dataset
- Sufficient metadata to facilitate selection for re-use
- Ability for other users to access the data for re-use, possibly requiring authentication/authorization
- Data management plan support
What does it mean to be a data repository?
- Commitment to ongoing stewardship and preservation of the data to facilitate future discovery, access, and re-use:
- See the Trustworthy Repositories Audit & Certification Checklist
- Includes everything from governance, organization, financing, staffing, technology
- Commitment to preservation/archiving standards, such as OAIS
Why is "big data" different?
- Traditional community and institutional repositories can't handle the data volume. Traditional preservation methods won't work,
- Presents new challenges in re-use and access
- For example, user's can't easily download or copy the data to local facilities for analysis
- Access and re-use in these environments is often complex and domain-specific (i.e., may require specialized applications and techniques).
- Presents new challenges in preservation
- LOCKSS might not work, but local preservation is problematic in the event of a disaster.
Example: SC16 demo case:
- Supercomputing and campus research computing centers facilitate computational research that generates data that is too large for traditional repositories and often too expensive to reproduce. For example, a large simulation on the Blue Waters system might be reproducible, but at this time reproducing the simulation is cost prohibitive. This is a similar situation faced by CERN.
- SC centers are in a unique position (and have a need) to provide a local repository that conforms to community standards, supporting discovery/access/re-use/preservation without necessarily moving the data to a new location.
- The NDSC can recommend a set of models and toolkits to enable SCs and research computing centers to become data repositories. For example:
- Tools to preserve and ensure the integrity of large datasets, conforming to archive/repository community standards
- Tools and techniques to enable discovery, access, and re-use of these datasets. For example:
- Access to HPC environments
- The ability to host dataset-specific applications/interfaces; for example – Python notebooks
- Possibly a preservation network, ensuring that large datasets are available at more than one location
Example: CERN
- CMS released a 300TB open dataset via their Open Data Portal
- They've started work on the CERN Analysis Preservation Framework as well as long-term data preservation strategies – including discussion of ISO-16363 certification (trustworthy repositories)
Example: TERRA-REF
- The ARPA-E TERRA-REF project will produce hundreds of TB of data and a specialized infrastructure for re-use. However, it's not clear how they will address the preservation of this data over the long term. There are likely many "big data" applications that would benefit from guidance on how to address the problem of long-term access and preservation.
Example: Galaxy Cluster Merger Catalog (GCMC)
- This is a website dedicated to a set of simulations
- It is a browsable catalog with links to downloadable files and summary pages with generated visualizations
- http://gcmc.hub.yt/sloshing/R5_b500/0000.html
- It is also possible to launch Jupyter notebooks
- Notebooks are general purpose (not a specific notebook for the dataset)
Example: Renaissance simulations
- http://www.ncsa.illinois.edu/news/story/blue_waters_simulations_suggest_there_are_fewer_faint_galaxies_than_expecte
- "Because these simulations are so costly to generate, the team moved the entire output of the Renaissance Simulations to SDSC Cloud—some 100 terabytes of data, or the equivalent of about 150,000 audio compact discs. "A data access portal is being set up so that others can investigate their properties in more detail."
Example: Dark Sky simulations
http://darksky.slac.stanford.edu/edr.html
- Data release:
- The raw data is available simply via HTTP with some instructions for use
- Data publication: http://arxiv.org/abs/1407.2600
- Data release page: http://darksky.slac.stanford.edu/data_release/
- "Just beware of clicking on those 34 TB files!"
- In yt.hub
- DarkSky data is now available via yt.hub, which enables analysis
- coLaboratory plugin
- "A Chrome App for running a port of IPython to PNaCl, in a notebook environment with Google Drive Integration."
- darksky_catalog
- https://bitbucket.org/darkskysims/darksky_catalog
- "This is a python package designed to make access to the Dark Sky Simulations simple and straightforward."
- https://bitbucket.org/darkskysims/darksky_catalog
- thingking
- "page-caching system capable of injecting raw binary data from the WWW into locally running Python sessions"
- "page-caching system capable of injecting raw binary data from the WWW into locally running Python sessions"
Example: Blue Waters Data Sharing Service
- DSS Pilot ended in 2016?
- "At the 2014 Blue Waters Symposium and in other conversations, the science and engineering partners identified the ability to share their data related to massive supercomputing projects with the broader scientific community as an important service for the future."
- Two classes of sharing based on the needs of the partners and data:
- Active data sharing for projects with current allocations on Blue Waters
- Community sharing plan for data produced by prior projects.
- Prototype DSS for Blue Water
- The service will allow supercomputer users to share their research data with colleagues who do not have access to the supercomputer.
- The service will allow supercomputer users to share their research data with colleagues who do not have access to the supercomputer.
- Users from each group can share data using
- Globus Online's sharing capabilities
- Web service interface.
- Projects (PIs) can submit a service request, which is really just a means for us to help the teams better prepare their data for distribution
- Obtain a Data Object Identifier (DOI) for the data set.
- The shared data also counts toward the science team’s storage limit on the Blue Waters sub-storage systems.