Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Image Removed

 Image Added

The SEAD model for publishing involves (more information available on other wiki pages and SEAD publications/presentations):

  • Secure active Project Spaces in which groups work, during their research projects, building collections of data files, annotations, and metadata (through a web interface or via researcher's software writing directly to their project space via the SEAD API),
  • Publishing Services that cache, analyze, and manage the submission of publication requests from researchers to repositories partnering with SEAD. These services use information about repositories and their policies (submitted by the repository), about people and their affiliations (harvested from sources such as ORCID), and the contents of publication requests (submitted by researchers from their project spaces or third-party infrastructure), to 'match' requests with compatible repositories on the basis of a configurable set of rules that may involve limits on total size, maximum file size, file types, affiliations of authors, and other metadata and statistics. In SEAD's 2.0 version, researchers can review the results of this assessment and decide whether to adjust their publication requests (adding required metadata, removing unacceptable files, etc.) and to then submit to a repository of their choice. SEAD's publishing services then make such requests available for repositories to discover and retrieve, and then tracks the status of the repository's processing of a request through to the final completion and assignment of a persistent identifier (e.g. DOI).
  • Repositories partnering with SEAD to acquire and preserve data publications that meet their institutional interests. Repositories may range from institutional repositories with rich services (e.g. based on DSpace or Fedora4) to repositories with fewer features but lower costs and/or higher scalability.

...

  • SEAD provides a RESTful API (see the SEAD 2.0 Publication API Walkthrough:) that repositories can use to interact with SEAD and retrieve data publication requests. Repositories working with SEAD had demonstrated connections via this API to institutional repositories. SEAD has also worked to develop software tha can efficiently and scalably package data publication for storage and retrieval of publications on large file systems and cloud storage. All of this software is avialable for use and/or extension and modification by respositories interested in connecting to SEAD.

The SEAD Reference Publisher

The SEAD Reference Publisher is a lightweight web application that can publish and provide access to data publications originating in SEAD. It consists of two interacting components that manage 1) The SEAD publication process from retrieval of a request from SEAD's publishing services to packaging, storage, and submission of a persistent identifier back to SEAD, and 2) providing access to the stored data through a landing page associated with the persistent identifier minted for the data. The reference publisher serves several purposes:

  • It supports publication of the larges packages SEAD has had to date (135K files, >160GB, 40+GB individual file size) and allows testing of SEAD's Spaces and publishing services at this scale
  • It demonstrates good (minimalist) handling of SEAD publication requests in terms of minting DOIs with discovery metadata, packaging submissions in a standards-based format, and supporting ongoing access to the data and metadata
  • It provides integrity checks of metadata submitted with the request (verifying the output of publications sources such as SEAD's v 1.5 and 2.0 spaces, and verifying that the contents of the final data package are consistent with the file lengths, hash values, total file/size counts, and hierarchical structure defined in the metadata).
  • It provides the core publication/packaging capability used within the Indiana University SEAD Cloud repository, which offers SEAD users a place to publish data (current policies limit submissions to a maximum size of 1 GB)
  • It provides an out-of-the-box capability that can be directly deployed or used as a

The SEAD Reference Publisher

The SEAD Reference Publisher consists of two interacting components that manage 1) The SEAD publication process from retrieval of a request from SEAD's publishing services to packaging, storage, and submission of a persistent identifier back to SEAD, and 2) providing access to the stored data through a landing page associated with the persistent identifier minted for the data. The reference publisher serves several purposes:

  • It supports publication of the larges packages SEAD has had to date (135K files, >160GB, 40+GB individual file size) and allows testing of SEAD's Spaces and publishing services at this scale
  • It demonstrates good (minimalist) handling of SEAD publication requests in terms of minting DOIs with discovery metadata, packaging submissions in a standards-based format, and supporting ongoing access to the data and metadata
  • It provides integrity checks of metadata submitted with the request (verifying the output of publications sources such as SEAD's v 1.5 and 2.0 spaces, and verifying that the contents of the final data package are consistent with the file lengths, hash values, total file/size counts, and hierarchical structure defined in the metadata).
  • It provides the core publication/packaging capability used within the Indiana University SEAD Cloud repository, which offers SEAD users a place to publish data (current policies limit submissions to a maximum size of 1 GB)
  • It provides an out-of-the-box capability that can be directly deployed or used as a starting point for institutions wishing to partner with SEAD (initial deployments are being pursued at the University of Michigan in partnership with UM's Advanced Research Computing (ARC) organization and as part of a pilot effort with the National Data Service).

...

The SRP RESTful web services and landing page were redesigned to support incremental, asynchronous loading of content. (The original design, which loaded the full json metadata file from the server, caused browser memory issues for sizes > ~25 MB.) Further performance improvements are being explored.

Publishing Performance:

On the UM-ARC-TS VM, retrieving metadata from Indiana University and content from a Project Space on SEAD's cluster at NCSA, we observed the following:

 

For a publication with 11625 files, 22GB:

...

To support the largest publications, the publication script redirected Java to use the large 2TB permanent storage space for temporary files (since the 169GB package size was > 50GB  system storage).

The Java Heap size also had to be increased to 4GB (2GB was not enough). Heap size does not have to be increased to support serving landing pages.

Publishing Performance:

On the UM-ARC-TS VM, retrieving metadata from Indiana University and content from a Project Space on SEAD's cluster at NCSA, we observed the following:

 

For a publication with 11625 files, 22GB:

<10 minutes to generate request and package metadata within the originating Space (i.e. pulling metadata via queries from SEAD’s 1.5 spaces database (RDF middleware over MySQL))

~40 minutes for SEAD C3PR at Indiana to sequentially test all file URLs (probably avoidable/optimizable)

~30 minutes to retrieve and process full package @ ARC-TS (from retrieval of the request to completion of the creation and storage of the ORE/BagIT/zip file)

~20 minutes to sequentially test SHA1 hashes of all files (a full scan of the created 

 

Processing @ARC From the publication of the 135K file/169GB publication, processing @ARC is ~linear with total size (~250 minutes for 113GB, versus 50 minutes for 20GB)

 

Size requirements:

Serving content appears to be lightweight – minimal CPU load and minimal memory. Clearly higher than serving static web pages but the current VM should easily handle many simultaneous users.

Publishing and packaging is CPU and memory intensive. The largest package (135K files) has a ~158MB metadata file and we currently download and parse that as one unit. That required giving Java a 4GB heap size. Once that files is parsed, we set up 135K HTTP transfers. On the 2 cores, it looked like 8 threads was ~optimal and the cpu load reported by top stayed at 150-200%.  To handle the largest packages, I moved the /tmp storage of the parallel retrieval (8 files – one per thread storing the compressed results streaming in) on the final storage (/sead_data). The bandwidth from NCSA to disk during the write phase was <=20MB/s (which includes the time for copying the 8 tmp files into one final zip on the same file system, etc.). Since write is one-time and we’re primarily looking at smaller packages, this is probably fine, but it might be worth some discussion to see where the bottleneck(s) are and if there are easy ones to remove.

Landing Page Performance:

Initial loading, downloads, and incremental updates of the folder hierarchy as viewers browse the table of contents appear to have reasonable performance - <1 sec to start transfers, <1 s to show a folder with ~10 entries, slowing to a few seconds for folders with hundreds to 1K+ entries. The primary bottleneck is in loading the initial information for the top-level of the table of contents which, for historical reasons, involves a synchronous transfer and multiple RESTful calls (one per top-level entry). Converting this to a single asynchronous ajax call should make performance reasonable even for the largest collections we have to date.

Discussion:

Overall, serving content appears to be lightweight – minimal CPU load and minimal memory. Clearly higher than serving static web pages but the current VM should easily handle simultaneous users.

Publishing and packaging is CPU and memory intensive. The largest package (135K files) has a ~158MB metadata file and we currently download and parse that as one unit. That required giving Java a 4GB heap size. Once that files is parsed, we set up 135K HTTP transfers (one java object in memory per entry). On the 2 cores, it looked like 8 threads was ~optimal and the cpu load reported by 'top' stayed at 150-200%.  To handle the largest packages, I moved the /tmp storage of the parallel retrieval (8 files – one per thread storing the compressed results streaming in) on the final storage (/sead_data). The bandwidth from NCSA to disk during the write phase was <=20MB/s (which includes the time for copying the 8 tmp files into one final zip on the same file system, etc.). Since write is one-time and we’re primarily looking at smaller packages, this is probably fine, but it might be worth some discussion to see where the bottleneck(s) are and if there are easy ones to remove either at UM or NCSA.

Further work: 

Some minimal additional work to remove the synchronous RESTful calls in the landing page should be all that's needed to provide reasonable performance on the existing VM. Beyond that, there are a number of areas in  management/monitoring where we may want to adjust the machine configuration and/or create some tools:

The publication script could be automated (e.g. cron job) and we could leverage the policy-enforcement additions made at Indiana to only process compliant requests and to alert an operator (e.g. via email) when a non-compliant request is submitted and/or to allow out-of-band communication with users regarding publications).

Test DOIs expire in two weeks, so it is useful to remove test publications older than 2 weeks, which requires removal of the zip file (and the two associated cache files), This could be manual or on a timer. If we decide to support test and 'real' publications using the same instance, this script would need to read the DOI information in the metadata to assure that only data with test DOIs are deleted.

The SRP supports Google Analytics and should soon provide information both about views of the landing page as well as any subsequent downloads of the individual files or zipped publication.We may want to configure this to send information to SEAD or ARC.

If we want an overall home landing page and/or search capability across publications, we should be able to adapt code developed at Indiana. (Conversely, some of the performance improvements made in testing at UM may be usable at Indiana).