Page History

...

The SRP does not currently provide a home page providing a list of published data sets or any search interface. The IU SEAD Cloud has demonstrated one way this may be accomplished (using an alternate landing page mechanism while sharing most of the publication library with the SRP). The SRP also does not track download statistics, but the means to optionally provide such information to Google Analytics is being added.

Deployment

Performance

Initial Notes (Not organized):

Here are a few DOIs that resolve to data published using the SEAD Reference Publisher on a virtual machine at University of Michigan's ARC-TS. They range from a few files to very many. If you look at the larger ones, you’ll see a performance issue that is being resolved – there’s a synchronous first step to load the top-level files and folders in the download table which can take 30 seconds to a minute plus. If you wait for that, you’ll see that subsequent steps to open a folder and see the contents work much better, taking only a few seconds even when there are hundreds of items in a single folder. Downloads of any given file from within the publication are also very responsive, despite the fact that there is only a zip of the entire package in the file system.

These examples are from the National Center for Earth Surface Dynamics and were published form the NCED Space SEAD runs on a cluster at NCSA (see https://nced.ncsa.illinois.edu/acr/#discovery - page loads slowly):

3 files, 1.77 MB à 557KBzipped, http://dx.doi.org/10.5072/FK22N54B43

11625 files , 22GB à20GBzipped http://dx.doi.org/10.5072/FK2NG4NK6S

At the University of Michigan, the SRP is being tested using a virtual machine being provided by Advanced Research Computing. The machine has been configured with CentOS 7 with 2-cores and 4 GB RAM with a 50 GB system disk and 2 TB of data publication storage. The machine is firewalled to allow public access on port 80. Yum was used to install tomcat, nginx, unzip, and audit2allow. Niginx was configured to forward port 80 requests to Tomcat at port 8080, and audit2allow was used to allow nginx to manage and forward port 80. The SRP was uploaded as a single war file that was unzipped and deployed to Tomcat. It was configured to use 8 threads, mint "test" DOIs (which only work for two weeks, but require no EZID credentials), to store data on the large 2 TB partition, and to use SEAD's test publication services. Log4j was configured to provide a daily rotating log for the SRP. A simple publication script was created that runs as "tomcat" and uses the class files and libraries within the webapp to publish data. A profile for the UM-ARC-TS repository has been added to SEAD's test publication services (http://seadva-test.d2i.indiana.edu/sead-c3pr/api/repositories/UM-ARC-TS).

Performance

This SRP instance has now been used to publish data packages ranging from 3 files/1.7MB to one including 135K files, 169GB total.

The following examples are from the National Center for Earth Surface Dynamics and were published form the NCED Space SEAD runs on a cluster at NCSA (see the Published Data page at https://nced.ncsa.illinois.edu/acr/#discovery): Given that they use test DOIs, these links will expire in ~2 weeks, but we anticipate publishing further tests and/or long-term publications that would be listed on the NCED Space Published Data page.

3 files, 1.77 MB à 557KBzipped, 29431 files, 31.2GB à 30.8GBzipped http://dx.doi.org/10.5072/FK2930TN6Q

135K files, 169GB à 113GBzipped , http://dx.doi.org/10.5072/FK2XK89H7D

Publication Processing Time:

.5072/FK22N54B43

11625 files , 22GB à20GBzipped http://dx.doi.org/10.5072/FK2NG4NK6S

29431 files, 31.2GB à 30.8GBzipped http://dx.doi.org/10.5072/FK2930TN6Q

135K files, 169GB à 113GBzipped , http://dx.doi.org/10.5072/FK2XK89H7D

Several issues have been addressed to successfully publish the larger collections:

SEAD Publication services required a larger Java Heap size to successfully serve the largest metadata files (which were >150MB)

The SEAD Publication services check of data file URLs (to verify that they are accessible) was modified to handle 0 byte files, intermittent failures (we observed a few timeouts when retrieving files that we believe were caused by heavy load (unrelated to the publication request) on the originating Project Space), and to improve performance.

The SRP implemented a URL retry to address the possibility of intermittent GET timeouts. (After implementing, we saw no need for retries in publishing >160K files).

The SRP RESTful web services and landing page were redesigned to support incremental, asynchronous loading of content. (The original design, which loaded the full json metadata file from the server, caused browser memory issues for sizes > ~25 MB.) Further performance improvements are being explored.

Publishing Performance:

On the UM-ARC-TS VM, retrieving metadata from Indiana University and content from a Project Space on SEAD's cluster at NCSA, we observed the following:

For a publication with 11625 files, 22GB:

<10 minutes to generate request and package metadata , pulling within the originating Space (i.e. pulling metadata via queries from SEAD’s 1.5 spaces database (RDF middleware over MySQL))

~40 minutes for SEAD C3PR at Indiana to sequentially test all file URLs (probably avoidable/optimizable)

~30 minutes to retrieve and process full package @ ARC-TS (from retrieval of the request to completion of the creation and storage of the ORE/BagIT/zip file)

~20 minutes to sequentially test SHA1 hashes of all files (a full scan of the created

Processing @ARC is ~linear with total size (~250 minutes for 113GB, versus 50 minutes for 20GB)

...

Space shortcuts

Page tree

Versions Compared

Old Version 3

New Version 4

Key

Deployment

Performance

Performance

Publishing Performance: