Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The SRP RESTful web services and landing page were redesigned to support incremental, asynchronous loading of content. (The original design, which loaded the full json metadata file from the server, caused browser memory issues for sizes > ~25 MB.) Further performance improvements are being explored.

Publishing Performance:

On the UM-ARC-TS VM, retrieving metadata from Indiana University and content from a Project Space on SEAD's cluster at NCSA, we observed the following:

 

For a publication with 11625 files, 22GB:

<10 minutes to generate request and package metadata within the originating Space (i.e. pulling metadata via queries from SEAD’s 1.5 spaces database (RDF middleware over MySQL))

~40 minutes for SEAD C3PR at Indiana to sequentially test all file URLs (probably avoidable/optimizable)

~30 minutes to retrieve and process full package @ ARC-TS (from retrieval of the request to completion of the creation and storage of the ORE/BagIT/zip file)

~20 minutes to sequentially test SHA1 hashes of all files (a full scan of the created 

 

Processing @ARC is ~linear with total size (~250 minutes for 113GB, versus 50 minutes for 20GB)

 

Size requirements:

Serving content appears to be lightweight – minimal CPU load and minimal memory. Clearly higher than serving static web pages but the current VM should easily handle many simultaneous users.

Publishing and packaging is CPU and memory intensive. The largest package (135K files) has a ~158MB metadata file and we currently download and parse that as one unit. That required giving Java a 4GB heap size. Once that files is parsed, we set up 135K HTTP transfers. On the 2 cores, it looked like 8 threads was ~optimal and the cpu load reported by top stayed at 150-200%.  To handle the largest packages, I moved the /tmp storage of the parallel retrieval (8 files – one per thread storing the compressed results streaming in) on the final storage (/sead_data). The bandwidth from NCSA to disk during the write phase was <=20MB/s (which includes the time for copying the 8 tmp files into one final zip on the same file system, etc.). Since write is one-time and we’re primarily looking at smaller packages, this is probably fine, but it might be worth some discussion to see where the bottleneck(s) are and if there are easy ones to remove.

To support the largest publications, the publication script redirected Java to use the large 2TB permanent storage space for temporary files (since the 169GB package size was > 50GB  system storage).

The Java Heap size also had to be increased to 4GB (2GB was not enough). Heap size does not have to be increased to support serving landing pages.

Publishing Performance:

On the UM-ARC-TS VM, retrieving metadata from Indiana University and content from a Project Space on SEAD's cluster at NCSA, we observed the following:

 

For a publication with 11625 files, 22GB:

<10 minutes to generate request and package metadata within the originating Space (i.e. pulling metadata via queries from SEAD’s 1.5 spaces database (RDF middleware over MySQL))

~40 minutes for SEAD C3PR at Indiana to sequentially test all file URLs (probably avoidable/optimizable)

~30 minutes to retrieve and process full package @ ARC-TS (from retrieval of the request to completion of the creation and storage of the ORE/BagIT/zip file)

~20 minutes to sequentially test SHA1 hashes of all files (a full scan of the created 

 

From the publication of the 135K file/169GB publication, processing @ARC is ~linear with total size (~250 minutes for 113GB, versus 50 minutes for 20GB)

Landing Page Performance:

Initial loading, downloads, and incremental updates of the folder hierarchy as viewers browse the table of contents appear to have reasonable performance - <1 sec to start transfers, <1 s to show a folder with ~10 entries, slowing to a few seconds for folders with hundreds to 1K+ entries. The primary bottleneck is in loading the initial information for the top-level of the table of contents which, for historical reasons, involves a synchronous transfer and multiple RESTful calls (one per top-level entry). Converting this to a single asynchronous ajax call should make performance reasonable even for the largest collections we have to date.

Discussion:

Overall, serving content appears to be lightweight – minimal CPU load and minimal memory. Clearly higher than serving static web pages but the current VM should easily handle simultaneous users.

Publishing and packaging is CPU and memory intensive. The largest package (135K files) has a ~158MB metadata file and we currently download and parse that as one unit. That required giving Java a 4GB heap size. Once that files is parsed, we set up 135K HTTP transfers (one java object in memory per entry). On the 2 cores, it looked like 8 threads was ~optimal and the cpu load reported by 'top' stayed at 150-200%.  To handle the largest packages, I moved the /tmp storage of the parallel retrieval (8 files – one per thread storing the compressed results streaming in) on the final storage (/sead_data). The bandwidth from NCSA to disk during the write phase was <=20MB/s (which includes the time for copying the 8 tmp files into one final zip on the same file system, etc.). Since write is one-time and we’re primarily looking at smaller packages, this is probably fine, but it might be worth some discussion to see where the bottleneck(s) are and if there are easy ones to remove either at UM or NCSA.

Further work: 

Some minimal additional work to remove the synchronous RESTful calls in the landing page should be all that's needed to provide reasonable performance on the existing VM. Beyond that, there are a number of areas in  management/monitoring where we may want to adjust the machine configuration and/or create some tools:

The publication script could be automated (e.g. cron job) and we could leverage the policy-enforcement additions made at Indiana to only process compliant requests and to alert an operator (e.g. via email) when a non-compliant request is submitted and/or to allow out-of-band communication with users regarding publications).

Test DOIs expire in two weeks, so it is useful to remove test publications older than 2 weeks, which requires removal of the zip file (and the two associated cache files), This could be manual or on a timer. If we decide to support test and 'real' publications using the same instance, this script would need to read the DOI information in the metadata to assure that only data with test DOIs are deleted.

The SRP supports Google Analytics and should soon provide information both about views of the landing page as well as any subsequent downloads of the individual files or zipped publication.We may want to configure this to send information to SEAD or ARC.