SEAD Reference Publisher

Initial Notes (Not organized):

Here are a few DOIs that resolve to data published using the SEAD Reference Publisher on a virtual machine at University of Michigan's ARC-TS. They range from a few files to very many. If you look at the larger ones, you’ll see a performance issue that is being resolved – there’s a synchronous first step to load the top-level files and folders in the download table which can take 30 seconds to a minute plus. If you wait for that, you’ll see that subsequent steps to open a folder and see the contents work much better, taking only a few seconds even when there are hundreds of items in a single folder. Downloads of any given file from within the publication are also very responsive, despite the fact that there is only a zip of the entire package in the file system.

3 files, 1.77 MB à 557KBzipped, http://dx.doi.org/10.5072/FK22N54B43

11625 files , 22GB à20GBzipped http://dx.doi.org/10.5072/FK2NG4NK6S

29431 files, 31.2GB à 30.8GBzipped http://dx.doi.org/10.5072/FK2930TN6Q

135K files, 169GB à 113GBzipped , http://dx.doi.org/10.5072/FK2XK89H7D

Publication Processing Time:

11625 files, 22GB:

<10 minutes to generate request and package metadata, pulling from SEAD’s 1.5 spaces database (RDF middleware over MySQL)

~40 minutes for SEAD C3PR at Indiana to sequentially test all file URLs (probably avoidable/optimizable)

~30 minutes to retrieve and process full package @ ARC-TS

~20 minutes to sequentially test SHA1 hashes of all files

Processing @ARC is ~linear with total size (~250 minutes for 113GB, versus 50 minutes for 20GB)

Size requirements:

Serving content appears to be lightweight – minimal CPU load and minimal memory. Clearly higher than serving static web pages but the current VM should easily handle many simultaneous users.

Publishing and packaging is CPU and memory intensive. The largest package (135K files) has a ~158MB metadata file and we currently download and parse that as one unit. That required giving Java a 4GB heap size. Once that files is parsed, we set up 135K HTTP transfers. On the 2 cores, it looked like 8 threads was ~optimal and the cpu load reported by top stayed at 150-200%. To handle the largest packages, I moved the /tmp storage of the parallel retrieval (8 files – one per thread storing the compressed results streaming in) on the final storage (/sead_data). The bandwidth from NCSA to disk during the write phase was <=20MB/s (which includes the time for copying the 8 tmp files into one final zip on the same file system, etc.). Since write is one-time and we’re primarily looking at smaller packages, this is probably fine, but it might be worth some discussion to see where the bottleneck(s) are and if there are easy ones to remove.

Space shortcuts

Page tree