Clowder Data Archiving Support

Driving Scenario I: TERRA-REF project uses file path uploading to create file entries in Clowder that point to mounted file paths (i.e. data bytes are not stored in MongoDB). TERRA's 1PB allocation on storage condo is filling up, necessitating some files (likely starting with raw data from 2016+) be moved into tape storage or offsite backup. This will be done manually and recovering files from the archive will require manual action.

Driving Scenario II: Industry partnership project would like to move files that have not been downloaded after X days automatically from S3 storage to Glacier. However, they would also like a button to automatically schedule the file to be restored from Glacier back to S3.

In both scenarios, we want to retain entries in Clowder for data that we archive for referencing and metadata purposes.

Completed work

https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/pull-requests/1364/overview

This pull request adds the following:

support for ARCHIVED status (to be displayed in right column of a File page)
/files/:id/archive endpoint that can be POSTed to to assign that status
if user attempts to download archived file, a new window will open with the /email form, with subject and body pre-populated to indicate the user wants to retrieve the file from the archive. This email would go to server admins.
Add a "This email was sent on behalf of..." footer to emails if the mimicuser config = false, so in cases where we don't want to spoof an email address (e.g. industry partner) there is still enough to get back in touch with the correct user.

Proposed future design

The pull request doesn't yet address the desire to support automatically archiving/unarchiving files on a user request. We have discussed one possible architecture that would leverage extractors to perform these two tasks.

Implement Archiver and Unarchiver extractors per use case
- Basic (TERRA-REF)
  - Archiver basically does nothing, or maybe emails site admins
- AWS (Industry Partner)
  - Archiver will contain credentials necessary to move a file from S3 to Glacier
  - Unarchiver will move a file from Glacier to S3
Add a "life limit" policy that can be set at the User/Space/Instance level (in order of preference) which, if above 0, will trigger Archiver automatically if the file has not been downloaded in that many days
There would also be the ability for any script/process/other extractor to trigger the Archiver as desired, so if there is an extractor that is the final step in a workflow, it could notify Clowder to Archive the working files necessary for the workflow while preserving the final outputs.

Space shortcuts

Page tree