Driving Scenario I: TERRA-REF project uses file path uploading to create file entries in Clowder that point to mounted file paths (i.e. data bytes are not stored in MongoDB). TERRA's 1PB allocation on storage condo is filling up, necessitating some files (likely starting with raw data from 2016+) be moved into tape storage or offsite backup. This will be done manually and recovering files from the archive will require manual action.

Driving Scenario II: Industry partnership project would like to move files that have not been downloaded after X days automatically from S3 storage to Glacier. However, they would also like a button to automatically schedule the file to be restored from Glacier back to S3.

In both scenarios, we want to retain entries in Clowder for data that we archive for referencing and metadata purposes.

Completed work

https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/pull-requests/1364/overview

This  pull request adds the following:

  • support for ARCHIVED status (to be displayed in right column of a File page)
  • /files/:id/archive endpoint that can be POSTed to to assign that status
  • if user attempts to download archived file, a new window will open with the /email form, with subject and body pre-populated to indicate the user wants to retrieve the file from the archive. This email would go to server admins.
  • Add a "This email was sent on behalf of..." footer to emails if the mimicuser config = false, so in cases where we don't want to spoof an email address (e.g. industry partner) there is still enough to get back in touch with the correct user.

Proposed future design

The pull request doesn't yet address the desire to support automatically archiving/unarchiving files on a user request. We have discussed one possible architecture that would leverage extractors to perform these two tasks.

  • Implement Archiver and Unarchiver extractors per use case
    • Basic (TERRA-REF)
      • Archiver basically does nothing, or maybe emails site admins
    • AWS (Industry Partner)
      • Archiver will contain credentials necessary to move a file from S3 to Glacier
      • Unarchiver will move a file from Glacier to S3
  • Add a "life limit" policy that can be set at the User/Space/Instance level (in order of preference) which, if above 0, will trigger Archiver automatically if the file has not been downloaded in that many days
  • There would also be the ability for any script/process/other extractor to trigger the Archiver as desired, so if there is an extractor that is the final step in a workflow, it could notify Clowder to Archive the working files necessary for the workflow while preserving the final outputs.

Low-level Implementation Ideas

  • When the module is disabled, we could offer an e-mail based strategy to process archive requests (as in "Completed Work" above)
  • To enable the module, configure one or more backup extractors (possibly in an array in the configuration?)
  • When the module is enabled, unarchived files offer an "Archive" option in the UI - when pressed, config is checked for which extractor should be used for archival and request is sent
    • If useful, this could potentially offer multiple archive targets if more than one is configured - e.g. S3, NFS, Glacier, etc


05/29 discussion notes 

  • We have an Archive/Unarchive button alongside Download, and Download is hidden if the file is Status: Archived
    • Sends an 'archive' or 'unarchive' parameter to extractor so 1 extractor can handle both modes
  • Add a new Archive permission that we can configure like the others - gives us nice control over who can archive things, where
  • extractor_info has a category that allows the UI to filter which are shown
    • Could potentially implement support for the Process block in extractor info as well - trigger on S3/Mongo/Disk storage, certain MIME types, etc.
  • Global and per-space lifetime setting (30 = archive if file is not downloaded for 30 days)

Open Questions

  • Is this new optional functionality a Play! Framework "module"? Perhaps this is something entirely different?
  • Should we support multiple archive options simultaneously (similar to the proposed Multiple Storage Backend feature)?
  • Could we easily provide a common pattern for extractor developers to use as a base for such archive/unarchive extractors?
    • Would such a pattern require two different extractors? Could we easily parameterize the `process_message` call?
  • Should the backup extractors also be listed in the Manual Submission view with the other extractors?
    • This could confuse end users, but as long as extractor is SAFE this should be ok - this may take careful design


  • No labels

7 Comments

  1. Maybe change the download button to say "Request Unarchive" and have it send an email to the admins of the server.

    Discussion point: What to do when downloading a dataset that has archived files? Do we just add the metadata and no blob? Do we add to the root of the download a file listing archived files?

    As for the API I would like to have a simple /api/files/{id} endpoint that is a PUT endpoint that takes as option archive=TRUE/false

  2. It would be cool if you could offer a hook into at least one archive solution, or at least be thinking about it while doing the design.

    For instance AWS Glacier.  That way you might consider options to request archive recovery as part of the API options.

    So instead of "Unarchive" sending an email (configurable), it actually initiates a restore from tape request, what may notify the user when it is complete.  Obviously the possibilities can get complicated quickly, but if kept simple to start with, or even just laying the ground work for later could be considered, that might be helpful.

    I used to work in a mainframe environment where you accessed a catalog and the physical storage was a secondary concept.  When you requested a file you didn't need to know where it was.  If it was on tape, it would notify an operator (robot) to mount a tape and your job would wait for the request to be completed.  I see Mongo providing a similar capability.  The file has a location, and it may be NFS storage or is may be "????" storage, but the user doesn't know and just accepts the performance hit when it is archived.

    Simple behavior would be that anything unarchived is just relocated from offline to online storage, and since it was accessed, the archive rules are "reset" for that file.  It may fall into the "candidate" for archive (re-archive) again at a later date.

    Again, there is a tremendous amount of complexity that I am glossing over here, just trying to look beyond just have an archive bit.

  3. Dan Harwood we've had some discussions and are thinking of a possible extractor-driven framework to support this in a very flexible way, document updated above.

  4. Discussed this briefly with Luigi Marini and added some lower-level implementation possibilities and open questions to the Wiki above.


  5. Admin's are able to set a global archival policy that defines the lifecycle in terms of archival destination (tape, glacier, outright deletion, etc), trigger (time since modification, time since upload, etc), and duration (30 days, 120 days, etc).   Also, admin's could allow user's to override this on a per account/space/collection/dataset basis.  

    Clowder would allow unarchival via UI.  This would allow users to restore archived files (unless they were deleted via the "destination" property.  Instance level disabling of archival could be set via an administrative setting to prevent unarchiving so that retention policies could be set.  

    Archival batch jobs would run periodically without user intervention.  Unarchival would be async, and restores would take place depending on archival destination limitations.  

    APIs would report "archived" status for files archived.  APIs would allow callers to archive and unarchive files to override automatic archiving rules.

  6. As requested:  Comments on trigger criteria on an image:


    • Last accessed (viewed, downloaded, any API actions on it) *
    • Elapsed time since first created

    Are the two most likely criteria I would use.


    Otherwise, Other clowder users could possibility use:

    • last downloaded
    • last modified (metadata or image)
    • last viewed

    But I tend to think any "read" operation would reset the clock on archiving the file.  The "elapsed time since first created" would be the compliance need files area always deleted since a certain date in compliance with discovery or sarbanes oxley type laws.


    Lastly, an idea is that you might want to set instance level policies that are different based on the classification of the data (secret, internal only, etc).  But my needs are rather modest, we simply need to archive files that meet the rule that I starred (*)