Archiving Files

The archival process in Clowder is an optional addon to the already optional RabbitMQ extractor framework.

The definition of "archiving" is left to the implementation of each archival extractor, and "unarchiving" is the exact inverse of that process.

Two archival extractor implementations exist that currently depend on which ByteStorageDriver you are using:

ncsa.archival.disk: Moves the file from one specially-designated folder on disk to another
ncsa.archival.s3: Changes the Storage Class of a file in S3

The two options cannot currently be mixed, meaning that if Clowder uses DiskByteStorage then you must use the Disk archiver.

If neither of the above two extractors fit your use case, pyclowder can be used to quickly create a new archival extractor that fits your needs.

Configuration Options / Defaults for Clowder

To use the archival feature, the RabbitMQ plugin must be enabled and properly configured.

With the RabbitMQ plugin enabled, the following defaults are configured in application.conf, but can be overridden by using a custom.conf file:

Configuration Path	Default	Description
`archiveEnabled`	`false`	If true, Clowder should perform a lookup once per day to see if any files uploaded the past hour are candidates for archival.
`archiveDebug`	`false`	If true, Clowder should temporarily use "5 minutes" as the archive check interval (instead of once per day). In addition, it only considers candidate files that were uploaded in the past hour.
`archiveExtractorId`	`"ncsa.archival.disk"`	The id of the Extractor to use for archival
`archiveAllowUnarchive`	`false`	If true, the UI should offer a way to Unarchive a file that is ARCHIVED
`archiveAutoAfterDaysInactive`	`90`	The number of days that an item can go without being downloaded before it is automatically archived.
`archiveMinimumStorageSize`	`1000000`	The minimum number of bytes for a file to be considered as a candidate for automatic archival.

Configuration Options: ncsa.archival.disk

The following configuration options must match your configuration of the DiskByteStorageDriver:

Environment Variable	Command-Line Flag	Default Value	Description
`ARCHIVE_SOURCE_DIRECTORY`	`--archive-source`	`$HOME/clowder/data/uploads/`	The current directory where Clowder stores it's uploaded files
`ARCHIVE_TARGET_DIRECTORY`	`--archive-target`	`$HOME/clowder/data/archive/`	The target directory where the archival extractor should store the files that it archives. Note that this path can be on a network or other persistent storage.

To build the Disk archival extractor's Docker image, execute the following commands:

git clone https://opensource.ncsa.illinois.edu/bitbucket/scm/cats/extractors-archival-disk.git
cd extractors-archival-disk/
docker build -t clowder/extractors-archival-disk .

Example Configuration: Archive to another folder

In Clowder, configure the following:

# storage driver
service.byteStorage=services.filesystem.DiskByteStorageService

# disk storage path
#clowder.diskStorage.path="/Users/lambert8/clowder/data"    # MacOSX
clowder.diskStorage.path="/home/clowder/clowder/data"      # Linux

To run the Disk archival extractor with this configuration:

docker run -itd --rm -e ARCHIVE_SOURCE_DIRECTORY="/home/clowder/clowder/data/uploads/" -e ARCHIVE_TARGET_DIRECTORY="/home/clowder/clowder/data/archive/" clowder/extractors-archival-disk

NOTE: on MacOSX, you may need to run the extractor with the --net=host option to connect to RabbitMQ

Configuration Options: ncsa.archival.s3

The following configuration options must match your configuration of the S3ByteStorageDriver:

Environment Variable	Command-Line Flag	Default Value	Description
`AWS_S3_SERVICE_ENDPOINT`	`--service-endpoint <value>`	`https://s3.amazonaws.com`	Which AWS Service Endpoint to use to connect to S3. Note that this may depend on the region used, but can also be used to point at a running MinIO instance.
`AWS_ACCESS_KEY`	`--access-key <value>`	`""`	The AccessKey that should be used to authorize with AWS or MinIO
`AWS_SECRET_KEY`	`--secret-key <value>`	`""`	The SecretKey that should be used to authorize with AWS or MinIO
`AWS_BUCKET_NAME`	`--bucket-name <value>`	`clowder-archive`	The name of the bucket where the files are stored in Clowder.
`AWS_REGION`	`--region <value>`	`us-east-1`	AWS only: the region where the S3 bucket exists

To build the S3 archival extractor's Docker image, execute the following commands:

git clone https://opensource.ncsa.illinois.edu/bitbucket/scm/cats/extractors-archival-s3.git
cd extractors-archival-s3/
docker build -t clowder/extractors-archival-s3 .

Example Configuration: S3 on AWS in us-east-2 Region

In Clowder, configure the following:

# AWS S3
clowder.s3.serviceEndpoint="https://s3-us-east-2.amazonaws.com"
clowder.s3.accessKey="AWSACCESSKEYKASOKD"
clowder.s3.secretKey="aWSseCretKey+asAfasf90asdASDADAOaisdoas"
clowder.s3.bucketName="bucket-on-aws"
clowder.s3.region="us-east-2"

NOTE: Changing the Region typically requires changing the S3 Service Endpoint.

To run the S3 archival extractor with this configuration:

docker run --net=host -it --rm -e AWS_S3_SERVICE_ENDPOINT='https://s3-us-east-2.amazonaws.com' -e AWS_ACCESS_KEY='AWSACCESSKEYKASOKD' -e AWS_SECRET_KEY='aWSseCretKey+asAfasf90asdASDADAOaisdoas' -e AWS_BUCKET_NAME='bucket-on-aws' -e AWS_REGION='us-east-2' clowder/extractors-archival-s3

NOTE: on MacOSX, you may need to run the extractor with the --net=host option to connect to RabbitMQ

Example Configuration: MinIO

In Clowder, configure the following to point the S3ByteStorageDriver and the archival extractor at your running MinIO instance:

# Minio S3
clowder.s3.serviceEndpoint="http://localhost:8000"
clowder.s3.accessKey="AMINIOACCESSKEYKASOKD"
clowder.s3.secretKey="aMinIOseCretKey+asAfasf90asdASDADAOaisdoas"
clowder.s3.bucketName="bucket-on-minio"

NOTE: MinIO does not use the value for "Region", if one was specified.

To run the S3 archival extractor with this configuration:

docker run --net=host -it --rm -e AWS_S3_SERVICE_ENDPOINT='http://localhost:8000' -e AWS_ACCESS_KEY='AMINIOACCESSKEYKASOKD' -e AWS_SECRET_KEY='aMinIOseCretKey+asAfasf90asdASDADAOaisdoas' -e AWS_BUCKET_NAME='bucket-on-minio' clowder/extractors-archival-s3

NOTE: on MacOSX, you may need to run the extractor with the --net=host option to connect to RabbitMQ

Process Overview

When a file is first uploaded, it is placed into a temp folder and created in the DB with the state CREATED.

At this point, users can start associating metadata with the new file, even though the actual file bytes are not yet available through Clowder's API.

Clowder then begins transferring the file bytes to the configured ByteStorage driver.

Once the bytes are completely uploaded into Clowder and done transferring to the data store, the file is marked as PROCESSED.

At this point users can access the file bytes via the Clowder API and UI, and are able download the file as normal.

If the admin has configured the archival feature (see above), then the user is also offered a button to Archive the file.

On Archive

If a user chooses to Archive the file, then it is sent to the configured archival extractor with a parameter of operation=unarchive.

The extractor performs whatever operation it deems as "archiving" - for example, copying to a network file system.

Finally the file is marked as ARCHIVED, and (if configured) the user is given the option to Unarchive the file.

If the user attempts to download an ARCHIVED file, then they should be presented with a prompt to notify the admin for a request to unarchive.

On Unarchive

If a user chooses to Unarchive a file, then it is sent to the configured archival extractor with a parameter of operation=unarchive.

The extractor performs the inverse of whatever operation that it previously defined as "archiving", bringing the file bytes back to where Clowder can access them for download.

Finally the file is marked as PROCESSED, and the user should be once again given the option to Archive the file and requests to download the file bytes should succeed.

Automatic File Archival

If configured (see above), Clowder can automatically archive files of sufficient size after a predetermined period of inactivity.

By default, files that are over 1MB and have not been downloaded in that last 90 days will be automatically archived.

Both the file size and the inactivity period can be configured according to your preferences.

Space shortcuts

Page tree

Configuration Options / Defaults for Clowder

Configuration Options: ncsa.archival.disk

Example Configuration: Archive to another folder

Configuration Options: ncsa.archival.s3

Example Configuration: S3 on AWS in us-east-2 Region

Example Configuration: MinIO

Process Overview

On Archive

On Unarchive

Automatic File Archival