This documentation may be out of date. For the most recent documentation for the archival extractors, see the READMEs in GitHub: https://github.com/clowder-framework/extractors-archival
The archival process in Clowder is an optional addon to the already optional RabbitMQ extractor framework.
...
Two archival extractor implementations exist that currently depend on which ByteStorageDriver you are using:
- ncsa.archival.disk: Moves the file from one specially-designated folder on disk to another (requires write access to Clowder's data directory)
- ncsa.archival.s3: Changes the Storage Class of an object stored in S3 (requires write access to Clowder's bucket in S3)
...
Finally the file is marked as PROCESSED, and the user should be once again given the option to Archive the file and requests to download the file bytes should succeed.
Automatic File Archival
If configured (see below), Clowder can automatically archive files of sufficient size after a predetermined period of inactivity. By default, files that are over 1MB and have not been downloaded in that last 90 days will be automatically archived.
Both the file size and the inactivity period can be configured according to your preferences.
Configuration Options / Defaults for Clowder
...
With the RabbitMQ plugin enabled, the following defaults are configured in application.conf, but can be overridden by using a custom.conf file:
Configuration Path | Default | Description |
---|---|---|
| false | If true, Clowder should perform a lookup once per day to see if any files uploaded the past hour are candidates for archival |
archiveDebug
false
. |
archiveExtractorId | "ncsa.archival.disk" | The id of the Extractor to use for archival
|
archiveAllowUnarchive | false | If true, the UI should offer a way to Unarchive a file that is ARCHIVED |
archiveAutoAfterDaysInactive
90
Automatic File Archival
If configured (see below), Clowder can automatically archive files of sufficient size after a predetermined period of inactivity.
By default, this behavior is disabled.
The default values after enabling the feature will cause files that are over 1MB and have not been downloaded in that last 90 days to be automatically archived.
Both the file size and the inactivity period can be configured according to your preferences.
Configuration Path | Default | Description |
---|---|---|
archiveAutoInterval | 0 | If == 0, disable automatic archiving. If > 0, check every |
archiveAutoDelay | 120 | Number of seconds to wait before starting the first iteration of the automatic archival loop. |
archiveAutoAfterInactiveCount | 90 | Number of units a file can go un-downloaded before it is considered "inactive". |
archiveAutoAfterInactiveUnits | days | The units for the inactivity timeout above (e.g. "90 days" old) |
archiveAutoAboveMinimumStorageSize |
archiveMinimumStorageSize
1000000 | The minimum number of bytes for a file to be considered as a candidate for automatic archival. |
ncsa.archival.disk
This image has been pre-built as clowder/extractors-archival-disk
.
(Optional) Building the Image
To build the Disk archival extractor's Docker image, execute the following commands:
...
The following configuration options must match your configuration of the DiskByteStorageDriver:
Environment Variable | Command-Line Flag | Default Value | Description |
---|---|---|---|
ARCHIVE_SOURCE_DIRECTORY | --archive-source | $HOME/clowder/data/uploads/ | The current directory where Clowder stores it's uploaded files |
ARCHIVE_TARGET_DIRECTORY | --archive-target | $HOME/clowder/data/archive/ | The target directory where the archival extractor should store the files that it archives. Note that this path can be on a network or other persistent storage. |
Example Configuration: Archive to another folder
...
Code Block |
---|
# storage driver service.byteStorage=services.filesystem.DiskByteStorageService # disk storage path #clowder.diskStorage.path="/Users/lambert8/clowder/data" # MacOSX clowder.diskStorage.path="/home/clowder/clowder/data" # Linux # disk archival settings archiveEnabled=true archiveDebug=false archiveExtractorId="ncsa.archival.disk" archiveAutoAfterDaysInactive archiveAllowUnarchive=true archiveAutoInterval=86400 archiveAutoDelay=300 archiveAutoAfterInactiveCount=90 archiveMinimumStorageSizearchiveAutoAfterInactiveUnits=1000000days archiveAllowUnarchivearchiveAutoAboveMinimumStorageSize=true1000000 |
To run the Disk archival extractor with this configuration:
...
NOTE 2: on MacOSX, you may need to run the extractor with the --net=host
option to connect to RabbitMQ.
ncsa.archival.s3
This image has been pre-built as clowder/extractors-archival-s3
.
(Optional) Building the Image
To build the S3 archival extractor's Docker image, execute the following commands:
...
The following configuration options must match your configuration of the S3ByteStorageDriver:
Environment Variable | Command-Line Flag | Default Value | Description |
---|---|---|---|
AWS_S3_SERVICE_ENDPOINT | --service-endpoint <value> | https://s3.amazonaws.com | Which AWS Service Endpoint to use to connect to S3. Note that this may depend on the region used, but can also be used to point at a running MinIO instance. |
AWS_ACCESS_KEY | --access-key <value> | "" | The AccessKey that should be used to authorize with AWS or MinIO |
AWS_SECRET_KEY | --secret-key <value> | "" | The SecretKey that should be used to authorize with AWS or MinIO |
AWS_BUCKET_NAME | --bucket-name <value> | clowder-archive | The name of the bucket where the files are stored in Clowder. |
AWS_REGION | --region <value> | us-east-1 | AWS only: the region where the S3 bucket exists |
AWS_ARCHIVED_STORAGE_CLASS | --archived-storage-class <value> | INTELLIGENT_TIERING | The S3 StorageClass to set for objects that are ARCHIVED. |
AWS_UNARCHIVED_STORAGE_CLASS | --unarchived-storage-class <value> | STANDARD | The S3 StorageClass to set for objects that are not archived (aka PROCESSED). |
Example Configuration: S3 on AWS in us-east-2 Region
...
Code Block |
---|
# storage driver service.byteStorage=services.s3.S3ByteStorageService # AWS S3 clowder.s3.serviceEndpoint="https://s3-us-east-2.amazonaws.com" clowder.s3.accessKey="AWSACCESSKEYKASOKD" clowder.s3.secretKey="aWSseCretKey+asAfasf90asdASDADAOaisdoas" clowder.s3.bucketName="bucket-on-aws" clowder.s3.region="us-east-2" # diskS3 archival settings archiveEnabled=true archiveDebug=false archiveExtractorId="ncsa.archival.s3" archiveAutoAfterDaysInactive archiveAllowUnarchive=true archiveAutoInterval=86400 archiveAutoDelay=300 archiveAutoAfterInactiveCount=90 archiveMinimumStorageSizearchiveAutoAfterInactiveUnits=1000000days archiveAllowUnarchivearchiveAutoAboveMinimumStorageSize=true1000000 |
NOTE: Changing the Region typically requires changing the S3 Service Endpoint.
...
Code Block |
---|
# storage driver service.byteStorage=services.s3.S3ByteStorageService # Minio S3 clowder.s3.serviceEndpoint="http://localhost:8000" clowder.s3.accessKey="AMINIOACCESSKEYKASOKD" clowder.s3.secretKey="aMinIOseCretKey+asAfasf90asdASDADAOaisdoas" clowder.s3.bucketName="bucket-on-minio" # S3 archival settings archiveEnabled=true archiveDebug=false archiveExtractorId="ncsa.archival.s3" archiveAutoAfterDaysInactive archiveAllowUnarchive=true archiveAutoInterval=86400 archiveAutoDelay=300 archiveAutoAfterInactiveCount=90 archiveMinimumStorageSizearchiveAutoAfterInactiveUnits=1000000days archiveAllowUnarchivearchiveAutoAboveMinimumStorageSize=true1000000 |
NOTE: MinIO ignores the value for "Region", if one is specified.
...
Code Block |
---|
docker run --net=host -itd -e AWS_S3_SERVICE_ENDPOINT='http://localhost:8000' -e AWS_ACCESS_KEY='AMINIOACCESSKEYKASOKD' -e AWS_SECRET_KEY='aMinIOseCretKey+asAfasf90asdASDADAOaisdoas' -e AWS_BUCKET_NAME='bucket-on-minio' -e AWS_ARCHIVED_STORAGE_CLASS='REDUCED_REDUNDANCY' clowder/extractors-archival-s3 |
...
Code Block | ||
---|---|---|
| ||
clowder.rabbitmq.uri="amqp://guest:guest@<PRIVATE IP>:5672/%2F" clowder.rabbitmq.exchange="clowder" clowder.rabbitmq.clowderurl="http://<PRIVATE IP>:9000" |
Gotcha: extractor complains about Python's built-in Thread.isAlive()
, and dies quickly after starting
pyclowder has an open issue here regarding a minor incompatibility with Python 3.9
...