Page History

This documentation may be out of date. For the most recent documentation for the archival extractors, see the READMEs in GitHub: https://github.com/clowder-framework/extractors-archival

The archival process in Clowder is an optional addon to the already optional RabbitMQ extractor framework.

...

Two archival extractor implementations exist that currently depend on which ByteStorageDriver you are using:

ncsa.archival.disk: Moves the file from one specially-designated folder on disk to another (requires write access to Clowder's data directory)
ncsa.archival.s3: Changes the Storage Class of an object stored in S3 (requires write access to Clowder's bucket in S3)

...

Finally the file is marked as PROCESSED, and the user should be once again given the option to Archive the file and requests to download the file bytes should succeed.

Automatic File Archival

If configured (see below), Clowder can automatically archive files of sufficient size after a predetermined period of inactivity. By default, files that are over 1MB and have not been downloaded in that last 90 days will be automatically archived.

Both the file size and the inactivity period can be configured according to your preferences.

Configuration Options / Defaults for Clowder

...

With the RabbitMQ plugin enabled, the following defaults are configured in application.conf, but can be overridden by using a custom.conf file:

Configuration Path	Default	Description
`archiveEnabled`	`false`	If true, Clowder should perform a lookup once per day to see if any files uploaded the past hour are candidates for archival

.archiveDebugfalseIf true, Clowder should temporarily use "5 minutes" as the archive check interval (instead of once per day)

.

In addition, it only considers candidate files that were uploaded in the past hour.

archiveExtractorId

"ncsa.archival.disk"

The id of the Extractor to use for archival

Use ncsa.archival.disk for DiskByteStorageDriver
Use ncsa.archival.s3 for S3ByteStorageDriver

archiveAllowUnarchive false If true, the UI should offer a way to Unarchive a file that is ARCHIVED

archiveAutoAfterDaysInactive90The number of days that an item can go without being downloaded before it is automatically archived.

Automatic File Archival

If configured (see below), Clowder can automatically archive files of sufficient size after a predetermined period of inactivity.

By default, this behavior is disabled.

The default values after enabling the feature will cause files that are over 1MB and have not been downloaded in that last 90 days to be automatically archived.

Both the file size and the inactivity period can be configured according to your preferences.

Configuration Path	Default	Description
`archiveAutoInterval`	`0`	If == 0, disable automatic archiving. If > 0, check every `interval` seconds for candidates for automatic archival.
`archiveAutoDelay`	`120`	Number of seconds to wait before starting the first iteration of the automatic archival loop.
`archiveAutoAfterInactiveCount`	`90`	Number of units a file can go un-downloaded before it is considered "inactive".
`archiveAutoAfterInactiveUnits`	`days`	The units for the inactivity timeout above (e.g. "90 days" old)
`archiveAutoAboveMinimumStorageSize`

archiveMinimumStorageSize

1000000

The minimum number of bytes for a file to be considered as a candidate for automatic archival.

ncsa.archival.disk

This image has been pre-built as clowder/extractors-archival-disk .

(Optional) Building the Image

To build the Disk archival extractor's Docker image, execute the following commands:

...

The following configuration options must match your configuration of the DiskByteStorageDriver:

Environment Variable	Command-Line Flag	Default Value	Description
`ARCHIVE_SOURCE_DIRECTORY`	`--archive-source`	`$HOME/clowder/data/uploads/`	The current directory where Clowder stores it's uploaded files
`ARCHIVE_TARGET_DIRECTORY`	`--archive-target`	`$HOME/clowder/data/archive/`	The target directory where the archival extractor should store the files that it archives. Note that this path can be on a network or other persistent storage.

Example Configuration: Archive to another folder

...

Code Block

# storage driver
service.byteStorage=services.filesystem.DiskByteStorageService

# disk storage path
#clowder.diskStorage.path="/Users/lambert8/clowder/data"    # MacOSX
clowder.diskStorage.path="/home/clowder/clowder/data"      # Linux

# disk archival settings
archiveEnabled=true
archiveDebug=false
archiveExtractorId="ncsa.archival.disk"
archiveAutoAfterDaysInactive
archiveAllowUnarchive=true

archiveAutoInterval=86400
archiveAutoDelay=300
archiveAutoAfterInactiveCount=90
archiveMinimumStorageSizearchiveAutoAfterInactiveUnits=1000000days
archiveAllowUnarchivearchiveAutoAboveMinimumStorageSize=true1000000

To run the Disk archival extractor with this configuration:

...

NOTE 2: on MacOSX, you may need to run the extractor with the --net=host option to connect to RabbitMQ.

ncsa.archival.s3

This image has been pre-built as clowder/extractors-archival-s3 .

(Optional) Building the Image

To build the S3 archival extractor's Docker image, execute the following commands:

...

The following configuration options must match your configuration of the S3ByteStorageDriver:

Environment Variable	Command-Line Flag	Default Value	Description
`AWS_S3_SERVICE_ENDPOINT`	`--service-endpoint <value>`	`https://s3.amazonaws.com`	Which AWS Service Endpoint to use to connect to S3. Note that this may depend on the region used, but can also be used to point at a running MinIO instance.
`AWS_ACCESS_KEY`	`--access-key <value>`	`""`	The AccessKey that should be used to authorize with AWS or MinIO
`AWS_SECRET_KEY`	`--secret-key <value>`	`""`	The SecretKey that should be used to authorize with AWS or MinIO
`AWS_BUCKET_NAME`	`--bucket-name <value>`	`clowder-archive`	The name of the bucket where the files are stored in Clowder.
`AWS_REGION`	`--region <value>`	`us-east-1`	AWS only: the region where the S3 bucket exists
`AWS_ARCHIVED_STORAGE_CLASS`	`--archived-storage-class <value>`	`INTELLIGENT_TIERING`	The S3 StorageClass to set for objects that are ARCHIVED.
`AWS_UNARCHIVED_STORAGE_CLASS`	`--unarchived-storage-class <value>`	`STANDARD`	The S3 StorageClass to set for objects that are not archived (aka PROCESSED).

Example Configuration: S3 on AWS in us-east-2 Region

...

Code Block

# storage driver
service.byteStorage=services.s3.S3ByteStorageService

# AWS S3
clowder.s3.serviceEndpoint="https://s3-us-east-2.amazonaws.com"
clowder.s3.accessKey="AWSACCESSKEYKASOKD"
clowder.s3.secretKey="aWSseCretKey+asAfasf90asdASDADAOaisdoas"
clowder.s3.bucketName="bucket-on-aws"
clowder.s3.region="us-east-2"

# diskS3 archival settings  
archiveEnabled=true
archiveDebug=false
archiveExtractorId="ncsa.archival.s3"
archiveAutoAfterDaysInactive
archiveAllowUnarchive=true

archiveAutoInterval=86400
archiveAutoDelay=300
archiveAutoAfterInactiveCount=90
archiveMinimumStorageSizearchiveAutoAfterInactiveUnits=1000000days
archiveAllowUnarchivearchiveAutoAboveMinimumStorageSize=true1000000

NOTE: Changing the Region typically requires changing the S3 Service Endpoint.

...

Code Block

# storage driver
service.byteStorage=services.s3.S3ByteStorageService

# Minio S3
clowder.s3.serviceEndpoint="http://localhost:8000"
clowder.s3.accessKey="AMINIOACCESSKEYKASOKD"
clowder.s3.secretKey="aMinIOseCretKey+asAfasf90asdASDADAOaisdoas"
clowder.s3.bucketName="bucket-on-minio"

# S3 archival settings  
archiveEnabled=true
archiveDebug=false
archiveExtractorId="ncsa.archival.s3"
archiveAutoAfterDaysInactive
archiveAllowUnarchive=true

archiveAutoInterval=86400
archiveAutoDelay=300
archiveAutoAfterInactiveCount=90
archiveMinimumStorageSizearchiveAutoAfterInactiveUnits=1000000days
archiveAllowUnarchivearchiveAutoAboveMinimumStorageSize=true1000000

NOTE: MinIO ignores the value for "Region", if one is specified.

...

Code Block

docker run --net=host -itd -e AWS_S3_SERVICE_ENDPOINT='http://localhost:8000' -e AWS_ACCESS_KEY='AMINIOACCESSKEYKASOKD' -e AWS_SECRET_KEY='aMinIOseCretKey+asAfasf90asdASDADAOaisdoas' -e AWS_BUCKET_NAME='bucket-on-minio' -e AWS_ARCHIVED_STORAGE_CLASS='REDUCED_REDUNDANCY' clowder/extractors-archival-s3

...

Code Block

language	text

clowder.rabbitmq.uri="amqp://guest:guest@<PRIVATE IP>:5672/%2F"
clowder.rabbitmq.exchange="clowder"
clowder.rabbitmq.clowderurl="http://<PRIVATE IP>:9000"

Gotcha: extractor complains about Python's built-in `Thread.isAlive()`, and dies quickly after starting

pyclowder has an open issue here regarding a minor incompatibility with Python 3.9

...

Space shortcuts

Page tree

Versions Compared

Old Version 16

New Version Current

Key

Automatic File Archival

Configuration Options / Defaults for Clowder

Automatic File Archival

ncsa.archival.disk

(Optional) Building the Image

Example Configuration: Archive to another folder

ncsa.archival.s3

(Optional) Building the Image

Example Configuration: S3 on AWS in us-east-2 Region

Gotcha: extractor complains about Python's built-in `Thread.isAlive()`, and dies quickly after starting

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 16

New Version Current

Key

Automatic File Archival

Configuration Options / Defaults for Clowder

Automatic File Archival

ncsa.archival.disk

(Optional) Building the Image

Example Configuration: Archive to another folder

ncsa.archival.s3

(Optional) Building the Image

Example Configuration: S3 on AWS in us-east-2 Region

Gotcha: extractor complains about Python's built-in Thread.isAlive(), and dies quickly after starting

Gotcha: extractor complains about Python's built-in `Thread.isAlive()`, and dies quickly after starting