Extractors are services that run silently alongside Clowder. They can be configured to wait for specific file types to be uploaded into Clowder, and automatically execute scripts and processing on those files to extract metadata or generate new files.
In order to develop and test an extractor, it's useful to have a local instance of Clowder running that you can test against. This will allow you to upload target files to trigger your extractor, and verify any outputs are being submitted back into Clowder correctly.
The easiest way to get a local Clowder instance up and running is via Docker. We have created a docker image with the full Clowder stack already installed:
It's possible to install these elements individually, but unless you want to pursue Clowder development this is unnecessary.
docker-compose up
. This should start Clowder and its components.docker network ls
and ifconfig
to determine Docker's IP address. docker-machine ip
.<dockerIP>:9000
.<dockerIP>:15672
.docker ps
to list running Docker containersdocker logs <clowder container name>
to see the logs for your Clowder containerIn addition to Clowder, another package that makes extractor development easier is pyClowder 2. This package is not required - extractors can be written in any language that RabbitMQ supports.
pyClowder 2 provides helpful Python wrapper functions for many standard Clowder API calls and communication with RabbitMQ using pika. If you want to write extractors in other languages, you'll need to implement that functionality on your own.
git clone https://opensource.ncsa.illinois.edu/bitbucket/scm/cats/pyclowder2.git
python setup.py install
from inside the new pyclowder2 directoryThis will install pyClowder 2 on your system. In Python scripts you will refer to the package as:
import pyclowder
from pyclowder.extractors import Extractor
import pyclowder.files
...etc.
When certain events occur in Clowder, such as a new file being added to a dataset, messages are generated and sent to RabbitMQ. These messages describe the type of event, the ID of the file/dataset in question, the MIME type of the file, and other information.
Extractors are configured to listen to RabbitMQ for particular types of messages. For example, an extractor can listen for any file being added to a dataset, or for specifically image files to be added to a dataset. Clowder event types below describes some of the available messages. RabbitMQ knows how to route messages coming from Clowder to any extractors listening for messages with that signature, at which point the extractor can examine the message and decide whether to proceed in processing the file/dataset/etc.
Clowder event types
Extractors use the RabbitMQ message bus to communicate with Clowder instances. Queues are created for each extractor, and the queue bindings filter the types of Clowder event messages the extractor is notified about. The following non-exhaustive list of events exist in Clowder (messages begin with an asterisk because the exchange name is not required to be 'clowder'):
message type | trigger event | message payload | examples |
---|---|---|---|
*.file.# | when any file is uploaded |
| clowder.file.image.png clowder.file.text.csv clowder.file.application.json |
*.file.image.# *.file.text.# ... | when any file of the given MIME type is uploaded (this is just a more specific matching) |
| see above |
*.dataset.file.added | when a file is added to a dataset |
| clowder.dataset.file.added |
*.dataset.file.removed | when a file is removed from a dataset |
| clowder.dataset.file.removed |
*.metadata.added | when metadata is added to a file or dataset |
| clowder.metadata.added |
*.metadata.removed | when metadata is removed from a file or dataset |
| clowder.metadata.removed |
In a pyClowder 2 context, extractor scripts will have 3 parts:
main()
will set up the connection with RabbitMQ and begin listening for messages. This typically will not change across extractors.check_message(parameters)
receives the message from RabbitMQ and includes information about the message in the parameters argument. Extractors can count the number of files, look for particular file extensions, check metadata and so on.process_message(parameters)
receives the message and, if specified in check_message(), the file(s) themselves. Here the actual extractor code is called on the files. Outputs can also be uploaded back to Clowder as files and/or metadata, for example.In addition to the extractor script itself:
extractor_info.json
contains some metadata about the extractor for registration and documentation.Now that we have our necessary dependencies, we can try running a simple extractor to make sure we've installed things correctly. The wordcount extractor is included with pyClowder 2 and will add metadata to text files when they are uploaded to Clowder.
/pyclowder2/sample-extractors/wordcount/
python wordcount.py
is basic examplepython wordcount.py --rabbitmqURI amqp://guest:guest@<dockerIP>/%2f
python wordcount.py -h
to get other commandline options.Starting to listen for messages"
you are ready.Once you can run the sample extractor, you are ready to develop your own extractor. Much of this section will be specific to Python extractors using pyClowder 2, but the concepts apply to all extractors.
Often you will have a script that already performs the desired operations, perhaps by providing a directory of input and output files on the command line. The goal will be to call the correct parts of your existing script from within the process_message()
function in your extractor, and to push the outputs from those methods back into Clowder.
Things to keep in mind:
sudo -s export RABBITMQ_URL="amqp://guest:guest@localhost:5672/%2F" export EXTRACTORS_HOME="/home/clowder" apt-get -y install git python-pip pip install pika requests cd ${EXTRACTORS_HOME} git clone https://opensource.ncsa.illinois.edu/stash/scm/cats/pyclowder.git chown -R clowder.users pyclowder |
cd /etc/init for x in clowder-*.conf; do start `basename $x .conf` done |