Extractors are services that run silently alongside Clowder. They can be configured to wait for specific file types to be uploaded into Clowder, and automatically execute scripts and processing on those files to extract metadata or generate new files.

Setting up a development environment

In order to develop and test an extractor, it's useful to have a local instance of Clowder running that you can test against. This will allow you to upload target files to trigger your extractor, and verify any outputs are being submitted back into Clowder correctly.

Start a local Clowder instance

The easiest way to get a local Clowder instance up and running is via Docker. We have created a docker image with the full Clowder stack already installed:

Java (to run the Clowder application itself)
MongoDB (underlying database storage)
RabbitMQ (message bus for communication between Clowder and extractors)
Clowder itself

It's possible to install these elements individually, but unless you want to pursue Clowder development this is unnecessary.

Install Docker and download our docker-compose.yml file from GitHub. This file tells Docker how to fetch and start the component containers for Clowder and its dependencies.
From a docker-aware terminal, go to the directory where you put the .yml file and run docker-compose up. This should start Clowder and its components.
Now you need to determine which IP address Docker is running on.
1. Docker's Networking documentation shows how to use docker network ls and ifconfig to determine Docker's IP address.
2. Older installations that use docker-machine may need to use docker-machine ip.
You should be able to now access Clowder at <dockerIP>:9000.
You can also access the RabbitMQ management console at <dockerIP>:15672.
Finally, sign up for a local account in Clowder. Because we have not configured an email server, no confirmation email will be sent - however we can get the confirmation URL from the Clowder logs:
1. docker ps to list running Docker containers
2. docker logs <clowder container name> to see the logs for your Clowder container
3. Look for a block of HTML in the log that is the body of the unsent email:
4. Copy the link into your browser to activate your account. For example in the screenshot above you would visit:
  http://localhost:9000/clowder/signup/22af7d43-8260-4c6c-822a-22bca0cb8340

Install pyClowder 2

In addition to Clowder, another package that makes extractor development easier is pyClowder 2. This package is not required - extractors can be written in any language that RabbitMQ supports.

pyClowder 2 provides helpful Python wrapper functions for many standard Clowder API calls and communication with RabbitMQ using pika. If you want to write extractors in other languages, you'll need to implement that functionality on your own.

Install Git
git clone https://opensource.ncsa.illinois.edu/bitbucket/scm/cats/pyclowder2.git
python setup.py install from inside the new pyclowder2 directory

This will install pyClowder 2 on your system. In Python scripts you will refer to the package as:

import pyclowder
from pyclowder.extractors import Extractor
import pyclowder.files

...etc.

Running a sample extractor

Now that we have our necessary dependencies, we can try running a simple extractor to make sure we've installed things correctly. The wordcount extractor is included with pyClowder 2 and will add metadata to text files when they are uploaded to Clowder.

Go to /pyclowder2/sample-extractors/wordcount/
Run the extractor
1. python wordcount.py is basic example
2. If you're running Docker, you'll need to specify the correct RabbitMQ URL because Docker is not localhost:
  python wordcount.py --rabbitmqURI amqp://guest:guest@<dockerIP>/%2f
3. You can use python wordcount.py -h to get other commandline options.
When the extractor reports "Starting to listen for messages" you are ready.
Upload a .txt file into Clowder and verify the extractor triggers and metadata is added to the file, e.g.:

Writing an extractor

Extractor events

Extractors use the RabbitMQ message bus to communicate with Clowder instances. Queues are created for each extractor, and the queue bindings filter the types of Clowder event messages the extractor is notified about. The following non-exhaustive list of events exist in Clowder (messages begin with an asterisk because the exchange name is not required to be 'clowder'):

message type	trigger event	message payload	examples
*.file.#	when any file is uploaded	added file ID added filename destination dataset ID, if applicable	clowder.file.image.png clowder.file.text.csv clowder.file.application.json
.file.image.# .file.text.# ...	when any file of the given MIME type is uploaded (this is just a more specific matching)	added file ID added filename destination dataset ID, if applicable	see above
*.dataset.file.added	when a file is added to a dataset	added file ID dataset ID full list of files in dataset	clowder.dataset.file.added
*.dataset.file.removed	when a file is removed from a dataset	removed file ID dataset ID full list of files in dataset	clowder.dataset.file.removed
*.metadata.added	when metadata is added to a file or dataset	file or dataset ID the metadata that was added	clowder.metadata.added
*.metadata.removed	when metadata is removed from a file or dataset	file or dataset ID	clowder.metadata.removed

common requirements

sudo -s
export RABBITMQ_URL="amqp://guest:guest@localhost:5672/%2F"
export EXTRACTORS_HOME="/home/clowder"
 
apt-get -y install git python-pip
pip install pika requests
 
cd ${EXTRACTORS_HOME}
git clone https://opensource.ncsa.illinois.edu/stash/scm/cats/pyclowder.git
chown -R clowder.users pyclowder

start extractors

cd /etc/init
for x in clowder-*.conf; do
  start `basename $x .conf`
done