You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 25 Next »

Extractors are services that run silently alongside Clowder. They can be configured to wait for specific file types to be uploaded into Clowder, and automatically execute scripts and processing on those files to extract metadata or generate new files.

Setting up a development environment

In order to develop and test an extractor, it's useful to have a local instance of Clowder running that you can test against. This will allow you to upload target files to trigger your extractor, and verify any outputs are being submitted back into Clowder correctly.

Start a local Clowder instance

The easiest way to get a local Clowder instance up and running is via Docker. We have created a docker image with the full Clowder stack already installed:

  • Java (to run the Clowder application itself)
  • MongoDB (underlying database storage)
  • RabbitMQ (message bus for communication between Clowder and extractors)
  • Clowder itself

It's possible to install these elements individually, but unless you want to pursue Clowder development this is unnecessary.

  1. Install Docker and download our docker-compose.yml file from GitHub. This file tells Docker how to fetch and start the component containers for Clowder and its dependencies.

  2. From a docker-aware terminal, go to the directory where you put the .yml file and run:
    docker-compose up

    This should start Clowder and its components.

  3. Now you need to determine which IP address Docker is running on. 
    https://docs.docker.com/engine/userguide/networking/ shows how to use
    docker network ls
    ifconfig

    ...to determine Docker's IP address. Older installations that use docker-machine may need to use:

    docker-machine ip
     
  4. You should be able to now access Clowder:
    <dockerIP>:9000

    You can also access the RabbitMQ management console at:

    <dockerIP>:15672

Install pyClowder 2

Writing an extractor

Extractor events

Extractors use the RabbitMQ message bus to communicate with Clowder instances. Queues are created for each extractor, and the queue bindings filter the types of Clowder event messages the extractor is notified about. The following non-exhaustive list of events exist in Clowder (messages begin with an asterisk because the exchange name is not required to be 'clowder'):

message typetrigger eventmessage payloadexamples
*.file.#when any file is uploaded
  • added file ID
  • added filename
  • destination dataset ID, if applicable

clowder.file.image.png

clowder.file.text.csv

clowder.file.application.json

*.file.image.#

*.file.text.#

...

when any file of the given MIME type is uploaded

(this is just a more specific matching)

  • added file ID
  • added filename
  • destination dataset ID, if applicable
see above
*.dataset.file.addedwhen a file is added to a dataset
  • added file ID
  • dataset ID
  • full list of files in dataset
clowder.dataset.file.added
*.dataset.file.removedwhen a file is removed from a dataset
  • removed file ID
  • dataset ID
  • full list of files in dataset
clowder.dataset.file.removed
*.metadata.addedwhen metadata is added to a file or dataset
  • file or dataset ID
  • the metadata that was added
clowder.metadata.added
*.metadata.removedwhen metadata is removed from a file or dataset
  • file or dataset ID
clowder.metadata.removed

 

common requirements

 

sudo -s
export RABBITMQ_URL="amqp://guest:guest@localhost:5672/%2F"
export EXTRACTORS_HOME="/home/clowder"
 
apt-get -y install git python-pip
pip install pika requests
 
cd ${EXTRACTORS_HOME}
git clone https://opensource.ncsa.illinois.edu/stash/scm/cats/pyclowder.git
chown -R clowder.users pyclowder

opencv

 

apt-get -y install python-opencv opencv-data
cd ${EXTRACTORS_HOME}
git clone https://opensource.ncsa.illinois.edu/stash/scm/cats/extractors-cv.git
for x in opencv-closeups opencv-eyes opencv-faces opencv-profiles; do
	ln -s ${EXTRACTORS_HOME}/pyclowder/pyclowder ${EXTRACTORS_HOME}/extractors-cv/opencv/$x
	sed -i -e "s#rabbitmqURL = .*#rabbitmqURL = '${RABBITMQ_URL}'#" \
           -e "s#/usr/local/share/OpenCV#/usr/share/opencv#" ${EXTRACTORS_HOME}/extractors-cv/opencv/$x/config.py
    cp ${EXTRACTORS_HOME}/extractors-cv/opencv/$x/*.conf /etc/init
done
chown -R clowder.users extractors-cv

ocr

 

apt-get -y install tesseract-ocr
cd ${EXTRACTORS_HOME}
git clone https://opensource.ncsa.illinois.edu/stash/scm/cats/extractors-cv.git
ln -s ${EXTRACTORS_HOME}/pyclowder/pyclowder ${EXTRACTORS_HOME}/extractors-cv/ocr
sed -i -e "s#rabbitmqURL = .*#rabbitmqURL = '${RABBITMQ_URL}'#" ${EXTRACTORS_HOME}/extractors-cv/ocr/config.py
cp ${EXTRACTORS_HOME}/extractors-cv/ocr/clowder-ocr.conf /etc/init
chown -R clowder.users pyclowder

audio

 

apt-get -y install sox libsox-fmt-mp3
cd ${EXTRACTORS_HOME}
git clone https://opensource.ncsa.illinois.edu/stash/scm/cats/extractors-audio.git
ln -s ${EXTRACTORS_HOME}/pyclowder/pyclowder ${EXTRACTORS_HOME}/extractors-audio/preview/
sed -i -e "s#Binary = .*#Binary = '`which sox`'#" -e "s#rabbitmqURL = .*#rabbitmqURL = '${RABBITMQ_URL}'#" extractors-audio/preview/config.py
cp ${EXTRACTORS_HOME}/extractors-audio/preview/clowder-audio-preview.conf /etc/init
chown -R clowder.users extractors-audio

image

 

apt-get -y install imagemagick
cd ${EXTRACTORS_HOME}
git clone https://opensource.ncsa.illinois.edu/stash/scm/cats/extractors-image.git
ln -s ${EXTRACTORS_HOME}/pyclowder/pyclowder ${EXTRACTORS_HOME}/extractors-image/preview/
sed -i -e "s#imageBinary = .*#imageBinary = '`which convert`'#" -e "s#rabbitmqURL = .*#rabbitmqURL = '${RABBITMQ_URL}'#" extractors-image/preview/config.py
cp ${EXTRACTORS_HOME}/extractors-image/preview/clowder-image-preview.conf /etc/init
chown -R clowder.users extractors-image

pdf

 

apt-get -y install imagemagick
cd ${EXTRACTORS_HOME}
git clone https://opensource.ncsa.illinois.edu/stash/scm/cats/extractors-pdf.git
ln -s ${EXTRACTORS_HOME}/pyclowder/pyclowder /home/clowder/extractors-pdf/preview/
sed -i -e "s#Binary = .*#Binary = '`which convert`'#" -e "s#rabbitmqURL = .*#rabbitmqURL = '${RABBITMQ_URL}'#" extractors-pdf/preview/config.py
cp ${EXTRACTORS_HOME}/extractors-pdf/preview/clowder-pdf-preview.conf /etc/init
chown -R clowder.users extractors-pdf

video

apt-get -y install libav-tools
cd /home/clowder
git clone https://opensource.ncsa.illinois.edu/stash/scm/cats/extractors-video.git
ln -s /home/clowder/pyclowder/pyclowder /home/clowder/extractors-video/preview/
sed -i -e "s#Binary = .*#Binary = '`which convert`'#" -e "s#rabbitmqURL = .*#rabbitmqURL = '${RABBITMQ_URL}'#" extractors-video/preview/config.py
cp /home/clowder/extractors-video/preview/clowder-video-preview.conf /etc/init
chown -R clowder.users extractors-video

start extractors

cd /etc/init
for x in clowder-*.conf; do
  start `basename $x .conf`
done
  • No labels