Page tree
Skip to end of metadata
Go to start of metadata

The Brown Dog Data Transformation Service (DTS) is a highly extensible/distributed service providing a uniform means of managing and accessing transformation capabilities within the web. Utilized tools can come in the form of command line applications, GUI driven applications, libraries, and/or other services. Here we go over the process of preparing a new transformation tool, either an extractor or a converter, for usage with the DTS.

Using the BD Development Base

BD-base runs the necessary dockerized Brown Dog Data Transformation Service components (Clowder, Polyglot, Fence, RabbitMQ, MongoDB, Redis, an example extractor, an example converter, and the BD CLI) allowing a developer to get up and running more quickly as they create and debug new extractors/converters. You can get the BD-base by cloning the git repo:

git clone https://opensource.ncsa.illinois.edu/bitbucket/scm/bd/bd-base.git

or download the VirtualBox VM image and run it:

https://browndog.ncsa.illinois.edu/downloads/bd-base.ova

After downloading BD-base, users can simply run the bash script in the command-line to start up the BD development base.

cd bd-base 
./bd

The BD-base script will split your terminal into panes and start each of the services needed for the Brown Dog DTS. This provides a useful and convenient way to view the logs of running services in panes. 

Users can switch between panes using Tmux commands.  The panes are as follows: Fence (top), Clowder (middle-left), example extractor (middle-right), Polyglot (middle-left), example convert (middle-right), and the BD CLI (bottom). In the bottom pane, users can run BD-CLI commands to interact with the Brown Dog Data Transformation Service (username: bd, password: browndog):


CTRL-b <arrow key> will navigate panes.
Exit the bd-base session by typing CTRL-b then :kill-session.
NOTE: There is a .tmux.conf file included in bd-base. If you copy this file into your home directory before starting a bd-base session you will be able to navigate panes via the mouse and end the session by typing CTRL-b then CTRL-c.

Extractors

Here we describe the process for taking a working piece of code and deploying it as a Brown Dog extractor. For simplicity, it is assumed that the method can be invoked from a single call. In this example, we are using the python extractor wrapper and will invoke a python function. In a very similar fashion, a method developed in a language other than python can be invoked using subprocess.

The main steps:

  1. Wrap the tool for use as an extractor in Clowder (and through that Brown Dog)
  2. Dockerize the extractor
  3. Deploy the extractor
  4. Add the extractor to the Tools Catalog

A few assumptions are that you have a tool that extracts some kind of metadata from a file or dataset and that you have installed Python, Git, and Docker as well other specific software needed by your extractor (if any) on your computer.

1. Install pyClowder2

Install pyClowder2, which is a Python library that helps to easily communicate with Clowder - the backend service of Brown Dog which handles extractions. The advantage of using this library is that it manages all communications with Clowder and RabbitMQ (the distributed messaging bus) and the developer doesn't have to take care of such tasks. Needless to say, an extractor can also be written in native Python without the use of pyClowder2, but it could be more 

pip install --upgrade pip
pip install -r https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/raw/requirements.txt git+https://opensource.ncsa.illinois.edu/bitbucket/scm/cats/pyclowder2.git

2. Get Your Code Together

We have developed a template extractor written in Python. It is a simple word count extractor that counts lines, words, and characters in a text file. Clone the template extractor and rename the directory to an appropriate name that reflects the purpose of your extractor.

git clone https://opensource.ncsa.illinois.edu/bitbucket/scm/bd/extractors-template.git
mv extractors-template/ <your_extractor_name>
cd <your_extractor_name>

Make changes to extractors.py (main program). Consider the process_file method as the main method of an extractor and accordingly it needs to contain the main logic. You can call other methods in your python code from this method after importing necessary modules into this file.


3. Edit extractor_info.json

This file contains information about the extractor in JSON-LD format. Update all relevant fields as needed.

4. Configuration Parameters

Extractors obtain the configuration details required to connect to RabbitMQ, Clowder, etc., either from command-line arguments or environment variables. If you look at the Dockerfile inside the termplate extractor directory you can see some of the environment variables being set.  For the purpose of running your extractor using BD Development Base, you DO NOT have to change anything.

Note: The remaining part of this section is relevant ONLY if you want to run your extractor against another production instance at some other location. Otherwise, you can skip and continue reading the next section.

If you are planning to run your extractor using Docker, you will need to modify the Dockerfile to set the environment variables as required.

Otherwise, if you run your extractor as a standalone program (outside of Docker), you will need to set the relevant command-line arguments. You can get a list of these parameters by running your extractor with the help option (-h, --help).

5. Edit the Dockerfile

Update the Dockerfile to install your software dependencies, provide necessary instructions in Dockerfile using the RUN command. You will need to add a line in Dockerfile to switch to the root user (USER root) for getting proper permissions. For e.g., to install ImageMagick package using apt-get, add the following commands to Dockerfile:

USER root
RUN apt-get update && apt-get install -y imagemagick

6. Test the Extractor

You can test your extractor as follows:

docker build -t <your_extractor_name> .
docker run -it --link browndog_clowder_1 --link browndog_rabbitmq_1:rabbitmq <your_extractor_name>

You should see the following in the terminal. This means that the extractor is running and waiting for messages:

INFO    : pyclowder.extractors -  Waiting for messages. To exit press CTRL+C

Converters

Here we described the process for taking a working piece of code (an application, library, other service, etc) and deploying it as a Brown Dog converter.  In this example, we describe the creation of a converter using the popular image converter tool, ImageMagick.

1. Get Your Code Together

We have developed a template converter. It is a simple image converter that converts between different image formats using ImageMagick tool. Clone the template converter and rename the directory to an appropriate name that reflects the purpose of your converter

git clone https://opensource.ncsa.illinois.edu/bitbucket/scm/bd/convertors-template.git
mv convertors-template/ <your_converter_name>
cd <your_converter_name>

Rename and edit ImageMagick_convert.sh script to wrap your conversion tool. This script file should be named in the format <alias>_convert.<script_type>. Here <alias> needs to be replaced by the name of the conversion tool with which the converter registers with Polyglot and <script_type> needs to be replaced by the extension for the type of script this wrapper is written in. Polyglot currently supports scripts written in Python, Bash, R, AutoHotKey, AutoIT, and Sikuli (e.g. *.py, *.sh, etc.). For the sake of ease of explanation, we will rename the script file as MyTool_convert.sh. This script accepts three parameters: 

  1. Full path to input file
  2. Full path to output file (including filename)
  3. Full local path to available scratch space (optional)

This script will be used by the Software Server to run the tool and carry out any requested conversions. The example script ImageMagick_convert.sh that uses ImageMagick tool to convert images between different formats is shown below. The conversion script follows a specific header and is written as comments:

  1. First line is the shebang line
  2. Second line contains the name of the converter followed by the version (if any)
  3. Third line refers to the type of the data that it can convert
  4. Fourth line contains a comma-separated list of the input file formats accepted by this converter
  5. Fifth line contains a comma-separated list of the output file formats that this converter can generate
  6. This is followed by the actual code that does conversion.

2. Edit the Dockerfile

Modify the Dockerfile in the converter directory to replace ImageMagick with MyTool. Specifically change line numbers 11, 15, 16 and 17. You need to also change other fields like maintainer and may need to add instructions to install any specific software required by your converter. For example, you can see instruction to install ImageMagick software in the example Dockerfile:

Dockerfile
# Create softwareserver for polyglot.
FROM ncsapolyglot/polyglot:develop
MAINTAINER Rob Kooper <kooper@illinois.edu>

USER root
# - install requirements
# - enable shellscripts to be scanned
# - enable imagemagick conversion by adding to .aliases.txt
RUN apt-get update && apt-get -y install vim nano imagemagick && \
	/bin/sed -i -e 's/^\([^#]*Scripts=\)/#\1/' -e 's/^#\(ShellScripts=\)/\1/' /home/polyglot/polyglot/SoftwareServer.conf && \
	echo "ImageMagick" > /home/polyglot/polyglot/scripts/sh/.aliases.txt

# copy convert file to scripts/sh folder in container
# this is done to keep cache so you can debug script easily
COPY ImageMagick_convert.sh /home/polyglot/polyglot/scripts/sh/
RUN chown polyglot /home/polyglot/polyglot/scripts/sh/ImageMagick_convert.sh && \
    chmod +x /home/polyglot/polyglot/scripts/sh/ImageMagick_convert.sh

# back to polyglot
CMD ["softwareserver"]

Specifically, modify:

echo "ImageMagick" > /home/polyglot/polyglot/scripts/sh/.aliases.txt

to:

echo "MyTool" > /home/polyglot/polyglot/scripts/sh/.aliases.txt

modify:

COPY ImageMagick_convert.sh /home/polyglot/polyglot/scripts/sh/

to:

COPY MyTool_convert.sh /home/polyglot/polyglot/scripts/sh/

and modify:

RUN chown polyglot /home/polyglot/polyglot/scripts/sh/ImageMagick_convert.sh && \
    chmod +x /home/polyglot/polyglot/scripts/sh/ImageMagick_convert.sh

to:

RUN chown polyglot /home/polyglot/polyglot/scripts/sh/MyTool_convert.sh && \
    chmod +x /home/polyglot/polyglot/scripts/sh/MyTool_convert.sh

6. Test the Converter

You can test your converter as follows:

docker build –t mytool .
docker run -it --link browndog_rabbitmq_1:rabbitmq mytool

You should see the following in the terminal. This means that the converter is running and waiting for messages:

Available Software:
  ImageMagick (ImageMagick)
  • No labels

3 Comments

  1. How do I shutdown the bd-base dev environment?

    1. Added the commands above.

  2. I've created a similarly looking screenshot of the converter script that shows line numbers as well. If it's useful, we can use it here and in the powerpoint presentations.