Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

At the end, it can call a sendrespose(files, metadata) function of some kind to auto build the dict for simple extractor to parse. We should think about this - maybe different sendresponse() functions if file vs. dataset extractor? Don't want users to have to build the JSON object themselves necessarily, although maybe they have to if the JSON object is complex and sendresponse() is just for basic responses?


Single File Extractor:

...

Introduction

Clowder is an open-source research data management system that supports curation of long-tail data and metadata across multiple research domains and diverse data types. It uses a metadata extraction bus to perform data curation. Extractors are software programs that do the extraction of specific metadata from a file or dataset (a group of related files). The Simple Extractor Wrapper is a piece of software being developed to make the process of developing an extractor easier. This document will provide the details of writing an extractor program using the Simple Extractor Wrapper.

Goals of Simple Extractor Wrapper

An extractor can be written in any programming language as long as it can communicate with Clowder using a simple HTTP web service API and RabbitMQ. It can be hard to develop an extractor from the scratch when you also consider the code that is needed for this communication. To reduce this effort and to avoid code duplication, we created libraries written in Python (PyClowder) and Java (JClowder) to make the processing of writing extractors easy in these languages. We chose these languages since they are among the most popular ones and they continue to remain to so. Though this is the case, there is still some overhead in terms of developing an extractor using these libraries. In order to make the process of writing extractors even easier, we created a Simple Extractor Wrapper, that wraps around your existing Python source code and converts your code into an extractor. The main goal of this wrapper is to help create Python extractors with very minimal effort. As the name says, the extractor itself needs to be simple in nature. The extractor will process a file and generate metadata in JSON format and/or create a file preview. Any other Clowder API endpoints are not currently available through the Simple Extractor and the developer would have to fall back to using PyClowder, JClowder or writing the extractor from scratch.

Step-by-Step Instructions

Prerequisites

The step-by-step instructions to create an extractor using the Simple Extractor Wrapper assumes the following:

  1. Docker is installed on your computer. You can download and install Docker from https://www.docker.com/products/docker-desktop.
  2. You already have a piece of code written in Python that can process a file and generate metadata.
  3. The extractor that you are trying to create will only generate metadata in JSON format and/or a file preview.
  4. Your code has been tested and does what it is supposed to do.

The main function of your Python program needs to accept the string format file path of the input file. It also needs to return a dictionary containing either metadata information ("metadata"), details about file previews ("previews") or both in the following format:

Code Block
languagejs
themeConfluence
{
	"metadata": dict(),
	"previews": array()
}

Instructions

Your extractor will contain several files. The ones that will be used by the Simple Extractor Wrapper are listed below. The instructions below will help you to create these files:

    • my_python_program.py (required): For simplicity, let us call the Python file that contains the main function my_python_program.py, the main function my_main_function, and your extractor my_extractor.
    • extractor_info.json (required): Contains metadata about the extractor

    • Dockerfile (required): Contains instructions to create a docker image of your extractor

    • requirements.txt (optional): Contains names of Python packages that will be installed using the pip command.

    • packages.apt (optional): Contains names of Linux packages that will be installed using the apt-get command.

Create and save extractor_info.json using any text editor in your source directory. This file contains the metadata about the extractor that you are creating. Please fill in the relevant details about the extractor in this file. This document follows the JSON-LD standard. A template extractor_info.json has been provided below for reference. As you can see, you can fill in the details like name, version, author, contributors, source code repository, docker image name, the data types on which the extractor will work, external services used, any dependent libraries, BibTex  format citation to a list of publications that the extractor is referring to, etc. An example extractor_info.json can be found here:

Code Block
languagejs
themeConfluence
linenumberstrue
{
   "@context": "<context root URL>",
   "name": "<extractor name>",
   "version": "<version number>",
   "description": "<extractor description>",
   "author": "<first name> <last name> <<email address>>",
   "contributors": [
       "<first name> <last name> <<email address>>",
       "<first name> <last name> <<email address>>",
     ],
   "contexts": [
    {
       "<metadata term 1>": "<URL definition of metadata term 1>",
        "<metadata term 2>": "<URL definition of metadata term 2>",
     }
   ],
   "repository": [
      {
	"repType": "git",
    	 "repUrl": "<source code URL>"
      }
   ],
   "process": {
     "file": [
       "<MIME type/subtype>",
       "<MIME type/subtype>"
     ]
   },
   "external_services": [],
   "dependencies": [],
   "bibtex": []
 }

...

You can also use curl command to download it from a terminal:

Code Block
languagebash
themeConfluence
curl https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/raw/docker-compose.yml?at=refs%2Fheads%2FBD-2226-add-docker-compose-file-to-pyclowder2 --output docker-compose.yml

Start up the Clowder services stack (Clowder, RabbitMQ, MongoDB, and ElasticSearch) by running the following command from the directory containing the downloaded docker-compose.yml file. This may take a few minutes when running for the first time:

Code Block
languagebash
themeConfluence
docker-compose up

Create and save a Dockerfile in your existing source code directory. This can be done using any text editor in your computer. The content of the Dockerfile needs to be the following, where should replace my_python_program.py and my_main_function with their actual names:

Code Block
languagebash
themeConfluence
FROM clowder/extractors-simple-extractor:onbuild
ENV EXTRACTION_FUNC="my_main_function"
ENV EXTRACTION_MODULE="my_python_program.py"

...

Now, create the Docker image for your extractor using the command below. Please note that there is a dot (.) at the end of the command. You will need to open a terminal client and change to your Dockerfile directory using the cd command before running the command below (this will also install the Python packages from requirements.txt and Linux apt-get packages from packages.apt):

Code Block
languagebash
themeConfluence
docker build -t my_extractor .

In the terminal, you should be able to see the logs of the services that are part of the Clowder stack.

From another terminal window, you can now run your extractor using the following command:

Code Block
languagebash
themeConfluence
docker run -t -i --rm --network clowder_clowder my_extractor

You should be able to see the logs related to the starting extractor in this terminal window.

...

To stop the Clowder services stack, you will need to open a terminal client and change to your Clowder docker-compose.yml directory using the cd command before running the command below:

Code Block
languagebash
themeConfluence
docker-compose down

...

Creating a Word Count Extractor Using Simple Extractor Wrapper

Now, following the instructions in the previous section we can create a word count extractor using the Simple Extractor Wrapper. There are three files in this extractor, namely, Dockerfile, extractor_info.json, and wordcount.py. The word count extractor utilizes the Ubuntu built in command wc to find the number of lines, words, and characters in a text or JSON format file. No additional packages are needed by this extractor and hence there are no package installation files (requirements.txt, packages.apt).

...

Running wordcount.py as regular python code

On the terminal, you can run the wordcount.py as a regular python code, e.g., you can use the below command to run the wordcount.py to extract the counts of lines, words and characters from the input text format file. For example, let us create a text file called poem.txt containing the first stanza of the poem "The Road Not Taken" by Robert Frost:

No Format
"Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;"

Now we can run the following command to test the word count code:

Code Block
languagebash
themeConfluence
python -c "import wordcount; print wordcount.wordcount(\"poem.txt\")"

It will show the output as:

Code Block
languagebash
themeConfluence
{'metadata': {'lines': 5', 'characters': 182', 'words': '37'}}

Running wordcount.py as extractor

Word count extractor runs as docker container. Thus, if you want to use word count extractor, please refer to instruction steps to start Clowder services stack and then launch the word count extractor. Then you can submit files from Clowder Web GUI and please follow the steps listed below:

...

The source code of the word count extractor created using the simple extractor wrapper can be found here: https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/browse/sample-extractors/wordcount-simple-extractor