Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Reverted from v. 2

...

At the end, it can call a sendrespose(files, metadata) function of some kind to auto build the dict for simple extractor to parse. We should think about this - maybe different sendresponse() functions if file vs. dataset extractor? Don't want users to have to build the JSON object themselves necessarily, although maybe they have to if the JSON object is complex and sendresponse() is just for basic responses?



Single File Extractor:

...

Introduction

Clowder is an open-source research data management system that supports curation of long-tail data and metadata across multiple research domains and diverse data types. It uses a metadata extraction bus to perform data curation. Extractors are software programs that do the extraction of specific metadata from a file or dataset (a group of related files). The Simple Extractor Wrapper is a piece of software being developed to make the process of developing an extractor easier. This document will provide the details of writing an extractor program using the Simple Extractor Wrapper.

Goals of Simple Extractor Wrapper

An extractor can be written in any programming language as long as it can communicate with Clowder using a simple HTTP web service API and RabbitMQ. It can be hard to develop an extractor from the scratch when you also consider the code that is needed for this communication. To reduce this effort and to avoid code duplication, we created libraries written in Python (PyClowder) and Java (JClowder) to make the processing of writing extractors easy in these languages. We chose these languages since they are among the most popular ones and they continue to remain to so. Though this is the case, there is still some overhead in terms of developing an extractor using these libraries. In order to make the process of writing extractors even easier, we created a Simple Extractor Wrapper, that wraps around your existing Python source code and converts your code into an extractor. The main goal of this wrapper is to help create Python extractors with very minimal effort. As the name says, the extractor itself needs to be simple in nature. The extractor will process a file and generate metadata in JSON format and/or create a file preview. Any other Clowder API endpoints are not currently available through the Simple Extractor and the developer would have to fall back to using PyClowder, JClowder or writing the extractor from scratch.

Step-by-Step Instructions

Prerequisites

The step-by-step instructions to create an extractor using the Simple Extractor Wrapper assumes the following:

  1. Docker is installed on your computer. You can download and install Docker from https://www.docker.com/products/docker-desktop.
  2. You already have a piece of code written in Python that can process a file and generate metadata.
  3. The extractor that you are trying to create will only generate metadata in JSON format and/or a file preview.
  4. Your code has been tested and does what it is supposed to do.
  5. The main function of your Python program needs to accept the string format file path of the input file. It also needs to return a dictionary containing either metadata information ("metadata"), details about file previews ("previews") or both in the following format:
    {
        "metadata": dict(),
        "previews": array()
    }

Instructions

Your extractor will contain several files. The ones that will be used by the Simple Extractor Wrapper are listed below. The instructions below will help you to create these files:

...