Adding Tools to the DAP and DTS, Overview and Examples

Image Removed

Introduction

This guide is intended as an introduction for new users working with the Brown Dog software platform. An introduction to the 3 main components of the platform, Polyglot, Medici, and Versus will be presented, and examples of scripts and code are provided. These 3 tools can be leveraged to add tools to the Data Access Proxy (DAP) and Data Tilling Service (DTS).

Table of Contents
- Prerequisites
- Polyglot
  - Overview
  - Brief History
    - Software Servers
    - Information Loss
    - The I/O Graph
    - File Formats
  - Scripting
    - AutoHotKey
      - Acrobat
      - OpenOffice
    - AppleScript
    - Python
    - Bash
- Medici
  - Overview
  - Example Extractors
    - Java
    - C++
    - Python
- Versus
  - Overview
  - Example Measures
    - Java

...

This overview assumes a basic level of knowledge about the three main components of the Brown Dog software platform, Polyglot, Medici, and Versus. Some background information will be provided, however, for a more in depth overview of each of these components and their function, it is recommended that you take the opportunity to view the provided online tutorial sessions that may be found on the ISDA's YouTube account: http://www.youtube.com/channel/UCGIXAeNEa2v7Gt-tvfdJPvw.

...

Image Removed

...

Polyglot is intended to be a universal and scalable file format converter. Data preservation and curation is an extremely difficult problem faced by many within the scientific community. One of the most difficult issues faced by those hoping to preserve data is that over time the file formats used to store important scientific data may become unreadable. This is a serious problem for those within the scientific community, as preservation of data is necessary to ensure reproducible science. A significant problem addressed by Polyglot is that there exist a multitude of different file formats that represent the same data (e.g. .jpg, .gif, .png for images) as well as numerous different software programs to view that data. This diminishes the lifespan of that data, as the software available to utilize the files can become unsupported by the creator, or it may be the case that it is not supported by software that functions on the user's current system. It can also be the case that the software required to read a particular file format is unavailable to a particular user due to lack of access to the relevant program or licensing issues. When this occurs accessing data stored in these file formats requires the user to find a different program to read the files, or in many cases to convert the data to a different file format that is usable by some other program. In addition to this, some proprietary file formats may be unreadable to those who might benefit from the data if access to the appropriate software is not available. Again, in this case the user is forced to find a different program to read the files, or to convert them to a different format. Polyglot seeks to allow the user to convert from any file format to one that is supported by the software available to the user. In this way, Polyglot preserves data, allowing data that might otherwise become unusable to persist over time.

...

One of the main components of the Polyglot system is the network of software servers used for converting between various file types. The system is homogenous, supporting different operating systems and it is designed to be extendable. A server running a specific set of software to preform conversions can be brought up and added to the system, thus expanding the variety of conversions Polyglot is capable of. A software server runs a set of scripts (described below in the scripting section) that open and save files. This is done in a way that allows one to convert from one file format to another through existing software. There is also a mechanism by which a script can be killed if a failure to convert, or a crash of the program is detected.

...

One Significant issue with file conversions that Polyglot addresses is that conversion from one file format to another can often times cause information loss. This is obviously detrimental to the value of the data stored in these files, as information loss during conversion can potentially render the files to become useless for their intended purpose. Many times different software programs may implement different subsets of the specification for the file format being converted, causing this type of information loss. Below an example of this type of information loss can be seen. The original model in the image (top) was in the .stp format and then was converted to the .igs format, then finally back to the .stp format. The result is shown below the original image. It is quite evident that the information loss is significant.

Image Removed

Polyglot mitigates the information loss described above using a structure called an I/O graph, described below.

...

As the name "Brown Dog" suggests the project aims at bringing together a number of external tools as part of the two services being constructed. For the DAP, which handles conversions, these tools are incorporated as scripts to the Software Servers in Polyglot or as DFDL schemas for Daffodil. For the DTS, which handles the automatic extraction of metadata and content signatures, tools are incorporated as either extractors for Medici or extractors for Versus. Below we show examples for incorporating each of these components. This overview assumes a basic level of knowledge about the three main components of the Brown Dog software platform, i.e. Polyglot, Medici, and Versus. For a more in depth overview of each of these components and their function it is recommended that you first read through their online documentation and/or go through one of the online tutorial videos:

Anchor
Polyglot
Polyglot
Polyglot Software Server Scripts

Software Server scripts are used by Polyglot to automate the interaction with software that is capable of converting from one file format to another. These scripts can directly wrap command line utilities that carry out conversions for use in Polyglot or split the steps of opening a file in one format and saving a file in a different format, typical of GUI driven applications. These wrapper scripts can be written in pretty much any text based scripting language. Below we show a few simple examples. Full details on the creation of these wrapper scripts, the required naming convensions, and required header convensions please refer to the the Scripting Manual

...

The I/O graph represents all possible paths from one particular file format to another based on the software within the Polyglot system. Below, a visual representation of one I/O graph is shown. It represents the information for 17 different 3D imaging applications. The vertices of this graph each represent the input and output file formats supported by the programs that were used to generate this particular I/O graph. The edges of the graph indicate that an application that can convert from the source format to the target format. It should be noted that it may be possible to convert between one format and another using multiple steps, as indicated by the highlighted path below, which represents a conversion between the .lwo file format and the .stp file format.

Image Removed

.

...

The name Polyglot comes from the term for someone who speaks many languages to reflect that Polyglot is intended to provide conversions to and from as many file formats as possible. The number of available conversions is only limited by the software installed on the software servers (described above) and the available paths from one file format to another in the generated I/O graph (also described above).

...

Scripts designed for the software servers used by the Polyglot system each carry out a particular function (e.g. open, save, kill) and depending on the operation the script takes 0, 1, or 2 arguments indicating the input/output files. Each script starts with a four line header that indicates the pretty name of the program, the domain of the file that it is designed for, and depending on the operation type the valid input and output formats. Three example scripts written in the AutoHotKey scripting language are provided below. The purpose of these scripts is to convert a .pdf using Adobe Acrobat. Open, save, and kill operations are given as examples. Other scripting formats are supported by Polyglot. These formats are mentioned below and examples will be provided in the future.

Anchor
AutoHotKey
AutoHotKey
AutoHotKey

Anchor
Acrobat
Acrobat
Acrobat

The script below opens a .pdf file for conversion

Code Block

title	Open

;Adobe Acrobat (v9.3.0 Pro Extended)
;document
;pdf

;Parse input filename
arg1 = %1%
StringGetPos, index, arg1, \, R
ifLess, index, 0, ExitApp
index += 2
input_filename := SubStr(arg1, index)

;Run program if not already running
IfWinNotExist, Adobe 3D Reviewer
{
  Run, C:\Program Files\Adobe\Acrobat 9.0\Acrobat\Acrobat.exe
  WinWait, Adobe Acrobat Pro Extended
}

;Activate the window
WinActivate, Adobe Acrobat Pro Extended
WinWaitActive, Adobe Acrobat Pro Extended

;Open document
Send, ^o
WinWait, Open
ControlSetText, Edit1, %1%
ControlSend, Edit1, {Enter}

;Make sure model is loaded before exiting
Loop
{
  IfWinExist, %input_filename% - Adobe Acrobat Pro Extended
  {
    break
  }

  Sleep, 500
}

The script below saves a converted .pdf file to the specified output format

Code Block

title	Save

;Adobe Acrobat (v9.3.0 Pro Extended)
;document
;doc, html, jpg, pdf, ps, rtf, txt

;Parse output format
arg1 = %1%
StringGetPos, index, arg1, ., R
ifLess, index, 0, ExitApp
index += 2
out := SubStr(arg1, index)

;Parse filename root
StringGetPos, index, arg1, \, R
ifLess, index, 0, ExitApp
index += 2
name := SubStr(arg1, index)
StringGetPos, index, name, ., R
ifLess, index, 0, ExitApp
name := SubStr(name, 1, index)

;Activate the window
WinActivate, %name%.pdf - Adobe Acrobat Pro Extended
WinWaitActive, %name%.pdf - Adobe Acrobat Pro Extended

;Save document
Send, ^S
WinWait, Save As

if(out = "doc"){
  ControlSend, ComboBox3, m
}else if(out = "html"){
  controlSend, ComboBox3, h
}else if(out = "jpg"){
  controlSend, ComboBox3, j
}else if(out = "pdf"){
  controlSend, ComboBox3, a
}else if(out = "ps"){
  controlSend, ComboBox3, p
  controlSend, ComboBox3, p
  controlSend, ComboBox3, p
  controlSend, ComboBox3, p
  controlSend, ComboBox3, p
}else if(out = "rtf"){
  controlSend, ComboBox3, r
}else if(out = "txt"){
  controlSend, ComboBox3, t
  controlSend, ComboBox3, t
}

ControlSetText, Edit1, %1%
ControlSend, Edit1, {Enter}

;Return to main window before exiting
Loop
{
  ;Continue on if main window is active
  IfWinActive, %name%.pdf - Adobe Acrobat Pro Extended
  { 
    break
  }

  ;Click "Yes" if asked to overwrite files
  IfWinExist, Save As
  {
    ControlGetText, tmp, Button1, Save As

    if(tmp = "&Yes")
    {
      ControlClick, Button1, Save As
    }
  }

  Sleep, 500
}

;Wait a lit bit more just in case
Sleep, 1000

;Close whatever document is currently open
Send, ^w

;Make sure it actually closed before exiting
Loop
{
  ;Continue on if main window is active
  IfWinActive, Adobe Acrobat Pro Extended
  { 
    break
  }

  Sleep, 500
}

The script below kills Adobe Acrobat in the case that the program hangs

Code Block

title	Kill

;Adobe Acrobat (v9.3.0 Pro Extended)

;Kill any scripts that could be using this application first
RunWait, taskkill /f /im Acrobat_open.exe
RunWait, taskkill /f /im Acrobat_save.exe

;Kill the application
RunWait, taskkill /f /im Acrobat.exe ^w

;Make sure it actually closed before exiting
Loop
{
  ;Continue on if main window is active
  IfWinActive, Adobe Acrobat Pro Extended
  { 
    break
  }

  Sleep, 500
}

Anchor
OpenOffice
OpenOffice
OpenOffice

Anchor
AppleScript
AppleScript
AppleScript
- Applescript is also supported by Polyglot. An example script will be provided in the future.
Anchor
Python
Python
Python
- Python is also supported by Polyglot. An example script will be provided in the future.
Anchor
Bash
Bash
Bash
- Bash is also supported by Polyglot. An example script will be provided in the future.

...

Anchor
Medici

...

Medici

...

Medici Extractors

Medici extractors typically serve to automatically extract some new kind of information from a file's content when its uploaded into Medici. These extractors do this by connecting to a shared RabbitMQ bus. When a new file is uploaded to Medici it is announced on this bus. Extractors that can handle a file of the type posted on the bus are triggered and the data they in turn create is returned to Medici as derived data to be associated with that file. The extractors themselves can be implemented in a variety of languages

...

Medici is another portion of the Brown Dog platform. Medici stores and allows for user curation of data uploaded to the system. In order to extract information from data uploaded into the Medici content repository, Medici requires a connection to a RabbitMQ bus in order to receive notifications when new data has been uploaded. It also requires an extractor, which as the name suggests extracts the desired information from the uploaded data. The information extracted could be metadata (e.g. geolocation, file size, file creation date), a preview of the uploaded file, provenance of the file, and more. Below, example code is given in multiple languages that describes how to create an extractor for the Medici system.

AnchorExample ExtractorsExample ExtractorsExample ExtractorsEach extractor must first connect to the RabbitMQ bus. Examples of how this may be accomplished in various languages are presented below. A receiver will consume data to be extracted from RabbitMQ and process the information that has been uploaded. Below, the receivers attempt to extract a word count from an uploaded text document.

Anchor
Java
Java
Java

Code Block

theme	Emacs
language	java
title	Connecting to RabbitMQ

protected void startExtractor(String rabbitMQUsername,
	String rabbitMQpassword) {
	try{ 
 		//Open channel and declare exchange and consumer
		ConnectionFactory factory = new ConnectionFactory();
		factory.setHost(serverAddr);
		factory.setUsername(rabbitMQUsername);
		factory.setPassword(rabbitMQpassword);
		Connection connection = factory.newConnection();

 		final Channel channel = connection.createChannel();
		channel.exchangeDeclare(EXCHANGE_NAME, "topic", true);

		channel.queueDeclare(QUEUE_NAME,DURABLE,EXCLUSIVE,AUTO_DELETE,null);
		channel.queueBind(QUEUE_NAME, EXCHANGE_NAME, "*.file.text.plain.#");
 
 		this.channel = channel;

 		// create listener
		channel.basicConsume(QUEUE_NAME, false, CONSUMER_TAG, new DefaultConsumer(channel) {
 			@Override
 			public void handleDelivery(String consumerTag, Envelope envelope, AMQP.BasicProperties properties, byte[] body) throws IOException {
				messageReceived = new String(body);
 				long deliveryTag = envelope.getDeliveryTag();
 				// (process the message components here ...)
				System.out.println(" {x} Received '" + messageReceived + "'");
 
				replyProps = new AMQP.BasicProperties.Builder().correlationId(properties.getCorrelationId()).build();
				replyTo = properties.getReplyTo();
 
				processMessageReceived();
				System.out.println(" [x] Done");
				channel.basicAck(deliveryTag, false);
			}
		});

 		// start listening 
		System.out.println(" [*] Waiting for messages. To exit press CTRL+C");
 		while (true) {
			Thread.sleep(1000);
		}
	}
 	catch(Exception e){
		e.printStackTrace();
		System.exit(1);
	} 
}

Anchor
C++
C++
C++

Code Block

language	cpp
title	Connecting to RabbitMQ

#include <amqpcpp.h>

namespace CPPExample {

  class RabbitMQConnectionHandler : public AMQP::ConnectionHandler {
      /**
      *  Method that is called by the AMQP library every time it has data
      *  available that should be sent to RabbitMQ. 
      *  @param  connection  pointer to the main connection object  
      *  @param  data        memory buffer with the data that should be sent to RabbitMQ
      *  @param  size        size of the buffer
      */
     virtual void onData(AMQP::Connection *connection, const char *data, size_t size)
     {
         // @todo 
         //  Add your own implementation, for example by doing a call to the
         //  send() system call. But be aware that the send() call may not
         //  send all data at once, so you also need to take care of buffering
         //  the bytes that could not immediately be sent, and try to send 
         //  them again when the socket becomes writable again
     }

      /**
      *  Method that is called by the AMQP library when the login attempt 
      *  succeeded. After this method has been called, the connection is ready 
      *  to use.
      *  @param  connection      The connection that can now be used
      */
      virtual void onConnected(Connection *connection)
      {
         // @todo
         //  add your own implementation, for example by creating a channel 
         //  instance, and start publishing or consuming
      }

      /**
      *  Method that is called by the AMQP library when a fatal error occurs
      *  on the connection, for example because data received from RabbitMQ
      *  could not be recognized.
      *  @param  connection      The connection on which the error occured
      *  @param  message         A human readable error message
      */
      virtual void onError(Connection *connection, const std::string &message)
      {
        // @todo
        //  add your own implementation, for example by reporting the error
        //  to the user of your program, log the error, and destruct the 
        //  connection object because it is no longer in a usable state
      }
  };

}

Code Block

language	cpp
title	Receiver

namespace CPPExample {

  /**
   *  Parse data that was recevied from RabbitMQ
   *  
   *  Every time that data comes in from RabbitMQ, you should call this method to parse
   *  the incoming data, and let it handle by the AMQP-CPP library. This method returns the number
   *  of bytes that were processed.
   *
   *  If not all bytes could be processed because it only contained a partial frame, you should
   *  call this same method later on when more data is available. The AMQP-CPP library does not do
   *  any buffering, so it is up to the caller to ensure that the old data is also passed in that
   *  later call.
   *
   *  @param  buffer      buffer to decode
   *  @param  size        size of the buffer to decode
   *  @return             number of bytes that were processed
   */
  size_t parse(char *buffer, size_t size)
  {
     return _implementation.parse(buffer, size);
  }
}

Anchor
Python
Python
Python

Code Block

theme	Emacs
language	py
title	Instantiating the logger and starting the extractor

def main():
 global logger

 # name of receiver
receiver='ExamplePythonExtractor'

 # configure the logging system
logging.basicConfig(format="%(asctime)-15s %(name)-10s %(levelname)-7s : %(message)s", level=logging.WARN)
logger = logging.getLogger(receiver)
logger.setLevel(logging.DEBUG)
 
 if len(sys.argv) != 4:
logger.info("Input RabbitMQ username, followed by RabbitMQ password and Medici REST API key.")
sys.exit()
 
 global playserverKey
playserverKey = sys.argv[3]

Code Block

theme	Emacs
language	py
title	Connecting to RabbitMQ

# connect to rabbitmq using input username and password 
credentials = pika.PlainCredentials(sys.argv[1], sys.argv[2])
parameters = pika.ConnectionParameters(credentials=credentials)
connection = pika.BlockingConnection(parameters)
 
 # connect to channel
channel = connection.channel()

 # declare the exchange
channel.exchange_declare(exchange='medici', exchange_type='topic', durable=True)

 # declare the queue
channel.queue_declare(queue=receiver, durable=True)

 # connect queue and exchange
channel.queue_bind(queue=receiver, exchange='medici', routing_key='*.file.text.plain')

 # create listener
channel.basic_consume(on_message, queue=receiver, no_ack=False)

 # start listening
logger.info("Waiting for messages. To exit press CTRL+C")
 try:
channel.start_consuming()
 except KeyboardInterrupt:
channel.stop_consuming()

Anchor
Versus
Versus
Versus

...

Extractors

Anchor
Java Measure
Java Measure
Java

Code Block

language	java
title	Measure

public class WordCountMeasure implements Serializable,Measure {

	private static final long SLEEP = 10000;

	@Override
	public Similarity compare(Descriptor feature1, Descriptor feature2)
			throws Exception {
		Thread.sleep(SLEEP);
		return new SimilarityNumber(0);
	}

	@Override
	public SimilarityPercentage normalize(Similarity similarity) {
		return null;
	}

	@Override
	public String getFeatureType() {
		return WordCountMeasure.class.getName();
	}

	@Override
	public String getName() {
		return "Word Count Measure";
	}

	@Override
	public Class<WordCountMeasure> getType() {
		return WordCountMeasure.class;
	}

}

...

Page tree

Versions Compared

Old Version 70

New Version 71

Key

Adding Tools to the DAP and DTS, Overview and Examples

Introduction

Anchor
Polyglot
Polyglot
Polyglot Software Server Scripts

Anchor
Medici

Medici

Medici Extractors

Anchor
Java
Java
Java

Anchor
C++
C++
C++

Anchor
Python
Python
Python

Anchor
Versus
Versus
Versus

Extractors

Anchor
Java Measure
Java Measure
Java

Page tree

Page History

Versions Compared

Old Version 70

New Version 71

Key

Adding Tools to the DAP and DTS, Overview and Examples

Introduction

AnchorPolyglotPolyglotPolyglot Software Server Scripts

AnchorMedici

Medici

Medici Extractors

AnchorJavaJavaJava

AnchorC++C++C++

AnchorPythonPythonPython

AnchorVersusVersusVersus

Extractors

AnchorJava MeasureJava MeasureJava

Anchor
Polyglot
Polyglot
Polyglot Software Server Scripts

Anchor
Medici

Anchor
Java
Java
Java

Anchor
C++
C++
C++

Anchor
Python
Python
Python

Anchor
Versus
Versus
Versus

Anchor
Java Measure
Java Measure
Java