Adding Tools to the DAP and DTS, Overview and Examples

As the name "Brown Dog" suggests the project aims at bringing together a number of external tools as part of the two services being constructed.  For the DAP, which handles conversions, these tools are incorporated as scripts to the Software Servers in Polyglot or as DFDL schemas for Daffodil.  For the DTS, which handles the automatic extraction of metadata and content signatures, tools are incorporated as either extractors for Medici or extractors for Versus. Below we show examples for incorporating each of these components. This overview assumes a basic level of knowledge about the three main components of the Brown Dog software platform, i.e. Polyglot, Medici, and Versus.  For a more in depth overview of each of these components and their function it is recommended that you first read through their online documentation and/or go through one of the online tutorial videos:

Polyglot Software Server Scripts

Software Server scripts are used by Polyglot to automate the interaction with software that is capable of converting from one file format to another.  These scripts can directly wrap command line utilities that carry out conversions for use in Polyglot or split the steps of opening a file in one format and saving a file in a different format, typical of GUI driven applications.  These wrapper scripts can be written in pretty much any text based scripting language.  Below we show a few simple examples.  Full details on the creation of these wrapper scripts, the required naming convensions, and required header convensions please refer to the the Scripting Manual.

  • AutoHotKey
    • Acrobat
      • The script below opens a .pdf file for conversion

        ;Adobe Acrobat (v9.3.0 Pro Extended)
        ;Parse input filename
        arg1 = %1%
        StringGetPos, index, arg1, \, R
        ifLess, index, 0, ExitApp
        index += 2
        input_filename := SubStr(arg1, index)
        ;Run program if not already running
        IfWinNotExist, Adobe 3D Reviewer
          Run, C:\Program Files\Adobe\Acrobat 9.0\Acrobat\Acrobat.exe
          WinWait, Adobe Acrobat Pro Extended
        ;Activate the window
        WinActivate, Adobe Acrobat Pro Extended
        WinWaitActive, Adobe Acrobat Pro Extended
        ;Open document
        Send, ^o
        WinWait, Open
        ControlSetText, Edit1, %1%
        ControlSend, Edit1, {Enter}
        ;Make sure model is loaded before exiting
          IfWinExist, %input_filename% - Adobe Acrobat Pro Extended
          Sleep, 500

        The script below saves a converted .pdf file to the specified output format

        ;Adobe Acrobat (v9.3.0 Pro Extended)
        ;doc, html, jpg, pdf, ps, rtf, txt
        ;Parse output format
        arg1 = %1%
        StringGetPos, index, arg1, ., R
        ifLess, index, 0, ExitApp
        index += 2
        out := SubStr(arg1, index)
        ;Parse filename root
        StringGetPos, index, arg1, \, R
        ifLess, index, 0, ExitApp
        index += 2
        name := SubStr(arg1, index)
        StringGetPos, index, name, ., R
        ifLess, index, 0, ExitApp
        name := SubStr(name, 1, index)
        ;Activate the window
        WinActivate, %name%.pdf - Adobe Acrobat Pro Extended
        WinWaitActive, %name%.pdf - Adobe Acrobat Pro Extended
        ;Save document
        Send, ^S
        WinWait, Save As
        if(out = "doc"){
          ControlSend, ComboBox3, m
        }else if(out = "html"){
          controlSend, ComboBox3, h
        }else if(out = "jpg"){
          controlSend, ComboBox3, j
        }else if(out = "pdf"){
          controlSend, ComboBox3, a
        }else if(out = "ps"){
          controlSend, ComboBox3, p
          controlSend, ComboBox3, p
          controlSend, ComboBox3, p
          controlSend, ComboBox3, p
          controlSend, ComboBox3, p
        }else if(out = "rtf"){
          controlSend, ComboBox3, r
        }else if(out = "txt"){
          controlSend, ComboBox3, t
          controlSend, ComboBox3, t
        ControlSetText, Edit1, %1%
        ControlSend, Edit1, {Enter}
        ;Return to main window before exiting
          ;Continue on if main window is active
          IfWinActive, %name%.pdf - Adobe Acrobat Pro Extended
          ;Click "Yes" if asked to overwrite files
          IfWinExist, Save As
            ControlGetText, tmp, Button1, Save As
            if(tmp = "&Yes")
              ControlClick, Button1, Save As
          Sleep, 500
        ;Wait a lit bit more just in case
        Sleep, 1000
        ;Close whatever document is currently open
        Send, ^w
        ;Make sure it actually closed before exiting
          ;Continue on if main window is active
          IfWinActive, Adobe Acrobat Pro Extended
          Sleep, 500
      • OpenOffice
  • AppleScript
    Applescript is also supported by Polyglot.
  • Python
    Python is also supported by Polyglot.
  • Bash
    Bash is also supported by Polyglot.

Medici Extractors

Medici extractors typically serve to automatically extract some new kind of information from a file's content when its uploaded into Medici.  These extractors do this by connecting to a shared RabbitMQ bus.  When a new file is uploaded to Medici it is announced on this bus.  Extractors that can handle a file of the type posted on the bus are triggered and the data they in turn create is returned to Medici as derived data to be associated with that file.  The extractors themselves can be implemented in a variety of languages.

  • Java

    Connecting to RabbitMQ
    protected void startExtractor(String rabbitMQUsername,
    	String rabbitMQpassword) {
     		//Open channel and declare exchange and consumer
    		ConnectionFactory factory = new ConnectionFactory();
    		Connection connection = factory.newConnection();
     		final Channel channel = connection.createChannel();
    		channel.exchangeDeclare(EXCHANGE_NAME, "topic", true);
    		channel.queueBind(QUEUE_NAME, EXCHANGE_NAME, "*.file.text.plain.#");
    = channel;
     		// create listener
    		channel.basicConsume(QUEUE_NAME, false, CONSUMER_TAG, new DefaultConsumer(channel) {
     			public void handleDelivery(String consumerTag, Envelope envelope, AMQP.BasicProperties properties, byte[] body) throws IOException {
    				messageReceived = new String(body);
     				long deliveryTag = envelope.getDeliveryTag();
     				// (process the message components here ...)
    				System.out.println(" {x} Received '" + messageReceived + "'");
    				replyProps = new AMQP.BasicProperties.Builder().correlationId(properties.getCorrelationId()).build();
    				replyTo = properties.getReplyTo();
    				System.out.println(" [x] Done");
    				channel.basicAck(deliveryTag, false);
     		// start listening 
    		System.out.println(" [*] Waiting for messages. To exit press CTRL+C");
     		while (true) {
     	catch(Exception e){


  • C++

    Connecting to RabbitMQ
    #include <amqpcpp.h>
    namespace CPPExample {
      class RabbitMQConnectionHandler : public AMQP::ConnectionHandler {
          *  Method that is called by the AMQP library every time it has data
          *  available that should be sent to RabbitMQ. 
          *  @param  connection  pointer to the main connection object  
          *  @param  data        memory buffer with the data that should be sent to RabbitMQ
          *  @param  size        size of the buffer
         virtual void onData(AMQP::Connection *connection, const char *data, size_t size)
             // @todo 
             //  Add your own implementation, for example by doing a call to the
             //  send() system call. But be aware that the send() call may not
             //  send all data at once, so you also need to take care of buffering
             //  the bytes that could not immediately be sent, and try to send 
             //  them again when the socket becomes writable again
          *  Method that is called by the AMQP library when the login attempt 
          *  succeeded. After this method has been called, the connection is ready 
          *  to use.
          *  @param  connection      The connection that can now be used
          virtual void onConnected(Connection *connection)
             // @todo
             //  add your own implementation, for example by creating a channel 
             //  instance, and start publishing or consuming
          *  Method that is called by the AMQP library when a fatal error occurs
          *  on the connection, for example because data received from RabbitMQ
          *  could not be recognized.
          *  @param  connection      The connection on which the error occured
          *  @param  message         A human readable error message
          virtual void onError(Connection *connection, const std::string &message)
            // @todo
            //  add your own implementation, for example by reporting the error
            //  to the user of your program, log the error, and destruct the 
            //  connection object because it is no longer in a usable state
    namespace CPPExample {
       *  Parse data that was recevied from RabbitMQ
       *  Every time that data comes in from RabbitMQ, you should call this method to parse
       *  the incoming data, and let it handle by the AMQP-CPP library. This method returns the number
       *  of bytes that were processed.
       *  If not all bytes could be processed because it only contained a partial frame, you should
       *  call this same method later on when more data is available. The AMQP-CPP library does not do
       *  any buffering, so it is up to the caller to ensure that the old data is also passed in that
       *  later call.
       *  @param  buffer      buffer to decode
       *  @param  size        size of the buffer to decode
       *  @return             number of bytes that were processed
      size_t parse(char *buffer, size_t size)
         return _implementation.parse(buffer, size);
  • Python

    Instantiating the logger and starting the extractor
    def main():
     global logger
     # name of receiver
     # configure the logging system
    logging.basicConfig(format="%(asctime)-15s %(name)-10s %(levelname)-7s : %(message)s", level=logging.WARN)
    logger = logging.getLogger(receiver)
     if len(sys.argv) != 4:"Input RabbitMQ username, followed by RabbitMQ password and Medici REST API key.")
     global playserverKey
    playserverKey = sys.argv[3]
    Connecting to RabbitMQ
    # connect to rabbitmq using input username and password 
    credentials = pika.PlainCredentials(sys.argv[1], sys.argv[2])
    parameters = pika.ConnectionParameters(credentials=credentials)
    connection = pika.BlockingConnection(parameters)
     # connect to channel
    channel =
     # declare the exchange
    channel.exchange_declare(exchange='medici', exchange_type='topic', durable=True)
     # declare the queue
    channel.queue_declare(queue=receiver, durable=True)
     # connect queue and exchange
    channel.queue_bind(queue=receiver, exchange='medici', routing_key='*.file.text.plain')
     # create listener
    channel.basic_consume(on_message, queue=receiver, no_ack=False)
     # start listening"Waiting for messages. To exit press CTRL+C")
     except KeyboardInterrupt:

Versus Extractors

  • Java

    public class WordCountMeasure implements Serializable,Measure {
    	private static final long SLEEP = 10000;
    	public Similarity compare(Descriptor feature1, Descriptor feature2)
    			throws Exception {
    		return new SimilarityNumber(0);
    	public SimilarityPercentage normalize(Similarity similarity) {
    		return null;
    	public String getFeatureType() {
    		return WordCountMeasure.class.getName();
    	public String getName() {
    		return "Word Count Measure";
    	public Class<WordCountMeasure> getType() {
    		return WordCountMeasure.class;


