As the name "Brown Dog" suggests the project aims at bringing together a number of external tools as part of the two services being constructed. For the DAP, which handles conversions, these tools are incorporated as scripts to the Software Servers in Polyglot or as DFDL schemas for Daffodil. For the DTS, which handles the automatic extraction of metadata and content signatures, tools are incorporated as either extractors for Medici or extractors for Versus. Below we show examples for incorporating each of these components. This overview assumes a basic level of knowledge about the three four main components of the Brown Dog software platform, i.e. Polyglot, Medici, Versus, and VersusDaffodil. For a more in depth overview of each of these components and their function it is recommended that you first read through their online documentation and/or go through one of the online tutorial videos:
The purpose of this document is to provide quick examples of each means of incorporating tools so as to bootstrap ones ability to include their code within one of the two services.
...
...
Start Here
To begin does your code, software, or tool carry out a data conversion or a data extraction? If a conversion the tool should be included in the Data Access Proxy. If an extraction the tool should be included in the Data Tilling Service.
The Data Access Proxy (DAP)The Data Access Proxy handles data conversions. If a piece of software or tool exists to carry out the conversion its incorporation into the DAP will be through Polyglot. If the specification of the file format is known then in can be incorporated as a DFDL schema within Daffodil.
Polyglot Software Server Scripts
Software Server scripts are used by Polyglot to automate the interaction with software that is capable of converting from one file format to another. These scripts can directly wrap command line utilities that carry out conversions for use in Polyglot or split the steps of opening a file in one format and saving a file in a different format, typical of GUI driven applications. These wrapper scripts can be written in pretty much any text based scripting language. Below we show a few simple examples. Full details on the creation of these wrapper scripts, the required naming conventions, and required header conventions please refer to the the Scripting Manual.
The following is an example of a bash wrapper script for ImageMagick. Note that it is fairly straight forward. The comments at the top contain the information Polyglot needs to use the application: the name and version of the application, they type of data it supports, the input formats it supports, and the output formats it supports.
Code Block |
---|
|
#!/bin/sh
#ImageMagick (v6.5.2)
#image
#bmp, dib, eps, fig, gif, ico, jpg, jpeg, pdf, pgm, pict, pix, png, pnm, ppm, ps, rgb, |
Software Server scripts are used by Polyglot to automate the interaction with software that is capable of converting from one file format to another. These scripts can directly wrap command line utilities that carry out conversions for use in Polyglot or split the steps of opening a file in one format and saving a file in a different format, typical of GUI driven applications. These wrapper scripts can be written in pretty much any text based scripting language. Below we show a few simple examples. Full details on the creation of these wrapper scripts, the required naming convensions, and required header convensions please refer to the the Scripting Manual.
Anchor |
---|
CommandLIne | CommandLIne | Command Line ApplicationsThe following is an example of a bash wrapper script for ImageMagick. Note that it is fairly straight forward. The comments at the top contain the information Polyglot needs to use the application: the name and version of the application, they type of data it supports, the input formats it supports, and the output formats it supports.
Code Block |
---|
|
#!/bin/sh
#ImageMagick (v6.5.2)
#image
#bmp, dib, eps, fig, gif, ico, jpg, jpeg, pdf, pgm, pict, pix, png, pnm, ppm, ps, rgb, rgba, sgi, sun, svg, tga, tif, tiff, ttf, x, xbm, xcf, xpm, xwd, yuv
#bmp, dib, eps, gif, jpg, jpeg, pdf, pgm, pict, png, pnm, ppm, ps, rgb, rgba, sgi, sun, svg, tga, tif, tiff, ttf, x, xbm, xpm, xwd, yuv
convert $1 $2 |
Some GUI based applications are capable of being called in a headless mode. The following is an example wrapper script for OpenOffice called in its headless mode.
Code Block |
---|
title | OpenOffice_convert.bat |
---|
|
REM OpenOffice (v3.1.0)
REM document
REM doc, odt, rtf, txt
REM doc, odt, pdf, rtf, txt
"C:\Program Files\OpenOffice.org 3\program\soffice.exe" -headless -norestore "-accept=socket`,host=localhost`,port=8100;urp;StarOffice.ServiceManager"
"C:\Program Files\OpenOffice.org 3\program\python.exe" "C:\Converters\DocumentConverter.py" "%1%" "%2%" |
GUI Applications
The following is an example of an AutoHotKey script to convert files with Adobe Acrobat, a GUI driven application. Note it contains a similar header in the comments at the beginning of the script. Also note that the open and save operation can be broken into two separate scripts.
...
Code Block |
---|
|
<?xml version="1.0" encoding="UTF-8"?>
<!--
Load image data from a PGM file and represent the data as a sequence of pixels in row major order.
-->
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/" xmlns:ex="http://example.com" targetNamespace="http://example.com">
<xs:include schemaLocation="xsd/built-in-formats.xsd"/>
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:format ref="ex:daffodilTest1" separator="" initiator="" terminator="" leadingSkip='0' textTrimKind="none" initiatedContent="no"
alignment="implicit" alignmentUnits="bits" trailingSkip="0" ignoreCase="no" separatorPolicy="suppressed"
separatorPosition="infix" occursCountKind="parsed" emptyValueDelimiterPolicy="both" representation="text"
textNumberRep="standard" lengthKind="delimited" encoding="ASCII"/>
</xs:appinfo>
</xs:annotation>
<xs:element name="file">
<xs:complexType>
<xs:sequence>
<xs:element name="header" dfdl:lengthKind="implicit" maxOccurs="1">
<xs:complexType>
<xs:sequence dfdl:sequenceKind="ordered" dfdl:separator="%NL;" dfdl:separatorPosition="postfix">
<xs:element name="type" type="xs:string"/>
<xs:element name="dimensions" maxOccurs="1" dfdl:occursCountKind="implicit">
<xs:complexType>
<xs:sequence dfdl:sequenceKind="ordered" dfdl:separator="%SP;">
<xs:element name="width" type="xs:integer"/>
<xs:element name="height" type="xs:integer"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="depth" type="xs:integer"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="pixels" dfdl:lengthKind="implicit" maxOccurs="1">
<xs:complexType>
<xs:sequence dfdl:separator="%SP; %NL; %SP;%NL;" dfdl:separatorPosition="postfix" dfdl:separatorSuppressionPolicy="anyEmpty">
<xs:element name="pixel" type="xs:integer" maxOccurs="unbounded"/> dfdl:occursCountKind="expression"
dfdl:occursCount="{../../ex:header/ex:dimensions/ex:width * ../../ex:header/ex:dimensions/ex:height }"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema> |
...
Code Block |
---|
<ex:file xmlns:ex="http://example.com">
<ex:header>
<ex:type>P2</ex:type>
<ex:dimensions>
<ex:width>16</ex:width>
<ex:height>16</ex:height>
</ex:dimensions>
<ex:depth>255</ex:depth>
</ex:header>
<ex:pixels>
<ex:pixel>136</ex:pixel>
<ex:pixel>136</ex:pixel>
<ex:pixel>136</ex:pixel>
...
<ex:pixel>136</ex:pixel>
<ex:pixel>136</ex:pixel>
</ex:pixels>
</ex:file> |
...
...
Medici extractors typically serve to automatically extract some new kind of information from a file's content when it is uploaded into Medici. These extractors do this by connecting to a shared RabbitMQ bus. When a new file is uploaded to Medici it is announced on this bus. Extractors that can handle a file of the type posted on the bus are triggered and the data they in turn create is returned to Medici as derived data to be associated with that file. The extractors themselves can be implemented in a variety of languages.
...
The Data Tilling Service (DTS)
The Data Tilling Services handles data extractions. If your code, tool, or software extracts information such as keywords from a file or its contents then it should be included in the DTS as a Medici extractor. If your code, tool, or software extracts a signature from the file's contents which in turn can be compared to the signatures of other files via some distance measure to find similar pieces of data, then, it should be included in the DTS as a Versus extractor.
Medici Extractors
Medici extractors typically serve to automatically extract some new kind of information from a file's content when it is uploaded into Medici. These extractors do this by connecting to a shared RabbitMQ bus. When a new file is uploaded to Medici it is announced on this bus. Extractors that can handle a file of the type posted on the bus are triggered and the data they in turn create is returned to Medici as derived data to be associated with that file. The extractors themselves can be implemented in a variety of languages. Examples of these extractors in different languages can be found in the extractors-templates code repository.
JavaJava extractors can be created using the amqp-client jar file. This allows you to connect to the RabitMQ bus and received messages. The easiest way to get up and running is to use maven with java to add all required dependencies. An example of an extractor written in java can be found at Medici Extractor in Java.
PythonPython extractors will often be based on the packages pika and requests. This allows you to connect to the RabittMQ message bus and easily send requests to medici. A complete example of the python extractor can be found at Medici Extractor in Python
Calling R Scripts from Python
Coming soon...
Versus Extractors
Versus extractors serve to extract a signature from a file's content. These signatures, effectively a hash for the data, are typically numerical vectors which capture some semantically meaningful aspect of the content so that two such signatures can then be compared using some distance measure. Within Versus extractors operate on a data structure representing the content of a file, produced a Versus adapter, and the returned signatures compared by either a Versus similarity or distance measure. The combination of these adapters, extractors, and measures in turn compose a comparison which can be used for relating files according their contents.
JavaThe main class sets up the comparison, this is done by adding the two files that need to be compared, as well as the adapter to load the file, the extractor to extract a feature from the file, and a measurement to compare the two features.
Code Block |
---|
|
static public void main(String[] args) {
PairwiseComparison comparison = new PairwiseComparison();
comparison.setId(UUID.randomUUID().toString());
comparison.setFirstDataset(new File("data/test1.txt"));
comparison.setSecondDataset(new File("data/test2.txt"));
comparison.setAdapterId(TextAdapter.class.getName());
comparison.setExtractorId(TextHistogramExtractor.class.getName());
comparison.setMeasureId(LabelHistogramEuclidianDistanceMeasure.class.getName());
ExecutionEngine ee = new ExecutionEngine();
ee.submit(comparison, new ComparisonStatusHandler() {
@Override
public void onStarted() {
System.out.println("STARTED : " |
Code Block |
---|
theme | Emacs |
---|
language | java |
---|
title | Connecting to RabbitMQ |
---|
|
protected void startExtractor(String rabbitMQUsername,
String rabbitMQpassword) {
try{
//Open channel and declare exchange and consumer
ConnectionFactory factory = new ConnectionFactory();
factory.setHost(serverAddr);
factory.setUsername(rabbitMQUsername);
factory.setPassword(rabbitMQpassword);
Connection connection = factory.newConnection();
final Channel channel = connection.createChannel();
channel.exchangeDeclare(EXCHANGE_NAME, "topic", true);
channel.queueDeclare(QUEUE_NAME,DURABLE,EXCLUSIVE,AUTO_DELETE,null);
channel.queueBind(QUEUE_NAME, EXCHANGE_NAME, "*.file.text.plain.#");
this.channel = channel;
// create listener
channel.basicConsume(QUEUE_NAME, false, CONSUMER_TAG, new DefaultConsumer(channel) {
@Override
public void handleDelivery(String consumerTag, Envelope envelope, AMQP.BasicProperties properties, byte[] body) throws IOException {
messageReceived = new String(body);
long deliveryTag = envelope.getDeliveryTag();
// (process the message components here ...)
System.out.println(" {x} Received '" + messageReceived + "'");
replyProps = new AMQP.BasicProperties.Builder().correlationId(properties.getCorrelationId()).build();
replyTo = properties.getReplyTo();
processMessageReceived();
System.out.println(" [x] Done");
channel.basicAck(deliveryTag, false);
}
});
// start listening
System.out.println(" [*] Waiting for messages. To exit press CTRL+C");
while (true) {
Thread.sleep(1000);
}
}
catch(Exception e){
e.printStackTrace();
System.exit(1);
}
} |
Code Block |
---|
language | java |
---|
title | Processing Messages Received From RabbitMQ |
---|
|
protected void processMessageReceived() {
try {
try {
ExampleJavaExtractorService extrServ = new ExampleJavaExtractorService(this);
jobReceived = getRepresentation(messageReceived, ExtractionJob.class);
File textFile = extrServ.processJob(jobReceived);
jobReceived.setFlag("wasText");}
log.info("Word count extraction complete. Returning word count@Override
file as intermediate result.");
sendStatus(jobReceived.getId(), this.getClass().getSimpleName(), "Word count extraction complete. Returning word count file as intermediate result.", log);
public void onFailed(String msg, Throwable e) {
System.out.println("FAILED uploadIntermediate(textFile,: "text/plain", log + msg);
textFilee.deleteprintStackTrace();
sendStatus(jobReceivedSystem.getIdexit(0), this.getClass().getSimpleName(), "DONE.", log);
;
} catch (Exception ioe) {}
log.error("Could not finish extraction job.", ioe); @Override
sendStatus(jobReceived.getId(), this.getClass().getSimpleName(), "Could not finish extraction job.", log);
public void onDone(double value) {
sendStatus(jobReceived.getId(), this.getClass().getSimpleName(), "DONE.", log System.out.println("DONE : " + value);
}
} catch(Exception e) {
eSystem.printStackTraceexit(0);
System.exit(1); }
}
} |
...
Code Block |
---|
language | cpp |
---|
title | Connecting to RabbitMQ |
---|
|
#include <amqpcpp.h>
namespace CPPExample {
class RabbitMQConnectionHandler : public AMQP::ConnectionHandler {@Override
/**
*public void Method that is called by the AMQP library every time it has data
onAborted(String msg) {
* available that should be sent to RabbitMQ. System.out.println("ABORTED : " + msg);
* @param connection pointer to the main connection object System.exit(0);
* @param data }
memory buffer with the data that should be sent to RabbitMQ
* @param size size of the buffer
*/
virtual void onData(AMQP::Connection *connection, const char *data, size_t size)
{
// @todo
// Add your own implementation, for example by doing a call to the
// send() system call. But be aware that the send() call may not });
} |
The text adapter will take a text file, and load all the file, splitting the text into words and return a list of all words in the text. The words are still in the right order, and it is possible to read the original information of the file by reading the words in the order as they are returned by getWords().
Code Block |
---|
language | java |
---|
title | Text Adapter |
---|
|
public class TextAdapter implements FileLoader, HasText {
private File file;
private List<String> words;
public TextAdapter() {}
// ----------------------------------------------------------------------
// FileLoader
// ----------------------------------------------------------------------
@Override
public void load(File file) {
//this.file = file;
send all data}
at once, so you@Override
also need to takepublic careString ofgetName() buffering{
return // the bytes that could not immediately be sent, and try to send
"Text Document";
}
@Override
public List<String> getSupportedMediaTypes() {
// List<String> themmediaTypes again= when the socket becomes writable againnew ArrayList<String>();
}
mediaTypes.add("text/**");
* Methodreturn that is called by the AMQP library when the login attempt mediaTypes;
}
* succeeded. After this method has been called, the connection is ready // ----------------------------------------------------------------------
* to use.// HasText
* @param connection// ----------------------------------------------------------------------
@Override
public The connection that can now be used
*/List<String> getWords() {
if (words == null) {
virtual void onConnected(Connection *connection)
words = new {ArrayList<String>();
// @todo
try {
// add your own implementation, for example by creating aBufferedReader channelbr
= new BufferedReader(new FileReader(file));
// instance, and start publishing or consuming
String line;
}
/**
* Method that is called by the AMQP library when a fatal error occurs
while((line = br.readLine()) != null) {
* on the connection, for example becauseString[] dataw received from RabbitMQ
= line.split(" ");
* could not be recognized.
* @paramwords.addAll(Arrays.asList(w));
connection The connection on which the error occured
}
* @param message A human readable error messagebr.close();
*/
virtual} voidcatch onError(Connection *connection, const std::string &message)
IOException e) {
{
// @todo e.printStackTrace();
// add your}
own implementation, for example by reporting the error}
//return words;
to the user of your program, log the error, and destruct the
// connection object because it is no longer in a usable state
}
};
} |
The extractor will take the words returned by the adapter and count the occurrence of each word. At this point we are left with a histogram with all words and how often they occur in the text, we can no longer read the text since the information about the order of the words is lost.
Code Block |
---|
language | java |
---|
title | Text Histogram Extractor |
---|
|
public class TextHistogramExtractor implements Extractor
|
Code Block |
---|
language | java |
---|
title | Processing the Job |
---|
|
public File processJob(ExtractionJob receivedMsg) throws Exception{
@Override
log.info("Downloading text file with ID "+ receivedMsg.getIntermediateId() +" from " + receivedMsg.getHost());
callingExtractor.sendStatus(receivedMsg.getId(), callingExtractor.getClass().getSimpleName(), "Downloading text file.", logpublic Adapter newAdapter() {
throw (new RuntimeException("Not supported."));
}
@Override
DefaultHttpClient httpclient =public newString DefaultHttpClientgetName(); {
HttpGet httpGet = new HttpGet(receivedMsg.getHost() +"api/files/"+ receivedMsg.getIntermediateId()+"?key="+playserverKey); return "Text Histogram Extractor";
}
HttpResponse fileResponse = httpclient.execute(httpGet);
log.info(fileResponse.getStatusLine());
if(fileResponse.getStatusLine().toString().indexOf("200") == -1){
throw new IOException("File not found." @Override
public Set<Class<? extends Adapter>> supportedAdapters() {
Set<Class<? extends Adapter>> adapters = new HashSet<Class<? extends Adapter>>();
}
HttpEntity fileEntity = fileResponseadapters.getEntityadd(HasText.class);
InputStream fileIs = fileEntity.getContent()return adapters;
}
Header[] hdrs = fileResponse.getHeaders("content-disposition"); @Override
String contentDisp = hdrs[0].toString();public Class<? extends Descriptor> getFeatureType() {
String fileName =return contentDisp.substring(contentDisp.indexOf("filename=")+9);
File tempFile = File.createTempFile(fileName.substring(0, fileName.lastIndexOf(".")), fileName.substring(fileName.lastIndexOf(".")).toLowerCase());
OutputStream fileOs = new FileOutputStream(tempFile);
IOUtils.copy(fileIs,fileOs);
fileIs.close();
fileOs.close();
EntityUtils.consume(fileEntity); LabelHistogramDescriptor.class;
}
@Override
public Descriptor extract(Adapter adapter) throws Exception {
if (adapter instanceof HasText) {
LabelHistogramDescriptor desc = new LabelHistogramDescriptor();
log.info("Download complete. Initiating for (String word count generation");
File textFile = processFile(tempFile, receivedMsg.getId());
return textFile;: ((HasText) adapter).getWords()) {
desc.increaseBin(word);
} |
Code Block |
---|
|
namespace CPPExample {
/**
* Parse data that was recevied from RabbitMQ return desc;
*
*} else Every{
time that data comes in from RabbitMQ, you should call this methodthrow to parsenew UnsupportedTypeException();
* the incoming data, and}
let it handle by}
the AMQP-CPP library. This
method returns the number@Override
* public of bytes that were processed.
boolean hasPreview(){
*
return *false;
If not all}
bytes could be processed
because it only contained@Override
a partial frame,public you shouldString previewName(){
* call this same methodreturn laternull;
on when more data is available. The AMQP-CPP library does not do
* any buffering, so it is up to the caller to ensure that the old data is also passed in that
* later call.
*
* @param buffer buffer to decode
* @param size size of the buffer to decode
* @return number of bytes that were processed
*/
size_t parse(char *buffer, size_t size)
{
return _implementation.parse(buffer, size);
}
} |
...
Code Block |
---|
theme | Emacs |
---|
language | py |
---|
title | Instantiating the logger and starting the extractor |
---|
|
def main():
global logger
# name of receiver
receiver='ExamplePythonExtractor'
# configure the logging system
logging.basicConfig(format="%(asctime)-15s %(name)-10s %(levelname)-7s : %(message)s", level=logging.WARN)
logger = logging.getLogger(receiver)
logger.setLevel(logging.DEBUG)
if len(sys.argv) != 4:
logger.info("Input RabbitMQ username, followed by RabbitMQ password and Medici REST API key.")
sys.exit()
global playserverKey
playserverKey = sys.argv[3] |
Code Block |
---|
theme | Emacs |
---|
language | py |
---|
title | Connecting to RabbitMQ |
---|
|
# connect to rabbitmq using input username and password
credentials = pika.PlainCredentials(sys.argv[1], sys.argv[2])
parameters = pika.ConnectionParameters(credentials=credentials)
connection = pika.BlockingConnection(parameters)
# connect to channel
channel = connection.channel()
# declare the exchange
channel.exchange_declare(exchange='medici', exchange_type='topic', durable=True)
# declare the queue
channel.queue_declare(queue=receiver, durable=True)
# connect queue and exchange
channel.queue_bind(queue=receiver, exchange='medici', routing_key='*.file.text.plain')
# create listener
channel.basic_consume(on_message, queue=receiver, no_ack=False)
# start listening
logger.info("Waiting for messages. To exit press CTRL+C")
try:
channel.start_consuming()
except KeyboardInterrupt:
channel.stop_consuming()
|
...
To compare two texts we use the euclidian distance measure of two histograms. First we normalize each histogram, so we can compare a large text with a small text, next we compare each big of the two histograms. If the bin is missing from either histogram it is assumed to have a value of 0.
Code Block |
---|
language | java |
---|
title | Euclidian Distance Measure |
---|
|
public class LabelHistogramEuclidianDistanceMeasure implements Measure
{
@Override
public SimilarityPercentage normalize(Similarity similarity) {
return new SimilarityPercentage(1 - similarity.getValue());
}
@Override
public String getFeatureType() {
return LabelHistogramDescriptor.class.getName();
}
@Override
public String getName() {
return "Histogram Distance";
}
@Override
public Class<LabelHistogramEuclidianDistanceMeasure> getType() {
return LabelHistogramEuclidianDistanceMeasure.class;
}
// correlation
@Override
public Similarity compare(Descriptor desc1, Descriptor desc2) throws Exception {
if ((desc1 instanceof LabelHistogramDescriptor) && (desc2 instanceof LabelHistogramDescriptor)) {
LabelHistogramDescriptor lhd1 = (LabelHistogramDescriptor) desc1;
LabelHistogramDescriptor lhd2 = (LabelHistogramDescriptor) desc2;
// get all possible labels
Set<String> labels = new HashSet<String>();
labels.addAll(lhd1.getLabels());
labels.addAll(lhd2.getLabels());
// normalize
lhd1.normalize();
lhd2.normalize();
// compute distance
double sum = 0;
for (String s : labels) {
Double b1 = lhd1.getBin(s);
Double b2 = lhd2.getBin(s);
if (b1 == null) {
sum += b2 * b2;
} else if (b2 == null) {
sum += b1 * b1;
} else {
sum += (b1 - b2) * (b1 - b2);
}
}
return new SimilarityNumber(Math.sqrt(sum), 0, 1, 0);
} else {
throw new UnsupportedTypeException();
}
} |
...
Versus extractors serve to extract a signature from a file's content. These signatures, effectively a hash for the data, are typically numerical vectors which capture some semantically meaningful aspect of the content so that two such signatures can then be compared using some distance measure. Within Versus extractors operate on a data structure representing the content of a file, produced a Versus adapter, and the returned signatures compared by either a Versus similarity or distance measure. The combination of these adapters, extractors, and measures in turn compose a comparison which can be used for relating files according their contents.
...
Code Block |
---|
|
public class WordCountMeasure implements Serializable,Measure {
private static final long SLEEP = 10000;
@Override
public Similarity compare(Descriptor feature1, Descriptor feature2)
throws Exception {
Thread.sleep(SLEEP);
return new SimilarityNumber(0);
}
@Override
public SimilarityPercentage normalize(Similarity similarity) {
return null;
}
@Override
public String getFeatureType() {
return WordCountMeasure.class.getName();
}
@Override
public String getName() {
return "Word Count Measure";
}
@Override
public Class<WordCountMeasure> getType() {
return WordCountMeasure.class;
}
} |