As the name "Brown Dog" suggests the project aims at bringing together a number of external tools as part of the two services being constructed. For the DAP, which handles conversions, these tools are incorporated as scripts to the Software Servers in Polyglot or as DFDL schemas for Daffodil. For the DTS, which handles the automatic extraction of metadata and content signatures, tools are incorporated as either extractors for Medici or extractors for Versus. Below we show examples for incorporating each of these components. This overview assumes a basic level of knowledge about the four main components of the Brown Dog software platform, i.e. Polyglot, Medici, Versus, and Daffodil. For a more in depth overview of each of these components and their function it is recommended that you first read through their online documentation and/or go through one of the online tutorial videos:
- Polyglot Documentation
- Medici Documentation
- Versus Documentation
- Daffodil and DFDL Documentation
- Tutorial Videos
The purpose of this document is to provide quick examples of each means of incorporating tools so as to bootstrap ones ability to include their code within one of the two services.
Start Here
To begin does your code, software, or tool carry out a data conversion or a data extraction? If a conversion the tool should be included in the Data Access Proxy. If an extraction the tool should be included in the Data Tilling Service.
The Data Access Proxy (DAP)
The Data Access Proxy handles data conversions. If a piece of software or tool exists to carry out the conversion its incorporation into the DAP will be through Polyglot. If the specification of the file format is known then in can be incorporated as a DFDL schema within Daffodil.
Polyglot Software Server Scripts
Software Server scripts are used by Polyglot to automate the interaction with software that is capable of converting from one file format to another. These scripts can directly wrap command line utilities that carry out conversions for use in Polyglot or split the steps of opening a file in one format and saving a file in a different format, typical of GUI driven applications. These wrapper scripts can be written in pretty much any text based scripting language. Below we show a few simple examples. Full details on the creation of these wrapper scripts, the required naming conventions, and required header conventions please refer to the the Scripting Manual.
Command Line Applications
Bash Script
The following is an example of a bash wrapper script for ImageMagick. Note that it is fairly straight forward. The comments at the top contain the information Polyglot needs to use the application: the name and version of the application, they type of data it supports, the input formats it supports, and the output formats it supports.
#!/bin/sh #ImageMagick (v6.5.2) #image #bmp, dib, eps, fig, gif, ico, jpg, jpeg, pdf, pgm, pict, pix, png, pnm, ppm, ps, rgb, rgba, sgi, sun, svg, tga, tif, tiff, ttf, x, xbm, xcf, xpm, xwd, yuv #bmp, dib, eps, gif, jpg, jpeg, pdf, pgm, pict, png, pnm, ppm, ps, rgb, rgba, sgi, sun, svg, tga, tif, tiff, ttf, x, xbm, xpm, xwd, yuv convert $1 $2
Batch File
Some GUI based applications are capable of being called in a headless mode. The following is an example wrapper script for OpenOffice called in its headless mode.
REM OpenOffice (v3.1.0) REM document REM doc, odt, rtf, txt REM doc, odt, pdf, rtf, txt "C:\Program Files\OpenOffice.org 3\program\soffice.exe" -headless -norestore "-accept=socket`,host=localhost`,port=8100;urp;StarOffice.ServiceManager" "C:\Program Files\OpenOffice.org 3\program\python.exe" "C:\Converters\DocumentConverter.py" "%1%" "%2%"
GUI Applications
AutoHotKey
The following is an example of an AutoHotKey script to convert files with Adobe Acrobat, a GUI driven application. Note it contains a similar header in the comments at the beginning of the script. Also note that the open and save operation can be broken into two separate scripts.
;Adobe Acrobat (v9.3.0 Pro Extended) ;document ;pdf ;Parse input filename arg1 = %1% StringGetPos, index, arg1, \, R ifLess, index, 0, ExitApp index += 2 input_filename := SubStr(arg1, index) ;Run program if not already running IfWinNotExist, Adobe 3D Reviewer { Run, C:\Program Files\Adobe\Acrobat 9.0\Acrobat\Acrobat.exe WinWait, Adobe Acrobat Pro Extended } ;Activate the window WinActivate, Adobe Acrobat Pro Extended WinWaitActive, Adobe Acrobat Pro Extended ;Open document Send, ^o WinWait, Open ControlSetText, Edit1, %1% ControlSend, Edit1, {Enter} ;Make sure model is loaded before exiting Loop { IfWinExist, %input_filename% - Adobe Acrobat Pro Extended { break } Sleep, 500 }
;Adobe Acrobat (v9.3.0 Pro Extended) ;document ;doc, html, jpg, pdf, ps, rtf, txt ;Parse output format arg1 = %1% StringGetPos, index, arg1, ., R ifLess, index, 0, ExitApp index += 2 out := SubStr(arg1, index) ;Parse filename root StringGetPos, index, arg1, \, R ifLess, index, 0, ExitApp index += 2 name := SubStr(arg1, index) StringGetPos, index, name, ., R ifLess, index, 0, ExitApp name := SubStr(name, 1, index) ;Activate the window WinActivate, %name%.pdf - Adobe Acrobat Pro Extended WinWaitActive, %name%.pdf - Adobe Acrobat Pro Extended ;Save document Send, ^S WinWait, Save As if(out = "doc"){ ControlSend, ComboBox3, m }else if(out = "html"){ controlSend, ComboBox3, h }else if(out = "jpg"){ controlSend, ComboBox3, j }else if(out = "pdf"){ controlSend, ComboBox3, a }else if(out = "ps"){ controlSend, ComboBox3, p controlSend, ComboBox3, p controlSend, ComboBox3, p controlSend, ComboBox3, p controlSend, ComboBox3, p }else if(out = "rtf"){ controlSend, ComboBox3, r }else if(out = "txt"){ controlSend, ComboBox3, t controlSend, ComboBox3, t } ControlSetText, Edit1, %1% ControlSend, Edit1, {Enter} ;Return to main window before exiting Loop { ;Continue on if main window is active IfWinActive, %name%.pdf - Adobe Acrobat Pro Extended { break } ;Click "Yes" if asked to overwrite files IfWinExist, Save As { ControlGetText, tmp, Button1, Save As if(tmp = "&Yes") { ControlClick, Button1, Save As } } Sleep, 500 } ;Wait a lit bit more just in case Sleep, 1000 ;Close whatever document is currently open Send, ^w ;Make sure it actually closed before exiting Loop { ;Continue on if main window is active IfWinActive, Adobe Acrobat Pro Extended { break } Sleep, 500 }
DFDL Schemas
The Data Format Description Language (DFDL) allows one to write an XML schema definition which defines how to automatically parse a file in that format into an XML representation of the data. DFDL provides an ideal means of preserving the many ad hoc formats created and used in labs. The DFDL schema below is a simple example that parses the data from a PGM image file.
<?xml version="1.0" encoding="UTF-8"?> <!-- Load image data from a PGM file and represent the data as a sequence of pixels in row major order. --> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/" xmlns:ex="http://example.com" targetNamespace="http://example.com"> <xs:include schemaLocation="xsd/built-in-formats.xsd"/> <xs:annotation> <xs:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:format ref="ex:daffodilTest1" separator="" initiator="" terminator="" leadingSkip='0' textTrimKind="none" initiatedContent="no" alignment="implicit" alignmentUnits="bits" trailingSkip="0" ignoreCase="no" separatorPolicy="suppressed" separatorPosition="infix" occursCountKind="parsed" emptyValueDelimiterPolicy="both" representation="text" textNumberRep="standard" lengthKind="delimited" encoding="ASCII"/> </xs:appinfo> </xs:annotation> <xs:element name="file"> <xs:complexType> <xs:sequence> <xs:element name="header" dfdl:lengthKind="implicit" maxOccurs="1"> <xs:complexType> <xs:sequence dfdl:sequenceKind="ordered" dfdl:separator="%NL;" dfdl:separatorPosition="postfix"> <xs:element name="type" type="xs:string"/> <xs:element name="dimensions" maxOccurs="1" dfdl:occursCountKind="implicit"> <xs:complexType> <xs:sequence dfdl:sequenceKind="ordered" dfdl:separator="%SP;"> <xs:element name="width" type="xs:integer"/> <xs:element name="height" type="xs:integer"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="depth" type="xs:integer"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="pixels" dfdl:lengthKind="implicit" maxOccurs="1"> <xs:complexType> <xs:sequence dfdl:separator="%SP; %NL; %SP;%NL;" dfdl:separatorPosition="postfix" dfdl:separatorSuppressionPolicy="anyEmpty"> <xs:element name="pixel" type="xs:integer" maxOccurs="unbounded" dfdl:occursCountKind="expression" dfdl:occursCount="{../../ex:header/ex:dimensions/ex:width * ../../ex:header/ex:dimensions/ex:height }"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
An example of the XML produced the above DFDL schema when applied to a PGM file is shown below.
<ex:file xmlns:ex="http://example.com"> <ex:header> <ex:type>P2</ex:type> <ex:dimensions> <ex:width>16</ex:width> <ex:height>16</ex:height> </ex:dimensions> <ex:depth>255</ex:depth> </ex:header> <ex:pixels> <ex:pixel>136</ex:pixel> <ex:pixel>136</ex:pixel> <ex:pixel>136</ex:pixel> ... <ex:pixel>136</ex:pixel> <ex:pixel>136</ex:pixel> </ex:pixels> </ex:file>
The Data Tilling Service (DTS)
The Data Tilling Services handles data extractions. If your code, tool, or software extracts information such as keywords from a file or its contents then it should be included in the DTS as a Medici extractor. If your code, tool, or software extracts a signature from the file's contents which in turn can be compared to the signatures of other files via some distance measure to find similar pieces of data, then, it should be included in the DTS as a Versus extractor.
Medici Extractors
Medici extractors typically serve to automatically extract some new kind of information from a file's content when it is uploaded into Medici. These extractors do this by connecting to a shared RabbitMQ bus. When a new file is uploaded to Medici it is announced on this bus. Extractors that can handle a file of the type posted on the bus are triggered and the data they in turn create is returned to Medici as derived data to be associated with that file. The extractors themselves can be implemented in a variety of languages.
Java
An extractor must establish a connection with the Medici RabbitMQ bus, handle incoming messages, start jobs based on received messages, and ultimatley carry out a job on a given file. The example below simply counts the number of words in a document and returns this information as a piece of metadata to be associated with the file.
protected void startExtractor(String rabbitMQUsername, String rabbitMQpassword) { try{ //Open channel and declare exchange and consumer ConnectionFactory factory = new ConnectionFactory(); factory.setHost(serverAddr); factory.setUsername(rabbitMQUsername); factory.setPassword(rabbitMQpassword); Connection connection = factory.newConnection(); final Channel channel = connection.createChannel(); channel.exchangeDeclare(EXCHANGE_NAME, "topic", true); channel.queueDeclare(QUEUE_NAME,DURABLE,EXCLUSIVE,AUTO_DELETE,null); channel.queueBind(QUEUE_NAME, EXCHANGE_NAME, "*.file.text.plain.#"); this.channel = channel; // create listener channel.basicConsume(QUEUE_NAME, false, CONSUMER_TAG, new DefaultConsumer(channel) { @Override public void handleDelivery(String consumerTag, Envelope envelope, AMQP.BasicProperties properties, byte[] body) throws IOException { messageReceived = new String(body); long deliveryTag = envelope.getDeliveryTag(); // (process the message components here ...) System.out.println(" {x} Received '" + messageReceived + "'"); replyProps = new AMQP.BasicProperties.Builder().correlationId(properties.getCorrelationId()).build(); replyTo = properties.getReplyTo(); processMessageReceived(); System.out.println(" [x] Done"); channel.basicAck(deliveryTag, false); } }); // start listening System.out.println(" [*] Waiting for messages. To exit press CTRL+C"); while (true) { Thread.sleep(1000); } }catch(Exception e){ e.printStackTrace(); System.exit(1); } }
protected void processMessageReceived() { try { try { ExampleJavaExtractorService extrServ = new ExampleJavaExtractorService(this); jobReceived = getRepresentation(messageReceived, ExtractionJob.class); File textFile = extrServ.processJob(jobReceived); jobReceived.setFlag("wasText"); log.info("Word count extraction complete. Returning word count file as intermediate result."); sendStatus(jobReceived.getId(), this.getClass().getSimpleName(), "Word count extraction complete. Returning word count file as intermediate result.", log); uploadIntermediate(textFile, "text/plain", log); textFile.delete(); sendStatus(jobReceived.getId(), this.getClass().getSimpleName(), "DONE.", log); } catch (Exception ioe) { log.error("Could not finish extraction job.", ioe); sendStatus(jobReceived.getId(), this.getClass().getSimpleName(), "Could not finish extraction job.", log); sendStatus(jobReceived.getId(), this.getClass().getSimpleName(), "DONE.", log); } } catch(Exception e) { e.printStackTrace(); System.exit(1); } }
public File processJob(ExtractionJob receivedMsg) throws Exception { log.info("Downloading text file with ID "+ receivedMsg.getIntermediateId() +" from " + receivedMsg.getHost()); callingExtractor.sendStatus(receivedMsg.getId(), callingExtractor.getClass().getSimpleName(), "Downloading text file.", log); DefaultHttpClient httpclient = new DefaultHttpClient(); HttpGet httpGet = new HttpGet(receivedMsg.getHost() +"api/files/"+ receivedMsg.getIntermediateId()+"?key="+playserverKey); HttpResponse fileResponse = httpclient.execute(httpGet); log.info(fileResponse.getStatusLine()); if(fileResponse.getStatusLine().toString().indexOf("200") == -1){ throw new IOException("File not found."); } HttpEntity fileEntity = fileResponse.getEntity(); InputStream fileIs = fileEntity.getContent(); Header[] hdrs = fileResponse.getHeaders("content-disposition"); String contentDisp = hdrs[0].toString(); String fileName = contentDisp.substring(contentDisp.indexOf("filename=")+9); File tempFile = File.createTempFile(fileName.substring(0, fileName.lastIndexOf(".")), fileName.substring(fileName.lastIndexOf(".")).toLowerCase()); OutputStream fileOs = new FileOutputStream(tempFile); IOUtils.copy(fileIs,fileOs); fileIs.close(); fileOs.close(); EntityUtils.consume(fileEntity); log.info("Download complete. Initiating word count generation"); File textFile = processFile(tempFile, receivedMsg.getId()); return textFile; }
private File processFile(File tempFile, String originalFileId) throws Exception { Runtime r = Runtime.getRuntime(); Process p; // Process tracks one external native process String tempDir = System.getProperty("java.io.tmpdir"); if (new Character(tempDir.charAt(tempDir.length()-1)).toString().equals(System.getProperty("file.separator")) == false){ tempDir = tempDir + System.getProperty("file.separator"); } String processCmd = ""; String operSystem = System.getProperty("os.name").toLowerCase(); // TODO: windows impl if(operSystem.indexOf("nix") >= 0 || operSystem.indexOf("nux") >= 0 || operSystem.indexOf("aix") > 0 ){ "wc -w " + tempDir + tempFile.getName(); } p = r.exec(processCmd, null, new File(tempDir)); StreamGobbler outputGobbler = new StreamGobbler(p.getInputStream(), "INFO", log); StreamGobbler errorGobbler = new StreamGobbler(p.getErrorStream(),"ERROR", log); outputGobbler.start(); errorGobbler.start(); p.waitFor(); File outFile = new File(tempDir + tempFile.getName().substring(0, tempFile.getName().lastIndexOf(".")) + ".txt"); tempFile.delete(); if(!Files.exists(outFile.toPath())) throw new Exception("File not processed correctly. File is possibly corrupt."); return outFile; }
Python
def main(): global logger # name of receiver receiver='ExamplePythonExtractor' # configure the logging system logging.basicConfig(format="%(asctime)-15s %(name)-10s %(levelname)-7s : %(message)s", level=logging.WARN) logger = logging.getLogger(receiver) logger.setLevel(logging.DEBUG) if len(sys.argv) != 4: logger.info("Input RabbitMQ username, followed by RabbitMQ password and Medici REST API key.") sys.exit() global playserverKey playserverKey = sys.argv[3] global exchange_name exchange_name = sys.argv[4]
# connect to rabbitmq using input username and password credentials = pika.PlainCredentials(sys.argv[1], sys.argv[2]) parameters = pika.ConnectionParameters(credentials=credentials) connection = pika.BlockingConnection(parameters) # connect to channel channel = connection.channel() # declare the exchange channel.exchange_declare(exchange='medici', exchange_type='topic', durable=True) # declare the queue channel.queue_declare(queue=receiver, durable=True) # connect queue and exchange channel.queue_bind(queue=receiver, exchange='medici', routing_key='*.file.text.plain') # create listener channel.basic_consume(on_message, queue=receiver, no_ack=False) # start listening logger.info("Waiting for messages. To exit press CTRL+C") try: channel.start_consuming() except KeyboardInterrupt: channel.stop_consuming() # close connection connection.close()
def on_message(channel, method, header, body): global logger statusreport = {} inputfile=None try: # parse body back from json jbody=json.loads(body) host=jbody['host'] fileid=jbody['id'] intermediatefileid=jbody['intermediateId'] if not (host.endswith('/')): host += '/' # for status reports statusreport['file_id'] = fileid statusreport['extractor_id'] = 'wordCount' # print what we are doing logger.debug("[%s] started processing", fileid) # fetch data statusreport['status'] = 'Downloading file.' statusreport['start'] = time.strftime('%Y-%m-%dT%H:%M:%S') channel.basic_publish(exchange='', routing_key=header.reply_to, properties=pika.BasicProperties(correlation_id = \ header.correlation_id), body=json.dumps(statusreport)) url=host + 'api/files/' + intermediatefileid + '?key=' + playserverKey r=requests.get(url, stream=True) r.raise_for_status() (fd, inputfile)=tempfile.mkstemp() with os.fdopen(fd, "w") as f: for chunk in r.iter_content(chunk_size=10*1024): f.write(chunk) # create word count statusreport['status'] = 'Creating word count.' statusreport['start'] = time.strftime('%Y-%m-%dT%H:%M:%S') channel.basic_publish(exchange='', routing_key=header.reply_to, properties=pika.BasicProperties(correlation_id = \ header.correlation_id), body=json.dumps(statusreport)) create_word_count(inputfile, ext, host, fileid) # Ack channel.basic_ack(method.delivery_tag) logger.debug("[%s] finished processing", fileid) except subprocess.CalledProcessError as e: logger.exception("[%s] error processing [exit code=%d]\n%s", fileid, e.returncode, e.output) statusreport['status'] = 'Error processing.' statusreport['start'] = time.strftime('%Y-%m-%dT%H:%M:%S') channel.basic_publish(exchange='', routing_key=header.reply_to, properties=pika.BasicProperties(correlation_id = \ header.correlation_id), body=json.dumps(statusreport)) except: logger.exception("[%s] error processing", fileid) statusreport['status'] = 'Error processing.' statusreport['start'] = time.strftime('%Y-%m-%dT%H:%M:%S') channel.basic_publish(exchange='', routing_key=header.reply_to, properties=pika.BasicProperties(correlation_id = \ header.correlation_id), body=json.dumps(statusreport)) finally: statusreport['status'] = 'DONE.' statusreport['start'] = time.strftime('%Y-%m-%dT%H:%M:%S') channel.basic_publish(exchange='', routing_key=header.reply_to, properties=pika.BasicProperties(correlation_id = \ header.correlation_id), body=json.dumps(statusreport)) if inputfile is not None: try: os.remove(inputfile) except OSError: pass except UnboundLocalError: pass
def create_word_count(inputfile, ext, host, fileid): global logger (fd, inputfile)=tempfile.mkstemp(suffix='.' + ext) try: # make syscall to wc subprocess.check_output(['wc', inputfile], stderr=subprocess.STDOUT) if(os.path.getsize(wcfile) == 0): raise Exception("File is empty.") # upload word count file
Calling R Scripts from Python
Coming soon...
Versus Extractors
Versus extractors serve to extract a signature from a file's content. These signatures, effectively a hash for the data, are typically numerical vectors which capture some semantically meaningful aspect of the content so that two such signatures can then be compared using some distance measure. Within Versus extractors operate on a data structure representing the content of a file, produced a Versus adapter, and the returned signatures compared by either a Versus similarity or distance measure. The combination of these adapters, extractors, and measures in turn compose a comparison which can be used for relating files according their contents.
Java
public class PDFAdapter implements FileLoader, HasRGBPixels, HasText, HasLineGraphics { private File file; private double[][][] pixels; private List<String> words; private List<Path2D> graphics; static public void main(String[] args) { List<Double> weights = new ArrayList<Double>(); List<PairwiseComparison> comparisons = new ArrayList<PairwiseComparison>(); PairwiseComparison comparison = new PairwiseComparison(); comparison.setId(UUID.randomUUID().toString()); comparison.setFirstDataset(new File("data/test1.pdf")); comparison.setSecondDataset(new File("data/test2.pdf")); comparison.setAdapterId(PDFAdapter.class.getName()); comparison.setExtractorId(TextHistogramExtractor.class.getName()); comparison.setMeasureId(LabelHistogramEuclidianDistanceMeasure.class.getName()); comparisons.add(comparison); weights.add(0.7); comparison = new PairwiseComparison(); comparison.setId(UUID.randomUUID().toString()); comparison.setFirstDataset(new File("data/test1.pdf")); comparison.setSecondDataset(new File("data/test2.pdf")); comparison.setAdapterId(PDFAdapter.class.getName()); comparison.setExtractorId(TextHistogramExtractor.class.getName()); comparison.setMeasureId(LabelHistogramEuclidianDistanceMeasure.class.getName()); comparisons.add(comparison); weights.add(0.2); comparison = new PairwiseComparison(); comparison.setId(UUID.randomUUID().toString()); comparison.setFirstDataset(new File("data/test1.pdf")); comparison.setSecondDataset(new File("data/test2.pdf")); comparison.setAdapterId(PDFAdapter.class.getName()); comparison.setExtractorId(TextHistogramExtractor.class.getName()); comparison.setMeasureId(LabelHistogramEuclidianDistanceMeasure.class.getName()); comparisons.add(comparison); weights.add(0.1); ComprehensiveEngine engine = new ComprehensiveEngine(); Double d = engine.compute(comparisons, weights); System.out.println(d); System.exit(0); ExecutionEngine ee = new ExecutionEngine(); ee.submit(comparison, new ComparisonStatusHandler() { @Override public void onStarted() { System.out.println("STARTED : "); } @Override public void onFailed(String msg, Throwable e) { System.out.println("FAILED : " + msg); e.printStackTrace(); System.exit(0); } @Override public void onDone(double value) { System.out.println("DONE : " + value); System.exit(0); } @Override public void onAborted(String msg) { System.out.println("ABORTED : " + msg); System.exit(0); } }); } public PDFAdapter() { } // ---------------------------------------------------------------------- // FileLoader // ---------------------------------------------------------------------- @Override public void load(File file) { this.file = file; } @Override public String getName() { return "PDF Document"; } @Override public List<String> getSupportedMediaTypes() { List<String> mediaTypes = new ArrayList<String>(); mediaTypes.add("application/pdf"); return mediaTypes; } // ---------------------------------------------------------------------- // HasRGBPixels // ---------------------------------------------------------------------- @Override public double getRGBPixel(int row, int column, int band) { if ((pixels == null) && (getRGBPixels() == null)) { return Double.NaN; } else { return pixels[row][column][band]; } } @Override public double[][][] getRGBPixels() { if (pixels == null) { // create monster array. try { loadImages(); } catch (IOException e) { e.printStackTrace(); return null; } } return pixels; } private void loadImages() throws IOException { PDFParser parser = new PDFParser(new FileInputStream(file), PDFParser.EXTRACT_IMAGES); // get all images in the pdf document List<PDFObjectImage> images = new ArrayList<PDFObjectImage>(); for (int i = 0; i < parser.getPageCount(); i++) { parser.parse(i); for (PDFObject po : parser.getObjects()) { if (po instanceof PDFObjectImage) { PDFObjectImage poi = (PDFObjectImage) po; images.add(poi); } } } // create a virtual image that is all the images combined // first column is the image number // second column is pixel (col + row*width) // third column is RGB value pixels = new double[images.size()][][]; for (int i = 0; i < images.size(); i++) { PDFObjectImage poi = images.get(i); int w = poi.getImage().getWidth(); int h = poi.getImage().getHeight(); int[] rgb = poi.getImage().getRGB(0, 0, w, h, null, 0, w); pixels[i] = new double[rgb.length][3]; for (int j = 0; j < rgb.length; j++) { pixels[i][j][0] = (rgb[j] & 0xff0000) >> 16; pixels[i][j][1] = (rgb[j] & 0x00ff00) >> 8; pixels[i][j][2] = (rgb[j] & 0x0000ff) >> 0; } } // close the parser parser.close(); } // ---------------------------------------------------------------------- // HasText // ---------------------------------------------------------------------- @Override public List<String> getWords() { if (words == null) { words = new ArrayList<String>(); try { PDFParser parser = new PDFParser(new FileInputStream(file), PDFParser.EXTRACT_TEXT); PDFGroupingText textgroup = new PDFGroupingText(PDFGroupingText.REMOVE_EMPTY_LINES); for (int i = 0; i < parser.getPageCount(); i++) { parser.parse(i); for (PDFObject po : textgroup.group(parser.getObjects())) { if (po instanceof PDFObjectText) { for (String s : ((PDFObjectText) po).getText().split("\\W+")) { //$NON-NLS-1$ if (!s.isEmpty()) { words.add(s); } } } } } parser.close(); } catch (IOException e) { e.printStackTrace(); } } return words; } // ---------------------------------------------------------------------- // HasLineGraphics // ---------------------------------------------------------------------- @Override public List<Path2D> getLineGraphics() { if (graphics == null) { graphics = new ArrayList<Path2D>(); try { PDFParser parser = new PDFParser(new FileInputStream(file), PDFParser.EXTRACT_GRAPHICS); PDFGroupingGraphics textgroup = new PDFGroupingGraphics(); for (int i = 0; i < parser.getPageCount(); i++) { parser.parse(i); for (PDFObject po : textgroup.group(parser.getObjects())) { if (po instanceof PDFObjectGraphics) { graphics.add(((PDFObjectGraphics) po).getPath()); } } } parser.close(); } catch (IOException e) { e.printStackTrace(); } } return graphics; } }
public class TextHistogramExtractor implements Extractor { @Override public Adapter newAdapter() { throw (new RuntimeException("Not supported.")); } @Override public String getName() { return "Text Histogram Extractor"; } @Override public Set<Class<? extends Adapter>> supportedAdapters() { Set<Class<? extends Adapter>> adapters = new HashSet<Class<? extends Adapter>>(); adapters.add(HasText.class); return adapters; } @Override public Class<? extends Descriptor> getFeatureType() { return LabelHistogramDescriptor.class; } @Override public Descriptor extract(Adapter adapter) throws Exception { if (adapter instanceof HasText) { LabelHistogramDescriptor desc = new LabelHistogramDescriptor(); for (String word : ((HasText) adapter).getWords()) { desc.increaseBin(word); } return desc; } else { throw new UnsupportedTypeException(); } } @Override public boolean hasPreview(){ return false; } @Override public String previewName(){ return null; } }
public class LabelHistogramEuclidianDistanceMeasure implements Measure { @Override public SimilarityPercentage normalize(Similarity similarity) { return new SimilarityPercentage(1 - similarity.getValue()); } @Override public String getFeatureType() { return LabelHistogramDescriptor.class.getName(); } @Override public String getName() { return "Histogram Distance"; } @Override public Class<LabelHistogramEuclidianDistanceMeasure> getType() { return LabelHistogramEuclidianDistanceMeasure.class; } // correlation @Override public Similarity compare(Descriptor desc1, Descriptor desc2) throws Exception { if ((desc1 instanceof LabelHistogramDescriptor) && (desc2 instanceof LabelHistogramDescriptor)) { LabelHistogramDescriptor lhd1 = (LabelHistogramDescriptor) desc1; LabelHistogramDescriptor lhd2 = (LabelHistogramDescriptor) desc2; // get all possible labels Set<String> labels = new HashSet<String>(); labels.addAll(lhd1.getLabels()); labels.addAll(lhd2.getLabels()); // normalize lhd1.normalize(); lhd2.normalize(); // compute distance double sum = 0; for (String s : labels) { Double b1 = lhd1.getBin(s); Double b2 = lhd2.getBin(s); if (b1 == null) { sum += b2 * b2; } else if (b2 == null) { sum += b1 * b1; } else { sum += (b1 - b2) * (b1 - b2); } } return new SimilarityNumber(Math.sqrt(sum), 0, 1, 0); } else { throw new UnsupportedTypeException(); } } }