Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Medici extractors typically serve to automatically extract some new kind of information from a file's content when it is uploaded into Medici.  These extractors do this by connecting to a shared RabbitMQ bus.  When a new file is uploaded to Medici it is announced on this bus.  Extractors that can handle a file of the type posted on the bus are triggered and the data they in turn create is returned to Medici as derived data to be associated with that file.  The extractors themselves can be implemented in a variety of languages. Examples of these extractors in different languages can be found in the extractors-templates code repository.

Anchor
Java
Java
Java

An extractor must establish a connection with the Medici RabbitMQ bus, handle incoming messages, start jobs based on received messages, and ultimatley carry out a job on a given file.  The example below simply counts the number of words in a document and returns this information as a piece of metadata to be associated with the file.

Code Block
themeEmacs
languagejava
titleConnecting to RabbitMQ
protected void startExtractor(String rabbitMQUsername, String rabbitMQpassword)
{
	try{ 
 		//Open channel and declare exchange and consumer
		ConnectionFactory factory = new ConnectionFactory();
		factory.setHost(serverAddr);
		factory.setUsername(rabbitMQUsername);
		factory.setPassword(rabbitMQpassword);
		Connection connection = factory.newConnection();

 		final Channel channel = connection.createChannel();
		channel.exchangeDeclare(EXCHANGE_NAME, "topic", true);

		channel.queueDeclare(QUEUE_NAME,DURABLE,EXCLUSIVE,AUTO_DELETE,null);
		channel.queueBind(QUEUE_NAME, EXCHANGE_NAME, "*.file.text.plain.#");
 
 		this.channel = channel;

 		// create listener
		channel.basicConsume(QUEUE_NAME, false, CONSUMER_TAG, new DefaultConsumer(channel) {
 			@Override
 			public void handleDelivery(String consumerTag, Envelope envelope, AMQP.BasicProperties properties, byte[] body) throws IOException {
				messageReceived = new String(body);
 				long deliveryTag = envelope.getDeliveryTag();
 				// (process the message components here ...)
				System.out.println(" {x} Received '" + messageReceived + "'");
 
				replyProps = new AMQP.BasicProperties.Builder().correlationId(properties.getCorrelationId()).build();
				replyTo = properties.getReplyTo();
 
				processMessageReceived();
				System.out.println(" [x] Done");
				channel.basicAck(deliveryTag, false);
			}
		});

 		// start listening 
		System.out.println(" [*] Waiting for messages. To exit press CTRL+C");

  		while (true) {
			Thread.sleep(1000);
		}
	}catch(Exception e){
		e.printStackTrace();
		System.exit(1);
	} 
}
Code Block
languagejava
titleProcessing Messages
protected void processMessageReceived()
{
  try {
    try {
      ExampleJavaExtractorService extrServ = new ExampleJavaExtractorService(this);
      jobReceived = getRepresentation(messageReceived, ExtractionJob.class);        
    
      File textFile = extrServ.processJob(jobReceived);
      jobReceived.setFlag("wasText");
      log.info("Word count extraction complete. Returning word count file as intermediate result.");
      sendStatus(jobReceived.getId(), this.getClass().getSimpleName(), "Word count extraction complete. Returning word count file as intermediate result.", log);
      uploadIntermediate(textFile, "text/plain", log);
      textFile.delete();
        
      sendStatus(jobReceived.getId(), this.getClass().getSimpleName(), "DONE.", log);
    } catch (Exception ioe) {
      log.error("Could not finish extraction job.", ioe);
      sendStatus(jobReceived.getId(), this.getClass().getSimpleName(), "Could not finish extraction job.", log);
      sendStatus(jobReceived.getId(), this.getClass().getSimpleName(), "DONE.", log);
    }
  } catch(Exception e) {
      e.printStackTrace();
      System.exit(1);
  } 
}
Code Block
languagejava
titleProcessing Jobs
public File processJob(ExtractionJob receivedMsg) throws Exception 
{    
  log.info("Downloading text file with ID "+ receivedMsg.getIntermediateId() +" from " + receivedMsg.getHost());
  callingExtractor.sendStatus(receivedMsg.getId(), callingExtractor.getClass().getSimpleName(), "Downloading text file.", log);
      
  DefaultHttpClient httpclient = new DefaultHttpClient();
  HttpGet httpGet = new HttpGet(receivedMsg.getHost() +"api/files/"+ receivedMsg.getIntermediateId()+"?key="+playserverKey);    
  HttpResponse fileResponse = httpclient.execute(httpGet);
  log.info(fileResponse.getStatusLine());

  if(fileResponse.getStatusLine().toString().indexOf("200") == -1){
    throw new IOException("File not found.");
  }

  HttpEntity fileEntity = fileResponse.getEntity();
  InputStream fileIs = fileEntity.getContent();
  
  Header[] hdrs = fileResponse.getHeaders("content-disposition");
  String contentDisp = hdrs[0].toString();
  
  String fileName = contentDisp.substring(contentDisp.indexOf("filename=")+9);
  File tempFile = File.createTempFile(fileName.substring(0, fileName.lastIndexOf(".")),      fileName.substring(fileName.lastIndexOf(".")).toLowerCase());
  OutputStream fileOs = new FileOutputStream(tempFile);   
  IOUtils.copy(fileIs,fileOs);
  fileIs.close();
  fileOs.close();
      
  EntityUtils.consume(fileEntity); 
      
  log.info("Download complete. Initiating word count generation");
  
  File textFile = processFile(tempFile, receivedMsg.getId());  
  return textFile;       
}

...

languagejava
titleProcessing Files

...

Java extractors can be created using the amqp-client jar file. This allows you to connect to the RabitMQ bus and received messages. The easiest way to get up and running is to use maven with java to add all required dependencies. An example of an extractor written in java can be found at Medici Extractor in Java.

Anchor
Python
Python
Python

Python extractors will often be based on the packages pika and requests. This allows you to connect to the RabittMQ message bus and easily send requests to medici. A complete example of the python extractor can be found at Medici Extractor in Python

...