You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 16 Next »

This page is under construction. This page will describe the features of Extractor Info Fetcher Service, the current choices of how it can be implemented and the implementation method that is chosen, elaborating the reasons behind it.

Feature Description

For the purpose of moving compute to data, Brown Dog clients will need to know which extractors can process a given file. For this, the clients will first access an "Extractor Info Fetcher Service", through the BD-API endpoint /extractors. This service should return a list of extractors (including the docker image name) that can process files belonging to a given file type. Depending on which instance of BD-API was used (Dev, Prod, etc.), the list of extractors should be from that instance. I.e. only those extractors that are bound to a particular Brown Dog instance through RabbitMQ virtual host should be returned based on the request. Ideally these extractors should be either currently running or available, i.e., they shouldn't be obsolete.

The service will need to know the file type of a file to find the extractors that can process it. There are two options here; either the client can find out the mime type of a file and send it to the service or the client can send the file extension and the service can find out the mime type based on the extension. Each result entry returned by the service should contain extractor name, extractor id, docker image name and git repository name. This service will need to be an independent one.

Implementation Details

The implementation of this feature can be done as a Python Flask app. It is easy to develop and debug API end points using Flask.

Server side vs client side calculation of MIME type

There are two options to calculate the mime type of a file. One option is that clients can find the mime type of a file and send them to the service. Though when the clients are written in languages like Python, this is easy, in some other clients, this may be difficult. Another option is that clients will send in the file extension and the service can figure out the MIME type from the file extension. This gives much more control to the service. This also means that the logic for calculating MIME type need not be implemented at multiple places (clients). In both these situations there are cases where the MIME type may not be standard and it may not be possible to find those from file extensions.

Finding extractors that are running (or active) at any time

It is very important that this service returns only those extractors that are running or are currently active (in the sense that the extractors should get fired if the file is submitted to a Brown Dog extraction service). Now, there is a tricky situation here. If there are consumers in a RabbitMQ extractor queue, then it means that it is running, but this is not always the case. Brown Dog uses its Elasticity module to automatically change the number of consumers based on request. It has a provision to set the number of consumers to be 0. This can be helpful to conserve resources for especially those extractors that use a lot of disk space / memory but are rarely used. Now, this provision means that even if the number of consumers is 0, it doesn't really mean that the extractor is not a current one. It can be because of its elasticity module. What seems practical is that if we maintain the Brown Dog RabbitMQ queues properly, i.e., by deleting unused or old extractor queues, RabbitMQ API can be used to find those extractors that are presently in active use.

Finding extractors that are bound to a specific Brown Dog instance

Currently, we have two instances of Brown Dog, namely the development (dev) and production (prod) instance. In future as Brown Dog gets adopted by other institutions, some of them may want to have their own internal instances of Brown Dog. This means that when a user submits a file for processing, it should be processed by a specific instance of Brown Dog on which it was intended to be processed. For the Extractor Info Fetcher service, this means that when returning a list of extractors that can a work on a given file type, it should also take into consideration the Brown Dog instance to which those extractors belong. In RabbitMQ for Brown Dog, we use specific virtual hosts for specific instances. Extractors can register themselves with multiple instances of Clowder (e.g. BD-Clowder-Dev, BD-Clowder, etc.) using the registration API end point. This information can also be used to find extractors that are bound to a Brown Dog instance. But, when a new BrownDog instance (synonymously a new Clowder instance behind the scenes for managing data) is created, extractor code (extractor_info.json) has to be modified to register it with this new instance. This can make things less scalable.

  • No labels