Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

BD-Clowder is the web application that does the Brown Dog content management. It stores the files that are submitted for processing by different BD clients, sends those files to be processed by remote extractor services, and stores the generated metadata and other auxiliary information for future use. Currently a Mongo database is being used as the data store. In this Mongo database, inside a collection titled extractors.info, the details about each extractor that gets registered with this Clowder instance is getting stored. Those details comes from the extractor_info.json file that is now part of every extractor. In future there is also a plan to include filetypes on which an extractor will fire inside the extractor_info.json. This means that all information needed for find finding the extractors that can process a file type type will be available in the extractor itself. So, the extractor info fetcher service if it resides inside BD-Clowder can query the database and obtain the needed the list of extractors with its details.

Implementation Details

The prototype of this feature has been developed using Python Flask app. It is easy to develop and debug API end points using Flask.

Server side vs client side calculation of filetype 

There are two options methods to calculate the filetype. One option method is that clients can find the filetype of a file and send them to the service. Though when the clients are written in languages like Python, this is easy, in some other clients, this may be difficult. Another option is that clients will send in the file extension and the service can figure out the filetype from the file extension. This gives much more control to the service. If file extensions cannot be uniquely mapped on to filetypes, this approach may not probably work, since in that case the client will be in a better situation to figure the filetype by scanning the input file header. This also means that the logic for calculating MIME type need not be implemented at multiple places (clients). In both these situations there are cases where the MIME type may not be standard and it may not be possible to find those from file extensions.

...