In addition to resolving the problem stated above, expanding extractor job tracking would also allow extractor developers to better debug and analyze their extractor jobs, or to determine if there are performance enhancements that could be made.
The API defines the ExtractorInfo and Extraction models, and offers an Extractions API for fetching and updating these events for display. The MongoDBExtractorService can be used internally to create and interact with those models within MongoDB.
The Clowder API for file uploads includes an optional call calls to the RabbitMqPlugin. This plugin must be enabled to use extractors.
If the RabbitMQ plugin is enabled, Clowder subscribes to RabbitMQ to receive heartbeat messages from each extractor. Clowder subscribes to these messages to determine if an extractor is still online and functioning. (This is the ExtractorInfo model.)
When a heartbeat is discovered and/or an extractor is registered manually via the API, Clowder subscribes to a "reply queue" to receive updates back about the extractor.
Each new file upload will push an event into the appropriate extractor queue(s) in RabbitMQ based on the types defined by each extractor.
When a reply comes back containing a file + extractor combo, Clowder creates a new Extraction in the database housing this message.
pyClowder pyClowder, as the name implies, is a framework for writing Python-based extractors for Clowder. It includes, among other utilities, a client for subscribing to the appropriate queues in RabbitMQ. It pyClowder knows how to automatically send back heartbeat signals and status updates to Clowder.
When a pyClowder-based extractor starts up, it begins sending heartbeat messages to RabbitMQ indicating the status of the extractor. Clowder subscribes to these messages to determine if an extractor is still online and functioning. (This is the ExtractorInfo model.)
At this time, the extractor also creates and/or subscribes to a queue unique to its extractor identifier, to which Clowder will send new upload events to be processed.
TBDFor each new file upload, Clowder will push an event into the appropriate extractor queue(s) in RabbitMQ based on the types/rules defined in the ExtractorInfo.
If an extractor is idle and sees a new message in the queue it is watching, it will grab the message and start processing.
Whenever a stdout/stderr log event happens or if the Extraction status changes (e.g. SUBMITTED → PROCESSED or PROCESSED → DONE), pyClowder will send an update back to Clowder on the "reply queue".
When a reply comes back on the "reply queue" containing a file + extractor combo, Clowder creates a new Extraction in the database housing this message.
The UI UIs for viewing a File offers or Dataset offer an Extractions tab, which simply lists out all Extraction objects associated with this the file or dataset and groups by extractor type.
pyClowder may need minor updates to send more information regarding a specific job. Alongside the user, resource (e.g. dataset or file), and extractor name, pyClowder should also attach a unique identifier (e.g. UUID) to related job messages. That is, when an extractor begins working on a new item from the queue, it should assign a unique identifier to that workload that is unique to each run of the desired extractor.
This way, we can easily create a UI that can filter and sort through these events without making major modifications to the UI and without taking a large performance hit in the frontend.
The API changes proposed should be fairly minimal, as we are simply extending the existing Extraction API to account for the new identifier that will need to be added to pyClowder.
- What do we do about past extractions/jobs that are missing this identifier? Is it worth preserving the existing display in the UI for this purpose?
- Extraction already has an id field that is unique per-status update - would adding a job_id field work in a way that keeps this data/view backward-compatible as possible?
- Should Clowder or pyClowder create this unique identifier? What about cases where an extractor dies mid-job? Can/should another extractor pick this up with the same ID, or should this generate a new ID instead?