Extractor chaining pipelines

Extractors are useful for performing a specific task, such as calculating metadata to attach to a resource or generating additional output files. However, there is sometimes a need for complex processes where several extractors need to daisy-chan, particularly if an extractor needs multiple inputs that each need a separate individual extractor run on them first before running.

Some projects have started employing solutions for this situation, and this page is intended to discuss models that are currently in use to determine if use cases can be better supported.

TERRA-REF Pipeline

TERRA uses a unique pipeline of extractors for each of the ~10 main sensors on the sensing platform, and each pipeline includes 5-10 extractors that operate as discrete steps in the workflow.

Rulechecker

If all 10 extractors were all configured to listen on files being uploaded to datasets and use the check_message() step to determine whether to process or not, there would be tremendous overhead as each extractor would need to comb through 9 other unrelated sensors' data constantly looking for relevant data, as RabbitMQ would send the event notification to all of them each time. Instead, TERRA developed the "rulechecker" extractor (colloquially called the switchboard) to manage the data flow.

https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/extractors-rulechecker/browse

(needs some updates from the TERRA fork: https://opensource.ncsa.illinois.edu/bitbucket/users/mburnet2/repos/terraref-rulechecker/browse)

The basic setup:

Individual extractors such as the bin2tif converter are NOT configured to listen for any events. They can only be triggered directly (e.g. something passes a message to their queue specifically by name)
The rulechecker switchboard is configured to listen for "dataset.file.added" events. when an event lands, rulechecker evaluates many rules at once - these rules are akin to the check_message() functions of the downstream extractors.
- e.g. "is this a stereoTop dataset with 2 BIN files? if yes, trigger bin2tif. if no, move on to next rule"
In this way, uploading 10,000 files generates 10,000 messages for rulechecker, which uses it's switchboard set of rules to pass those messages along to the necessary downstream extractors. So instead of bin2tif, flir2tif, ply2las, etc. all getting 10,000 messages, they only get ~3,300 each. Rulechecker alone sees every message.

This is a way to manage the fact that, for dataset-level extractors, it is not possible to specify a MIME type or other parameter for triggering and instead dataset extractors must either evaluate every dataset message (which rulechecker does), or trigger via alternative means (the extractors rulechecker triggers).

Daisy-chaining

From there, each extractor can trigger the next extractor directly in the chain if possible, at the end of the process_message() function.

bin2tif uses pyclowder's submit_extraction() function to pass it's geoTIFF outputs along to 3 additional extractors, for example.

Collection-level extractors

For extractors such as the fieldmosaic stitcher that mosaics together 9000+ images from a single day, the rulechecker extractor is used once again. Each geoTIFF is passed back to rulechecker, which triggers a special rule to add that geoTIFF to a PSQL database maintaining a list of geoTIFFs for a specific day or scan. Once a threshold is met, all 9000 geoTIFFs are passed to the fieldmosaic extractor (each of the geoTIFFs from a different dataset) to stitch them all at once.

Space shortcuts

Page tree