Extractors are useful for performing a specific task, such as calculating metadata to attach to a resource or generating additional output files. However, there is sometimes a need for complex processes where several extractors need to daisy-chan, particularly if an extractor needs multiple inputs that each need a separate individual extractor run on them first before running. 

Some projects have started employing solutions for this situation, and this page is intended to discuss models that are currently in use to determine if use cases can be better supported.

TERRA-REF Pipeline

TERRA uses a unique pipeline of extractors for each of the ~10 main sensors on the sensing platform, and each pipeline includes 5-10 extractors that operate as discrete steps in the workflow. 


If all 10 extractors were all configured to listen on files being uploaded to datasets and use the check_message() step to determine whether to process or not, there would be tremendous overhead as each extractor would need to comb through 9 other unrelated sensors' data constantly looking for relevant data, as RabbitMQ would send the event notification to all of them each time. Instead, TERRA developed the "rulechecker" extractor (colloquially called the switchboard) to manage the data flow.


(needs some updates from the TERRA fork: https://opensource.ncsa.illinois.edu/bitbucket/users/mburnet2/repos/terraref-rulechecker/browse)

The basic setup:

This is a way to manage the fact that, for dataset-level extractors, it is not possible to specify a MIME type or other parameter for triggering and instead dataset extractors must either evaluate every dataset message (which rulechecker does), or trigger via alternative means (the extractors rulechecker triggers).

Daisy-chaining directly

From there, each extractor can trigger the next extractor directly in the chain if possible, at the end of the process_message() function.

This is the simplest option of extractor pipelines are simple direct 1-1 paths.

Collection-level extractors

For extractors such as the fieldmosaic stitcher that mosaics together 9000+ images from a single day, the rulechecker extractor is used once again. Each geoTIFF is passed back to rulechecker, which triggers a special rule to add that geoTIFF to a PSQL database maintaining a list of geoTIFFs for a specific day or scan. Once a threshold is met, all 9000 geoTIFFs are passed to the fieldmosaic extractor (each of the geoTIFFs from a different dataset) to stitch them all at once.

Rulechecker is useful when:

For in-depth example, see the terraref_switchboard() function in TERRA's rules.py, a file that is necessary for a rulechecker deployment defining which rules to execute on each dataset (https://opensource.ncsa.illinois.edu/bitbucket/users/mburnet2/repos/terraref-rulechecker/browse/rules.py).