You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Current »

Notes about how we handle extractors/RabbitMQ messages.

Currently

  • FILE - Originally extractors primarily operated on one file. Trigger when file is added to Clowder.
    • *.file.#
    • *.file.image.#

      fielddescription
      idfile UUID
      intermediateIdfile UUID (deprecated)
      datasetIdid of dataset file was added to
      filenamefile name
      secretKeyClowder API key
      hostClowder host URL
  • DATASET - Later, support for dataset extractors was added. Trigger when file is added to a dataset.
    • *.dataset.file.added
    • *.dataset.file.removed

      fielddescription
      idfile UUID
      intermediateIdfile UUID (deprecated)
      datasetIdid of dataset file was added to
      secretKeyClowder API key
      hostClowder host URL
    These trigger when a file is added to a dataset. 
    • Because the message contents are otherwise identical, PyClowder currently uses the presence of 'filename' field in message to determine whether to handle as a file or dataset extraction.
    • Max just updated PyClowder2 pull request to include routing_key in the parameters for extractors, so we can check that instead of checking 'filename' field.
  • METADATA - Later, support for metadata-triggered extractors was added.
    • *.metadata.added
    • *.metadata.removed

      fielddescription
      idfile or dataset UUID
      metadatamd that was added/removed
  • COLLECTION -  Eventually, we may want to trigger extractors that process arbitrary collections of datasets.
    • Not quite sure how we're gonna do this yet.

Possible Improvement Ideas

  • Reduce distinction between 'file' and 'dataset' extractors
    • Clowder has changed so that we no longer present files as separate from datasets - that is, we don't have files outside datasets. So my distinction between the two kinds of extractors is maybe unnecessary now - in both cases, extraction begins when a file is uploaded to a dataset.

    • Once that happens, extractor might want to do several things:
      1. Use the file to generate metadata and attach to the file
      2. Use the file to generate metadata and attach to the dataset
      3. Use the file to convert to a different format and upload to the dataset
      4. Use many files from dataset to generate output files/metadata and add to dataset or files
      ...we can do #1-3 with currently existing file extractors. #4 just requires a way to get list of other files in the dataset (as Rob suggested, ordered by date added)

    • With that in mind, I think we could get rid of these messages:
      • *.dataset.file.added
      • *.dataset.file.removed

      ...and instead just use *.file.#. Then, each extractor can have a flag that says whether or not to fetch list of all files in dataset, or just the file that triggered the extractor

    • One problem: how do we handle:
      • Dataset-level extractor events (STARTED, PROCESSING, DONE) if these are handled as file events
        • Currently TERRA extractors write 'COMPLETED' as extractor metadata to dataset, and check for that in later extractions
      • Rerunning extractors on a dataset
        • Do we just send the last added file as the 'file' event and trigger that way?

  • Remove intermediateID if no one is using

  • Revisit what we include in RabbitMQ messages
    • drop intermediateID
    • other info that could help us make this more efficient?
      • Data that PyClowder fetches on processing to give to extractors
        • List of files in a dataset w/ creation date, file paths
        • List of metadata attached to file/dataset
        (these are likely to make the messages too big, unless we wanted to cap them and only include if under a certain length & let PyClowder fetch otherwise)

 

 

  • No labels