Idea 1: Emphasize workflow

  • Replace the IOGraph which Polyglot builds from the Software Servers by a full DataWolf workflow.
  • Need mechanism to trigger a subset of that graph (i.e. given an input format and desired output format)
  • Track intermediate step outputs in workflow for error tracking (also quality measures)

Idea 2: Emphasize tools

  • Software Servers as extractors
  • Needs means of registering software servers with Medici to build IOGraph and issue sequence of extractions
  • Extractors will need to be able to spawn other extractors (next steps in multi step conversion paths)

Lessons from Medici

  • Make the Steward stateless so that multiple instances might be executed in parallel and in the event that one should fail a request (or task) will not be lost.  Similar to Medici, a service such as NGINX might then also be used to delegate requests to these concurrently running stewards.  State, specifically the current I/O-graph, can be stored in a distributed mongo database.
  • Utilize a discovery service for Software Servers such as etcd rather than the current ad hoc notification implement implemented through java sockets over TCP and/or UDP.
  • Leverage a distributed bus like RabbitMQ  to handle the delegation of jobs or sub-jobs to currently active Software Servers.
    • Leverage Medici VM elasiticity work
  • Enforce a stateless REST interface which immediately returns ID’s that can then be polled on a different endpoint until the task is completed and result file available for download.

Polyglot Steward Requirements

  • Identify and query available Software Servers
  • Construct and keep up to date the Input/Ouput graph
  • Accept tasks of the form source format to target format and carry them out
  • Identify non-busy Software Servers with needed software for each step and issue a task to them
    • Pass URLs to files to Software Servers and not the files themselves

Additional Thoughts

  • Utilize only the REST interface to the Software Servers
  • Pass URLs to data rather than the data, with intermediary files hosted on the Software Server that last operated on it
  • No labels