Idea 1: Emphasize workflow
- Replace the IOGraph which Polyglot builds from the Software Servers by a full DataWolf workflow.
- Need mechanism to trigger a subset of that graph (i.e. given an input format and desired output format)
- Track intermediate step outputs in workflow for error tracking (also quality measures)
Idea 2: Emphasize tools
- Software Servers as extractors
- Needs means of registering software servers with Medici to build IOGraph and issue sequence of extractions
- Extractors will need to be able to spawn other extractors (next steps in multi step conversion paths)
Lessons from Medici
- Make the Steward stateless so that multiple instances might be executed in parallel and in the event that one should fail a request (or task) will not be lost. Similar to Medici, a service such as NGINX might then also be used to delegate requests to these concurrently running stewards. State, specifically the current I/O-graph, can be stored in a distributed mongo database.
- Utilize a discovery service for Software Servers such as etcd rather than the current ad hoc notification implement implemented through java sockets over TCP and/or UDP.
- Leverage a distributed bus like RabbitMQ to handle the delegation of jobs or sub-jobs to currently active Software Servers.
- Leverage Medici VM elasiticity work
- Enforce a stateless REST interface which immediately returns ID’s that can then be polled on a different endpoint until the task is completed and result file available for download.
Polyglot Steward Requirements
- Identify and query available Software Servers
- Construct and keep up to date the Input/Ouput graph
- Accept tasks of the form source format to target format and carry them out
- Identify non-busy Software Servers with needed software for each step and issue a task to them
- Pass URLs to files to Software Servers and not the files themselves
Additional Thoughts
- Utilize only the REST interface to the Software Servers
- Pass URLs to data rather than the data, with intermediary files hosted on the Software Server that last operated on it