This document contains discussions / plans about moving computation towards data.

Moving the computation, i.e. data manipulation or analysis code, closer to the data is becoming a much more frequently utilized approach when dealing with large data sets. For example, if A hosts a data set and the analysis code on that data is running on machine B, as the size of the data gets larger it becomes increasingly impractical to move the data from A to B for the analysis to run.  The more frequently used alternative in these cases, especially as portable containerized code has become more practical with technologies such as docker, is to move the containerized analysis code over to the the machine hosting the data and executing it their as opposed to moving the data (given that the containers are significantly smaller than the datasets and assuming some computational resource is also available on or near the server hosting the data).  With the dockerization of extractors and converters as part of the previous milestone we now address this optional means of carrying out transformations within Brown Dog on larger datasets locally.

Scenario

A user wishes to convert and then extract some statistics on a locally hosted 8 GB weather dataset.  The user is using the Brown Dog command line interface to do this.  Below we outline the modifications needed by each of the components in order to allow these transformations to be carried out locally (i.e. moving the compute to the local data).

DTS (Clowder)

DAP (Polyglot)

BD CLI

Once completed the bd command line interface might be utilized as follows in order to carry out the desired data transformations:

bd -o --bigdata pecan.zip region.narr.zip | bd -v --bigdata

 


Development Notes

DTS endpoint to determine needed containers:

  1. Queries the RabbitMQ server to get all the available the queues (/api/queues/vhost). If vhost is not set in the config file, it uses the default one (%2F).
  2. Then, for each of the queues, it again queries the server for receiving the bindings (/api/queues/vhost/name/bindings), where vhost (default is %2F) is obtained from the config file and name (i.e. queue name) is obtained from the previous step.
  3. The bindings returned for a particular queue are searched for matching MIME types in the routing key. If this is found, the corresponding queue name is appended to the result array.
  4. Finally, when all the queues have been traversed, the result array is returned to the user in JSON format.

DAP:

  1. Query DAP for conversion path, http://bd-api-dev.ncsa.illinois.edu/dap/path/<output>/<input>, get path back in JSON, e.g. nc to pdf would return:
    1. [{"input":"nc","application":"ncdump","output":"txt"},{"input":"txt","application":"unoconv","output":"pdf"}]
  2. Pull docker containers for applications specified in conversion path
    1. Requires
  3. Modify https://opensource.ncsa.illinois.edu/bitbucket/projects/POL/repos/polyglot/browse/docker/polyglot/entrypoint.sh so that if rabbitmq url is not set it instead runs the local version :
    1. SoftwareServer.sh -nocache <Application> <operation> <output format> <input file>