Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
This document contains discussions / plans about moving computation towards data.

...

Moving computer Moving the computation, i.e. data manipulation or analysis code, closer to the data is a well-known paradigm in the realm of Big Data. Suppose A is the site where data is hosted and B the site were computer / processing programs are hosted, transferring becoming a much more frequently utilized approach when dealing with large data sets. For example, if A hosts a data set and the analysis code on that data is running on machine B, as the size of the data gets larger it becomes increasingly impractical to move the data from A to B for processing and processed data or metadata back from B to A will be a time consuming task as the amount of data increases. So, instead moving data around, a better approach is to move computer or the processing programs towards data. This is based on the assumption that generally executables or source code will use much lesser disk space when compared with datathe analysis code to run.  The more frequently used alternative in these cases, especially as portable containerized code has become more practical with technologies such as docker, is to move the containerized analysis code over to the the machine hosting the data and executing it their as opposed to moving the data (given that the containers are significantly smaller than the datasets and assuming some computational resource is also avaialble on or near the server hosting the data).

Rough Outline of Steps

  1. From a site where data is residing (Site A), a Brown Dog client application (BD-client) will first open the local file that needs to be processed
  2. BD-client will read the local file's file type
  3. BD-client then hits an endpoint for finding out the extractors that are running at that moment and which can process the file type
  4. The client can query the extractors (here detailed information is needed) to find out what dependencies it has, installs them, and submits the file for extraction at Site A.

Endpoint Tasks

  1. The endpoint first queries the RabbitMQ server to get all the available the queues (/api/queues). This can be done based on specific virtual hosts.

...