Moving Compute To Data

This document contains discussions / plans about moving computation towards data.

Moving the computation, i.e. data manipulation or analysis code, closer to the data is becoming a much more frequently utilized approach when dealing with large data sets. For example, if A hosts a data set and the analysis code on that data is running on machine B, as the size of the data gets larger it becomes increasingly impractical to move the data from A to B for the analysis code to run. The more frequently used alternative in these cases, especially as portable containerized code has become more practical with technologies such as docker, is to move the containerized analysis code over to the the machine hosting the data and executing it their as opposed to moving the data (given that the containers are significantly smaller than the datasets and assuming some computational resource is also avaialble on or near the server hosting the data).

Rough Outline of Steps

From a site where data is residing (Site A), a Brown Dog client application (BD-client) will first open the local file that needs to be processed
BD-client will read the local file's file type
BD-client then hits an endpoint for finding out the extractors that are running at that moment and which can process the file type
The client can query the extractors (here detailed information is needed) to find out what dependencies it has, installs them, and submits the file for extraction at Site A.

Endpoint Tasks

The endpoint first queries the RabbitMQ server to get all the available the queues (/api/queues). This can be done based on specific virtual hosts.

Page tree

Moving Compute To Data

This document contains discussions / plans about moving computation towards data.

Rough Outline of Steps

Endpoint Tasks