Current Implementation

The current implementation is enabled by setting the geostream.cache to a path on disk where the files are stored. Calls to api.Geostreams.binDatapoints(...) and api.Geostreams.searchDatapoints(...) will use cacheFetch() to retrieve the response from the cache if available or create it, put it in the cache and then return it. The files on disk are stored by hashing the request. Two files are stored on disk under the hash, the response as a json text file (without spaces) and a .json file that includes the actual request as json (for inspection of the raw files). For example:


Here are two example queries from a .json file:


Here is an example of the response:

{"depth":0.0,"label":"1991 winter","sources":[""],"year":1991,"date":"1991-02-01T12:00:00.000-06:00","depth_code":"NA","average":0.0477260702109212,"count":1},
{"depth":0.0,"label":"1991 spring","sources":[""],"year":1991,"date":"1991-05-01T12:00:00.000-05:00","depth_code":"NA","average":0.0477260702109212,"count":1},
{"depth":0.0,"label":"1991 summer","sources":[""],"year":1991,"date":"1991-08-01T12:00:00.000-05:00","depth_code":"NA","average":0.0477260702109212,"count":1},
{"depth":0.0,"label":"1991 fall","sources":[""],"year":1991,"date":"1991-11-01T12:00:00.000-06:00","depth_code":"NA","average":0.0477260702109212,"count":1},
{"depth":0.0,"label":"1993 winter","sources":[""],"year":1993,"date":"1993-02-01T12:00:00.000-06:00","depth_code":"NA","average":0.139673763930734,"count":1},
{"depth":0.0,"label":"1993 spring","sources":[""],"year":1993,"date":"1993-05-01T12:00:00.000-05:00","depth_code":"NA","average":0.139673763930734,"count":1},
{"depth":0.0,"label":"1993 summer","sources":[""],"year":1993,"date":"1993-08-01T12:00:00.000-05:00","depth_code":"NA","average":0.139673763930734,"count":1}


Current potential issues:

  1. Lots of files are created on disk
  2. Files on disk don't seem to include since and until, result in potentially more data being sent to the client
  3. Admin has to prime the cache (sometimes we forget)

Proposed Implementation

  1. Move the cache to postgresql. 
  2. Each aggregate datapoint could be a row in a table so that queries could be more specific to a certain range / sensor / stream.
  3. Keep aggregations (yearly, semi, seasonal, monthly, daily, hourly) in separate tables. 
    1. For example bins_year would include all yearly averages. 
    2. Columns could be (id:int:, sensor:int, stream:int, year:int, data:json, averages:json). 
      1. The data column could store the current total and count for each variable updating running averages.
      2. The averages column could store the current average. This way returning the actual values will not require any further computation.
  4. Each new added datapoint triggers updates on the aggregations tables. This will only update one row per table.



  • No labels