You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Method 1: Group by extracted time component

This SQL statement is computing average of the bins which created by grouping the extracted time component 

Explanation of the SQL:

"where" statement contains the filtering, it should include the stream_id (sensor_id), could include start time, end time, source .....

"group by" is to do the grouping, "by 1" meaning group by the first selection, i.e. extract(year from start_time) in this example. it can also by "by 1,2"....

"cast( data ->> 'temperature' as DOUBLE PRECISION)" is to find the 'temperature' in data and convert it to double format. Using the following SQL can make it more responsive ( just running once):



create or replace function cast_to_double(text) returns DOUBLE PRECISION as $$


begin
    -- Note the double casting to avoid infinite recursion.
    return cast($1::varchar as DOUBLE PRECISION);
exception
    when invalid_text_representation then
        return 0.0;
end;
$$ language plpgsql immutable;
create cast (text as DOUBLE PRECISION) with function cast_to_double(text);

"avg" is to get the average, it usually used alone with "group by", you can have sum, count.....

Overview

by returning the average of the datapoints, short the time for streaming. 

then we just need to convert each SQL result to a json. 

Limitation

  • Monthly, daily binning will not work. For example, it will group all "December" regardless of year.
  • Customized binning, such as water years, can not be used. 

Method 2: Group by pre-generated bins with join

This SQL statement pre-generate bins using "generate_series()" and tstzrange type; then it joins with datapoints table and groups by bins

with bin as (
	select tstzrange(s, s+'1 year'::interval) as r 
	from generate_series('2002-01-01 00:00:00-05'::timestamp, '2017-12-31 23:59:59-05'::timestamp, '1 year') as s
)
select 
	bin.r, 
	avg(cast( data->> 'pH' as DOUBLE PRECISION))
from datapoints
right join bin on datapoints.start_time <@ bin.r
where datapoints.stream_id = 1584
group by 1 
order by 1;

Method 3: Group by pre-generated bins with filter for aggregate function

This SQL statement pre-generate bins using "generate_series()" and tstzrange type; then it uses filter with avg function instead of join


with bin as (
	select tstzrange(s, s+'1 year'::interval) as r 
	from generate_series('2002-01-01 00:00:00-05'::timestamp, '2017-12-31 23:59:59-05'::timestamp, '1 year') as s
)
select 
	bin.r, 
	avg(cast( data->> 'pH' as DOUBLE PRECISION)) filter(where datapoints.start_time <@ bin.r)
from datapoints, bin
where datapoints.stream_id = 1584
group by 1 
order by 1;


Method 4: Using aggregate function with "over" and "window" (not working yet)

Jong Lee Looked into this option; but couldn't find a way to do it. Jong Lee may not understand the functionality.

Method 5: Using procedure function (PL/pgsql or PL/python)

TODO... 

Performance

Tested with GLGT production database. Used the stream_id 1584 which has 987,384 datapoints. Used "explain analyze"


Method 1Method 2Method 3
Planning time0.269 ms0.259 ms0.145 ms
Execution time3069.059 ms6029.105 ms12008.801 ms


Caching design

If we store the count and sum of the values in addition to average, it becomes easy to update the bin with a new datapoint.

bins_year

sensor_idyearfieldcountsumaverage
123452003temperature1208400.060.0
123452003pH120240.02.0

bins_month

sensor_idmonthyearfieldcountsumaverage
1234562003temperature10600.060.0
1234562003pH1020.02.0

bins_day


sensor_iddaymonthyearfieldcountsumaverage
123451362003temperature160.060.0
123451362003pH12.02.0

bins_hour


sensor_idhourdaymonthyearfieldcountsumaverage
12345191362003temperature160.060.0
12345191362003pH12.02.0

Other possible tables:

bins_season - do we need to cache this, or calculate from monthly bins?

bins_total - do we need to cache this, or is it fast enough to calculate from yearly bins?

I don't think we want cache table for water_year, for example, because that is specific to GLM/GLTG and not generic for clowder. We could use the month caches to quickly calculate that on the fly.

When do we update cache tables?

  • cron job (hourly? 5 minutes?)
  • whenever new datapoint is added (at most 1 bin per table would need to be created or updated) - upsert
  • No labels