...

"cast( data ->> 'temperature' as DOUBLE PRECISION)" is to find the 'temperature' in data and convert it to double format. Using the following SQL can make it more responsive ( just running once):

Code Block

language	sql

create or replace function cast_to_double(text) returns DOUBLE PRECISION as

...

Code Block

language	sql

 $$


begin
    -- Note the double casting to avoid infinite recursion.
    return cast($1::varchar as DOUBLE PRECISION);
exception
    when invalid_text_representation then
        return 0.0;
end;
$$ language plpgsql immutable;
create cast (text as DOUBLE PRECISION) with function cast_to_double(text);

...

This SQL statement pre-generate bins using "generate_series()" and tstzrange type; then it uses filter with avg function instead of join

Code Block

language	sql

with bin as (
	select tstzrange(s, s+'1 year'::interval) as r 
	from generate_series('2002-01-01 00:00:00-05'::timestamp, '2017-12-31 23:59:59-05'::timestamp, '1 year') as s
)
select 
	bin.r, 
	avg(cast( data->> 'pH' as DOUBLE PRECISION)) filter(where datapoints.start_time <@ bin.r)
from datapoints, bin
where datapoints.stream_id = 1584
group by 1 
order by 1;

Method 4: Using aggregate function with "over" and "window" (not working yet)

Jong Lee Looked into this option; but couldn't find a way to do it. Jong Lee may not understand the functionality.

Method 5: Using procedure function (PL/pgsql or PL/python)

TODO...

Performance

Tested with GLGT production database. Used the stream_id 1584 which has 987,384 datapoints. Used "explain analyze"

	Method 1	Method 2	Method 3
Planning time	0.269 ms	0.259 ms	0.145 ms
Execution time	3069.059 ms	6029.105 ms	12008.801 ms

Caching design

If we store the count and sum of the values in addition to average, it becomes easy to update the bin with a new datapoint.

bins_year

sensor_id	year	field	count	sum	average	start_time	end_time	updated
12345	2003	temperature	120	8400.0	60.0
12345	2003	pH	120	240.0	2.0

are start/end times for bins actually useful for anything? there could be holes in between endpoints
store completeness by sensor /stream?

bins_month

sensor_id	month	year	field	count	sum	average	start_time	end_time	updated
12345	6	2003	temperature	10	600.0	60.0
12345	6	2003	pH	10	20.0	2.0

bins_day

sensor_id	day	month	year	field	count	sum	average	start_time	end_time	updated
12345	13	6	2003	temperature	1	60.0	60.0
12345	13	6	2003	pH	1	2.0	2.0

bins_hour

sensor_id	hour	day	month	year	field	count	sum	average	start_time	end_time	updated
12345	19	13	6	2003	temperature	1	60.0	60.0
12345	19	13	6	2003	pH	1	2.0	2.0

bins_special

sensor_id	label	field	count	sum	average	start_time	end_time
12345	spring	temperature	3	180.0	60.0	01/01/2003	03/31/2003
12345	spring	pH	3	6.0	2.0	01/01/2003	03/31/2003
12345	spring	pH	1	2.0	2.0	01/01/2003	01/31/2003

for bins_special, do we actually need count/sum/average here, or does it simply need a start/end time and an aggregation level (year/month/etc) that defines the custom aggregation unit and use the bins_year, bins_month to populate?

for spring, we get monthly averages only within start/end time
for spring, we get yearly average only including months within start/end

bins_special (alt option)

label	start_time	end_time	updated
spring	Jan 1	Mar 31
spring	Jan 1	Mar 31

if i want special bin by year, only consider points between start and end time.
- if start/end time includes entire year, use bins_year
- if < 1 year time span, aggregate month + day bins until you cover entire time span
if i want by months, include each month between start/end time
- for complete months, use bins_month
- for partial months, aggregate day bins until you cover entire time span

Other possible tables:

bins_season - do we need to cache this, or calculate from monthly bins? latter option suggested above.

bins_total - do we need to cache this, or is it fast enough to calculate from yearly bins?

I don't think we want cache table for water_year, for example, because that is specific to GLM/GLTG and not generic for clowder. We could use the month caches to quickly calculate that on the fly.

When do we update cache tables?

cron job (hourly? 5 minutes?)
whenever new datapoint is added (at most 1 bin per table would need to be created or updated) - upsert

Trends region

Image Added

Page tree

Versions Compared

Old Version 3

New Version Current

Key

Method 4: Using aggregate function with "over" and "window" (not working yet)

Method 5: Using procedure function (PL/pgsql or PL/python)

Performance

Caching design

Trends region

Page tree

Page History

Versions Compared

Old Version 3

New Version Current

Key

Method 4: Using aggregate function with "over" and "window" (not working yet)

Method 5: Using procedure function (PL/pgsql or PL/python)

Performance

Caching design

Trends region