...
- Get data from source (unique)
- Parsing begins by getting the data from the source. Two types of data are needed:
- data that describes the site such as geocodes, name, and source.
- measurements (the data)
- The format and retrieval method varies from source to source
- Some source formats
- API for a single station (USGS,NOAA)
- API for mixed stations (Water Quality Portal)
- Files stored to server with loggetnet (GREON)
- csv download (LRTM)
- Some source formats
- Parsing begins by getting the data from the source. Two types of data are needed:
- Parse data to sensor (unique and general)
- Up till now, this has been a unique process for each source; however, this portion should be broken into 2 portions
- reformat data into a standard that can be input into general parser (unique)
- parse data to sensor json (general)
- post to geostreaming api
- Up till now, this has been a unique process for each source; however, this portion should be broken into 2 portions
- Parse data to stream(s) (unique abd general)
- Similar to parse to sensor, with the main difference being that sources can have multiple stream for different reasons.
- For example:
- GREON uses 2 streams - one for water quality data and one for environmental data
- USGS uses 5 streams - water quality measurements, gap filled nitrate, gap filled discharge, load, and cumulative load
- two different conventions have been used, and need to be standardized
- GREON names the streams differently: GREON-07_MD or GREON-07_WQ
- USGS puts a data_type key in properties with possible values: source_data, fill_nitrate, fill_discharge, calc_load, and calc_cumul_load
- Probably should be discussed and decided.
- For example:
- Currently, each source has it's own implementation for all, like sensors, it should be broken into unique and general portions
- get stream data from source including determining number of streams needed and parsing data to standard format (unique)
- parse data to stream json (general)
- post to geostreaming api
- Similar to parse to sensor, with the main difference being that sources can have multiple stream for different reasons.
- Parse data to datapoints (unique and general)
- This is by far the most involved portion of parsing
- First parse versus continuous parsing
- Parsing data to a site continuously (parsing at a regular interval) has some challenges
- need to know the last datapoint parsed (general)
- more efficient to only fetch source data from that time forward (unique)
- Parsing data to a site continuously (parsing at a regular interval) has some challenges
- Considerations
- Time - keeping track of time zone and daylight savings time is challenging
- The geostreaming api stores time in UTC (Zulu)
- sources store time in a variety of formats
- The suggested standard should be that all retrieved data should be converted to UTC as soon as possible within the scripts and handled in UTC.
- The geostreaming api stores time in UTC (Zulu)
- Memory
- The earliest versions of the parsers loaded all data into memory from a single station and parsed from memory
- after adding new usgs sites, it became apparent that this would not work on a server with ~2GB memory (or even larger as datasets grow)
- 2 approaches were considered: disk storage and parsing by time period
- it was decided to parse one year at a time
- The earliest versions of the parsers loaded all data into memory from a single station and parsed from memory
- Time - keeping track of time zone and daylight savings time is challenging
- Parsing
- fetch data from source (unique)
- parse to standard format (unique)
- map measurement names to standardized parameter ids
- parse to datapoint json (general)
- post to geostreaming api (general)
- List of methods that should be general (or have several general versions)
- create or get sensor/stream
- parse to sensor json
- parse to stream json
- parse to datapoint json
- post
- get
- get most recent datapoint
- iterate over a year of data
- convert time
- map_names
- needs to be standardized across sources with different column names mapped to standard parameter ids