pyGeodashboard

Development of pyGeodashboard started on 2016-02-04. It is a library that contains the basic functions needed for parsing sensors, streams, and datapoints to the geostreaming API.

Step 1: Outline Parser Functions
- Create a outline that describes the process of parsing with focus on separating reusable portions of a parser from those that are particular to a specific data source.

Outline Parser Functions

Functions will be described as unique= code particular to the source and general= should be able to run as part of every source

Get data from source (unique)
1. Parsing begins by getting the data from the source. Two types of data are needed:
  1. data that describes the site such as geocodes, name, and source.
  2. measurements (the data)
2. The format and retrieval method varies from source to source
  1. Some source formats
    1. API for a single station (USGS,NOAA,EPA)
    2. API for mixed stations (Water Quality Portal)
    3. Files stored to server with loggetnet (GREON)
    4. csv download (LRTM)
Parse data to sensor (unique and general)
1. Up till now, this has been a unique process for each source; however, this portion should be broken into unique and general portions
  1. reformat data into a standard that can be input into general parser (unique)
  2. parse data to sensor json (general)
  3. post to geostreaming api (general)
Parse data to stream(s) (unique abd general)
1. Similar to parse to sensor, with the main difference being that sources can have multiple stream for different reasons.
  1. For example:
    1. GREON uses 2 streams - one for water quality data and one for environmental data
    2. USGS uses 5 streams - water quality measurements, gap filled nitrate, gap filled discharge, load, and cumulative load
  2. two different conventions have been used, and need to be standardized
    1. GREON names the streams differently: GREON-07_MD or GREON-07_WQ
    2. USGS puts a data_type key in properties with possible values: source_data, fill_nitrate, fill_discharge, calc_load, and calc_cumul_load
      1. Probably should be discussed and decided.
2. Currently, each source has it's own implementation, like sensors, and should be broken into unique and general portions
  1. get stream data from source including determining number of streams needed and parsing data to standard format (unique)
  2. parse data to stream json (general)
  3. post to geostreaming api
Parse data to datapoints (unique and general)
1. Initial parse versus continuous parsing
  1. Parsing data to a site continuously (parsing at a regular interval) has some challenges
    1. need to know the last datapoint parsed (general)
    2. more efficient to only fetch source data from that time forward (unique)
2. Considerations
  1. Time - keeping track of time zone and daylight savings time is challenging
    1. The geostreaming api stores time in UTC (Zulu)
      1. sources store time in a variety of formats
    2. The suggested standard: all retrieved data should be converted to UTC as soon as possible within the scripts and handled in UTC.
  2. Memory
    1. The earliest versions of the parsers loaded all data into memory from a single station and parsed from memory
      1. after adding new usgs sites, it became apparent that this would not work on a server with ~2GB memory (or even larger as datasets grow)
      2. 2 approaches were considered: disk storage and parsing by time period
      3. it was decided to parse one year at a time
3. Parsing
  1. fetch data from source (unique)
  2. parse to standard format (unique)
    1. map measurement names to standardized parameter ids
  3. parse to datapoint json (general)
  4. post to geostreaming api (general)
Gap filling and load calculations
1. At the moment, these are only run on USGS data
2. This should be generalizable to any stream that contains the needed parameters
  1. Gap Filling: should take the source data stream and fill parameter
  2. Load calculation: should take the source data stream and gap fill streams as input
  3. Cumulative load calculation: should take the load stream as input
List of methods that should be general (or have several general versions)
1. create or get sensor/stream
2. parse to sensor json
3. parse to stream json
4. parse to datapoint json
5. post
6. get
7. get most recent datapoint
8. iterate over a year of data
9. convert time
10. map_names
  1. needs to be standardized across sources with different column names mapped to standard parameter ids

Page tree

pyGeodashboard