Configuring data fetchers
DSAPI2 provides a framework for fetching, parsing, and tokenizing external data sources to produce streams.
An external fetcher is a standalone Java application intended to be invoked periodically by a server, e.g., as a cron job. There is a single entry point, and a properties-based configuration determines which classes are used for fetching, parsing, and filtering stream tokens. This is a simple form of dependency injection of the sort that is often done with more complex frameworks such as Spring.
Several properties are generic and apply to all fetcher and parser implementations. These are:
fetcher.class - the fully-qualified name of the class that will be used to fetch data
fetcher.realtime - if true, only the most recent token from each execution will be written to the stream; if false, all tokens produced from each execution will be written to the stream
fetcher.delay - how long to wait between executions (in milliseconds)
parser.class - the fully-qualified name of the class that will be used to parse data (some parser implementations may ignore the fetcher and perform the fetching themselves)
date.extractor.class - the fully-qualified name of the class that will be used to extract dates (some parsers may ignore the date extractor and perform timestamping themselves)
stream.assigner.class - the fully-qualified name of the class that will determine which stream tokens will be written to (some parsers may ignore the stream assigner)
Example configuration: Twitter
DSAPI2 includes a Twitter parser. It ignores its fetcher, because Twitter-specific-parameters determine what URL needs to be fetched. The following example is annotated to describe each parameter.
Fetching. “Pulling” data from a remote service, including authentication and any other protocol interactions required to produce a binary input stream representing desired data.
Parsing. Given a binary input stream produced by a fetcher, locating and interpreting time-series data points (e.g., timestamped rows in a CSV file).
Tokenizing. Given time-series data, producing DSAPI2 stream tokens for insertion into a stream.
Stream assignment. Producing the URI of a stream which will serve as a destination for tokens produced by a parser/tokenizer.
Date extraction. Conversion of a textual representation of a timestamp into an unambiguous numerical representation of the date.
Filtering. Examining a token and either 1) deciding whether or not to insert it in a stream, or 2) taking additional actions such as associating metadata with the token or its destination stream.
=========================== below is an old version of the documentation. If in doubt, please refer the above doc==
The External Data Fetcher is a tool that allows fetching data "continuously" from an external source (e.g. HTTP, FTP, POP) and publishing into a stream to be consumed at a later time using the Data Stream API. The External fetcher is extensible and configurable to fetch data from more or less arbitrary sources and perform user-provided operations when data is fetched, such as adding metadata to a stream or triggering a workflow.
Installing and Running
The External Fetcher utilizes several pluggable components, all of which are set up through a single .properties files:
- Fetcher: The plug-in responsable for fetching the data through whatever protocol they are available. It is specified in the .properties file by the property fetcher.class followed by the qualified name of the class, e.g. Additional properties control the behavior of the fetcher:
- There are currently three fetchers available:
- FileFetcher, with properties fetcher.file.filename
- HTMLFetcher, with properties fetcher.html.url, fetcher.html.username, fetcher.html.password
- FTPFetcher. Use in conjuntion with FTPParser
- Parser: Takes the raw data fetch by the fetcher (a file, and FTP directory, etc) and extracts the relevant data points and their timestamp. For instance the LineParser will read file line by line and extract data and timestamp according to a configurable pattern. e.g: The available parsers are:
- XMLParser, with properties parser.xml.xslt. It transforms XML/HTML files using an XSLT translation into an XML file with the following schema:
- XMLLinkParser, with properties parser.xmllink.xslt. It tranforms XML/HTML files using an XSLT translation into the following format with links to the actual data
- FTPParser, with properties parser.ftp.url, parser.ftp.username, parser.ftp.password and parser.ftp.filename.pattern. Download files with a given name pattern from and FTP directory* Date Extractor: Transform a string (as provided by the Parser) into a timestamp. e.g:
- Stream Name Provider: Provides a URI for the stream corresponding to a set of data, by looking and ID provided by the parser or using a constant URI. Example:
- Pre Filter, Post Filter and One-Time Filter: Filters to determine whether a data point should be accepted and optionally perform additional actions, such as triggering a workflow or registering metadata, after the data is posted to the stream or the first time data is posted to a stream.
This examples fetches friends' statuses every minute for a user account to be authenticated via the command prompt.
Yahoo! Weather Example
This examples fetches temperature data for Champaign from Yahoo! Web API