Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

These design notes concern exposing the UNO NBI data (

Jira
serverJIRA
serverIdb14d4ad9-eb00-3a94-88ac-a843fb6fa1ca
keyNDS-992
 ) to users via workbench.

See also Shared data directories

Background

From Robin Gandhi, Univ. of Nebrask at Omaha:

Compute and query infrastructure for National Bridge Inventory data: Federal Highway Administration (FHWA) requires all state Departments of Transportation/Roads to annually report information on bridges and tunnels that have road traffic. This data, which is called the National Bridge Inventory (NBI), is made available through position aligned or comma separated values, sometimes compressed, on the FHWA website (https://www.fhwa.dot.gov/bridge/nbi/ascii.cfm). Since 1992, this dataset has collected approximately 17 million bridge inspection records. Each bridge inspection record conforms to a data coding guide, which allows the dataset to capture a great amount of information in a dense format. Due to the shear size of these records, simple tools such as Excel are not suitable for any advanced data analytics. This has been noted by many researchers attempting to analyze this dataset. To make this dataset more accessible, we have developed scripts to transfer this dataset into a big data pipeline. In particular, we have setup a MongoDB instance using infrastructure available from a cloud provider (digital ocean). A simple example of the data analytics for all the bridges in Nebraska, which is possible through the new prototype we developed, is available here: http://faculty.ist.unomaha.edu/rgandhi/r/mongoNBI.html. All data export scripts (in active development) are available on Github (https://github.com/kaleoyster/ProjectNBI) to replicate these activities.

Discussions began in June 2017 to "transition" to DataDNS. This project presents a couple of interesting opportunities:

  • Hosting data in an active (i.e., database) format. While we can keep a copy of the raw data, the active data, even if read only, is likely of more interest to the community.
  • The raw data is available from DOT, but software has been written to convert to a more usable format. 
  • UNO has provided a sample Jupyter notebook to analyze the data. 

Workbench, Share, and DataDNS

  • The Labs Workbench service platform supports launching a variety of services, including analysis environments. It would be easy to include the UNO Jupyter notebook in a Workbench instance.
  • Workbench could easily host a "public" MongoDB for the NBI data.  Of course, this raises questions (how long, how do we maintain it going forward, is there an SLA, do we offer this for everyone?) As a pilot, this seems like an interesting opportunity.
  • The NDS Share Globus endpoint provides a place and mechanism to transfer data. Via Globus Publish, it also supports metadata for dataset description.
  • We end up with the following:
    • Raw data via Globus on santiago
    • Metadata record in Globus onling
    • Code in Zenodo
    • MongoDB container running in "public" space on Workbench
    • Raw data optionally mounted via NFS (however, Workbench is currently at SDSC)
    • Notebook container available in Workbench
  • We could either install Workbench on santiago or put the data on the SDSC Workbench instance while exploring how best to expose it.


Download the NBI data

git clone https://github.com/kaleoyster/ProjectNBI

...