Discussion Notes

Notes from 8/26/16 discussionL

Data search/cross search is relevant to TERRA-REF, NDS and likely other projects.
TERRA-REF (see requirements section below)
- Goal: ability to do a search that spans multiple databases
- Clowder, BETYdb, CoGe, fieldbook.
- Find me everything you know about this genome; Genomics (type of query)
- Ingest in one place or search across them
- Search across
  - Translate queries/results
  - Translate queries (e.g., synonyms)
- Potentially also search DataOne, Dataverse, etc.
- Best with API
  - Return list of hits
- Plugins that know how to get data from BETYdb and Clowder
- https://github.com/terraref/computing-pipeline/issues/47
Ability to download the results
- Transform on the fly
- Clowder: zip, file, metadata
- BETYdb: csv, json
- Linking between datasets
Authentication/Authorization
- Some data is only available if you're logged in
- BETY has access levels, as does Clowder
Needs to have smarts
- Linking/translation of keywords
- Cloweder = datasets/collections
- Crowdsourcing links between databases, versus requiring some published thing.
TERRA
- Arizona plant field
  - Site in BETYdb
  - ID is in Clowder
ECGS
- User defined mappings between CSV and vocabularies
- Semantic annotation service
Instead of just showing the dataset summary page, show related things in related repositories
Publication data as well

TERRA-REF Requirements

Source issues:

Use Cases

User is looking at an image in Clowder, have identified a particular trait and want to find all plants with this trait (within some range or greater than e.g., top 10% biomass) and find other data associated with these plants.
I have an interesting things I've noticed, can I find all plants with the same feature +/- X%
Want to upload data so someone else can get to it and its metadata
Want to publish a collection from Clowd
Find all plots where [accession X] was planted
1. Find all points and dates where seeds of [accession X] were planted
2. find bounding boxes associated with these plots
3. use bounding boxes to query data from sensor X from date range
Find all trait values (in BETYdb traits table, summarized to plant or plot level) associated with [accession X] collected using [method X]
after sensor data have been processed geospatially orthorectified and aligned with overlap and artifacts removed, use bounding boxes to clip / select data from some set of sensors
return experimental design 'plot plan' from PostGIS database (BETYdb)
return information about accessions from BETYdb cultivars table and also in BMS,
return information about lineage, seed packets, experimental design from BMS,
Load field measurements collected in FieldBook APP into BMS
Import traits from BMS to BETYdb and vice-versa The reason we are using both BETYdb and BMS despite substantial overlap is that BETYdb has better support for geospatial data, numeric traits, and large external raster files; BMS does a better job at tracking experimental design, lineage, genomics, and can (apparently now or in the near future) import directly from the FieldBook app.

Notes from email exchange with dlebauer:

Make BETYdb + Clowder the priority.
Not clear exactly how to query CoGe --> most important is that we can link to CoGe with the cultivars.name field from BETYdb. Wouldn't spend too much time at this point with the genomics databases (CoGe is one, GOBII will be another to support) because the focus of the grant is on the phenomics rather than genomics.
BMS would be the next database - this uses the BRAPI API.
In general, thinking that the most important unique identifier linking BETYdb to Clowder would be the geospatial and temporal information. The most important link between BETYdb and BMS and COGE would be thecultivars.name field. BMS and COGE do not have spatial information.

Brainstorming

A few ideas based on the discussion last week.

Goals:

Reusable across projects – not TERRA-REF specific
More than "search these two databases' – should include linking ssupport
More than metadata search – DataMed, B2FIND, DataOne are all metadata centric

Ideas

Each database publishes a description (e.g., JSON) of capabilities and supported variables or standards. For example, we might know that the TERRA Clowder instance has spatial and temporal operator support and supports a specific set of metadata variables. We also know that BETYdb supports spatial and temporal operators and supports an overlapping set of metadata variables. There could also be a mapping via ECGS or similar between IDs in each. Each system could implement a standard search API, the search engine could have adaptors/translators. When searching, the search interface would call out to the separate services, aggregate and link results where possible. For example, if Clowder and TERRA both return JSON-LD objects as search results and the context for certain data is the same, we could link the results to the other system based on that variable (e.g., accession number, plot, cultivar, trait, etc).
- Building on this, we could also implement a metadata index similar to DataMed, but identifying and linking known attributes – e.g., this ID in Clowder is mapped to this ID in BETYdb). This would allow us to support a centralized index for some information but link out to other services where appropriate. However, we don't want to index all of BETYdb and all of Clowder to make this happen, but maybe each could produce a summary "record" similar to those found in DataOne or DataMed.

Background:

What is data search?

An open question:

What do we index?
- Descriptive metadata only (i.e., the catalog record)
- Internal metadata (e.g., structured information included in the data package)
- Data content
How do we support search
- Federated search (sense 1)
  - Search each data source separately, merge results
  - Would require writing per-repository adaptors for both queries and results (old-school fed search)
  - Could restrict to repositories that conform to specific standards
- Federated search (sense 2)
  - Crawl but categorize data (think Google web search vs image search vs. scholar) into big buckets
  - Information is stored separately for each bucket, but integrated depending on user query (e.g., how image appear in Google Search results when relevant)
- Centralized search
  - Crawl/harvest content and index it in a central index
  - Choice between metadata only, metadata + data
  - Will ace the "deep web" problem – data stored in databases

TERRA-REF use case

The TERRA project currently has data stored in Clowder and BETYdb. BETYdb is a Postgres RDB with a REST interface. There are no community-defined metadata standards (yet).

How do we allow the user to search across BETYdb and Clowder in a single result set?
Open question: why would we?

Existing systems

DataMed/bioCADDIE

Primary site: https://datamed.org/

DataMed is a biomedical data search engine or "data discovery index" described as "Pubmed for data." It is a metadata-based search engine where content is indexed from funding agencies, publishers, and data producers. They currently support 23 different repositories including GEO, PDB, Dryad, Dataverse (subset). It is focused on biomedical data.

The use cases are interesting:

Find all data sets from Alzheimer’s patients that have RNA-seq, behavioral and imaging data available.
A user wants to get all proteomics and metabolomics data sets related to the same biological process
A user wants to know what datasets are available that have genome data about IDH1 and IDH2 in humans or other species for a particular phenotype of interest

They've put some effort into defining a specific metadata schem for ingestion:

- DATS: Descriptive metadata for datasets"
- Mirrors how journals submit data to PubMed.
- MongoDB, ActiveMQ, ElasticSearch, custom web interface
- https://biocaddie.org/sites/default/files/d7/project/1493/jeff-biocaddie-ingestion-2016jun.pdf
More on DATS:
- https://biocaddie.org/group/working-group/working-group-3-descriptive-metadata-datasets
- https://docs.google.com/document/d/1hVcYRleE6-dFfn7qbF9Bv1Ohs1kTF6a8OwWUvoZlDto/edit?usp=sharing
- The Google doc provides a detailed list of data search initiatives in section 2.1.1 – including NDS... (#5)

DataOne/OneMercury

https://search.dataone.org/#data/page/0

https://cn.dataone.org/onemercury/

DataONE is a network composed of member nodes that use the same system. DataONE is focused on earth/environment science data, centered on the Ecological Metadata Language (EML). In that sense, DataONE has distributed but heterogenous data. Search is based on periodic harvesting of member node data (https://cn.dataone.org/cn/v1/node)

DataONE search was/is based on the ORNL Mercury system.

"The OneMercury search tool allows users to search environmental science and Earth observational data sets through a distributed framework. The search tool is based on Mercury, a Web-based system to search for metadata and retrieve associated data. "
Earth-science centric
Distributed search (DataOne member nodes)
https://releases.dataone.org/online/api-documentation-v1.2.0/design/SearchMetadata.html
http://ceur-ws.org/Vol-951/paper4.pdf

Mercury

EUDAT B2FIND

http://b2find.eudat.eu/

https://eudat.eu/sites/default/files/DaanBroeder.pdf

https://eudat.eu/services/userdoc/b2find-integration
Metadata-centric
OAI-PMH
Solr/Lucene
CKAN

Datacite

http://search.datacite.org/ui

Search datasets registered with Datacite

Related standards

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

https://www.openarchives.org/pmh/

Defines a mechanism for harvesting records containing metadata from repositories based on HTTP/XML
Originally developed for e-print (e.g., paper) repositories, not data

re3data.org registered protocols

Quick count of 'protocols' supported by repositories in re3data.org.

Protocol	Count	What is it
FTP	359	You know
other	236
REST	192
OAI-PMH	99
SOAP	56
NetCDF	37
OpenDAP	24	Discipline-neutral means of requesting and providing data across the World Wide We
SWORD	23	Simple Web-service Offering Repository Deposit
SPARQL	13	RDF query

Space shortcuts

Page tree

Discussion Notes

TERRA-REF Requirements

Use Cases

Brainstorming

Background:

What is data search?

TERRA-REF use case

Existing systems

DataMed/bioCADDIE

DataOne/OneMercury

Mercury

EUDAT B2FIND

Datacite

Related standards

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

re3data.org registered protocols

Space shortcuts

Page tree

Data Search

Discussion Notes

TERRA-REF Requirements

Use Cases

Brainstorming

Background:

What is data search?

TERRA-REF use case

Existing systems

DataMed/bioCADDIE

DataOne/OneMercury

Mercury

EUDAT B2FIND

Datacite

Related standards

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

re3data.org registered protocols