Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The end product will be a search engine that merges the re3data, biosharing (if available), funder and publisher lists along with models of relevance.

Analysis

What tools already exist in this space?

Registries of Research Data Repositories

...

Started from Databib, crowd-sourced.

Metadata is too general for search; user feedback "precision is horrible"; not based on natural language

...

Schema based on BioDBCore: http://biocuration.org/community/standards-biodbcore/

Data is not available, but will be.

BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences

...

Curated database of geoscience information resources

...

Directory of open-access repositories

...

Index of research activities/outputs including data management plans, grant proposals, preprints, presentations, and data repository deposits

Search Engine

We can use either a research-oriented (Indri/Galago/Terrier) or general-purpose (Lucene) search engine platform. The goal would be to identify features/characteristics of repositories that can be used to improve rankings, aside from basic language models.

Potential features:

  • Retrieval score based on name, description, subject, information crawled from associated URLs, keywords, 
  • language, startDate, size
  • URL format (e.g. presence of non-standard ports, path depth)
  • # results in Google scholar
  • How much info in re3data (how complete is the record)?
  • Number of policies

Test collection:

A key requirement will be to be able to evaluate the retrieval model, which requires a suitable test collection.  For NDSC6, we would just pilot this.

  • Find researchers with real datasets and have them identify the top repositories from re3data?
  • For some subset of repositories, go find a dataset. 

Background/Analysis

What tools already exist in this space?

Registries of Research Data Repositories

RegistryDescriptionNotes
Re3DataRegistry of research data repositories

Started from Databib, crowd-sourced.

Metadata is too general for search; user feedback "precision is horrible"; not based on natural language

Biosharing.orgRegistry of databases and policies for life/environmental/bio sciences

Schema based on BioDBCore: http://biocuration.org/community/standards-biodbcore/

Data is not available, but will be.

BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences

CinergiCommunity Inventory of EarthCube Resources for Geosciences Interoperability

Curated database of geoscience information resources

OpenAIREOpenAIRE data provider searchPublishes guidelines for data archives
LA Referencia  
bioCADDIEData discovery indexIndex of data "do for data what pubmed did for literature"
OpenDOAR

Directory of open-access repositories

 
SHARE 

Index of research activities/outputs including data management plans, grant proposals, preprints, presentations, and data repository deposits

Approved and Recommended Repositories 

Publishers, funding agencies, research/domain organizations(e.g., AGU, ACM), and libraries often provide lists of recommended or supported repositories for depositing research data.  The motivations and requirements are often different, but the lists themselves might serve as the basis for our analysis. We can review these (and other) lists to determine the factors in recommending data repositories to researchers. 

(This list is not exhaustive – it's likely that many publishers, agencies, and organizations will provide similar lists):

SEAD Publication API

See also  and the actual 

A primary function

Approved and Recommended Repositories 

Publishers, funding agencies, research/domain organizations(e.g., AGU, ACM), and libraries often provide lists of recommended or supported repositories for depositing research data.  The motivations and requirements are often different, but the lists themselves might serve as the basis for our analysis. We can review these (and other) lists to determine the factors in recommending data repositories to researchers. 

(This list is not exhaustive – it's likely that many publishers, agencies, and organizations will provide similar lists):

...

https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html

...

Note that the Biosharing database already includes information about whether a repository is recommended by a funding agency:

...

Supported Data Repositories

Public repositories to store and find data (Data in Brief)

...

  • List of databases with bi-directional linking

...

Data policy

Recommended Data Repositories

Data Policies: Nature Scientific Data

...

  • Includes mandates
  • Drawn from re3data and biosharing

...

http://blogs.plos.org/everyone/2015/07/02/plos-recommended-data-repositories/

http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories

...

https://library.uoregon.edu/datamanagement/sharingdata.html

http://www.library.cmu.edu/datapub/dms/respositories

...

http://www.ijdc.net/index.php/ijdc/article/viewFile/9.1.152/349

http://www.rdc-drc.ca/wp-content/uploads/Review-of-Research-Data-Repositories-2015.pdf

AMS: https://www.ametsoc.org/ams/index.cfm/publications/authors/journal-and-bams-authors/journal-and-bams-authors-guide/data-archiving-and-citation/

AGU: http://publications.agu.org/files/2014/06/Data-Repositories.pdf

http://openarchaeologydata.metajnl.com/about/#repo

https://www.datacite.org/services/find-repository.html

...

SEAD Publication API

See also  and the actual 

A primary function of the SEAD Publication API (C3PR) is to match or recommend a repository given a research data object based on a set of technical requirements implemented as rules:

...

  1. Do researchers come to you looking for places to put their data?
    1. Of those that come to you, do you have some estimate of the percentage of those that eventually do find a place to put their data?
  2. Thinking about the researchers that come to you, what is the typical consultation like? What types of questions or concerns do they have?
  3. Do you notice any common challenges or themes across the campus for researchers looking for places to deposit data?
  4. What are some of the tools you recommend and how well do they meet the needs of the researcher?
  5. Do you have any ideas of tools or services that could help you/them better?
  6. We’re thinking of this service (describe current vision of recommender), what do you think? Would it be useful?
  7. Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
  8. Is there anyone else working in this space that you think we should talk to?

Potential features:

  • Retrieval score based on name, description, subject, information crawled from associated URLs, keywords, 
  • language, startDate, size
  • URL format (e.g. presence of non-standard ports, path depth)
  • # results in Google scholar
  • How much info in re3data (how complete is the record)?
  • Number of policies

Test collection:

...

  1. data?
  2. What are some of the tools you recommend and how well do they meet the needs of the researcher?
  3. Do you have any ideas of tools or services that could help you/them better?
  4. We’re thinking of this service (describe current vision of recommender), what do you think? Would it be useful?
  5. Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
  6. Is there anyone else working in this space that you think we should talk to?

...

References


Elsevier. Supported Data Repositories.

...