...
The end product will be a search engine that merges the re3data, biosharing (if available), funder and publisher lists along with models of relevance.
Analysis
What tools already exist in this space?
Registries of Research Data Repositories
...
Started from Databib, crowd-sourced.
Metadata is too general for search; user feedback "precision is horrible"; not based on natural language
...
Schema based on BioDBCore: http://biocuration.org/community/standards-biodbcore/
Data is not available, but will be.
...
Curated database of geoscience information resources
...
Directory of open-access repositories
...
Index of research activities/outputs including data management plans, grant proposals, preprints, presentations, and data repository deposits
Search Engine
We can use either a research-oriented (Indri/Galago/Terrier) or general-purpose (Lucene) search engine platform. The goal would be to identify features/characteristics of repositories that can be used to improve rankings, aside from basic language models.
Potential features:
- Retrieval score based on name, description, subject, information crawled from associated URLs, keywords,
- language, startDate, size
- URL format (e.g. presence of non-standard ports, path depth)
- # results in Google scholar
- How much info in re3data (how complete is the record)?
- Number of policies
Test collection:
A key requirement will be to be able to evaluate the retrieval model, which requires a suitable test collection. For NDSC6, we would just pilot this.
- Find researchers with real datasets and have them identify the top repositories from re3data?
- For some subset of repositories, go find a dataset.
Background/Analysis
What tools already exist in this space?
Registries of Research Data Repositories
Registry | Description | Notes |
---|---|---|
Re3Data | Registry of research data repositories | Started from Databib, crowd-sourced. Metadata is too general for search; user feedback "precision is horrible"; not based on natural language |
Biosharing.org | Registry of databases and policies for life/environmental/bio sciences | Schema based on BioDBCore: http://biocuration.org/community/standards-biodbcore/ Data is not available, but will be. |
Cinergi | Community Inventory of EarthCube Resources for Geosciences Interoperability | Curated database of geoscience information resources |
OpenAIRE | OpenAIRE data provider search | Publishes guidelines for data archives |
LA Referencia | ||
bioCADDIE | Data discovery index | Index of data "do for data what pubmed did for literature" |
OpenDOAR | Directory of open-access repositories | |
SHARE | Index of research activities/outputs including data management plans, grant proposals, preprints, presentations, and data repository deposits |
Approved and Recommended Repositories
Publishers, funding agencies, research/domain organizations(e.g., AGU, ACM), and libraries often provide lists of recommended or supported repositories for depositing research data. The motivations and requirements are often different, but the lists themselves might serve as the basis for our analysis. We can review these (and other) lists to determine the factors in recommending data repositories to researchers.
(This list is not exhaustive – it's likely that many publishers, agencies, and organizations will provide similar lists):
NIH | https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html | Note that the Biosharing database already includes information about whether a repository is recommended by a funding agency: |
Elsevier | Public repositories to store and find data (Data in Brief) |
|
Nature |
| |
PLOS | http://blogs.plos.org/everyone/2015/07/02/plos-recommended-data-repositories/ http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories | |
Libraries | ||
Other | http://www.ijdc.net/index.php/ijdc/article/viewFile/9.1.152/349 http://www.rdc-drc.ca/wp-content/uploads/Review-of-Research-Data-Repositories-2015.pdf AGU: http://publications.agu.org/files/2014/06/Data-Repositories.pdf |
SEAD Publication API
See also and the actual
A primary function
Approved and Recommended Repositories
Publishers, funding agencies, research/domain organizations(e.g., AGU, ACM), and libraries often provide lists of recommended or supported repositories for depositing research data. The motivations and requirements are often different, but the lists themselves might serve as the basis for our analysis. We can review these (and other) lists to determine the factors in recommending data repositories to researchers.
(This list is not exhaustive – it's likely that many publishers, agencies, and organizations will provide similar lists):
...
https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html
...
Note that the Biosharing database already includes information about whether a repository is recommended by a funding agency:
...
Public repositories to store and find data (Data in Brief)
...
- List of databases with bi-directional linking
...
Data Policies: Nature Scientific Data
...
- Includes mandates
- Drawn from re3data and biosharing
...
http://blogs.plos.org/everyone/2015/07/02/plos-recommended-data-repositories/
http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories
...
https://library.uoregon.edu/datamanagement/sharingdata.html
http://www.library.cmu.edu/datapub/dms/respositories
...
http://www.ijdc.net/index.php/ijdc/article/viewFile/9.1.152/349
http://www.rdc-drc.ca/wp-content/uploads/Review-of-Research-Data-Repositories-2015.pdf
AGU: http://publications.agu.org/files/2014/06/Data-Repositories.pdf
http://openarchaeologydata.metajnl.com/about/#repo
https://www.datacite.org/services/find-repository.html
...
SEAD Publication API
See also and the actual
A primary function of the SEAD Publication API (C3PR) is to match or recommend a repository given a research data object based on a set of technical requirements implemented as rules:
...
- Do researchers come to you looking for places to put their data?
- Of those that come to you, do you have some estimate of the percentage of those that eventually do find a place to put their data?
- Thinking about the researchers that come to you, what is the typical consultation like? What types of questions or concerns do they have?
- Do you notice any common challenges or themes across the campus for researchers looking for places to deposit data?
- What are some of the tools you recommend and how well do they meet the needs of the researcher?
- Do you have any ideas of tools or services that could help you/them better?
- We’re thinking of this service (describe current vision of recommender), what do you think? Would it be useful?
- Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
- Is there anyone else working in this space that you think we should talk to?
Potential features:
- Retrieval score based on name, description, subject, information crawled from associated URLs, keywords,
- language, startDate, size
- URL format (e.g. presence of non-standard ports, path depth)
- # results in Google scholar
- How much info in re3data (how complete is the record)?
- Number of policies
Test collection:
...
- data?
- What are some of the tools you recommend and how well do they meet the needs of the researcher?
- Do you have any ideas of tools or services that could help you/them better?
- We’re thinking of this service (describe current vision of recommender), what do you think? Would it be useful?
- Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
- Is there anyone else working in this space that you think we should talk to?
...
References
Elsevier. Supported Data Repositories.
...