This page is intended to capture information related to .
The goal of this project is to develop a general-purpose research data repository "recommender" service to be hosted by the NDS. The basic use case is very broad: a research has data that they want to deposit, but they don't know where to put it.
How is this problem currently addressed? We can find a few cases in the wild:
Service | Data Repository Recommendation |
---|---|
U of I Research Data Service | "Deposition of data into a web-accessible repository is generally the preferred mechanism for public data sharing because it ensures wide-spread and consistent access to the data. If your discipline already has a trusted repository, we recommend you deposit where your community knows to look. To find a repository, re3data.org is a large, vetted, and searchable catalog of data repositories. If no discipline-specific repository exists, there are several options, including Illinois’ IDEALS repository (free) and other general-purpose repositories like DataDryad (fee-based)." |
Elsevier | List of supported data repositories |
Nature | "Supporting data must be made available to editors and peer-reviewers at the time of submission for the purposes of evaluating the manuscript...For information about suitable public repositories, see sections that follow." |
PLOS | PLOS Data Repository Recommendation Guide "PLOS has identified a set of established repositories below, which are recognized and trusted within their respective communities. Additionally, the Registry of Research Data Repositories (Re3Data) is a full scale resource of registered repositories across subject areas. " |
DCC | Where can I find a data repository?
|
A researcher looking for a repository has many options, all of which require manual analysis: determine funding agency requirements, identify field/domain recommendations, review publisher recommendations, or search repository registries. The NDS repository recommender will try to provide a single point where users can go to search for an appropriate repository.
There are several existing services in this space including the Registry of Research Data Repositories (RE3Data), Biosharing.org, and the SEAD C3PR service. In addition to these existing registries of research data repositories, funding agencies and publishers provide lists of recommended repositories.
To be useful, the NDS repository recommender must differentiate itself from these existing tools and services. For example
Is it really a "recommender"? Broadly speaking, a "recommender system" attempts to predict the relevance of an item to a user based on information known about the user. This could be profile information, previous ratings or related activities. It is more likely that this system will be a "search engine" in the sense that the user comes with an information need and is looking for a ranked list of candidate repositories. The information need might be a query or the dataset itself.
The end product will be a search engine that merges the re3data, biosharing (if available), funder and publisher lists along with models of relevance.
We can use either a research-oriented (Indri/Galago/Terrier) or general-purpose (Lucene) search engine platform. The goal would be to identify features/characteristics of repositories that can be used to improve rankings, aside from basic language models.
A key requirement will be to be able to evaluate the retrieval model, which requires a suitable test collection. For NDSC6, we would just pilot this.
Registry | Description | Notes |
---|---|---|
Re3Data | Registry of research data repositories | Started from Databib, crowd-sourced. Metadata is too general for search; user feedback "precision is horrible"; not based on natural language |
Biosharing.org | Registry of databases and policies for life/environmental/bio sciences | Schema based on BioDBCore: http://biocuration.org/community/standards-biodbcore/ Data is not available, but will be. |
Cinergi | Community Inventory of EarthCube Resources for Geosciences Interoperability | Curated database of geoscience information resources |
OpenAIRE | OpenAIRE data provider search | Publishes guidelines for data archives |
LA Referencia | ||
bioCADDIE | Data discovery index | Index of data "do for data what pubmed did for literature" |
OpenDOAR | Directory of open-access repositories | |
SHARE | Index of research activities/outputs including data management plans, grant proposals, preprints, presentations, and data repository deposits |
Publishers, funding agencies, research/domain organizations(e.g., AGU, ACM), and libraries often provide lists of recommended or supported repositories for depositing research data. The motivations and requirements are often different, but the lists themselves might serve as the basis for our analysis. We can review these (and other) lists to determine the factors in recommending data repositories to researchers.
(This list is not exhaustive – it's likely that many publishers, agencies, and organizations will provide similar lists):
NIH | https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html | Note that the Biosharing database already includes information about whether a repository is recommended by a funding agency: |
Elsevier | Public repositories to store and find data (Data in Brief) |
|
Nature |
| |
PLOS | http://blogs.plos.org/everyone/2015/07/02/plos-recommended-data-repositories/ http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories | |
Libraries | ||
Other | http://www.ijdc.net/index.php/ijdc/article/viewFile/9.1.152/349 http://www.rdc-drc.ca/wp-content/uploads/Review-of-Research-Data-Repositories-2015.pdf AGU: http://publications.agu.org/files/2014/06/Data-Repositories.pdf |
See also and the actual
A primary function of the SEAD Publication API (C3PR) is to match or recommend a repository given a research data object based on a set of technical requirements implemented as rules:
SEAD 2.0 introduces the "publish" workflow. The user selects "publish" and the "live object" is copied to a staging area into a "curation object". The user is able to modify the curation object – adding removing files, metadata, etc. There can be many curation objects for a live object. During the publish workflow, SEAD/Clowder represents the curation object as an ORE MAP and sends a request to the C3PR service. The C3PR service matches the ORE-MAP to available repositories based on a set of rules/criteria. The user is presented with a ranked list of repositories based on a best-match against the ORE. The user can opt to publish to any listed repository.
See also:
C3PR API server
Plale et al (2013). SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science
What other sources of information might we include in a recommender service?
Use cases
Researchers with data and they don't know where to put it, for various reasons.
User | Situation |
---|---|
No community repository | The researcher is in a community without a repository |
Doesn't fit neatly | A researcher is becoming interdisciplinary, moving to a new discpline, or has data they think might be useful for other disciplines |
Novice/lazy | New research not aware of existing resources (note, most advice would come from social media, conferences, training) |
Reviewing the above publisher lists and registries, we can identify factors in the recommendation of repositories to researchers:
Factor | Description |
---|---|
Funding agency approval | Funding agencies (e.g. NIH) have lists of approved repositories |
Researcher communities | Some repositories restrict to researchers in certain communities |
Publisher integration | Publishers (e.g., Elsevier) have arrangements with repositories (e.g., bi-directional linking) |
Domain/Field | Repositories are often restricted by domain, with some generalist services |
Technical restrictions | Repositories have technical restrictions (e.g., maximum file size, supported formats) |
Community mandates | Some research communities have mandated repositories (see Nature list) |
Data type | Does the repository take the data you want to deposit? Some repositories are restricted to specific types of data. These criteria vary, for example:
Data types are often directly related to domain/field of study. |
Metadata format | Some repositories are restricted to specific types of metadata (e.g., MIAME) |
Licensing | Free and unrestricted use or public domain (PLOS) |
Best practices | Repository adhere's to best practices pertaining to responsible data sharing, digital preservation, citation, and openness (PLOS) |
Additional factors from the DCC:
Publishers, funding agencies, and libraries construct these lists of approved repositories to meet the needs of researchers, Many of these sites now link to centralized services, such as re3data.org. However, re3data.org does not capture all of the information needed to make a recommendation (e.g., C3PR technical restrictions).
Elsevier. Supported Data Repositories.
Myers, Jim. (2016). SEAD 2.0 Publication API Walkthrough:.
Nature. Availability of data and materials.
PLOS ONE. Data availability.
UI RDS. Saving and Sharing your Data.
Whyte, A. (2015). ‘Where to keep research data: DCC checklist for evaluating data repositories’ v.1.1 Edinburgh: Digital Curation Centre.