Data Repository Recommender

This page is intended to capture information related to NDS-211 - Getting issue details... STATUS .

Background

Registries of Research Data Repositories

There are (at least) two major registries of research data repositories. Publishers and funding agencies often direct researchers to search for repositories using these tools:

- http://www.re3data.org/
- https://biosharing.org/
  - The schema is based on BioDBCore: http://biocuration.org/community/standards-biodbcore/,
  - License: Creative Commons by Share Alike 4.0
  - See also:
    - BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences
- http://cinergi.sdsc.edu/ (used by EarthCube, includes re3data?)
- http://www.share-research.org/
- https://www.openaire.eu/
- http://lareferencia.redclara.net/rfr/

Publishers refer to both in their lists of recommended repositories, but both services appear to be intended for librarians, curators, publishers and funding agencies instead of the average researcher. The re3data is easily available for download and could be incorporated into our system. It's not clear whether the Bioshare data is available (technically, it could be crawled).

Question: How is our recommender different than these systems? What need are we meeting that these systems don't meet?

Approved and Recommended Repositories

Publishers, funding agencies, research/domain organizations(e.g., AGU, ACM), and libraries often provide lists of recommended or supported repositories for depositing research data. The motivations and requirements are often different, but the lists themselves might serve as the basis for our analysis. We can review these (and other) lists to determine the factors in recommending data repositories to researchers.

(This list is not exhaustive – it's likely that many publishers, agencies, and organizations will provide similar lists):

NIH:

- https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html

Elsevier:

Nature

PLOS

- http://blogs.plos.org/everyone/2015/07/02/plos-recommended-data-repositories/
- http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories

Libraries

- https://library.uoregon.edu/datamanagement/sharingdata.html
- http://www.library.cmu.edu/datapub/dms/respositories

Other

Note that the Biosharing database already includes information about whether a repository is recommended by a funding agency:

SEAD C3PR/Matchmaker

The SEAD Matchmaker is used to pair datasets to repositories. The matchmaker is part of the SEAD 2.0 C3PR service (https://github.com/Data-to-Insight-Center/sead2/tree/master/sead-matchmaker). Repositories can register with C3PR, providing information including accepted data types, maximum collection depth, maximum dataset size, minimum metadata fields, affiliations, and global identifier requirements.

SEAD 2.0 introduces the "publish" workflow. The user selects "publish" and the "live object" is copied to a staging area into a "curation object". The user is able to modify the curation object – adding removing files, metadata, etc. There can be many curation objects for a live object. During the publish workflow, SEAD/Clowder represents the curation object as an ORE MAP and sends a request to the C3PR service. The C3PR service matches the ORE-MAP to available repositories based on a set of rules/criteria. The user is presented with a ranked list of repositories based on a best-match against the ORE. The user can opt to publish to any listed repository.

Other sources of information:

What other sources of information might we include in a recommender service?

Researcher identifiers, such as ORCID Persistent digital identifier for researchers: these might be helpful in collecting researcher profile information that can be used for recommendation.
Journal/publication information: We can relate specific journals to data repositories. If the user is publishing in a specific journal, we can recommend where to put the data.
Abstract: Use text matching techniques to match an abstract to a repository.
https://www.datacite.org/
BrownDog: Can we use information from extractors to identify criteria for recommendation?

Harvesting information

Many of the data repositories are crawl-able or implement standard APIs (OAI-PMH) for harvesting metadata. It might be interesting to consider whether we can harvest descriptive metadata – particularly citation information – and use journal or other publication metadata as part of the recommendation process.

Analysis

Reviewing the above publisher lists and registries, we can identify factors in the recommendation of repositories to researchers:

Factor	Description
Funding agency approval	Funding agencies (e.g. NIH) have lists of approved repositories
Researcher communities	Some repositories restrict to researchers in certain communities
Publisher integration	Publishers (e.g., Elsevier) have arrangements with repositories (e.g., bi-directional linking)
Domain	Repositories are often restricted by domain, with some generalist services
Technical restrictions	Repositories have technical restrictions (e.g., maximum file size, supported formats)
Community mandates	Some research communities have mandated repositories (see Nature list)
Data type	Some repositories are restricted to specific types of data. These criteria vary, for example: Protein structures Human or non-human derived Phenotypes Data types are often directly related to domain/field of study.
Metadata format	Some repositories are restricted to specific types of metadata (e.g., MIAME)

Publishers, funding agencies, and libraries construct these lists of approved repositories to meet the needs of researchers, Many of these sites now link to centralized services, such as re3data.org. However, re3data.org does not capture all of the information needed to make a recommendation (e.g., technical restrictions).

Use cases

Q. Who are the users? While the re3data and biosharing sites seem more targeted at experts, perhaps our service is targeted at the novice researcher?

For example:

A researcher in the area of information retrieval has code and data to deposit related to a recent publication. How do they determine where to publish the data?
- What does the publisher require? JASIST, TOIS/ACM
- What does the funding agency require? NSF
- What does the community generally do?
- Where have I or my collaborators previously published data?

Draft Questions for RDS

Do researchers come to you looking for places to put their data?
Thinking about the researchers that come to you, what is the typical consultation like? What types of questions or concerns do they have?
Do you notice any common challenges or themes across the campus for researchers looking for places to deposit data?
What are some of the tools you recommend and how well do they meet the needs of the researcher?
Do you have any ideas of tools or services that could help you/them better?
We’re thinking of this service (describe current vision of recommender), what do you think? Would it be useful?
Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
Is there anyone else working in this space that you think we should talk to?
Of those that come to you, do you have some estimate of the percentage of those that eventually do find a place to put their data?

Space shortcuts

Page tree