This page is intended to capture information related to NDS-211 - Getting issue details... STATUS .

Overview

The goal of this project is to develop a general-purpose research data repository "recommender" service to be hosted by the NDS. The basic use case is a researcher that has data that they want to deposit, but they do not know where to put it. A few possible use cases:

There is no existing community repository
The data doesn't fit the researcher's usual repository. For example, someone working in a new interdisciplinary space or has data they believe might be useful to another community.
Novice or "lazy" user – however, most advice from these users will come from social media, conferences, and training.

There are several existing services in this space including the Registry of Research Data Repositories (RE3Data), Biosharing.org, and the SEAD C3PR service. Informal discussions with U of I Research Data Service makes the following recommendation:

"Deposition of data into a web-accessible repository is generally the preferred mechanism for public data sharing because it ensures wide-spread and consistent access to the data. If your discipline already has a trusted repository, we recommend you deposit where your community knows to look. To find a repository, re3data.org is a large, vetted, and searchable catalog of data repositories. If no discipline-specific repository exists, there are several options, including Illinois’ IDEALS repository (free) and other general-purpose repositories like DataDryad (fee-based)."

In addition to these existing registries of research data repositories, funding agencies and publishers provide lists of recommended repositories.

To be useful, the NDS repository recommender must differentiate itself from these existing tools and services. For example

Improved search over Re3Data through the use of priors (e.g., "trustworthiness" or some sort of impact factor)
Accounting for user motivations (funding agency requirements, publisher requirements, data size) through guided search

Background

What tools already exist in this space?

Registries of Research Data Repositories

Registry	Description	Notes
Re3Data	Registry of research data repositories	Started from Databib, crowd-sourced. Metadata is too general for search; user feedback "precision is horrible"; not based on natural language
Biosharing.org	Registry of databases and policies for life/environmental/bio sciences	Schema based on BioDBCore: http://biocuration.org/community/standards-biodbcore/ Data is not available, but will be. BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences
Cinergi	Community Inventory of EarthCube Resources for Geosciences Interoperability	Curated database of geoscience information resources
OpenAIRE	OpenAIRE data provider search	Publishes guidelines for data archives
LA Referencia
bioCADDIE	Data discovery index	Index of data "do for data what pubmed did for literature"
OpenDOAR	Directory of open-access repositories
SHARE		Index of research activities/outputs including data management plans, grant proposals, preprints, presentations, and data repository deposits

Publishers refer to both in their lists of recommended repositories, but both services appear to be intended for librarians, curators, publishers and funding agencies instead of the average researcher. The re3data is easily available for download and could be incorporated into our system. It's not clear whether the Bioshare data is available (technically, it could be crawled).

Question: How is our recommender different than these systems? What need are we meeting that these systems don't meet?

Approved and Recommended Repositories

Publishers, funding agencies, research/domain organizations(e.g., AGU, ACM), and libraries often provide lists of recommended or supported repositories for depositing research data. The motivations and requirements are often different, but the lists themselves might serve as the basis for our analysis. We can review these (and other) lists to determine the factors in recommending data repositories to researchers.

(This list is not exhaustive – it's likely that many publishers, agencies, and organizations will provide similar lists):


NIH	https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html	Note that the Biosharing database already includes information about whether a repository is recommended by a funding agency: https://biosharing.org/policy/1947 https://biosharing.org/policy/1931
Elsevier	https://www.elsevier.com/?a=57755 https://www.elsevier.com/books-and-journals/content-innovation/data-base-linking/supported-data-repositories http://www.journals.elsevier.com/data-in-brief/policies-and-guidelines/public-repositories-to-store-and-find-data
Nature	http://www.nature.com/authors/policies/availability.html http://www.nature.com/sdata/policies/repositories http://www.nature.com/sdata/policies/data-policies
PLOS	http://blogs.plos.org/everyone/2015/07/02/plos-recommended-data-repositories/ http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories
Libraries	https://library.uoregon.edu/datamanagement/sharingdata.html http://www.library.cmu.edu/datapub/dms/respositories
Other	http://www.ijdc.net/index.php/ijdc/article/viewFile/9.1.152/349 http://www.rdc-drc.ca/wp-content/uploads/Review-of-Research-Data-Repositories-2015.pdf AMS: https://www.ametsoc.org/ams/index.cfm/publications/authors/journal-and-bams-authors/journal-and-bams-authors-guide/data-archiving-and-citation/ AGU: http://publications.agu.org/files/2014/06/Data-Repositories.pdf http://openarchaeologydata.metajnl.com/about/#repo https://www.datacite.org/services/find-repository.html

SEAD Publication API

Other sources of information:

What other sources of information might we include in a recommender service?

Researcher identifiers, such as ORCID Persistent digital identifier for researchers: these might be helpful in collecting researcher profile information that can be used for recommendation.
Journal/publication information: We can relate specific journals to data repositories. If the user is publishing in a specific journal, we can recommend where to put the data.
Abstract: Use text matching techniques to match an abstract to a repository.
https://www.datacite.org/
BrownDog: Can we use information from extractors to identify criteria for recommendation?

Harvesting information

Many of the data repositories are crawl-able or implement standard APIs (OAI-PMH) for harvesting metadata. It might be interesting to consider whether we can harvest descriptive metadata – particularly citation information – and use journal or other publication metadata as part of the recommendation process.

What would make the existing tools better?

Natural language search
Ranking basic on different characteristics
- Does it support my (identifier, metadata, etc)
- Is it trusted (sustainability/certification). How long is the commitment?
- Repository "impact factor"
- Additional value adds (curatorial, linked)
- Specialized vs geneal

Analysis

Reviewing the above publisher lists and registries, we can identify factors in the recommendation of repositories to researchers:

Factor	Description
Funding agency approval	Funding agencies (e.g. NIH) have lists of approved repositories
Researcher communities	Some repositories restrict to researchers in certain communities
Publisher integration	Publishers (e.g., Elsevier) have arrangements with repositories (e.g., bi-directional linking)
Domain	Repositories are often restricted by domain, with some generalist services
Technical restrictions	Repositories have technical restrictions (e.g., maximum file size, supported formats)
Community mandates	Some research communities have mandated repositories (see Nature list)
Data type	Some repositories are restricted to specific types of data. These criteria vary, for example: Protein structures Human or non-human derived Phenotypes Data types are often directly related to domain/field of study.
Metadata format	Some repositories are restricted to specific types of metadata (e.g., MIAME)

Publishers, funding agencies, and libraries construct these lists of approved repositories to meet the needs of researchers, Many of these sites now link to centralized services, such as re3data.org. However, re3data.org does not capture all of the information needed to make a recommendation (e.g., technical restrictions).

Use cases

Who are the users?

Researchers with data and they don't know where to put it, for various reasons.

User	Situation
No community repository	The researcher is in a community without a repository
Doesn't fit neatly	A researcher is becoming interdisciplinary, moving to a new discpline, or has data they think might be useful for other disciplines
Novice/lazy	New research not aware of existing resources (note, most advice would come from social media, conferences, training)

What are their motivations?

Responding to request from funding agency. Might need different characteristics (needs DOI, linking etc)
Has very large data (university can't handle it, domain repos can't handle it)
Has specific availability requirements (5 years, 10 years)
Is really complicated (has a lot of contextual information, does the service support it)
Sharing – not responding to regulatory requirement – just wants to make things available for reuse

Use cases

Q. Who are the users? While the re3data and biosharing sites seem more targeted at experts, perhaps our service is targeted at the novice researcher?

For example:

A researcher in the area of information retrieval has code and data to deposit related to a recent publication. How do they determine where to publish the data?
- What does the publisher require? JASIST, TOIS/ACM
- What does the funding agency require? NSF
- What does the community generally do?
- Where have I or my collaborators previously published data?

Draft Questions for RDS

Do researchers come to you looking for places to put their data?
1. Of those that come to you, do you have some estimate of the percentage of those that eventually do find a place to put their data?
Thinking about the researchers that come to you, what is the typical consultation like? What types of questions or concerns do they have?
Do you notice any common challenges or themes across the campus for researchers looking for places to deposit data?
What are some of the tools you recommend and how well do they meet the needs of the researcher?
Do you have any ideas of tools or services that could help you/them better?
We’re thinking of this service (describe current vision of recommender), what do you think? Would it be useful?
Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
Is there anyone else working in this space that you think we should talk to?

Space shortcuts

Page tree

Overview

Background

What tools already exist in this space?

Registries of Research Data Repositories

Approved and Recommended Repositories

SEAD Publication API

Other sources of information:

Harvesting information

What would make the existing tools better?

Analysis

Use cases

Who are the users?

What are their motivations?

Use cases

Draft Questions for RDS

Space shortcuts

Page tree

Data Repository Recommender

Overview

Background

What tools already exist in this space?

Registries of Research Data Repositories

Approved and Recommended Repositories

SEAD Publication API

Other sources of information:

Harvesting information

What would make the existing tools better?

Analysis

Use cases

Who are the users?

What are their motivations?

Use cases

Draft Questions for RDS