Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The goal of this project is to develop a general-purpose research data repository "recommender" service to be hosted by the NDS.  The The basic use case is very broad: a researcher that research has data that they want to deposit, but they do not don't know where to put it.  A few possible use cases:

  • There is no existing community repository
  • The data doesn't fit the researcher's usual repository.  For example, someone working in a new interdisciplinary space or has data they believe might be useful to another community.
  • Novice or "lazy" user – however, most advice from these users will come from social media, conferences, and training.

...

 

How is this problem currently addressed? We can find a few cases in the wild:

ServiceData Repository Recommendation
U of I Research Data Service

...

"Deposition of data into a web-accessible repository is generally the preferred mechanism for public data sharing because it ensures wide-spread and consistent access to the data.  If your discipline already has a trusted repository, we recommend you deposit where your community knows to look.  To find a repository, re3data.org is a large, vetted, and searchable catalog of data repositories.  If no discipline-specific repository exists, there are several options, including Illinois’ IDEALS repository (free) and other general-purpose repositories like DataDryad (fee-based)."
ElsevierList of supported data repositories
Nature

Data availability policy

"Supporting data must be made available to editors and peer-reviewers at the time of submission for the purposes of evaluating the manuscript...For information about suitable public repositories, see sections that follow."

PLOS

PLOS Data Repository Recommendation Guide

"PLOS has identified a set of established repositories below, which are recognized and trusted within their respective communities. Additionally, the Registry of Research Data Repositories (Re3Data) is a full scale resource of registered repositories across subject areas. "

...

 

A researcher at the U of I looking for a repository to publish their data has several options: select a field-specific repository based on funding agency or publisher requirements from curated lists, search re3data.org, or use their local institutional repository.

There are several existing services in this space including the Registry of Research Data Repositories (RE3Data), Biosharing.org, and the SEAD C3PR service.   In In addition to these existing registries of research data repositories, funding agencies and publishers provide lists of recommended repositories. 

To be useful, the NDS repository recommender must differentiate itself from these existing tools and services. For example

  • Improved search over Re3Data through the use of of priors (e.g., "trustworthiness" or some sort of impact factor)
  • Accounting for user motivations (funding agency requirements, publisher requirements, data size) through guided search
  • Suitable for use by publishers (via API or otherwise)

Recommender?

Is it really a "recommender"? Broadly speaking, a "recommender system" attempts to predict the relevance of an item to a user based on information known about the user. This could be profile information, previous ratings or related activities.  It is more likely that this system will be a "search engine" in the sense that the user comes with an information need and is looking for a ranked list of candidate repositories. The information need might be a query or the dataset itself.

Background

What tools already exist in this space?

Registries of Research Data Repositories

RegistryDescriptionNotes
Re3DataRegistry of research data repositories

Started from Databib, crowd-sourced.

Metadata is too general for search; user feedback "precision is horrible"; not based on natural language

Biosharing.orgRegistry of databases and policies for life/environmental/bio sciences

Schema based on BioDBCore: http://biocuration.org/community/standards-biodbcore/

Data is not available, but will be.

BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences

CinergiCommunity Inventory of EarthCube Resources for Geosciences Interoperability

Curated database of geoscience information resources

OpenAIREOpenAIRE data provider searchPublishes guidelines for data archives
LA Referencia  
bioCADDIEData discovery indexIndex of data "do for data what pubmed did for literature"
OpenDOAR

Directory of open-access repositories

 
SHARE 

Index of research activities/outputs including data management plans, grant proposals, preprints, presentations, and data repository deposits

Publishers refer to both in their lists of recommended repositories, but both services appear to be intended for librarians, curators, publishers and funding agencies instead of the average researcher. The re3data is easily available for download and could be incorporated into our system. It's not clear whether the Bioshare data is available (technically, it could be crawled).

...

 

Approved and Recommended Repositories 

...

(This list is not exhaustive – it's likely that many publishers, agencies, and organizations will provide similar lists):

   
NIH

https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html

Note that the Biosharing database already includes information about whether a repository is recommended by a funding agency:

Elsevier

https://www.elsevier.com/?a=57755

https://www.elsevier.com/books-and-journals/content-innovation/data-base-linking/supported-data-repositories

http://www.journals.elsevier.com/data-in-brief/policies-and-guidelines/public-repositories-to-store-and-find-data

 

Supported Data Repositories

Public repositories to store and find data (Data in Brief)

  • List of databases with bi-directional linking
Nature

Data policy

Recommended Data Repositories

Data Policies: Nature Scientific Data

  • Includes mandates
  • Drawn from re3data and biosharing
Nature

http://www.nature.com/authors/policies/availability.html

http://www.nature.com/sdata/policies/repositories

http://www.nature.com/sdata/policies/data-policies

 
PLOS

http://blogs.plos.org/everyone/2015/07/02/plos-recommended-data-repositories/

http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories

 
Libraries

https://library.uoregon.edu/datamanagement/sharingdata.html

http://www.library.cmu.edu/datapub/dms/respositories

 
Other

http://www.ijdc.net/index.php/ijdc/article/viewFile/9.1.152/349

http://www.rdc-drc.ca/wp-content/uploads/Review-of-Research-Data-Repositories-2015.pdf

AMS: https://www.ametsoc.org/ams/index.cfm/publications/authors/journal-and-bams-authors/journal-and-bams-authors-guide/data-archiving-and-citation/

AGU: http://publications.agu.org/files/2014/06/Data-Repositories.pdf

http://openarchaeologydata.metajnl.com/about/#repo

https://www.datacite.org/services/find-repository.html

 

 

SEAD Publication API

See also  and the actual 

...

FactorDescription
Funding agency approvalFunding agencies (e.g. NIH) have lists of approved repositories
Researcher communitiesSome repositories restrict to researchers in certain communities
Publisher integrationPublishers (e.g., Elsevier) have arrangements with repositories (e.g., bi-directional linking)
Domain/FieldRepositories are often restricted by domain, with some generalist services
Technical restrictionsRepositories have technical restrictions (e.g., maximum file size, supported formats)
Community mandatesSome research communities have mandated repositories (see Nature list)
Data type

Some repositories are restricted to specific types of data. These criteria vary, for example:

    • Protein structures
    • Human or non-human derived
    • Phenotypes

Data types are often directly related to domain/field of study.

Metadata formatSome repositories are restricted to specific types of metadata (e.g., MIAME)
LicensingFree and unrestricted use or public domain (PLOS)
Best practicesRepository adhere's to best practices pertaining to responsible data sharing, digital preservation, citation, and openness (PLOS)

 

Publishers, funding agencies, and libraries construct these lists of approved repositories to meet the needs of researchers, Many of these sites now link to centralized services, such as re3data.org. However, re3data.org does not capture all of the information needed to make a recommendation (e.g., C3PR technical restrictions).

Use cases

...

  • Responding to request from funding agency. Might need different characteristics (needs DOI, linking etc)
  • Has very large data (university can't handle it, domain repos can't handle it)
  • Has specific availability requirements (5 years, 10 years)
  • Is really complicated (has a lot of contextual information, does the service support it)
  • Sharing – not responding to regulatory requirement – just wants to make things available for reuse

Use cases

Q. Who are the users? While the re3data and biosharing sites seem more targeted at experts, perhaps our service is targeted at the novice researcher?

For example:

  • A researcher in the area of information retrieval has code and data to deposit related to a recent publication. How do they determine where to publish the data?
    • What does the publisher require? JASIST, TOIS/ACM
    • What does the funding agency require? NSF
    • What does the community generally do?
    • Where have I or my collaborators previously published data?

 

Draft Questions

...

  1. Do researchers come to you looking for places to put their data?
    1. Of those that come to you, do you have some estimate of the percentage of those that eventually do find a place to put their data?
  2. Thinking about the researchers that come to you, what is the typical consultation like? What types of questions or concerns do they have?
  3. Do you notice any common challenges or themes across the campus for researchers looking for places to deposit data?
  4. What are some of the tools you recommend and how well do they meet the needs of the researcher?
  5. Do you have any ideas of tools or services that could help you/them better?
  6. We’re thinking of this service (describe current vision of recommender), what do you think? Would it be useful?
  7. Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
  8. Is there anyone else working in this space that you think we should talk to?

...