...
How is this problem currently addressed? We can find a few cases in the wild:
Service | Data Repository Recommendation |
---|---|
U of I Research Data Service | "Deposition of data into a web-accessible repository is generally the preferred mechanism for public data sharing because it ensures wide-spread and consistent access to the data. If your discipline already has a trusted repository, we recommend you deposit where your community knows to look. To find a repository, re3data.org is a large, vetted, and searchable catalog of data repositories. If no discipline-specific repository exists, there are several options, including Illinois’ IDEALS repository (free) and other general-purpose repositories like DataDryad (fee-based)." |
Elsevier | List of supported data repositories |
Nature | "Supporting data must be made available to editors and peer-reviewers at the time of submission for the purposes of evaluating the manuscript...For information about suitable public repositories, see sections that follow." |
PLOS | PLOS Data Repository Recommendation Guide "PLOS has identified a set of established repositories below, which are recognized and trusted within their respective communities. Additionally, the Registry of Research Data Repositories (Re3Data) is a full scale resource of registered repositories across subject areas. " |
A researcher at the U of I looking for a repository to publish their data has several options: select a field-specific repository based on funding agency or publisher requirements from curated lists, search re3data.org, or use their local institutional repository.
...
Is it really a "recommender"? Broadly speaking, a "recommender system" attempts to predict the relevance of an item to a user based on information known about the user. This could be profile information, previous ratings or related activities. It is more likely that this system will be a "search engine" in the sense that the user comes with an information need and is looking for a ranked list of candidate repositories. The information need might be a query or the dataset itself.
Broad vision
- Start with re3data and biosharing.org records as core
- Develop and test priors based on repository attributes
Analysis
What tools already exist in this space?
...
- Do researchers come to you looking for places to put their data?
- Of those that come to you, do you have some estimate of the percentage of those that eventually do find a place to put their data?
- Thinking about the researchers that come to you, what is the typical consultation like? What types of questions or concerns do they have?
- Do you notice any common challenges or themes across the campus for researchers looking for places to deposit data?
- What are some of the tools you recommend and how well do they meet the needs of the researcher?
- Do you have any ideas of tools or services that could help you/them better?
- We’re thinking of this service (describe current vision of recommender), what do you think? Would it be useful?
- Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
- Is there anyone else working in this space that you think we should talk to?
Potential features:
- Retrieval score based on name, description, subject, information crawled from associated URLs, keywords,
- language, startDate, size
- URL format (e.g. presence of non-standard ports, path depth)
- # results in Google scholar
- How much info in re3data (how complete is the record)?
- Number of policies
Test collection:
- Find researchers with real datasets and have them identify the top repositories from re3data (possible future IRB study)?
- For some subset of repositories, go find a dataset.