Table of Contents |
---|
This page is intended to capture information related to
Jira | ||||||||
---|---|---|---|---|---|---|---|---|
|
...
- Use an existing search engine (e.g., Solr/Lucene) to index the re3data
- Create a test collection of datasets/queries/relevance judgements
- This can be done manually (find a set of researchers to give us a dataset and/or query and the repository they seleted)
- This can be done automatically by sampling datasets from existing repositories and assume that these are the "most relevant"
- Develop demonstration UI
Analysis
What tools already exist in this space?
Registries of Research Data Repositories
The end product will be a search engine that merges the re3data, biosharing (if available), funder and publisher lists along with models of relevance.
Search Engine
We can use either a research-oriented (Indri/Galago/Terrier) or general-purpose (Lucene) search engine platform. The goal would be to identify features/characteristics of repositories that can be used to improve rankings, aside from basic language models.
Potential features:
- Retrieval score based on name, description, subject, information crawled from associated URLs, keywords,
- language, startDate, size
- URL format (e.g. presence of non-standard ports, path depth)
- # results in Google scholar
- How much info in re3data (how complete is the record)?
- Number of policies
Test collection:
A key requirement will be to be able to evaluate the retrieval model, which requires a suitable test collection. For NDSC6, we would just pilot this.
- Find researchers with real datasets and have them identify the top repositories from re3data?
- For some subset of repositories, go find a dataset.
Background/Analysis
What tools already exist in this space?
Registries of Research Data Repositories
Registry | Description | Notes |
---|---|---|
Re3Data | Registry of research data repositories | Started from Databib, crowd-sourced. Metadata is too general for search; user feedback "precision is horrible"; not based on natural language |
Biosharing.org | Registry of databases and policies for life/environmental/bio sciences | Schema based on BioDBCore: http://biocuration.org/community/standards-biodbcore/ Data is not available, but will be. |
Cinergi | Community Inventory of EarthCube Resources for Geosciences Interoperability | Curated database of geoscience information resources |
OpenAIRE | OpenAIRE data provider search | Publishes guidelines for data archives |
LA Referencia | ||
bioCADDIE | Data discovery index | Index of data "do for data what pubmed did for literature" |
OpenDOAR | Directory of open-access repositories | |
SHARE | Index of research activities/outputs including data management plans, | |
Registry | Description | Notes |
Re3Data | Registry of research data repositories | Started from Databib, crowd-sourced. Metadata is too general for search; user feedback "precision is horrible"; not based on natural language |
Biosharing.org | Registry of databases and policies for life/environmental/bio sciences | Schema based on BioDBCore: http://biocuration.org/community/standards-biodbcore/ Data is not available, but will be. |
Cinergi | Community Inventory of EarthCube Resources for Geosciences Interoperability | Curated database of geoscience information resources |
OpenAIRE | OpenAIRE data provider search | Publishes guidelines for data archives |
LA Referencia | ||
bioCADDIE | Data discovery index | Index of data "do for data what pubmed did for literature" |
OpenDOAR | Directory of open-access repositories | |
SHARE | Index of research activities/outputs including data management plans, grant proposals, preprints, presentations, and data repository deposits |
...
- Do researchers come to you looking for places to put their data?
- Of those that come to you, do you have some estimate of the percentage of those that eventually do find a place to put their data?
- Thinking about the researchers that come to you, what is the typical consultation like? What types of questions or concerns do they have?
- Do you notice any common challenges or themes across the campus for researchers looking for places to deposit data?
- What are some of the tools you recommend and how well do they meet the needs of the researcher?
- Do you have any ideas of tools or services that could help you/them better?
- We’re thinking of this service (describe current vision of recommender), what do you think? Would it be useful?Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
- Is there anyone else working in this space that you think we should talk to?
Potential features:
- Retrieval score based on name, description, subject, information crawled from associated URLs, keywords,
- language, startDate, size
- URL format (e.g. presence of non-standard ports, path depth)
- # results in Google scholar
- How much info in re3data (how complete is the record)?
- Number of policies
Test collection:
- Find researchers with real datasets and have them identify the top repositories from re3data (possible future IRB study)?
- For some subset of repositories, go find a dataset.
References
- Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
- Is there anyone else working in this space that you think we should talk to?
References
Elsevier. Supported Data Repositories.
Myers, Jim. (2016). SEAD 2.0 Publication API Walkthrough:.
Nature. Availability of data and materials.
PLOS ONE. Data availability.
UI RDS. Saving and Sharing your Data.
Whyte, A. (2015). ‘Where to keep research data: DCC checklist for evaluating data repositories’ v.1.1 Edinburgh: Digital Curation Centre. DCC. Where to keep research data.