...
- Use an existing search engine (e.g., Solr/Lucene) to index the re3data
- Create a test collection of datasets/queries/relevance judgements
- This can be done manually (find a set of researchers to give us a dataset and/or query and the repository they seleted)
- This can be done automatically by sampling datasets from existing repositories and assume that these are the "most relevant"
- Develop demonstration UI
The end product will be a search engine that merges the re3data, biosharing (if available), funder and publisher lists along with models of relevance.
Search Engine
We can use either a research-oriented (Indri/Galago/Terrier) or general-purpose (Lucene) search engine platform. The goal would be to identify features/characteristics of repositories that can be used to improve rankings, aside from basic language models.
Potential features:
- Retrieval score based on name, description, subject, information crawled from associated URLs, keywords,
- language, startDate, size
- URL format (e.g. presence of non-standard ports, path depth)
- # results in Google scholar
- How much info in re3data (how complete is the record)?
- Number of policies
Test collection:
A key requirement will be to be able to evaluate the retrieval model, which requires a suitable test collection. For NDSC6, we would just pilot this.
- Find researchers with real datasets and have them identify the top repositories from re3data?
- For some subset of repositories, go find a dataset.
Background/Analysis
What tools already exist in this space?
...
- Do researchers come to you looking for places to put their data?
- Of those that come to you, do you have some estimate of the percentage of those that eventually do find a place to put their data?
- Thinking about the researchers that come to you, what is the typical consultation like? What types of questions or concerns do they have?
- Do you notice any common challenges or themes across the campus for researchers looking for places to deposit data?
- What are some of the tools you recommend and how well do they meet the needs of the researcher?
- Do you have any ideas of tools or services that could help you/them better?
- We’re thinking of this service (describe current vision of recommender), what do you think? Would it be useful?Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
- Is there anyone else working in this space that you think we should talk to?
Potential features:
- Retrieval score based on name, description, subject, information crawled from associated URLs, keywords,
- language, startDate, size
- URL format (e.g. presence of non-standard ports, path depth)
- # results in Google scholar
- How much info in re3data (how complete is the record)?
- Number of policies
Test collection:
- Find researchers with real datasets and have them identify the top repositories from re3data (possible future IRB study)?
- For some subset of repositories, go find a dataset.
References
- Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
- Is there anyone else working in this space that you think we should talk to?
References
Elsevier. Supported Data Repositories.
Myers, Jim. (2016). SEAD 2.0 Publication API Walkthrough:.
Nature. Availability of data and materials.
PLOS ONE. Data availability.
UI RDS. Saving and Sharing your Data.
Whyte, A. (2015). ‘Where to keep research data: DCC checklist for evaluating data repositories’ v.1.1 Edinburgh: Digital Curation Centre. DCC. Where to keep research data.