Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Use an existing search engine (e.g., Solr/Lucene) to index the re3data
  2. Create a test collection of datasets/queries/relevance judgements
    1. This can be done manually (find a set of researchers to give us a dataset and/or query and the repository they seleted)
    2. This can be done automatically by sampling datasets from existing repositories and assume that these are the "most relevant"
  3. Develop demonstration UI

The end product will be a search engine that merges the re3data, biosharing (if available), funder and publisher lists along with models of relevance.

Search Engine

We can use either a research-oriented (Indri/Galago/Terrier) or general-purpose (Lucene) search engine platform. The goal would be to identify features/characteristics of repositories that can be used to improve rankings, aside from basic language models.

Potential features:

  • Retrieval score based on name, description, subject, information crawled from associated URLs, keywords, 
  • language, startDate, size
  • URL format (e.g. presence of non-standard ports, path depth)
  • # results in Google scholar
  • How much info in re3data (how complete is the record)?
  • Number of policies

Test collection:

A key requirement will be to be able to evaluate the retrieval model, which requires a suitable test collection.  For NDSC6, we would just pilot this.

  • Find researchers with real datasets and have them identify the top repositories from re3data?
  • For some subset of repositories, go find a dataset. 

Background/Analysis

What tools already exist in this space?

...

  1. Do researchers come to you looking for places to put their data?
    1. Of those that come to you, do you have some estimate of the percentage of those that eventually do find a place to put their data?
  2. Thinking about the researchers that come to you, what is the typical consultation like? What types of questions or concerns do they have?
  3. Do you notice any common challenges or themes across the campus for researchers looking for places to deposit data?
  4. What are some of the tools you recommend and how well do they meet the needs of the researcher?
  5. Do you have any ideas of tools or services that could help you/them better?
  6. We’re thinking of this service (describe current vision of recommender), what do you think? Would it be useful?Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
  7. Is there anyone else working in this space that you think we should talk to?

Potential features:

  • Retrieval score based on name, description, subject, information crawled from associated URLs, keywords, 
  • language, startDate, size
  • URL format (e.g. presence of non-standard ports, path depth)
  • # results in Google scholar
  • How much info in re3data (how complete is the record)?
  • Number of policies

Test collection:

  • Find researchers with real datasets and have them identify the top repositories from re3data (possible future IRB study)?
  • For some subset of repositories, go find a dataset. 

References

  1. Are there any departments/researchers/labs that you think are representative of this problem that we could talk to? (Looking for most common cases)
  2. Is there anyone else working in this space that you think we should talk to?

References


Elsevier. Supported Data Repositories.

Myers, Jim. (2016). SEAD 2.0 Publication API Walkthrough:.

Nature. Availability of data and materials.

PLOS ONE. Data availability.

UI RDS. Saving and Sharing your Data.

Whyte, A. (2015). ‘Where to keep research data: DCC checklist for evaluating data repositories’ v.1.1 Edinburgh: Digital Curation Centre. DCC. Where to keep research data.