The goal of this task is to run our baselines against the OHSUMED test collection:
http://trec.nist.gov/data/t9_filtering.html
Basic steps:
- Download test collection. Put data in shared data directory
- Create new build_index scripts in biocaddie/index for collection.
- Build index, put output in shared directory (but also keep a copy on your VM for performance)
- Convert topics to indri format. Check converted topics into biocaddie/queries
- Run baselines (all non-feedback + rm3), add results to Wiki.
When done, create PR with your changes to the biocaddie repo and assign this ticket to Craig for review.
Note: The OHSUMED test collection was originally used for filtering, so the queries and qrels appear to be split into train/test. You'll want to combine these for ad-hoc evaluation. Also, the documents are in a non-standard format (probably old PubMed). You might spend some time looking to see of there is TREC-formatted documents or if someone has provided a script to convert the data.