...
Also make a copy at /data/trecgenomics/data/
2. Indexes (/shared/trecgenomics/indexes/trecgenomics_all)
...
Also make a copy at /data/trecgenomics/indexes/trecgenomics_all
3. Queries
#download topics to /shared/trecgenomics/queries folder
...
Also make a copy of the query at /data/trecgenomics/queries
4. Qrels
#download qrels to /shared/trecgenomics/qrels folder
No Format |
---|
wget http://skynet.ohsu.edu/trec-gen/data/2007/trecgen2007.all.judgments.tsv.txt |
#convert qrels into correct format for trec_eval (add in 0 in second column, replace NOT_RELEVANT with 0 and RELEVANT with 2, remove columns 4 and 5)
No Format |
---|
grep -v "#" /shared/trecgenomics/qrels/trecgen2007.all.judgments.tsv.txt | sed -e 's/\tRELEVANT/\t2/g' -e 's/\tNOT_RELEVANT/\t0/g' -e 's/\t/\t0\t/1' | cut -f 1,2,3,6 > trecgenomics-qrels.txt |
...
The relevant judgements generated above contain duplicate values such as a document for a query might have multiple judgements (RELEVENT/NON-RELEVANT) based on the document's maximum-length span.
Eg: In trecgen2007.all.judgments.tsv.txt file:
...
Also make a copy of the qrels at/data/trecgenomics/qrels
5. IndriRunQuery - Output
No Format |
---|
cd ~/biocaddie/baselines/trecgenomics ./<model>.sh <topic> <collection> |parallel -j 20 bash -c "{}" |
...
/data/trecgenomics/output/rm3/combined/orig
6. Cross-validation
No Format |
---|
cd ~/biocaddie scripts/mkeval_trecgenomics.sh <model> <topics> <collection> |
Eg: scripts/mkeval_trecgenomics.sh tfidf orig combined
7. Compare models
No Format |
---|
cd ~/biocaddie Rscript scripts/compare_trecgenomics.R <collection> <from model> <to model> <topic> |
...