1. Variables and notations
Term | Meaning | Variable in script | Scripts |
---|---|---|---|
collection | name of the collection/dataset (biocaddie, ohsumed, treccds, trecgenomics) | col | all |
subset | set of data used for running baselines (combined, test, train) | subset | all |
topics | name of the topics file (short, orig, stopped, etc) | topics | all |
year | year of the collection/dataset, available for few collections such as trecgenomics (2006, 2007) | year | all |
model | retrieval model (dir, rm3, jm, pubmed, etc) | model | mkeval.sh, mkeval-lucene.sh |
from model | retrieval model (dir, rm3, jm, pubmed, etc), for comparison (t-test) | from | compare.R |
to model | retrieval model (dir, rm3, jm, pubmed, etc), for comparison (t-test) | to | compare.R |
metric | evaluation metric (map, ndcg, P_20, ndcg_cut_20, etc) | metric | mkeval.sh, mkeval-lucene.sh, compare.R |
run method | running method (indri, lucene) | run | compare.R |
2. Files and their locations
Collections without year: biocaddie, ohsumed
Collections with year: treccds (2015), trecgenomics (2006, 2007)
Type | Location | Example |
---|---|---|
Indexes | /data/<col>/indexes/<col>_all /data/<col>/indexes/<col><year>_all | /data/biocaddie/indexes/biocaddie_all /data/trecgenomics/indexes/trecgenomics2006_all |
Queries | /data/<col>/queries/queries.<subset>.<topics> /data/<col>/queries/queries.<subset>.<topics>.<year> | /data/biocaddie/queries/queries.test.short /data/trecgenomics/queries/queries.combined.orig.2006 |
Qrels | /data/<col>/qrels/<col>.qrels.<subset> /data/<col>/qrels/<col>.qrels.<subset>.<year> | /data/biocaddie/qrels/biocaddie.qrels.test /data/trecgenomics/qrels/trecgenomics.qrels.combined.2006 |
Output | /data/<col>/output/<model>/<subset>/<topics> /data/<col>/output/<year>/<model>/<subset>/<topics> | /data/biocaddie/output/dir/test/short /data/trecgenomics/output/2006/dir/combined/orig |
Eval | /data/<col>/eval/<model>/<subset>/<topics> /data/<col>/eval/<year>/<model>/<subset>/<topics> | /data/biocaddie/eval/dir/test/short /data/trecgenomics/eval/2006/dir/combined/orig |
Loocv | /data/<col>/loocv/<model>.<subset>.<topics>.<metric>.indri.out /data/<col>/loocv/<year>/<model>.<subset>.<topics>.<metric>.indri.out | /data/biocaddie/loocv/dir.test.short.ndcg.indri.out /data/trecgenomics/loocv/2006/dir.combined.orig.ndcg.indri.out |
*** Note: both Lucene and Indri's loocv results are saved in the same location for easy comparison across different runs.
3. Run Indri baselines
a) IndriRunQuery (output)
cd ~/biocaddie baselines/new/<model.sh <topics> <subset> <col>| parallel -j 20 bash -c "{}" baselines/new/<model>.sh <topics> <subset> <col> <year>| parallel -j 20 bash -c "{}"
Eg: baselines/new/dir.sh short test biocaddie | parallel -j 20 bash -c "{}"
baselines/new/dir.sh orig combined trecgenomics 2006| parallel -j 20 bash -c "{}"
b) Evaluation and Cross-validation (eval, loocv)
cd ~/biocaddie scripts/new/mkeval.sh <model> <topics> <subset> <col> scripts/new/mkeval.sh <model> <topics> <subset> <col> <year>
Eg: scripts/new/mkeval.sh dir short test biocaddie
scripts/new/mkeval.sh dir orig combined trecgenomics 2006
c) Compare models
We have to input running method for comparison:
0 - both from and to models are from Indri run
1 - both from and to models are from Lucene run
2 - from model is from Indri run, to model is from Lucene run
3 - from model is from Lucene run, to model is from Indri run
cd ~/biocaddie Rscript scripts/new/compare.R <subset> <from> <to> <topics> <col> Rscript scripts/new/compare.R <subset> <from> <to> <topics> <col> <year>
Eg: Rscript scripts/new/compare.R test tfidf dir short biocaddie
-- Then select running method for comparison (such as '0' if we want to compare both tfidf and dir results from Indri run)