(This is a description of my current evaluation process, mainly for review).
Overview
The evaluation process consists of the following steps:
- Build BioCADDIE index
- Run baseline models with parameter sweeps
- Run our models with parameter sweeps
- Run leave-one-query-out cross validation
- Compare results of cross-validation using ttest
Some details:
Working directory is assumed to be:
biocaddie.ndslabs.org:/data/willis8/bioCaddie
Java classes are in https://github.com/craig-willis/biocaddie
Building the index
Currently using IndriBuild index.
Convert documents to TREC text format for indexing. This is the process used to produce
scripts/dats2trec.sh
Note, the edu.gslis.biocaddie.util.DATSToTrecText class will operate on all fields or a subset of fields (title, description). At this point, I'm using all.
The output of this process is the file:
/data/willis8/bioCaddie/data/biocaddie_all.txt
Use IndriBuildIndex to construct the index:
IndriBuildIndex build_index.all.params
Run baseline models
I have scripts that sweep parameters for several baseline models under the baselines/ directory:
- dir.sh: LM/Dirichlet
- jm.sh: LM/Jelinek-Mercer
- okapi.sh: Indri's Okapi implementation
- rm3.sh: Indri's RM3 implementation
- tfidf: Indri's TFIDF baseline
- two.sh: LM/Two-stage smoothing
Each scripts takes two arguments:
- topics: orig, short, stopped
- collection: combined, train, test
Each of these scripts produces a set of TREC-formatted output files under the following directory structure:
- output
- model (dir, jm, okapi, rm3, tfidf, two)
- collection (combined, train, test)
- topics (orig, short, stopped)
- collection (combined, train, test)
- model (dir, jm, okapi, rm3, tfidf, two)
Cross validation:
The "mkeval.sh" script generates trec_eval -c -q -m all_trec formatted output for each parameter combination. For example:
./mkeval.sh dir short combined
Produces:
eval/dir/combined/short
With one file per parameter combination.
The script then runs a simple leave-one-query-out CrossValidation utility optimizing for multiple metrics (map, ndcg, ndcg_cut_20, p_20). This produces a set of output files in the loocv/ directory of the form:
model.collection.topics.metric.out
Comparing runs:
A simple R script compare.R reads the output from two models and compares across multiple metrics via ttest. For example:
Rscript compare.R combined tfidf dir short
[1] "map 0.2444 0.2776 p= 0.0257"
[1] "ndcg 0.4545 0.5252 p= 0.0356"
[1] "P_20 0.431 0.531 p= 0.0266"
[1] "ndcg_cut_20 0.3982 0.4859 p= 0.0161"
The columns are:
- Metric
- First model (tfidf)
- Second model (dir)
- p-value from one-tailed paired t-test (first model is < second model)
The first column is the metric, second is the first model (tfidf), third is the second model (dir) and fourth is the p-value.