(This is a description of my current evaluation process, mainly for review).
Overview
The evaluation process consists of the following steps:
- Build BioCADDIE index
- Run models with parameter sweeps
- Run leave-one-query-out cross validation
- Compare results of cross-validation using ttest
Some details:
Working directory is assumed to be:
biocaddie.ndslabs.org:/data/willis8/bioCaddie
Java classes are in https://github.com/craig-willis/biocaddie
Building the index
Currently using IndriBuild index.
Convert documents to TREC text format for indexing. This is the process used to produce
scripts/dats2trec.sh
Note, the edu.gslis.biocaddie.util.DATSToTrecText class will operate on all fields or a subset of fields (title, description). At this point, I'm using all.
The output of this process is the file:
/data/willis8/bioCaddie/data/biocaddie_all.txt
Use IndriBuildIndex to construct the index:
IndriBuildIndex build_index.all.params
Run baseline models
I have scripts that sweep parameters for several baseline models under the baselines/ directory:
- dir.sh: LM/Dirichlet
- jm.sh: LM/Jelinek-Mercer
- okapi.sh: Indri's Okapi implementation
- rm3.sh: Indri's RM3 implementation
- tfidf: Indri's TFIDF baseline
- two.sh: LM/Two-stage smoothing
Each scripts takes two arguments:
- topics: orig, short, stopped
- collection: combined, train, test
Each of these scripts produces a set of TREC-formatted output files under the following directory structure:
- output
- model (dir, jm, okapi, rm3, tfidf, two)
- collection (combined, train, test)
- topics (orig, short, stopped)
- collection (combined, train, test)
- model (dir, jm, okapi, rm3, tfidf, two)
Cross validation:
Craig:
The "mkeval.sh" script generates trec_eval -c -q -m all_trec formatted output for each parameter combination. For example:
./mkeval.sh dir short combined
Produces:
eval/dir/combined/short
With one trec_eval output file per parameter combination.
The script then runs a simple leave-one-query-out CrossValidation utility optimizing for multiple metrics (map, ndcg, ndcg_cut_20, p_20). This produces a set of output files in the loocv/ directory of the form:
model.collection.topics.metric.out
Garrick:
My old cross-validation framework is most useful as a library if you need to run relatively small computations as part of a larger process: https://github.com/gtsherman/cross-validation
For most uses, my "generic" cross-validation framework is a better fit, and is in the same vein as Craig's LOOCV above: https://github.com/gtsherman/generic-cross-validation
My framework leaves it up to you to produce the directory containing trec_eval output (one file per parameter combination, as above). Given this directory, you can run:
generic-cross-validation/run.py -d <dir> -r <seed> -k <num_folds> -m <metric> -s
For LOOCV, set k equal to the number of queries. For LOOCV, the seed should be irrelevant; however, if you choose to use e.g. 10-fold cross-validation, setting the seed allows you to replicate your cross-validation results later by duplicating the random split of queries into folds. The metric chosen may be any of the metrics available in the trec_eval output. You can run this repeatedly for each metric of interest.
If you want to see the optimal parameter settings for each fold, you can add the -v option. This will cause fold information to be printed to stderr like so:
Split into 10 folds
Items per fold: 10
Best params for fold 0 (n=10): 0.3_0.7 (0.53798)
Best params for fold 1 (n=10): 0.4_0.6 (0.545085555556)
Best params for fold 2 (n=10): 0.4_0.6 (0.534586666667)
Best params for fold 3 (n=10): 0.4_0.6 (0.539261111111)
n shows the number of items in the fold (this is sometimes fewer than the "Items per fold" suggests, if the number of queries is not evenly divisible by the number of folds). The value in parentheses at the end is the value of the target metric obtained for that fold with that parameter setting.
Comparing runs:
Craig:
A simple R script compare.R reads the output from two models and compares across multiple metrics via ttest. For example:
Rscript compare.R combined tfidf dir short
[1] "map 0.2444 0.2776 p= 0.0257"
[1] "ndcg 0.4545 0.5252 p= 0.0356"
[1] "P_20 0.431 0.531 p= 0.0266"
[1] "ndcg_cut_20 0.3982 0.4859 p= 0.0161"
The columns are:
- Metric
- First model (tfidf)
- Second model (dir)
- p-value from one-tailed paired t-test (first model is < second model)
The first column is the metric, second is the first model (tfidf), third is the second model (dir) and fourth is the p-value.
Garrick:
For evaluating cross-validation, run ttest_generic.py with two "generic" cross-validation output files as arguments. The script expects the files to contain the same queries.
Since each cross-validation output is for a single metric, you will need to run this repeatedly for each metric of interest. It will return the p-value for a paired one-sided t-test.
If you want to evaluate two TREC formatted output files (not cross-validation output, but actual run output) you can also use ttest.py. This takes a few parameters:
./ttest.py -q <qrels> -m [map,ndcg] -f <file1> <file2> [-g]
This script will run the trec_eval for you and read in the data for either MAP or nDCG@20. If -g is specified, it will run a one-tailed t-test. This is not the greatest piece of code in existence; it's handy once in a while, but you probably won't want to use it too often.