Evaluation process

(This is a description of my current evaluation process, mainly for review).

Overview

The evaluation process consists of the following steps:

Build BioCADDIE index
Run models with parameter sweeps
Run leave-one-query-out cross validation
Compare results of cross-validation using ttest

Some details:

Working directory is assumed to be:

biocaddie.ndslabs.org:/data/willis8/bioCaddie

Java classes are in https://github.com/craig-willis/biocaddie

Building the index

Currently using IndriBuild index.

Convert documents to TREC text format for indexing. This is the process used to produce

scripts/dats2trec.sh

Note, the edu.gslis.biocaddie.util.DATSToTrecText class will operate on all fields or a subset of fields (title, description). At this point, I'm using all.

The output of this process is the file:

/data/willis8/bioCaddie/data/biocaddie_all.txt

Use IndriBuildIndex to construct the index:

IndriBuildIndex build_index.all.params

Run baseline models

I have scripts that sweep parameters for several baseline models under the baselines/ directory:

dir.sh: LM/Dirichlet
jm.sh: LM/Jelinek-Mercer
okapi.sh: Indri's Okapi implementation
rm3.sh: Indri's RM3 implementation
tfidf: Indri's TFIDF baseline
two.sh: LM/Two-stage smoothing

Each scripts takes two arguments:

topics: orig, short, stopped
collection: combined, train, test

Each of these scripts produces a set of TREC-formatted output files under the following directory structure:

output
- model (dir, jm, okapi, rm3, tfidf, two)
  - collection (combined, train, test)
    - topics (orig, short, stopped)

Cross validation:

Craig:

The "mkeval.sh" script generates trec_eval -c -q -m all_trec formatted output for each parameter combination. For example:

./mkeval.sh dir short combined

Produces:

eval/dir/combined/short

With one trec_eval output file per parameter combination.

The script then runs a simple leave-one-query-out CrossValidation utility optimizing for multiple metrics (map, ndcg, ndcg_cut_20, p_20). This produces a set of output files in the loocv/ directory of the form:

model.collection.topics.metric.out

Garrick:

My old cross-validation framework is most useful as a library if you need to run relatively small computations as part of a larger process: https://github.com/gtsherman/cross-validation

For most uses, my "generic" cross-validation framework is a better fit, and is in the same vein as Craig's LOOCV above: https://github.com/gtsherman/generic-cross-validation

My framework leaves it up to you to produce the directory containing trec_eval output (one file per parameter combination, as above). Given this directory, you can run:

generic-cross-validation/run.py -d <dir> -r <seed> -k <num_folds> -m <metric> -s

For LOOCV, set k equal to the number of queries. For LOOCV, the seed should be irrelevant; however, if you choose to use e.g. 10-fold cross-validation, setting the seed allows you to replicate your cross-validation results later by duplicating the random split of queries into folds. The metric chosen may be any of the metrics available in the trec_eval output. You can run this repeatedly for each metric of interest.

If you want to see the optimal parameter settings for each fold, you can add the -v option. This will cause fold information to be printed to stderr like so:

Split into 10 folds
Items per fold: 10
Best params for fold 0 (n=10): 0.3_0.7 (0.53798)
Best params for fold 1 (n=10): 0.4_0.6 (0.545085555556)
Best params for fold 2 (n=10): 0.4_0.6 (0.534586666667)
Best params for fold 3 (n=10): 0.4_0.6 (0.539261111111)

n shows the number of items in the fold (this is sometimes fewer than the "Items per fold" suggests, if the number of queries is not evenly divisible by the number of folds). The value in parentheses at the end is the value of the target metric obtained for that fold with that parameter setting.

Comparing runs:

Craig:

A simple R script compare.R reads the output from two models and compares across multiple metrics via ttest. For example:

Rscript compare.R combined tfidf dir short
[1] "map          0.2444 0.2776 p= 0.0257"
[1] "ndcg         0.4545 0.5252 p= 0.0356"
[1] "P_20         0.431  0.531  p= 0.0266"
[1] "ndcg_cut_20  0.3982 0.4859 p= 0.0161"

The columns are:

Metric
First model (tfidf)
Second model (dir)
p-value from one-tailed paired t-test (first model is < second model)

The first column is the metric, second is the first model (tfidf), third is the second model (dir) and fourth is the p-value.

Garrick:

For evaluating cross-validation, run ttest_generic.py with two "generic" cross-validation output files as arguments. The script expects the files to contain the same queries.

Since each cross-validation output is for a single metric, you will need to run this repeatedly for each metric of interest. It will return the p-value for a paired one-sided t-test.

If you want to evaluate two TREC formatted output files (not cross-validation output, but actual run output) you can also use ttest.py. This takes a few parameters:

./ttest.py -q <qrels> -m [map,ndcg] -f <file1> <file2> [-g]

This script will run the trec_eval for you and read in the data for either MAP or nDCG@20. If -g is specified, it will run a one-tailed t-test. This is not the greatest piece of code in existence; it's handy once in a while, but you probably won't want to use it too often.

Space shortcuts

Page tree

Overview

Some details:

Building the index

Run baseline models

Cross validation:

Craig:

Garrick:

Comparing runs:

Craig:

Garrick: