Evaluation process

(This is a description of my current evaluation process, mainly for review).

Overview

The evaluation process consists of the following steps:

Build BioCADDIE index
Run baseline models with parameter sweeps
Run our models with parameter sweeps
Run leave-one-query-out cross validation
Compare results of cross-validation using ttest

Some details:

Working directory is assumed to be:

biocaddie.ndslabs.org:/data/willis8/bioCaddie

Java classes are in https://github.com/craig-willis/biocaddie

Building the index

Currently using IndriBuild index.

Convert documents to TREC text format for indexing. This is the process used to produce

scripts/dats2trec.sh

Note, the edu.gslis.biocaddie.util.DATSToTrecText class will operate on all fields or a subset of fields (title, description). At this point, I'm using all.

The output of this process is the file:

/data/willis8/bioCaddie/data/biocaddie_all.txt

Use IndriBuildIndex to construct the index:

IndriBuildIndex build_index.all.params

Run baseline models

I have scripts that sweep parameters for several baseline models under the baselines/ directory:

dir.sh: LM/Dirichlet
jm.sh: LM/Jelinek-Mercer
okapi.sh: Indri's Okapi implementation
rm3.sh: Indri's RM3 implementation
tfidf: Indri's TFIDF baseline
two.sh: LM/Two-stage smoothing

Each scripts takes two arguments:

topics: orig, short, stopped
collection: combined, train, test

Each of these scripts produces a set of TREC-formatted output files under the following directory structure:

output
- model (dir, jm, okapi, rm3, tfidf, two)
  - collection (combined, train, test)
    - topics (orig, short, stopped)

Cross validation:

The "mkeval.sh" script generates trec_eval -c -q -m all_trec formatted output for each parameter combination. For example:

./mkeval.sh dir short combined

Produces:

eval/dir/combined/short

With one file per parameter combination.

The script then runs a simple leave-one-query-out CrossValidation utility optimizing for multiple metrics (map, ndcg, ndcg_cut_20, p_20). This produces a set of output files in the loocv/ directory of the form:

model.collection.topics.metric.out

Comparing runs:

A simple R script compare.R reads the output from two models and compares across multiple metrics via ttest. For example:

Rscript compare.R combined tfidf dir short
[1] "map          0.2444 0.2776 p= 0.0257"
[1] "ndcg         0.4545 0.5252 p= 0.0356"
[1] "P_20         0.431  0.531  p= 0.0266"
[1] "ndcg_cut_20  0.3982 0.4859 p= 0.0161"

The columns are:

Metric
First model (tfidf)
Second model (dir)
p-value from one-tailed paired t-test (first model is < second model)

The first column is the metric, second is the first model (tfidf), third is the second model (dir) and fourth is the p-value.

Space shortcuts

Page tree