You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

(This is a description of my current evaluation process, mainly for review).


The evaluation process consists of the following steps:

  • Build BioCADDIE index
  • Run baseline models with parameter sweeps
  • Run our models with parameter sweeps
  • Run leave-one-query-out cross validation
  • Compare results of cross-validation using ttest

Some details:

Working directory is assumed to be:

Java classes are in

Building the index

Currently using IndriBuild index.

Convert documents to TREC text format for indexing. This is the process used to produce 


Note, the edu.gslis.biocaddie.util.DATSToTrecText class will operate on all fields or a subset of fields (title, description).  At this point, I'm using all.

The output of this process is the file:


Use IndriBuildIndex to construct the index:

IndriBuildIndex build_index.all.params

Run baseline models

I have scripts that sweep parameters for several baseline models under the baselines/ directory:

  • LM/Dirichlet
  • LM/Jelinek-Mercer
  • Indri's Okapi implementation
  • Indri's RM3 implementation
  • tfidf: Indri's TFIDF baseline
  • LM/Two-stage smoothing

Each scripts takes two arguments:

  • topics: orig, short, stopped
  • collection: combined, train, test

Each of these scripts produces a set of TREC-formatted output files under the following directory structure:

  • output
    • model (dir, jm, okapi, rm3, tfidf, two)
      • collection (combined, train, test)
        • topics (orig, short, stopped)

Cross validation:

The "" script generates trec_eval -c -q -m all_trec formatted output for each parameter combination. For example:

./ dir short combined 



With one file per parameter combination.

The script then runs a simple leave-one-query-out CrossValidation utility optimizing for multiple metrics (map, ndcg, ndcg_cut_20, p_20). This produces a set of output files in the loocv/ directory of the form:


Comparing runs:

A simple R script compare.R reads the output from two models and compares across multiple metrics via ttest. For example:

Rscript compare.R combined tfidf dir short
[1] "map 0.2444 0.2776 p= 0.0257"
[1] "ndcg 0.4545 0.5252 p= 0.0356"
[1] "P_20 0.431 0.531 p= 0.0266"
[1] "ndcg_cut_20 0.3982 0.4859 p= 0.0161"


The columns are:

  • Metric
  • First model (tfidf)
  • Second model (dir)
  • p-value from one-tailed paired t-test (first model is < second model)

The first column is the metric, second is the first model (tfidf), third is the second model (dir) and fourth is the p-value.


  • No labels