Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • output
    • model (dir, jm, okapi, rm3, tfidf, two)
      • collection (combined, train, test)
        • topics (orig, short, stopped)

Cross validation:

Craig:

The "mkeval.sh" script generates trec_eval -c -q -m all_trec formatted output for each parameter combination. For example:

...

model.collection.topics.metric.out

Garrick:

My old cross-validation framework is most useful as a library if you need to run relatively small computations as part of a larger process: https://github.com/gtsherman/cross-validation

For most uses, my "generic" cross-validation framework is a better fit, and is in the same vein as Craig's LOOCV above: https://github.com/gtsherman/generic-cross-validation

My framework leaves it up to you to produce the directory containing trec_eval output (one file per parameter combination, as above). Given this directory, you can run:

generic-cross-validation/run.py -d <dir> -r <seed> -k <num_folds> -m <metric> -s

For LOOCV, set k equal to the number of queries. For LOOCV, the seed should be irrelevant; however, if you choose to use e.g. 10-fold cross-validation, setting the seed allows you to replicate your cross-validation results later by duplicating the random split of queries into folds. The metric chosen may be any of the metrics available in the trec_eval output. You can run this repeatedly for each metric of interest.

If you want to see the optimal parameter settings for each fold, you can add the -v option. This will cause fold information to be printed to stderr like so:

Split into 10 folds
Items per fold: 10
Best params for fold 0 (n=10): 0.3_0.7 (0.53798)
Best params for fold 1 (n=10): 0.4_0.6 (0.545085555556)
Best params for fold 2 (n=10): 0.4_0.6 (0.534586666667)
Best params for fold 3 (n=10): 0.4_0.6 (0.539261111111)

n shows the number of items in the fold (this is sometimes fewer than the "Items per fold" suggests, if the number of queries is not evenly divisible by the number of folds). The value in parentheses at the end is the value of the target metric obtained for that fold with that parameter setting.

Comparing runs:

Craig:

A simple R script compare.R reads the output from two models and compares across multiple metrics via ttest. For example:

...

The first column is the metric, second is the first model (tfidf), third is the second model (dir) and fourth is the p-value.

Garrick:

For evaluating cross-validation, run ttest_generic.py with two "generic" cross-validation output files as arguments. The script expects the files to contain the same queries.

Since each cross-validation output is for a single metric, you will need to run this repeatedly for each metric of interest. It will return the p-value for a paired one-sided t-test.

If you want to evaluate two TREC formatted output files (not cross-validation output, but actual run output) you can also use ttest.py. This takes a few parameters:

./ttest.py -q <qrels> -m [map,ndcg] -f <file1> <file2> [-g]

This script will run the trec_eval for you and read in the data for either MAP or nDCG@20. If -g is specified, it will run a one-tailed t-test. This is not the greatest piece of code in existence; it's handy once in a while, but you probably won't want to use it too often.