We currently use Indri for our evaluation process. The goal of NDS-867 is to implement a similar evaluation framework based on ElasticSearch.

Requirements

Basic requirements for an evaluation framework:

With Indri (and related tools) we can do the following:

What we don't have under this framework:

ElasticSearch Woes

Unfortunately, the ElasticSearch similarity is set for an index at creation.  This means that, to evaluate a particular parameter combination, we'd need to re-index the complete collection for each combination. This is likely prohibitive. There are changes proposed to later versions of ElasticSearch.  But it seems that we might want to stick with Indri for now.

What options do we have:

Evaluation with Lucene

A very relevant workshop report from SIGIR: Lucene4IR: Developing Information Retrieval Evaluation Resources using Lucene.

Also worth a read:  Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR)

In short, it looks like there's been recent work to develop an evaluation framework around Lucene.  We have some support for this in ir-utils, but it wasn't widely used (we've always used the Indri implementation for consistency). So we have a choice – work with the lucene4ir workshop code, which is open source but primarily developed for a single workshop. Or continue working in ir-utils, since that what we've got. In this case, we'd need to extend ir-utils to have improved support for Lucene similarities.

Lucene4IR Framework

Supports the following:

IR-Utils

The ir-utils project is maybe the best of both worlds – supporting evaluation using both Indri and Lucene. It's also a bit of a mess and missing things we've added on our own forks.

What it has:

What is could have with a few PRs:


Other notes

Re-reading Zhai's SLMIR, noticed different ranges for Okapi BM25 parameters.