View Source

We currently use Indri for our evaluation process. The goal of NDS-867 is to implement a similar evaluation framework based on ElasticSearch.

Requirements

Basic requirements for an evaluation framework:

Ability to create an index controlling for specific transformations (stemming, stopping, field storage, etc)
Ability to index standard TREC collection formats as well as the BioCADDIE JSON, XML, HTML data etc.
Using a single index, ability to dynamically change retrieval models and parameters (i.e., IndriRunQuery)
Output in TREC format for evaluation using trec_eval and related tools
Ability to add new retrieval model implementations
Standard baselines for comparison
Handles standard TREC topic formats
Multi-threaded and distributed processing for parameter sweeps
- Ideally, works with large collections, such as ClueWeb
Cross validation
Hypothesis/significance testing.
Query performance prediction: implement the basics

With Indri (and related tools) we can do the following:

Using a single index, dynamically change retrieval models and parameters through IndriRunQuery
Re-indexing is only required for major transformations such as stopping, stemming, or field storage.
IndriRunQuery support TREC formatted output for simple evaluation via trec_eval
New retrieval models can be added by hacking Indri or the ir-utils framework.
We have processes for multi-threaded and soon distributed parameter sweeps and cross-validation.

What we don't have under this framework:

A pattern to support structured retrieval (ElasticSearch is JSON-native, so we'd need to have a JSON-based indexing strategy in Indri – which it does support)
ElasticSearch/Luceneisms – specific model implementations
Snowball stemmer support (Indri uses Porter and Krovetz). That said we can always pre-stem with Snowball for evaluation.

ElasticSearch Woes

Unfortunately, the ElasticSearch similarity is set for an index at creation. This means that, to evaluate a particular parameter combination, we'd need to re-index the complete collection for each combination. This is likely prohibitive. There are changes proposed to later versions of ElasticSearch. But it seems that we might want to stick with Indri for now.

Similarity should accept dynamic settings when possible https://github.com/elastic/elasticsearch/issues/20339

What options do we have:

Evaluate with Indri, implement with ElasticSearch: One approach would be to continue to use Indri as the evaluation framework and translate results to implementation decisions in ElasticSearch. This isn't ideal, but seems like it might be the best
Evaluate with Lucene, implement with ElasticSearch: Since ES is Lucene-based, we could develop an evaluation framework around Lucene. ir-tools already has some of the ingredients. This would allow us to work within some of the constraints of ElasticSearch (i.e., specific model implementations).

Evaluation with Lucene

A very relevant workshop report from SIGIR: Lucene4IR: Developing Information Retrieval Evaluation Resources using Lucene.

Github repos:
- https://github.com/leifos/lucene4ir
- https://github.com/isoboroff/trec-demo

Also worth a read: Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR)

In short, it looks like there's been recent work to develop an evaluation framework around Lucene. We have some support for this in ir-utils, but it wasn't widely used (we've always used the Indri implementation for consistency). So we have a choice – work with the lucene4ir workshop code, which is open source but primarily developed for a single workshop. Or continue working in ir-utils, since that what we've got. In this case, we'd need to extend ir-utils to have improved support for Lucene similarities.

Lucene4IR Framework

Supports the following:

Indexing parameters in XML format
Retrieval parameters in XML format
Index support for CACM, TRECAquaint, TRECNEWS, Tipster formats
In addition to Lucene similarities, BM25L, Okapi BM25, SMART BNNBNN
IndexerApp
RetrievalApp
RetrievalAppQueryExpansion

IR-Utils

The ir-utils project is maybe the best of both worlds – supporting evaluation using both Indri and Lucene. It's also a bit of a mess and missing things we've added on our own forks.

What it has:

Basic framework for running models with parameterization
A variety of scorers
Weak evaluation support (mainly use trec_eval)
Abstraction of Indri and Lucene indexes
Lucene indexer support with Trec, StreamCorpus, Wiki, Xml support
LuceneRunQuery, LuceneBuildIndex classes
Trec-formatted output
Feedback models

What is could have with a few PRs:

YAML-based collection/model parameterization framework
Multi-threaded query runner
Distributed query runner (via Mike's Kubernetes work)
Cross-validation framework
Permutation test (via Galago ireval)

Other notes

Re-reading Zhai's SLMIR, noticed different ranges for Okapi BM25 parameters.

k1 in 1.0-2.0
b 0.75
k3 0-1000