Page History

...

Ability to create an index controlling for specific transformations (stemming, stopping, field storage, etc)
Ability to index standard TREC collection formats as well as the BioCADDIE JSON or XML data, XML, HTML data etc.
Using a single index, ability to dynamically change retrieval models and parameters (i.e., IndriRunQuery)
Output in TREC format for evaluation using trec_eval and related tools
Ability to add new retrieval model implementations
Standard baselines for comparison
Handles standard TREC topic formats
Multi-threaded and distributed processing for parameter sweeps
- Ideally, works with large collections, such as ClueWeb
Cross validation
Hypothesis/significance testing.
Query performance prediction: implement the basics

With Indri (and related tools) we can do the following:

...

Indexing parameters in XML format
Retrieval parameters in XML format
Index support for CACM, TRECAquaint, TRECNEWS, Tipster formats
In addition to Lucene similarities, BM25L, Okapi BM25, SMART BNNBNN
IndexerApp
RetrievalApp
RetrievalAppQueryExpansion

IR-Utils

The ir-utils project is maybe the best of both worlds – supporting evaluation using both Indri and Lucene. It's also a bit of a mess and missing things we've added on our own forks.

What it has:

Basic framework for running models with parameterization
A variety of scorers
Weak evaluation support (mainly use trec_eval)
Abstraction of Indri and Lucene indexes
Lucene indexer support with Trec, StreamCorpus, Wiki, Xml support
LuceneRunQuery, LuceneBuildIndex classes
Trec-formatted output
Feedback models

What is could have with a few PRs:

YAML-based collection/model parameterization framework
Multi-threaded query runner
Distributed query runner (via Mike's Kubernetes work)
Cross-validation framework
Permutation test (via Galago ireval)

Other notes

Re-reading Zhai's SLMIR, noticed different ranges for Okapi BM25 parameters.

...

Space shortcuts

Page tree

Versions Compared

Old Version 6

New Version Current

Key

IR-Utils

Other notes