...
- Ability to create an index controlling for specific transformations (stemming, stopping, field storage, etc)
- Ability to index standard TREC collection formats as well as the BioCADDIE JSON or XML data, XML, HTML data etc.
- Using a single index, ability to dynamically change retrieval models and parameters (i.e., IndriRunQuery)
- Output in TREC format for evaluation using trec_eval and related tools
- Ability to add new retrieval model implementations
- Standard baselines for comparison
- Handles standard TREC topic formats
- Multi-threaded and distributed processing for parameter sweeps
- Ideally, works with large collections, such as ClueWeb
- Cross validation
- Hypothesis/significance testing.
- Query performance prediction: implement the basics
With Indri (and related tools) we can do the following:
...
- Indexing parameters in XML format
- Retrieval parameters in XML format
- Index support for CACM, TRECAquaint, TRECNEWS, Tipster formats
- In addition to Lucene similarities, BM25L, Okapi BM25, SMART BNNBNN
- IndexerApp
- RetrievalApp
- RetrievalAppQueryExpansion
IR-Utils
The ir-utils project is maybe the best of both worlds – supporting evaluation using both Indri and Lucene. It's also a bit of a mess and missing things we've added on our own forks.
What it has:
- Basic framework for running models with parameterization
- A variety of scorers
- Weak evaluation support (mainly use trec_eval)
- Abstraction of Indri and Lucene indexes
- Lucene indexer support with Trec, StreamCorpus, Wiki, Xml support
- LuceneRunQuery, LuceneBuildIndex classes
- Trec-formatted output
- Feedback models
What is could have with a few PRs:
- YAML-based collection/model parameterization framework
- Multi-threaded query runner
- Distributed query runner (via Mike's Kubernetes work)
- Cross-validation framework
- Permutation test (via Galago ireval)
Other notes
Re-reading Zhai's SLMIR, noticed different ranges for Okapi BM25 parameters.
...