Lucene Notes

Notes from attempt to move from Indri to Lucene for evaluation.

Similarity

The LMSimilarity classes are basic QL, not KL.
Similarities are used at both index and query time.
At index time:
- computeNorm – stores per-document normalization value later used by getNormValues
At query time
- computeWeight called once per query
- getValueForNormalization is query normalization, called once per query
- score() method called for each document
- exactSimScorer
- sloppySimScorer
Document length is only accessible to Similarity (not for our re-ranking approach without explicitly storing as a field)
For some reason Lucene JM is giving very different results than Indri JM, but this may be related to how I'm indexing the fields (for Indri, dumping the JSON data into TEXT and for Lucene actually creating separate TITLE and METADATA fields)'
- This was the case. By putting all data in the TEXT field, the scores are much closer.
The explain method ends up being handy for reverse-engineering the scores
Lucene's LMDirichlet model is odd – I'm not clear why Math.log(1 + freq...). If the 1 is epsilon, then this should be in parenthesis. However, they are also ignoring negative scores
- return score > 0.0f ? score : 0.0f;