You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Current »

Notes from attempt to move from Indri to Lucene for evaluation.

Similarity

  • The LMSimilarity classes are basic QL, not KL.
  • Similarities are used at both index and query time.
  • At index time:
    • computeNorm – stores per-document normalization value later used by getNormValues
  • At query time
    • computeWeight called once per query
    • getValueForNormalization is query normalization, called once per query
    • score() method called for each document
    • exactSimScorer
    • sloppySimScorer
  • Document length is only accessible to Similarity (not for our re-ranking approach without explicitly storing as a field)
  • For some reason Lucene JM is giving very different results than Indri JM, but this may be related to how I'm indexing the fields (for Indri, dumping the JSON data into TEXT and for Lucene actually creating separate TITLE and METADATA fields)'
    • This was the case. By putting all data in the TEXT field, the scores are much closer.
  • The explain method ends up being handy for reverse-engineering the scores
  • Lucene's LMDirichlet model is odd – I'm not clear why Math.log(1 + freq...). If the 1 is epsilon, then this should be in parenthesis. However, they are also ignoring negative scores
    •  return score > 0.0f ? score : 0.0f;
  • No labels