Notes from attempt to move from Indri to Lucene for evaluation.

Similarity

  • The LMSimilarity classes are basic QL, not KL.
  • Similarities are used at both index and query time.
  • At index time:
    • computeNorm – stores per-document normalization value later used by getNormValues
  • At query time
    • computeWeight called once per query
    • getValueForNormalization is query normalization, called once per query
    • score() method called for each document
    • exactSimScorer
    • sloppySimScorer
  • Document length is only accessible to Similarity (not for our re-ranking approach without explicitly storing as a field)
  • For some reason Lucene JM is giving very different results than Indri JM, but this may be related to how I'm indexing the fields (for Indri, dumping the JSON data into TEXT and for Lucene actually creating separate TITLE and METADATA fields)'
    • This was the case. By putting all data in the TEXT field, the scores are much closer.
  • The explain method ends up being handy for reverse-engineering the scores
  • Lucene's LMDirichlet model is odd – I'm not clear why Math.log(1 + freq...). If the 1 is epsilon, then this should be in parenthesis. However, they are also ignoring negative scores
    •  return score > 0.0f ? score : 0.0f;
    • The explanation for this is part of NDS-914 - Getting issue details... STATUS

Internal Query Execution Procedure

  • A Query is a reusable representation of the search query. Internally, Lucene converts queries into non-reusable Weight objects.
  • Queries are automatically parsed by a QueryParser. In the case of a raw text string, they appear to be parsed into BooleanQuery objects
  • When IndexSearcher.search is called, it eventually calls IndexSearcher.createNormalizedWeight, which itself calls createWeight on the query object
  • BooleanQuery.createWeight constructs a new BooleanWeight from the BooleanQuery
  • The BooleanWeight constructor iterates over BooleanClause objects in the BooleanQuery. While we never explicitly see these BooleanClauses get created (because it happens during query parsing), I believe there is one BooleanClause per query term. In the constructor, a new Weight is created from each BooleanClause that comprises the BooleanQuery. This means that a single BooleanWeight is made up of a list of Weights representing each term in the query. My guess is that the QueryParser has constructed these BooleanClauses to contain one TermQuery per term in the query (a TermQuery is a single term query).
  • At some point, Lucene calls TermQuery.scorer (I believe starting with IndexSearcher.search calling (Boolean)Weight.bulkScorer), which returns a new TermScorer object, which is used to score the TermQuery (which, remember, is just a single word in the overall BooleanQuery).
  • Part of TermQuery.scorer is getting a PostingsEnum of all the documents containing the term (a.k.a. the postings). DefaultBulkScorer.scoreRange (which we can see in the stack trace below) involves advancing through the document postings (the iterator.nextDoc() call) and, for each document, calling OrCollector.collect, which itself calls TermScorer.score to score the individual query term in the document currently selected from the PostingsEnum. TermScorer.score in turn ultimately calls OurDirichletSimilarity.score (it is given our Similarity implementation during construction in TermQuery.scorer, which itself fetches it from the IndexSearcher).

So there's our problem: the DefaultBulkScorer only scores documents that appear in the postings list for a given term. These scores are then collected by the OrCollector to construct a single overall document score, but since documents that don't contain the query term don't appear in the term's postings list, the final score of the document doesn't include any score for that term

We will need to find a way to either make the BulkScorer iterate over the combined postings of all query terms, or somehow add in the smoothing during the collection phase (when the scores of all terms are being combined into one per-document score). The former is probably much slower but more flexible than the latter when it comes to implementing our own custom scorers. I'll be thinking about how this might be accomplished.

java.lang.Exception: Stack trace
        at java.lang.Thread.dumpStack(Thread.java:1333)
        at org.retrievable.lucene.searching.similarities.OurDirichletSimilarity.score(OurDirichletSimilarity.java:37)
        at org.apache.lucene.search.similarities.SimilarityBase$BasicSimScorer.score(SimilarityBase.java:281)
        at org.apache.lucene.search.TermScorer.score(TermScorer.java:66)
        at org.apache.lucene.search.BooleanScorer$OrCollector.collect(BooleanScorer.java:143)
        at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:221)
        at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:208)
        at org.apache.lucene.search.BooleanScorer$BulkScorerAndDoc.score(BooleanScorer.java:61)
        at org.apache.lucene.search.BooleanScorer.scoreWindowIntoBitSetAndReplay(BooleanScorer.java:219)
        at org.apache.lucene.search.BooleanScorer.scoreWindowMultipleScorers(BooleanScorer.java:266)
        at org.apache.lucene.search.BooleanScorer.scoreWindow(BooleanScorer.java:311)
        at org.apache.lucene.search.BooleanScorer.score(BooleanScorer.java:335)
        at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:591)
        at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:449)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:460)
        at org.retrievable.lucene.searching.Searcher.search(Searcher.java:106)
        at org.retrievable.lucene.searching.Searcher.main(Searcher.java:85)
  • No labels