...
- The LMSimilarity classes are basic QL, not KL.
- Similarities are used at both index and query time.
- At index time:
- computeNorm – stores per-document normalization value later used by getNormValues
- At query time
- computeWeight called once per query
- getValueForNormalization is query normalization, called once per query
- score() method called for each document
- exactSimScorer
- sloppySimScorer
- Document length is only accessible to Similarity (not for our re-ranking approach without explicitly storing as a field)
- For some reason Lucene JM is giving very different results than Indri JM, but this may be related to how I'm indexing the fields (for Indri, dumping the JSON data into TEXT and for Lucene actually creating separate TITLE and METADATA fields)'
- This was the case. By putting all data in the TEXT field, the scores are much closer.
- The explain method ends up being handy for reverse-engineering the scores
- Lucene's LMDirichlet model is odd – I'm not clear why Math.log(1 + freq...). If the 1 is epsilon, then this should be in parenthesis. However, they are also ignoring negative scores
- return score > 0.0f ? score : 0.0f;
- The explanation for this is part of
Jira server JIRA serverId b14d4ad9-eb00-3a94-88ac-a843fb6fa1ca key NDS-914
Internal Query Execution Procedure
- A
Query
is a reusable representation of the search query. Internally, Lucene converts queries into non-reusableWeight
objects. - Queries are automatically parsed by a
QueryParser
. In the case of a raw text string, they appear to be parsed intoBooleanQuery
objects - When
IndexSearcher.search
is called, it eventually callsIndexSearcher.createNormalizedWeight
, which itself callscreateWeight
on the query object BooleanQuery.createWeight
constructs a newBooleanWeight
from theBooleanQuery
- The
BooleanWeight
constructor iterates overBooleanClause
objects in theBooleanQuery
. While we never explicitly see theseBooleanClause
s get created (because it happens during query parsing), I believe there is oneBooleanClause
per query term. In the constructor, a newWeight
is created from eachBooleanClause
that comprises theBooleanQuery
. This means that a singleBooleanWeight
is made up of a list ofWeight
s representing each term in the query. My guess is that the QueryParser has constructed these BooleanClauses to contain one TermQuery per term in the query (a TermQuery is a single term query). - At some point, Lucene calls TermQuery.scorer (I believe starting with IndexSearcher.search calling (Boolean)Weight.bulkScorer), which returns a new TermScorer object, which is used to score the TermQuery (which, remember, is just a single word in the overall BooleanQuery).
- Part of TermQuery.scorer is getting a PostingsEnum of all the documents containing the term (a.k.a. the postings). DefaultBulkScorer.scoreRange (which we can see in the stack trace below) involves advancing through the document postings (the iterator.nextDoc() call) and, for each document, calling OrCollector.collect, which itself calls TermScorer.score to score the individual query term in the document currently selected from the PostingsEnum. TermScorer.score in turn ultimately calls OurDirichletSimilarity.score (it is given our Similarity implementation during construction in TermQuery.scorer, which itself fetches it from the IndexSearcher).
So there's our problem: the DefaultBulkScorer only scores documents that appear in the postings list for a given term. These scores are then collected by the OrCollector to construct a single overall document score, but since documents that don't contain the query term don't appear in the term's postings list, the final score of the document doesn't include any score for that term
We will need to find a way to either make the BulkScorer iterate over the combined postings of all query terms, or somehow add in the smoothing during the collection phase (when the scores of all terms are being combined into one per-document score). The former is probably much slower but more flexible than the latter when it comes to implementing our own custom scorers. I'll be thinking about how this might be accomplished.
java.lang.Exception: Stack trace
at java.lang.Thread.dumpStack(Thread.java:1333)
at org.retrievable.lucene.searching.similarities.OurDirichletSimilarity.score(OurDirichletSimilarity.java:37)
at org.apache.lucene.search.similarities.SimilarityBase$BasicSimScorer.score(SimilarityBase.java:281)
at org.apache.lucene.search.TermScorer.score(TermScorer.java:66)
at org.apache.lucene.search.BooleanScorer$OrCollector.collect(BooleanScorer.java:143)
at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:221)
at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:208)
at org.apache.lucene.search.BooleanScorer$BulkScorerAndDoc.score(BooleanScorer.java:61)
at org.apache.lucene.search.BooleanScorer.scoreWindowIntoBitSetAndReplay(BooleanScorer.java:219)
at org.apache.lucene.search.BooleanScorer.scoreWindowMultipleScorers(BooleanScorer.java:266)
at org.apache.lucene.search.BooleanScorer.scoreWindow(BooleanScorer.java:311)
at org.apache.lucene.search.BooleanScorer.score(BooleanScorer.java:335)
at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:591)
at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:449)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:460)
at org.retrievable.lucene.searching.Searcher.search(Searcher.java:106)
at org.retrievable.lucene.searching.Searcher.main(Searcher.java:85)