There are differences in the number of qrels for the train (EA1-EA6) and test (T1-T9) queries. It would be good to know whether the differences in the number of judgments is having a negative affect on our retrieval metrics.
I suggest you start with the following:
- Find the exact numbers of judged relevant (qrel >= 1) and non-relevant (qrel = 0) documents for each query
- For the usual baseline runs (QL, TF-IDF, Okapi, RM3), get the usual metrics (MAP, nDCG, P@20, nDCG@20, etc.) using just the EA queries and then just the T queries.
- Calculate the variance of each metric for each type of query across runs. In R, you can use the var() function with a list of the metrics to calculate variance. For example, if 0.23 is MAP for EA queries under QL, 0.35 is MAP for EA queries under TF-IDF, etc.:
var(c(0.23, 0.35, 0.56, ...)) |
- Record all of the above in the wiki