Uploaded image for project: 'National Data Service'
  1. National Data Service
  2. NDS-935

Investigate differences between train and test query scores

XMLWordPrintableJSON

    • Icon: Task Task
    • Resolution: Fixed
    • Icon: Normal Normal
    • None
    • None
    • None

      There are differences in the number of qrels for the train (EA1-EA6) and test (T1-T9) queries. It would be good to know whether the differences in the number of judgments is having a negative affect on our retrieval metrics.

      I suggest you start with the following:

      • Find the exact numbers of judged relevant (qrel >= 1) and non-relevant (qrel = 0) documents for each query
      • For the usual baseline runs (QL, TF-IDF, Okapi, RM3), get the usual metrics (MAP, nDCG, P@20, nDCG@20, etc.) using just the EA queries and then just the T queries.
      • Calculate the variance of each metric for each type of query across runs. In R, you can use the var() function with a list of the metrics to calculate variance. For example, if 0.23 is MAP for EA queries under QL, 0.35 is MAP for EA queries under TF-IDF, etc.:

      var(c(0.23, 0.35, 0.56, ...))

      • Record all of the above in the wiki

              gsherma2 Garrick Sherman
              gsherma2 Garrick Sherman
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved:

                  Estimated:
                  Original Estimate - 4 hours
                  4h
                  Remaining:
                  Remaining Estimate - 4 hours
                  4h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified