Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Look at the data structure, currently we treat it as unstructured. What about weighting fields?
  • Subset the expansion collections
  • Find other expansion collections (SNOMED/UMLS?)
  • Expand datasets with their repositories or associated publications
  • Use MeSH and/or Wikipedia categories to work around vocabulary mismatch
  • Use Boolean queries to ensure high precision
    • Take the word "estrogen," which appears in some queries. It probably has a reasonably high IDF, since most documents won't be about estrogen, but its TF is probably not necessarily indicative of a document's relevance, since estrogen is relevant to many unrelated biomedical subjects. However, if both "estrogen" and "cancer" (which appear in the query together) appear in a document, then it may be more safe to assume that the TF of "estrogen" is indicative of document relevance (to breast cancer, in this case).

Other thoughts

  • Analyze the relevance judgments and the search results
    • Are we seeing a lot of unjudged but relevant results with anything beyond a baseline model?
    • Is there a systematic difference between the performance of the train and test queries?