Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Using orig queries (pre-test queries included)

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
tfidf0.22040.45380.29950.29040.17350.3376Sweep b and k106/07/17
Okapi0.22180.45570.2819-0.30350.17170.3386Sweep b, k1, k306/07/17
QL (JM)0.1876-0.4212-0.2505-0.27730.1403-0.295-Sweep lambda06/07/17
QL (Dir)0.2032-0.4359-0.2713-0.29270.1633-0.3304

Sweep mu

06/07/17
QL (TS)0.2101-0.4415-0.2761-0.30290.1638-0.3277Sweep mu and lambda06/07/17
RM30.2618+0.45920.3277+0.29650.1913+0.3662+Sweep mu, fbDocs, fbTerms, and lambda06/08/17


No Format
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf dir orig 
[1] "map 0.2204 0.2032 p= 0.9988" 
[1] "ndcg 0.4538 0.4359 p= 0.9987" 
[1] "P_20 0.2995 0.2713 p= 0.9985" 
[1] "ndcg_cut_20 0.2904 0.2927 p= 0.417" 
[1] "P_100 0.1735 0.1633 p= 0.9945" 
[1] "ndcg_cut_100 0.3376 0.3304 p= 0.7764" 
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf jm orig 
[1] "map 0.2204 0.1876 p= 0.9966" 
[1] "ndcg 0.4538 0.4212 p= 0.9992" 
[1] "P_20 0.2995 0.2505 p= 0.9999" 
[1] "ndcg_cut_20 0.2904 0.2773 p= 0.8572" 
[1] "P_100 0.1735 0.1403 p= 1" 
[1] "ndcg_cut_100 0.3376 0.295 p= 0.9996" 
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf two orig 
[1] "map 0.2204 0.2101 p= 0.972" 
[1] "ndcg 0.4538 0.4415 p= 0.9859" 
[1] "P_20 0.2995 0.2761 p= 0.9954" 
[1] "ndcg_cut_20 0.2904 0.3029 p= 0.1072" 
[1] "P_100 0.1735 0.1638 p= 0.9992" 
[1] "ndcg_cut_100 0.3376 0.3277 p= 0.857" 
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf okapi orig 
[1] "map 0.2204 0.2218 p= 0.4445" 
[1] "ndcg 0.4538 0.4557 p= 0.414" 
[1] "P_20 0.2995 0.2819 p= 0.975" 
[1] "ndcg_cut_20 0.2904 0.3035 p= 0.1157" 
[1] "P_100 0.1735 0.1717 p= 0.6907" 
[1] "ndcg_cut_100 0.3376 0.3386 p= 0.4437" 

Using short queries (pre-test queries not included)

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
tfidf0.31880.60840.450.42550.26570.4625Sweep b and k106/07/17
Okapi0.31170.60440.44080.42770.2610.4569Sweep b, k1, k306/07/17
QL (JM)0.2545-0.5527-0.3908-0.3882-0.2135-0.3883-Sweep lambda06/07/17
QL (Dir)0.2924-0.5866-0.39750.4018-0.2492-0.432-

Sweep mu

06/07/17
QL (TS)0.2934-0.5828-0.4092-0.41220.2508-0.4385-Sweep mu and lambda06/07/17
RM30.3717+0.60870.5067+

0.4529 (p-value: 0.0541)

0.291+0.4934+Sweep mu, fbDocs, fbTerms, and lambda06/08/17


No Format
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf dir short
[1] "map 0.3188 0.2924 p= 0.9997"
[1] "ndcg 0.6084 0.5866 p= 0.9994"
[1] "P_20 0.45 0.3975 p= 0.9998"
[1] "ndcg_cut_20 0.4255 0.4018 p= 0.9881"
[1] "P_100 0.2657 0.2492 p= 0.9947"
[1] "ndcg_cut_100 0.4625 0.432 p= 0.9999"
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf jm short
[1] "map 0.3188 0.2545 p= 1"
[1] "ndcg 0.6084 0.5527 p= 1"
[1] "P_20 0.45 0.3908 p= 0.9984"
[1] "ndcg_cut_20 0.4255 0.3882 p= 0.9973"
[1] "P_100 0.2657 0.2135 p= 1"
[1] "ndcg_cut_100 0.4625 0.3883 p= 1"
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf okapi short
[1] "map 0.3188 0.3117 p= 0.7974"
[1] "ndcg 0.6084 0.6044 p= 0.6834"
[1] "P_20 0.45 0.4408 p= 0.7506"
[1] "ndcg_cut_20 0.4255 0.4277 p= 0.4236"
[1] "P_100 0.2657 0.261 p= 0.791"
[1] "ndcg_cut_100 0.4625 0.4569 p= 0.747"
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf two short
[1] "map 0.3188 0.2934 p= 1"
[1] "ndcg 0.6084 0.5828 p= 0.9997"
[1] "P_20 0.45 0.4092 p= 0.9989"
[1] "ndcg_cut_20 0.4255 0.4122 p= 0.89"
[1] "P_100 0.2657 0.2508 p= 0.9991"
[1] "ndcg_cut_100 0.4625 0.4385 p= 0.9992"


8. Comments:

BioCADDIE dataset contains descriptive metadata (structured and unstructured) of more than 1.5 millions documents from biomedical datasets. There are 20 queries which are manually refined and shortened including important keywords. Relevant judgements contains 3 categories 0-"not relevant", 1-"possibly relevant" and 2-"definitely relevant". 

TREC CDS dataset is a collection of 733.328 full-text biomedical literature of journal articles. 30 topics are provided, each includes topic "description" (containing a complete account of the patients' visits, including details such as their vital statistics, drug dosages, etc) and topic "summary" (a simplified versions of the narratives that contain less irrelevant information). Queries are constructed by topic summaries. Similar to bioCADDIE, relevant judgements are divided into 3 categories 0-"not relevant", 1-"possibly relevant" and 2-"definitely relevant". 

The OHSUMED test collection is a set of 348,566 references/documents from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals. Compared to the two above collections, Ohsumed dataset is quite small. OHSUMED topics include 2 fields - tilte(patient description) and description (information request). Topic descriptions are selected to construct queries. Relevant judgements include 2 categories 1-"possibly relevant" and 2-"definitely relevant" 

Based on the characteristics of 3 collections, TREC CDS is far different from bioCADDIE and Ohsumed as it uses full text search and its queries are patient visit record summary instead of common information queries. Ohsumed collection is closer to bioCADDIE in term of dataset similarity (non full-text). However, bioCADDIE queries are short keyword queries while OHSUMED queries are short verbose queries. 

As per the baselines run results over all 3 collections, RM3 baselines generally perform well and consistent. Especially for TREC CDS and Ohsumed, RM3 gives best results for most of the metrics compared to other baselines. This was expected as RM3 based on Rocchio relevance feedback which can help to generate good query (query expansion) even we don’t know the collection well. 

One surprising result was that Query likelihood baselines with smoothing (such as JM, Dir and TS) did not improve the retrieval results over TFIDF for any metrics in TREC CDS and Ohsumed collections as bioCADDIE or previous studies did (http://trec.nist.gov/pubs/trec23/papers/pro-UCLA_MII_clinical.pdf) or. However, type of queries could be an important factor that might cause the differences in retrieval results. This was also mentioned in the study of Zhai C (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.8978) that queries with only keywords tend to perform better than more verbose queries.

We tried to examine the difference in using verbose queries and keyword queries on Ohsumed collection.

Using original queries (verbose queries) for OHSUMED

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
tfidf0.31880.60840.450.42550.26570.4625Sweep b and k106/07/17
QL (JM)0.2545-0.5527-0.3908-0.3882-0.2135-0.3883-Sweep lambda06/07/17
QL (Dir)0.2924-0.5866-0.39750.4018-0.2492-0.432-

Sweep mu

06/07/17
QL (TS)0.2934-0.5828-0.4092-0.41220.2508-0.4385-Sweep mu and lambda06/07/17

Using manually refined queries (mostly keywords) for OHSUMED

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
tfidf0.3150.59490.41980.38020.26140.4454Sweep b and k106/09/17
QL (JM)0.2587-0.5466-0.3817-0.36080.2257-0.3806-Sweep lambda06/09/17
QL (Dir)0.3027-0.58830.40870.3790.2610.4333-

Sweep mu

06/09/17
QL (TS)0.3052-0.58710.41590.38960.26270.4354-Sweep mu and lambda06/09/17

We can see that when using keyword queries, the difference in retrieval results between tfidf and QL is smaller.

Specifically, tfidf performed worse for all metrics when using keyword queries than using verbose/original queries. QL (JM) also performed worse for NDCG, P@20, NDCG@20 and NDCG@100. However, QL (Dir) and QL (TS) performed better for most of the metrics. This matched with the finding in Zhai's study that JM works worst for short keywords queries but more effective when queries are verbose while Dir works better for concise keyword queries then verbose queries.

Also number of queries used for running baselines in each collection could be considered for the difference.

(to be continued)