Ohsumed Baselines

1. Data (/shared/ohsumed/data)

#download Ohsumed data

mkdir –p /shared/ohsumed 
cd /shared/ohsumed
wget http://trec.nist.gov/data/filtering/t9.filtering.tar.gz

#untar Ohsumed data

tar xvzf t9.filtering.tar.gz --owner root --group root --no-same-owner 2>&1 >> ohsumeddata.log

#copy data into /shared/ohsumed/data

cp /shared/ohsumed/ohsu-trec/trec9-test/ohsumed.88-91 /shared/ohsumed/data
cp /shared/ohsumed/ohsu-trec/trec9-train/ohsumed.87 /shared/ohsumed/data

#convert data into trec format

cd ~/biocaddie/scripts   
./ohsumed2trec.sh

***#documents=348566

Output: /shared/ohsumed/data/trecText/ohsumed_all.txt

Also make a copy at /data/ohsumed/data/

2. Indexes (/shared/ohsumed/indexes/ohsumed_all)

Index param file: ~/biocaddie/index/build_index.ohsumed.params

#Content

<parameters> 
  <index>/shared/ohsumed/indexes/ohsumed_all</index> 
  <indexType>indri</indexType> 
  <corpus> 
    <path>/shared/ohsumed/data/trecText/ohsumed_all.txt</path> 
    <class>trectext</class> 
  </corpus> 
</parameters>

#Build index

mkdir -p /shared/ohsumed/indexes/  
cd ~/biocaddie  
IndriBuildIndex index/build_index.ohsumed.params

Output is saved at /shared/ohsumed/indexes/ohsumed_all

Also make a copy at /data/ohsumed/indexes/ohsumed_all

3. Queries

#copy topics to /shared/ohsumed/queries folder

cp /shared/ohsumed/ohsu-trec/trec9-train/query.ohsu.1-63 /shared/ohsumed/queries 
cp /shared/ohsumed/ohsu-trec/pre-test/query.ohsu.test.1-43 /shared/ohsumed/queries

There are two queries - for pre-test and for training (and test) sets

We create queries.combined.orig (total 106) including all the queries and queries.combined.short (total 63) for the queries used for training and test sets (not include pre-test queries)

#convert query into trec format (use ohsumedtopics2trec.sh to create queries.combined.orig and ohsumedtopics2trec_v2.sh to create queries.combined.short)

cd ~/biocaddie  
scripts/ohsumedtopics2trec.sh

Output is saved at /shared/ohsumed/queries

Also make a copy of the query at /data/ohsumed/queries

4. Qrels

#copy qrels to /shared/ohsumed/qrels folder

cp /shared/ohsumed/ohsu-trec/trec9-train/qrels.ohsu.batch.87 /shared/ohsumed/qrels 
cp /shared/ohsumed/ohsu-trec/pre-test/qrels.ohsu.test.87 /shared/ohsumed/qrels 
cp /shared/ohsumed/ohsu-trec/trec9-test/qrels.ohsu.88-91 /shared/ohsumed/qrels

Similar to queries, the 3 qrels files include relevant judgements for pre-test, training and test sets.

In case of queries.combined.orig, all qrels files are used - qrels.all

In case of queries.combined.short, only qrels for training and test sets are used (qrels.ohsu.batch.87 and qrels.ohsu.88-91) - qrels.notest

However, the downloaded qrels are missing one column for trec_eval to process, we have to add the missing column before using.

#convert qrels into correct format for trec_eval (add in 0 in second column)

cat qrels.ohsu.* | sed 's/\t/\t0\t/1' > qrels.all

Output is saved at /shared/ohsumed/qrels

Also make a copy of the qrels at /data/ohsumed/qrels

5. IndriRunQuery - Output

cd ~/biocaddie/baselines/ohsumed 
./<model>.sh <topic> <collection> |parallel -j 20 bash -c "{}"

For orig queries:

./jm.sh orig combined| parallel -j 20 bash -c "{}" 
./dir.sh orig combined| parallel -j 20 bash -c "{}" 
./dir.sh orig combined| parallel -j 20 bash -c "{}" 
./two.sh orig combined| parallel -j 20 bash -c "{}" 
./okapi.sh orig combined| parallel -j 20 bash -c "{}" 
./rm3.sh orig combined| parallel -j 20 bash -c "{}"

For short queries:

./jm.sh short combined| parallel -j 20 bash -c "{}" 
./dir.sh short combined| parallel -j 20 bash -c "{}" 
./dir.sh short combined| parallel -j 20 bash -c "{}" 
./two.sh short combined| parallel -j 20 bash -c "{}" 
./okapi.sh short combined| parallel -j 20 bash -c "{}" 
./rm3.sh short combined| parallel -j 20 bash -c "{}"

IndriRunQuery outputs for different baselines are stored at:

/data/ohsumed/output/tfidf/combined/orig

/data/ohsumed/output/dir/combined/orig

/data/ohsumed/output/okapi/combined/orig

/data/ohsumed/output/jm/combined/orig

/data/ohsumed/output/two/combined/orig

/data/ohsumed/output/rm3/combined/orig

---

/data/ohsumed/output/tfidf/combined/short

/data/ohsumed/output/dir/combined/short

/data/ohsumed/output/okapi/combined/short

/data/ohsumed/output/jm/combined/short

/data/ohsumed/output/two/combined/short

/data/ohsumed/output/rm3/combined/short

6. Cross-validation

cd ~/biocaddie

For orig queries which use qrels.all

scripts/mkeval_ohsumed.sh <model> <topics> <collection>

Eg: scripts/mkeval_ohsumed.sh tfidf orig combined

For short queries which use qrels.notest

scripts/mkeval_ohsumed_v2.sh <model> <topics> <collection>

Eg: scripts/mkeval_ohsumed_v2.sh tfidf short combined

7. Compare models

cd ~/biocaddie  
Rscript scripts/compare_ohsumed.R <collection> <from model> <to model> <topic>

Results (compared to tfidf baseline)

Using orig queries (pre-test queries included)

Model	MAP	NDCG	P@20	NDCG@20	P@100	NDCG@100	Notes	Date
tfidf	0.2204	0.4538	0.2995	0.2904	0.1735	0.3376	Sweep b and k1	06/07/17
Okapi	0.2218	0.4557	0.2819-	0.3035	0.1717	0.3386	Sweep b, k1, k3	06/07/17
QL (JM)	0.1876-	0.4212-	0.2505-	0.2773	0.1403-	0.295-	Sweep lambda	06/07/17
QL (Dir)	0.2032-	0.4359-	0.2713-	0.2927	0.1633-	0.3304	Sweep mu	06/07/17
QL (TS)	0.2101-	0.4415-	0.2761-	0.3029	0.1638-	0.3277	Sweep mu and lambda	06/07/17
RM3	0.2618+	0.4592	0.3277+	0.2965	0.1913+	0.3662+	Sweep mu, fbDocs, fbTerms, and lambda	06/08/17

root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf dir orig 
[1] "map 0.2204 0.2032 p= 0.9988" 
[1] "ndcg 0.4538 0.4359 p= 0.9987" 
[1] "P_20 0.2995 0.2713 p= 0.9985" 
[1] "ndcg_cut_20 0.2904 0.2927 p= 0.417" 
[1] "P_100 0.1735 0.1633 p= 0.9945" 
[1] "ndcg_cut_100 0.3376 0.3304 p= 0.7764" 
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf jm orig 
[1] "map 0.2204 0.1876 p= 0.9966" 
[1] "ndcg 0.4538 0.4212 p= 0.9992" 
[1] "P_20 0.2995 0.2505 p= 0.9999" 
[1] "ndcg_cut_20 0.2904 0.2773 p= 0.8572" 
[1] "P_100 0.1735 0.1403 p= 1" 
[1] "ndcg_cut_100 0.3376 0.295 p= 0.9996" 
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf two orig 
[1] "map 0.2204 0.2101 p= 0.972" 
[1] "ndcg 0.4538 0.4415 p= 0.9859" 
[1] "P_20 0.2995 0.2761 p= 0.9954" 
[1] "ndcg_cut_20 0.2904 0.3029 p= 0.1072" 
[1] "P_100 0.1735 0.1638 p= 0.9992" 
[1] "ndcg_cut_100 0.3376 0.3277 p= 0.857" 
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf okapi orig 
[1] "map 0.2204 0.2218 p= 0.4445" 
[1] "ndcg 0.4538 0.4557 p= 0.414" 
[1] "P_20 0.2995 0.2819 p= 0.975" 
[1] "ndcg_cut_20 0.2904 0.3035 p= 0.1157" 
[1] "P_100 0.1735 0.1717 p= 0.6907" 
[1] "ndcg_cut_100 0.3376 0.3386 p= 0.4437"

Using short queries (pre-test queries not included)

Model	MAP	NDCG	P@20	NDCG@20	P@100	NDCG@100	Notes	Date
tfidf	0.3188	0.6084	0.45	0.4255	0.2657	0.4625	Sweep b and k1	06/07/17
Okapi	0.3117	0.6044	0.4408	0.4277	0.261	0.4569	Sweep b, k1, k3	06/07/17
QL (JM)	0.2545-	0.5527-	0.3908-	0.3882-	0.2135-	0.3883-	Sweep lambda	06/07/17
QL (Dir)	0.2924-	0.5866-	0.3975	0.4018-	0.2492-	0.432-	Sweep mu	06/07/17
QL (TS)	0.2934-	0.5828-	0.4092-	0.4122	0.2508-	0.4385-	Sweep mu and lambda	06/07/17
RM3	0.3717+	0.6087	0.5067+	0.4529 (p-value: 0.0541)	0.291+	0.4934+	Sweep mu, fbDocs, fbTerms, and lambda	06/08/17

root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf dir short
[1] "map 0.3188 0.2924 p= 0.9997"
[1] "ndcg 0.6084 0.5866 p= 0.9994"
[1] "P_20 0.45 0.3975 p= 0.9998"
[1] "ndcg_cut_20 0.4255 0.4018 p= 0.9881"
[1] "P_100 0.2657 0.2492 p= 0.9947"
[1] "ndcg_cut_100 0.4625 0.432 p= 0.9999"
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf jm short
[1] "map 0.3188 0.2545 p= 1"
[1] "ndcg 0.6084 0.5527 p= 1"
[1] "P_20 0.45 0.3908 p= 0.9984"
[1] "ndcg_cut_20 0.4255 0.3882 p= 0.9973"
[1] "P_100 0.2657 0.2135 p= 1"
[1] "ndcg_cut_100 0.4625 0.3883 p= 1"
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf okapi short
[1] "map 0.3188 0.3117 p= 0.7974"
[1] "ndcg 0.6084 0.6044 p= 0.6834"
[1] "P_20 0.45 0.4408 p= 0.7506"
[1] "ndcg_cut_20 0.4255 0.4277 p= 0.4236"
[1] "P_100 0.2657 0.261 p= 0.791"
[1] "ndcg_cut_100 0.4625 0.4569 p= 0.747"
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf two short
[1] "map 0.3188 0.2934 p= 1"
[1] "ndcg 0.6084 0.5828 p= 0.9997"
[1] "P_20 0.45 0.4092 p= 0.9989"
[1] "ndcg_cut_20 0.4255 0.4122 p= 0.89"
[1] "P_100 0.2657 0.2508 p= 0.9991"
[1] "ndcg_cut_100 0.4625 0.4385 p= 0.9992"

8. Comments:

BioCADDIE dataset contains descriptive metadata (structured and unstructured) of more than 1.5 millions documents from biomedical datasets. There are 20 queries which are manually refined and shortened including important keywords. Relevant judgements contains 3 categories 0-"not relevant", 1-"possibly relevant" and 2-"definitely relevant".

TREC CDS dataset is a collection of 733.328 full-text biomedical literature of journal articles. 30 topics are provided, each includes topic "description" (containing a complete account of the patients' visits, including details such as their vital statistics, drug dosages, etc) and topic "summary" (a simplified versions of the narratives that contain less irrelevant information). Queries are constructed by topic summaries. Similar to bioCADDIE, relevant judgements are divided into 3 categories 0-"not relevant", 1-"possibly relevant" and 2-"definitely relevant".

The OHSUMED test collection is a set of 348,566 references/documents from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals. Compared to the two above collections, Ohsumed dataset is quite small. OHSUMED topics include 2 fields - tilte (patient description) and description (information request). Topic descriptions are selected to construct queries. Relevant judgements include 2 categories 1-"possibly relevant" and 2-"definitely relevant"

Based on the characteristics of 3 collections, TREC CDS is far different from bioCADDIE and Ohsumed as it uses full text search and its queries are patient visit record summary instead of common information queries. Ohsumed collection is closer to bioCADDIE in term of dataset similarity (non full-text). However, bioCADDIE queries are short keyword queries while OHSUMED queries are short verbose queries.

As per the baselines run results over all 3 collections, RM3 baselines generally perform well and consistent. Especially for TREC CDS and Ohsumed, RM3 gives best results for most of the metrics compared to other baselines. This was expected as RM3 based on Rocchio relevance feedback which can help to generate good query (query expansion) even we don’t know the collection well.

One surprising result was that Query likelihood baselines with smoothing (such as JM, Dir and TS) did not improve the retrieval results over TFIDF for any metrics in TREC CDS and Ohsumed collections as bioCADDIE or previous studies did (http://trec.nist.gov/pubs/trec23/papers/pro-UCLA_MII_clinical.pdf) or. However, type of queries could be an important factor that might cause the differences in retrieval results. This was also mentioned in the study of Zhai C (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.8978) that queries with only keywords tend to perform better than more verbose queries.

We tried to examine the difference in using verbose queries and keyword queries on Ohsumed collection.

Using original queries (verbose queries) for OHSUMED

Model	MAP	NDCG	P@20	NDCG@20	P@100	NDCG@100	Notes	Date
tfidf	0.3188	0.6084	0.45	0.4255	0.2657	0.4625	Sweep b and k1	06/07/17
QL (JM)	0.2545-	0.5527-	0.3908-	0.3882-	0.2135-	0.3883-	Sweep lambda	06/07/17
QL (Dir)	0.2924-	0.5866-	0.3975	0.4018-	0.2492-	0.432-	Sweep mu	06/07/17
QL (TS)	0.2934-	0.5828-	0.4092-	0.4122	0.2508-	0.4385-	Sweep mu and lambda	06/07/17

Using manually refined queries (mostly keywords) for OHSUMED

Model	MAP	NDCG	P@20	NDCG@20	P@100	NDCG@100	Notes	Date
tfidf	0.315	0.5949	0.4198	0.3802	0.2614	0.4454	Sweep b and k1	06/09/17
QL (JM)	0.2587-	0.5466-	0.3817-	0.3608	0.2257-	0.3806-	Sweep lambda	06/09/17
QL (Dir)	0.3027-	0.5883	0.4087	0.379	0.261	0.4333-	Sweep mu	06/09/17
QL (TS)	0.3052-	0.5871	0.4159	0.3896	0.2627	0.4354-	Sweep mu and lambda	06/09/17

We can see that when using keyword queries, the difference in retrieval results between tfidf and QL is smaller.

Specifically, tfidf performed worse for all metrics when using keyword queries than using verbose/original queries. QL (JM) also performed worse for NDCG, P@20, NDCG@20 and NDCG@100. However, QL (Dir) and QL (TS) performed better for most of the metrics. This matched with the finding in Zhai's study that JM works worst for short keywords queries but more effective when queries are verbose while Dir works better for concise keyword queries then verbose queries.

Also number of queries used for running baselines in each collection could be considered for the difference.

(to be continued)

Space shortcuts

Page tree