1. Data (/shared/ohsumed/data)
#download Ohsumed data
mkdir –p /shared/ohsumed cd /shared/ohsumed wget http://trec.nist.gov/data/filtering/t9.filtering.tar.gz
#untar Ohsumed data
tar xvzf t9.filtering.tar.gz --owner root --group root --no-same-owner 2>&1 >> ohsumeddata.log
#copy data into /shared/ohsumed/data
cp /shared/ohsumed/ohsu-trec/trec9-test/ohsumed.88-91 /shared/ohsumed/data cp /shared/ohsumed/ohsu-trec/trec9-train/ohsumed.87 /shared/ohsumed/data
#convert data into trec format
cd ~/biocaddie/scripts ./ohsumed2trec.sh
***#documents=348566
Output: /shared/ohsumed/data/trecText/ohsumed_all.txt
Also make a copy at /data/ohsumed/data/
2. Indexes (/shared/ohsumed/indexes/ohsumed_all)
Index param file: ~/biocaddie/index/build_index.ohsumed.params
#Content
<parameters> <index>/shared/ohsumed/indexes/ohsumed_all</index> <indexType>indri</indexType> <corpus> <path>/shared/ohsumed/data/trecText/ohsumed_all.txt</path> <class>trectext</class> </corpus> </parameters>
#Build index
mkdir -p /shared/ohsumed/indexes/ cd ~/biocaddie IndriBuildIndex index/build_index.ohsumed.params
Output is saved at /shared/ohsumed/indexes/ohsumed_all
Also make a copy at /data/ohsumed/indexes/ohsumed_all
3. Queries
#copy topics to /shared/ohsumed/queries folder
cp /shared/ohsumed/ohsu-trec/trec9-train/query.ohsu.1-63 /shared/ohsumed/queries cp /shared/ohsumed/ohsu-trec/pre-test/query.ohsu.test.1-43 /shared/ohsumed/queries
There are two queries - for pre-test and for training (and test) sets
We create queries.combined.orig (total 106) including all the queries and queries.combined.short (total 63) for the queries used for training and test sets (not include pre-test queries)
#convert query into trec format (use ohsumedtopics2trec.sh to create queries.combined.orig and ohsumedtopics2trec_v2.sh to create queries.combined.short)
cd ~/biocaddie scripts/ohsumedtopics2trec.sh
Output is saved at /shared/ohsumed/queries
Also make a copy of the query at /data/ohsumed/queries
4. Qrels
#copy qrels to /shared/ohsumed/qrels folder
cp /shared/ohsumed/ohsu-trec/trec9-train/qrels.ohsu.batch.87 /shared/ohsumed/qrels cp /shared/ohsumed/ohsu-trec/pre-test/qrels.ohsu.test.87 /shared/ohsumed/qrels cp /shared/ohsumed/ohsu-trec/trec9-test/qrels.ohsu.88-91 /shared/ohsumed/qrels
Similar to queries, the 3 qrels files include relevant judgements for pre-test, training and test sets.
In case of queries.combined.orig, all qrels files are used - qrels.all
In case of queries.combined.short, only qrels for training and test sets are used (qrels.ohsu.batch.87 and qrels.ohsu.88-91) - qrels.notest
However, the downloaded qrels are missing one column for trec_eval to process, we have to add the missing column before using.
#convert qrels into correct format for trec_eval (add in 0 in second column)
cat qrels.ohsu.* | sed 's/\t/\t0\t/1' > qrels.all
Output is saved at /shared/ohsumed/qrels
Also make a copy of the qrels at /data/ohsumed/qrels
5. IndriRunQuery - Output
cd ~/biocaddie/baselines/ohsumed ./<model>.sh <topic> <collection> |parallel -j 20 bash -c "{}"
For orig queries:
./jm.sh orig combined| parallel -j 20 bash -c "{}" ./dir.sh orig combined| parallel -j 20 bash -c "{}" ./dir.sh orig combined| parallel -j 20 bash -c "{}" ./two.sh orig combined| parallel -j 20 bash -c "{}" ./okapi.sh orig combined| parallel -j 20 bash -c "{}" ./rm3.sh orig combined| parallel -j 20 bash -c "{}"
For short queries:
./jm.sh short combined| parallel -j 20 bash -c "{}" ./dir.sh short combined| parallel -j 20 bash -c "{}" ./dir.sh short combined| parallel -j 20 bash -c "{}" ./two.sh short combined| parallel -j 20 bash -c "{}" ./okapi.sh short combined| parallel -j 20 bash -c "{}" ./rm3.sh short combined| parallel -j 20 bash -c "{}"
IndriRunQuery outputs for different baselines are stored at:
/data/ohsumed/output/tfidf/combined/orig
/data/ohsumed/output/dir/combined/orig
/data/ohsumed/output/okapi/combined/orig
/data/ohsumed/output/jm/combined/orig
/data/ohsumed/output/two/combined/orig
/data/ohsumed/output/rm3/combined/orig
---
/data/ohsumed/output/tfidf/combined/short
/data/ohsumed/output/dir/combined/short
/data/ohsumed/output/okapi/combined/short
/data/ohsumed/output/jm/combined/short
/data/ohsumed/output/two/combined/short
/data/ohsumed/output/rm3/combined/short
6. Cross-validation
cd ~/biocaddie
For orig queries which use qrels.all
scripts/mkeval_ohsumed.sh <model> <topics> <collection>
Eg: scripts/mkeval_ohsumed.sh tfidf orig combined
For short queries which use qrels.notest
scripts/mkeval_ohsumed_v2.sh <model> <topics> <collection>
Eg: scripts/mkeval_ohsumed_v2.sh tfidf short combined
7. Compare models
cd ~/biocaddie Rscript scripts/compare_ohsumed.R <collection> <from model> <to model> <topic>
Results (compared to tfidf baseline)
Using orig queries (pre-test queries included)
Model | MAP | NDCG | P@20 | NDCG@20 | P@100 | NDCG@100 | Notes | Date |
---|---|---|---|---|---|---|---|---|
tfidf | 0.2204 | 0.4538 | 0.2995 | 0.2904 | 0.1735 | 0.3376 | Sweep b and k1 | 06/07/17 |
Okapi | 0.2218 | 0.4557 | 0.2819- | 0.3035 | 0.1717 | 0.3386 | Sweep b, k1, k3 | 06/07/17 |
QL (JM) | 0.1876- | 0.4212- | 0.2505- | 0.2773 | 0.1403- | 0.295- | Sweep lambda | 06/07/17 |
QL (Dir) | 0.2032- | 0.4359- | 0.2713- | 0.2927 | 0.1633- | 0.3304 | Sweep mu | 06/07/17 |
QL (TS) | 0.2101- | 0.4415- | 0.2761- | 0.3029 | 0.1638- | 0.3277 | Sweep mu and lambda | 06/07/17 |
RM3 | 0.2618+ | 0.4592 | 0.3277+ | 0.2965 | 0.1913+ | 0.3662+ | Sweep mu, fbDocs, fbTerms, and lambda | 06/08/17 |
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf dir orig [1] "map 0.2204 0.2032 p= 0.9988" [1] "ndcg 0.4538 0.4359 p= 0.9987" [1] "P_20 0.2995 0.2713 p= 0.9985" [1] "ndcg_cut_20 0.2904 0.2927 p= 0.417" [1] "P_100 0.1735 0.1633 p= 0.9945" [1] "ndcg_cut_100 0.3376 0.3304 p= 0.7764" root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf jm orig [1] "map 0.2204 0.1876 p= 0.9966" [1] "ndcg 0.4538 0.4212 p= 0.9992" [1] "P_20 0.2995 0.2505 p= 0.9999" [1] "ndcg_cut_20 0.2904 0.2773 p= 0.8572" [1] "P_100 0.1735 0.1403 p= 1" [1] "ndcg_cut_100 0.3376 0.295 p= 0.9996" root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf two orig [1] "map 0.2204 0.2101 p= 0.972" [1] "ndcg 0.4538 0.4415 p= 0.9859" [1] "P_20 0.2995 0.2761 p= 0.9954" [1] "ndcg_cut_20 0.2904 0.3029 p= 0.1072" [1] "P_100 0.1735 0.1638 p= 0.9992" [1] "ndcg_cut_100 0.3376 0.3277 p= 0.857" root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf okapi orig [1] "map 0.2204 0.2218 p= 0.4445" [1] "ndcg 0.4538 0.4557 p= 0.414" [1] "P_20 0.2995 0.2819 p= 0.975" [1] "ndcg_cut_20 0.2904 0.3035 p= 0.1157" [1] "P_100 0.1735 0.1717 p= 0.6907" [1] "ndcg_cut_100 0.3376 0.3386 p= 0.4437"
Using short queries (pre-test queries not included)
Model | MAP | NDCG | P@20 | NDCG@20 | P@100 | NDCG@100 | Notes | Date |
---|---|---|---|---|---|---|---|---|
tfidf | 0.3188 | 0.6084 | 0.45 | 0.4255 | 0.2657 | 0.4625 | Sweep b and k1 | 06/07/17 |
Okapi | 0.3117 | 0.6044 | 0.4408 | 0.4277 | 0.261 | 0.4569 | Sweep b, k1, k3 | 06/07/17 |
QL (JM) | 0.2545- | 0.5527- | 0.3908- | 0.3882- | 0.2135- | 0.3883- | Sweep lambda | 06/07/17 |
QL (Dir) | 0.2924- | 0.5866- | 0.3975 | 0.4018- | 0.2492- | 0.432- | Sweep mu | 06/07/17 |
QL (TS) | 0.2934- | 0.5828- | 0.4092- | 0.4122 | 0.2508- | 0.4385- | Sweep mu and lambda | 06/07/17 |
RM3 | 0.3717+ | 0.6087 | 0.5067+ | 0.4529 (p-value: 0.0541) | 0.291+ | 0.4934+ | Sweep mu, fbDocs, fbTerms, and lambda | 06/08/17 |
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf dir short [1] "map 0.3188 0.2924 p= 0.9997" [1] "ndcg 0.6084 0.5866 p= 0.9994" [1] "P_20 0.45 0.3975 p= 0.9998" [1] "ndcg_cut_20 0.4255 0.4018 p= 0.9881" [1] "P_100 0.2657 0.2492 p= 0.9947" [1] "ndcg_cut_100 0.4625 0.432 p= 0.9999" root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf jm short [1] "map 0.3188 0.2545 p= 1" [1] "ndcg 0.6084 0.5527 p= 1" [1] "P_20 0.45 0.3908 p= 0.9984" [1] "ndcg_cut_20 0.4255 0.3882 p= 0.9973" [1] "P_100 0.2657 0.2135 p= 1" [1] "ndcg_cut_100 0.4625 0.3883 p= 1" root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf okapi short [1] "map 0.3188 0.3117 p= 0.7974" [1] "ndcg 0.6084 0.6044 p= 0.6834" [1] "P_20 0.45 0.4408 p= 0.7506" [1] "ndcg_cut_20 0.4255 0.4277 p= 0.4236" [1] "P_100 0.2657 0.261 p= 0.791" [1] "ndcg_cut_100 0.4625 0.4569 p= 0.747" root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf two short [1] "map 0.3188 0.2934 p= 1" [1] "ndcg 0.6084 0.5828 p= 0.9997" [1] "P_20 0.45 0.4092 p= 0.9989" [1] "ndcg_cut_20 0.4255 0.4122 p= 0.89" [1] "P_100 0.2657 0.2508 p= 0.9991" [1] "ndcg_cut_100 0.4625 0.4385 p= 0.9992"