You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

1. Data (/shared/ohsumed/data) 

#download Ohsumed data 

mkdir –p /shared/ohsumed 
cd /shared/ohsumed
wget http://trec.nist.gov/data/filtering/t9.filtering.tar.gz

#untar Ohsumed data 

tar xvzf t9.filtering.tar.gz --owner root --group root --no-same-owner 2>&1 >> ohsumeddata.log  

 #copy data into /shared/ohsumed/data 

cp /shared/ohsumed/ohsu-trec/trec9-test/ohsumed.88-91 /shared/ohsumed/data
cp /shared/ohsumed/ohsu-trec/trec9-train/ohsumed.87 /shared/ohsumed/data

#convert data into trec format 

cd ~/biocaddie/scripts   
./ohsumed2trec.sh 

***#documents=348566 

Output: /shared/ohsumed/data/trecText/ohsumed_all.txt 

Also make a copy at /data/ohsumed/data/ 
 

2. Indexes (/shared/ohsumed/indexes/ohsumed_all) 

Index param file: ~/biocaddie/index/build_index.ohsumed.params 

#Content 

<parameters> 
  <index>/shared/ohsumed/indexes/ohsumed_all</index> 
  <indexType>indri</indexType> 
  <corpus> 
    <path>/shared/ohsumed/data/trecText/ohsumed_all.txt</path> 
    <class>trectext</class> 
  </corpus> 
</parameters> 

#Build index 

mkdir -p /shared/ohsumed/indexes/  
cd ~/biocaddie  
IndriBuildIndex index/build_index.ohsumed.params  

Output is saved at /shared/ohsumed/indexes/ohsumed_all 

Also make a copy at /data/ohsumed/indexes/ohsumed_all 


3. Queries 

#copy topics to /shared/ohsumed/queries folder 

cp /shared/ohsumed/ohsu-trec/trec9-train/query.ohsu.1-63 /shared/ohsumed/queries 
cp /shared/ohsumed/ohsu-trec/pre-test/query.ohsu.test.1-43 /shared/ohsumed/queries 

There are two queries - for pre-test and for training (and test) sets

We create queries.combined.orig (total 106) including all the queries and queries.combined.short (total 63) for the queries used for training and test sets (not include pre-test queries)

#convert query into trec format (use ohsumedtopics2trec.sh to create queries.combined.orig and ohsumedtopics2trec_v2.sh to create queries.combined.short)

cd ~/biocaddie  
scripts/ohsumedtopics2trec.sh

Output is saved at /shared/ohsumed/queries

Also make a copy of the query at /data/ohsumed/queries 


4. Qrels

#copy qrels to /shared/ohsumed/qrels folder 

cp /shared/ohsumed/ohsu-trec/trec9-train/qrels.ohsu.batch.87 /shared/ohsumed/qrels 
cp /shared/ohsumed/ohsu-trec/pre-test/qrels.ohsu.test.87 /shared/ohsumed/qrels 
cp /shared/ohsumed/ohsu-trec/trec9-test/qrels.ohsu.88-91 /shared/ohsumed/qrels 

Similar to queries, the 3 qrels files include relevant judgements for pre-test, training and test sets.

In case of queries.combined.orig, all qrels files are used - qrels.all

In case of queries.combined.short, only qrels for training and test sets are used (qrels.ohsu.batch.87 and qrels.ohsu.88-91) - qrels.notest

However, the downloaded qrels are missing one column for trec_eval to process, we have to add the missing column before using.

#convert qrels into correct format for trec_eval (add in 0 in second column) 

cat qrels.ohsu.* | sed 's/\t/\t0\t/1' > qrels.all 

Output is saved at /shared/ohsumed/qrels

Also make a copy of the qrels at /data/ohsumed/qrels  
 

5. IndriRunQuery - Output

cd ~/biocaddie/baselines/ohsumed 
./<model>.sh <topic> <collection> |parallel -j 20 bash -c "{}" 

For orig queries:

./jm.sh orig combined| parallel -j 20 bash -c "{}" 
./dir.sh orig combined| parallel -j 20 bash -c "{}" 
./dir.sh orig combined| parallel -j 20 bash -c "{}" 
./two.sh orig combined| parallel -j 20 bash -c "{}" 
./okapi.sh orig combined| parallel -j 20 bash -c "{}" 
./rm3.sh orig combined| parallel -j 20 bash -c "{}" 

For short queries:

./jm.sh short combined| parallel -j 20 bash -c "{}" 
./dir.sh short combined| parallel -j 20 bash -c "{}" 
./dir.sh short combined| parallel -j 20 bash -c "{}" 
./two.sh short combined| parallel -j 20 bash -c "{}" 
./okapi.sh short combined| parallel -j 20 bash -c "{}" 
./rm3.sh short combined| parallel -j 20 bash -c "{}" 

IndriRunQuery outputs for different baselines are stored at: 

/data/ohsumed/output/tfidf/combined/orig
/data/ohsumed/output/dir/combined/orig
/data/ohsumed/output/okapi/combined/orig
/data/ohsumed/output/jm/combined/orig
/data/ohsumed/output/two/combined/orig
/data/ohsumed/output/rm3/combined/orig 
---
/data/ohsumed/output/tfidf/combined/short
/data/ohsumed/output/dir/combined/short
/data/ohsumed/output/okapi/combined/short
/data/ohsumed/output/jm/combined/short
/data/ohsumed/output/two/combined/short
/data/ohsumed/output/rm3/combined/short 

 

6. Cross-validation 

cd ~/biocaddie  

For orig queries which use qrels.all

scripts/mkeval_ohsumed.sh <model> <topics> <collection>
Eg: scripts/mkeval_ohsumed.sh tfidf orig combined 

For short queries which use qrels.notest

scripts/mkeval_ohsumed_v2.sh <model> <topics> <collection>
Eg: scripts/mkeval_ohsumed_v2.sh tfidf short combined 


7. Compare models

cd ~/biocaddie  
Rscript scripts/compare_ohsumed.R <collection> <from model> <to model> <topic> 

Results (compared to tfidf baseline) 

Using orig queries (pre-test queries included)

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
tfidf0.22040.45380.29950.29040.17350.3376Sweep b and k106/07/17
Okapi0.22180.45570.2819-0.30350.17170.3386Sweep b, k1, k306/07/17
QL (JM)0.1876-0.4212-0.2505-0.27730.1403-0.295-Sweep lambda06/07/17
QL (Dir)0.2032-0.4359-0.2713-0.29270.1633-0.3304

Sweep mu

06/07/17
QL (TS)0.2101-0.4415-0.2761-0.30290.1638-0.3277Sweep mu and lambda06/07/17
RM30.2618+0.45920.3277+0.29650.1913+0.3662+Sweep mu, fbDocs, fbTerms, and lambda06/08/17
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf dir orig 
[1] "map 0.2204 0.2032 p= 0.9988" 
[1] "ndcg 0.4538 0.4359 p= 0.9987" 
[1] "P_20 0.2995 0.2713 p= 0.9985" 
[1] "ndcg_cut_20 0.2904 0.2927 p= 0.417" 
[1] "P_100 0.1735 0.1633 p= 0.9945" 
[1] "ndcg_cut_100 0.3376 0.3304 p= 0.7764" 
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf jm orig 
[1] "map 0.2204 0.1876 p= 0.9966" 
[1] "ndcg 0.4538 0.4212 p= 0.9992" 
[1] "P_20 0.2995 0.2505 p= 0.9999" 
[1] "ndcg_cut_20 0.2904 0.2773 p= 0.8572" 
[1] "P_100 0.1735 0.1403 p= 1" 
[1] "ndcg_cut_100 0.3376 0.295 p= 0.9996" 
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf two orig 
[1] "map 0.2204 0.2101 p= 0.972" 
[1] "ndcg 0.4538 0.4415 p= 0.9859" 
[1] "P_20 0.2995 0.2761 p= 0.9954" 
[1] "ndcg_cut_20 0.2904 0.3029 p= 0.1072" 
[1] "P_100 0.1735 0.1638 p= 0.9992" 
[1] "ndcg_cut_100 0.3376 0.3277 p= 0.857" 
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf okapi orig 
[1] "map 0.2204 0.2218 p= 0.4445" 
[1] "ndcg 0.4538 0.4557 p= 0.414" 
[1] "P_20 0.2995 0.2819 p= 0.975" 
[1] "ndcg_cut_20 0.2904 0.3035 p= 0.1157" 
[1] "P_100 0.1735 0.1717 p= 0.6907" 
[1] "ndcg_cut_100 0.3376 0.3386 p= 0.4437" 

Using short queries (pre-test queries not included)

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
tfidf0.31880.60840.450.42550.26570.4625Sweep b and k106/07/17
Okapi0.31170.60440.44080.42770.2610.4569Sweep b, k1, k306/07/17
QL (JM)0.2545-0.5527-0.3908-0.3882-0.2135-0.3883-Sweep lambda06/07/17
QL (Dir)0.2924-0.5866-0.39750.4018-0.2492-0.432-

Sweep mu

06/07/17
QL (TS)0.2934-0.5828-0.4092-0.41220.2508-0.4385-Sweep mu and lambda06/07/17
RM30.3717+0.60870.5067+

0.4529 (p-value: 0.0541)

0.291+0.4934+Sweep mu, fbDocs, fbTerms, and lambda06/08/17
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf dir short
[1] "map 0.3188 0.2924 p= 0.9997"
[1] "ndcg 0.6084 0.5866 p= 0.9994"
[1] "P_20 0.45 0.3975 p= 0.9998"
[1] "ndcg_cut_20 0.4255 0.4018 p= 0.9881"
[1] "P_100 0.2657 0.2492 p= 0.9947"
[1] "ndcg_cut_100 0.4625 0.432 p= 0.9999"
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf jm short
[1] "map 0.3188 0.2545 p= 1"
[1] "ndcg 0.6084 0.5527 p= 1"
[1] "P_20 0.45 0.3908 p= 0.9984"
[1] "ndcg_cut_20 0.4255 0.3882 p= 0.9973"
[1] "P_100 0.2657 0.2135 p= 1"
[1] "ndcg_cut_100 0.4625 0.3883 p= 1"
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf okapi short
[1] "map 0.3188 0.3117 p= 0.7974"
[1] "ndcg 0.6084 0.6044 p= 0.6834"
[1] "P_20 0.45 0.4408 p= 0.7506"
[1] "ndcg_cut_20 0.4255 0.4277 p= 0.4236"
[1] "P_100 0.2657 0.261 p= 0.791"
[1] "ndcg_cut_100 0.4625 0.4569 p= 0.747"
root@integration-1:~/biocaddie# Rscript scripts/compare_ohsumed.R combined tfidf two short
[1] "map 0.3188 0.2934 p= 1"
[1] "ndcg 0.6084 0.5828 p= 0.9997"
[1] "P_20 0.45 0.4092 p= 0.9989"
[1] "ndcg_cut_20 0.4255 0.4122 p= 0.89"
[1] "P_100 0.2657 0.2508 p= 0.9991"
[1] "ndcg_cut_100 0.4625 0.4385 p= 0.9992"
  • No labels