1. TREC CDS 2015 collection (http://trec-cds.appspot.com/2015.html)

1.1. Data (/shared/treccds/data/2015) 

#download TREC CDS 2015 data 

mkdir –p /shared/treccds/data/2015 
cd /shared/treccds/data/2015 
wget http://ceb.nlm.nih.gov/~simpsonmatt/pmc-text-00.tar.gz 
wget http://ceb.nlm.nih.gov/~simpsonmatt/pmc-text-01.tar.gz 
wget http://ceb.nlm.nih.gov/~simpsonmatt/pmc-text-02.tar.gz 
wget http://ceb.nlm.nih.gov/~simpsonmatt/pmc-text-03.tar.gz 

#untar TREC CDS 2015 data 

tar xvzf pmc-text-00.tar.gz --owner root --group root --no-same-owner 2>&1 >> 2015data.log 
tar xvzf pmc-text-01.tar.gz --owner root --group root --no-same-owner 2>&1 >> 2015data.log 
tar xvzf pmc-text-02.tar.gz --owner root --group root --no-same-owner 2>&1 >> 2015data.log 
tar xvzf pmc-text-03.tar.gz --owner root --group root --no-same-owner 2>&1 >> 2015data.log 

#convert data into trec format 

cd ~/biocaddie/scripts 
./xml2trec.sh 2015 

Output: /shared/treccds/data/2015/trecText/treccds2015_all.txt 

Also make a copy at /data/treccds/data/2015 


1.2. Indexes (/shared/treccds/indexes/treccds2015_all) 

Index param file: ~/biocaddie/index/build_index.treccds2015.params 

#Content

<parameters>
   <index>/shared/treccds/indexes/treccds2015_all</index>
   <indexType>indri</indexType>
   <corpus>
     <path>/shared/treccds/data/2015/trecText/treccds2015_all.txt</path>
     <class>trectext</class>
   </corpus>
</parameters> 

#Build index

mkdir -p /shared/treccds/indexes/ 
cd ~/biocaddie IndriBuildIndex 
index/build_index.treccds2015.params 

Output is saved at /shared/treccds/indexes/treccds2015_all 

Also make a copy at /data/treccds/indexes/treccds2015_all 

 

1.3.Queries (/shared/treccds/queries/2015) 

As topics2015A.xml and topics2015B.xml are quite similar (only few topics in topics2015B.xml have extra <diagnosis> tag, we use topics2015B.xml only and rename it to topics2015.xml 

#download topics 2015 

mkdir -p /shared/treccds/queries/2015 
cd /shared/treccds/queries/2015 
wget http://trec-cds.appspot.com/topics2015B.xml 
mv topics2015B.xml topics2015.xml 

#convert query into trec format (./topics2trec.sh <year>)

cd ~/biocaddie 
scripts/topics2trec.sh 2015  

Output is saved at /shared/treccds/queries/2015/queries.combined.orig 

Also make a copy of the query at /data/treccds/queries/2015 

 

1.4. Qrels (/shared/treccds/qrels/2015) 

#download qrels 2015 

mkdir -p /shared/treccds/qrels/2015 
cd /shared/treccds/qrels/2015 
wget http://trec-cds.appspot.com/qrels-treceval-2015.txt 

Also make a copy at /data/treccds/qrels/2015 


1.5. IndriRunQuery - Output (/shared/treccds/output/2015) 

cd ~/biocaddie/baselines
./<model>.sh <topic> <collection> <year> | parallel -j 20 bash -c "{}"

IndriRunQuery outputs for different baselines are stored at: 

/data/treccds/output/2015/two/combined/orig  

/data/treccds/output/2015/tfidf/combined/orig 

/data/treccds/output/2015/dir/combined/orig 

/data/treccds/output/2015/okapi/combined/orig 

/data/treccds/output/2015/jm/combined/orig 

/data/treccds/output/2015/rm3/combined/orig 

 

1.6. Cross-validation 

cd ~/biocaddie 
scripts/mkeval_treccds.sh <model> <topics> <collection> <year> 


1.7. Compare models 

cd ~/biocaddie Rscript 
scripts/compare_treccds.R <collection> <from model> <to model> <topic> <year> 

Results (compared to tfidf baseline)

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
tfidf0.09570.30170.30.23530.18570.194Sweep b and k106/05/17
Okapi0.0552-0.1831-0.23-0.1824-0.128-0.1477-Sweep b, k1, k306/05/17
QL (JM)0.09710.29010.31670.27080.18270.212Sweep lambda06/05/17
QL (Dir)0.10260.30020.32170.27050.19370.2197+

Sweep mu

06/05/17
QL (TS)0.09760.30010.2950.25550.18570.2145Sweep mu and lambda06/05/17
RM30.156+0.3614+0.375+0.2926+0.236+0.2595+Sweep mu, fbDocs, fbTerms, and lambda06/06/17


root@integration-1:~/biocaddie# Rscript scripts/compare_treccds.R combined tfidf dir orig 2015 
[1] "map 0.0957 0.1026 p= 0.2837" 
[1] "ndcg 0.3017 0.3002 p= 0.5381" 
[1] "P_20 0.3 0.3217 p= 0.2352" 
[1] "ndcg_cut_20 0.2353 0.2705 p= 0.0834" 
[1] "P_100 0.1857 0.1937 p= 0.2639" 
[1] "ndcg_cut_100 0.194 0.2197 p= 0.0396" 
root@integration-1:~/biocaddie# Rscript scripts/compare_treccds.R combined tfidf jm orig 2015 
[1] "map 0.0957 0.0971 p= 0.4594" 
[1] "ndcg 0.3017 0.2901 p= 0.7606" 
[1] "P_20 0.3 0.3167 p= 0.3204" 
[1] "ndcg_cut_20 0.2353 0.2708 p= 0.1038" 
[1] "P_100 0.1857 0.1827 p= 0.5918" 
[1] "ndcg_cut_100 0.194 0.212 p= 0.1283" 
root@integration-1:~/biocaddie# Rscript scripts/compare_treccds.R combined tfidf okapi orig 2015 
[1] "map 0.0957 0.0552 p= 0.9997" 
[1] "ndcg 0.3017 0.1831 p= 1" 
[1] "P_20 0.3 0.23 p= 0.9979" 
[1] "ndcg_cut_20 0.2353 0.1824 p= 0.9991" 
[1] "P_100 0.1857 0.128 p= 0.9999" 
[1] "ndcg_cut_100 0.194 0.1477 p= 0.9993" 
root@integration-1:~/biocaddie# Rscript scripts/compare_treccds.R combined tfidf two orig 2015 
[1] "map 0.0957 0.0976 p= 0.432" 
[1] "ndcg 0.3017 0.3001 p= 0.5467" 
[1] "P_20 0.3 0.295 p= 0.5667" 
[1] "ndcg_cut_20 0.2353 0.2555 p= 0.2152" 
[1] "P_100 0.1857 0.1857 p= 0.5" 
[1] "ndcg_cut_100 0.194 0.2145 p= 0.0652"
root@integration-1:~/biocaddie# Rscript scripts/compare_treccds.R combined tfidf rm3 orig 2015
[1] "map 0.0957 0.156 p= 0.0042"
[1] "ndcg 0.3017 0.3614 p= 0.005"
[1] "P_20 0.3 0.375 p= 0.0185"
[1] "ndcg_cut_20 0.2353 0.2926 p= 0.0348"
[1] "P_100 0.1857 0.236 p= 0.0155"
[1] "ndcg_cut_100 0.194 0.2595 p= 0.0052"


2. TREC CDS 2016 collection (http://trec-cds.appspot.com/2016.html)

Data and query format are same as TREC CDS 2015. However, qrels are not available.

3. TREC PM 2017 collection (http://trec-cds.appspot.com/2017.html)

Data and query format are different compared to TREC CDS. Also, qrels are not available.


  • No labels