1. TREC CDS 2015 collection (http://trec-cds.appspot.com/2015.html)
1.1. Data (/shared/treccds/data/2015)
#download TREC CDS 2015 data
No Format |
---|
mkdir –p /shared/treccds/data/2015
cd /shared/treccds/data/2015
wget http://ceb.nlm.nih.gov/~simpsonmatt/pmc-text-00.tar.gz
wget http://ceb.nlm.nih.gov/~simpsonmatt/pmc-text-01.tar.gz
wget http://ceb.nlm.nih.gov/~simpsonmatt/pmc-text-02.tar.gz
wget http://ceb.nlm.nih.gov/~simpsonmatt/pmc-text-03.tar.gz |
#untar TREC CDS 2015 data
No Format |
---|
tar xvzf pmc-text-00.tar.gz --owner root --group root --no-same-owner 2>&1 >> 2015data.log
tar xvzf pmc-text-01.tar.gz --owner root --group root --no-same-owner 2>&1 >> 2015data.log
tar xvzf pmc-text-02.tar.gz --owner root --group root --no-same-owner 2>&1 >> 2015data.log
tar xvzf pmc-text-03.tar.gz --owner root --group root --no-same-owner 2>&1 >> 2015data.log |
#convert data into trec format
No Format |
---|
cd ~/biocaddie/scripts
./xml2trec.sh 2015 |
Output:/shared/treccds/data/2015/trecText/treccds2015_all.txt
Also make a copy at /data/treccds/data/2015
1.2. Indexes (/shared/treccds/indexes/treccds2015_all)
...
No Format |
---|
<parameters> <index>/shared/treccds/indexes/treccds2015_all</index> <indexType>indri</indexType> <corpus> <path>/shared/treccds/data/2015/trecText/treccds2015_all.txt</path> <class>trectext</class> </corpus> </parameters> |
#Build index
No Format |
---|
mkdir -p /shared/treccds/indexes/
cd ~/biocaddie IndriBuildIndex
index/build_index.treccds2015.params |
Output is saved at /shared/treccds/indexes/treccds2015_all
...
As topics2015A.xml and topics2015B.xml are quite similar (only few topics in topics2015B.xml have extra <diagnosis> tag, we use topics2015B.xml only and rename it to topics2015.xml
#download topics 2015
No Format |
---|
mkdir -p /shared/treccds/queries/2015
cd /shared/treccds/queries/2015
wget http://trec-cds.appspot.com/topics2015B.xml
mv topics2015B.xml topics2015.xml |
#convert query into trec format (./topics2trec.sh <year>)
...
1.5. IndriRunQuery - Output (/shared/treccds/output/2015)
No Format |
---|
cd ~/biocaddie/baselines
./<model>.sh <topic> <collection> <year> |
IndriRunQuery outputs for different baselines are stored at:
...
/data/treccds/output/2015/rm3/combined/orig
1.6. Cross-validation
No Format |
---|
cd ~/biocaddie
scripts/mkeval_treccds.sh <model> <topics> <collection> <year> |
1.7. Compare models
No Format |
---|
cd ~/biocaddie Rscript
scripts/compare_treccds.R <collection> <from model> <to model> <topic> <year> |
Results: (rm3 to be updated)
Model | MAP | NDCG | P@20 | NDCG@20 | P@100 | NDCG@100 | Notes | Date |
---|---|---|---|---|---|---|---|---|
tfidf | 0.0957 | 0.3017 | 0.3 | 0.2353 | 0.1857 | 0.194 | Sweep b and k1 | 06/05/17 |
Okapi | 0.0552- | 0.1831- | 0.23- | 0.1824- | 0.128- | 0.1477- | Sweep b, k1, k3 | 06/05/17 |
QL (JM) | 0.0971 | 0.2901 | 0.3167 | 0.2708 | 0.1827 | 0.212 | Sweep lambda | 06/05/17 |
QL (Dir) | 0.1026 | 0.3002 | 0.3217 | 0.2705 | 0.1937 | 0.2197+ | Sweep mu | 06/05/17 |
QL (TS) | 0.0976 | 0.3001 | 0.295 | 0.2555 | 0.1857 | 0.2145 | Sweep mu and lambda | 06/05/17 |
RM3 | Sweep mu, fbDocs, fbTerms, and lambda |
No Format |
---|
root@integration-1:~/biocaddie# Rscript scripts/compare_treccds.R combined tfidf dir orig 2015
[1] "map 0.0957 0.1026 p= 0.2837"
[1] "ndcg 0.3017 0.3002 p= 0.5381"
[1] "P_20 0.3 0.3217 p= 0.2352"
[1] "ndcg_cut_20 0.2353 0.2705 p= 0.0834"
[1] "P_100 0.1857 0.1937 p= 0.2639"
[1] "ndcg_cut_100 0.194 0.2197 p= 0.0396"
root@integration-1:~/biocaddie# Rscript scripts/compare_treccds.R combined tfidf jm orig 2015
[1] "map 0.0957 0.0971 p= 0.4594"
[1] "ndcg 0.3017 0.2901 p= 0.7606"
[1] "P_20 0.3 0.3167 p= 0.3204"
[1] "ndcg_cut_20 0.2353 0.2708 p= 0.1038"
[1] "P_100 0.1857 0.1827 p= 0.5918"
[1] "ndcg_cut_100 0.194 0.212 p= 0.1283"
root@integration-1:~/biocaddie# Rscript scripts/compare_treccds.R combined tfidf okapi orig 2015
[1] "map 0.0957 0.0552 p= 0.9997"
[1] "ndcg 0.3017 0.1831 p= 1"
[1] "P_20 0.3 0.23 p= 0.9979"
[1] "ndcg_cut_20 0.2353 0.1824 p= 0.9991"
[1] "P_100 0.1857 0.128 p= 0.9999"
[1] "ndcg_cut_100 0.194 0.1477 p= 0.9993"
root@integration-1:~/biocaddie# Rscript scripts/compare_treccds.R combined tfidf two orig 2015
[1] "map 0.0957 0.0976 p= 0.432"
[1] "ndcg 0.3017 0.3001 p= 0.5467"
[1] "P_20 0.3 0.295 p= 0.5667"
[1] "ndcg_cut_20 0.2353 0.2555 p= 0.2152"
[1] "P_100 0.1857 0.1857 p= 0.5"
[1] "ndcg_cut_100 0.194 0.2145 p= 0.0652" |
2. TREC CDS 2016 collection (http://trec-cds.appspot.com/2016.html)
Data and query format are same as TREC CDS 2015. However, qrels are not available.
3. TREC PM 2017 collection (http://trec-cds.appspot.com/2017.html)