1. Data (/shared/trecgenomics/data) 

#download TREC Genomics data 

mkdir –p /shared/trecgenomics/data 
cd /shared/trecgenomics/data 
wget http://skynet.ohsu.edu/trec-gen/data/2006/documents/ajepidem.zip 
wget http://skynet.ohsu.edu/trec-gen/data/2006/documents/ajpcell.zip 
… 

(total 59 files to be downloaded) 

#unzip TREC Genomics data 

unzip '*.zip' 

#convert data into trec format 

cd ~/biocaddie/scripts    
./trecgenomics2trec.sh  

***#documents=162259 

Output: /shared/trecgenomics/data/trecText/trecgenomics_all.txt 

Also make a copy at /data/trecgenomics/data/ 

2. Indexes (/shared/trecgenomics/indexes/trecgenomics_all) 

Index param file: ~/biocaddie/index/build_index.trecgenomics.params 

#Content 

<parameters>
  <index>/shared/trecgenomics/indexes/trecgenomics_all</index>
  <indexType>indri</indexType>
  <corpus>
    <path>/shared/trecgenomics/data/trecText/trecgenomics_all.txt</path>
    <class>trectext</class>
  </corpus>
</parameters>

#Build index 

mkdir -p /shared/trecgenomics/indexes/   
cd ~/biocaddie   
IndriBuildIndex index/build_index.trecgenomics.params   

Output is saved at /shared/trecgenomics/indexes/trecgenomics_all 

Also make a copy at /data/trecgenomics/indexes/trecgenomics_all 

3. Queries 

#download topics to /shared/trecgenomics/queries folder 

wget http://skynet.ohsu.edu/trec-gen/data/2007/2007topics.txt 

#convert query into trec format (use trecgentopics2trec.sh to create queries.combined.orig) 

cd ~/biocaddie   
scripts/trecgentopics2trec.sh      #for 2007 queries
scripts/trecgentopics2trec2006.sh  #for 2006 queries  

Output is saved at /shared/trecgenomics/queries 

Also make a copy of the query at /data/trecgenomics/queries 

4. Qrels 

2007 qrels:

#download qrels to /shared/trecgenomics/qrels folder 

wget http://skynet.ohsu.edu/trec-gen/data/2007/trecgen2007.all.judgments.tsv.txt 

#convert qrels into correct format for trec_eval (add in 0 in second column, replace NOT_RELEVANT with 0 and RELEVANT with 2, remove columns 4 and 5) 

grep -v "#" /shared/trecgenomics/qrels/trecgen2007.all.judgments.tsv.txt | sed -e 's/\tRELEVANT/\t2/g' -e 's/\tNOT_RELEVANT/\t0/g' -e 's/\t/\t0\t/1' | cut -f 1,2,3,6 > trecgenomics-qrels-2007.txt  

2006 qrels:

#download qrels to /shared/trecgenomics/qrels folder 

wget http://skynet.ohsu.edu/trec-gen/data/2006/topics/2006topics.txt

#convert qrels into correct format for trec_eval (add in 0 in second column, replace NOT with 0, POSSIBLY with 1 and DEFINITELY with 2, remove columns 4, 5 and 6) 

grep -v "#" /shared/trecgenomics/qrels/trec2006.raw.relevance.tsv.txt | sed -e 's/\tDEFINITELY/\t2/g' -e 's/\tPOSSIBLY/\t1/g' -e 's/\tNOT/\t0/g' -e 's/\t/\t0\t/1' | cut -f 1,2,3,7 > trecgenomics-qrels-2006.txt 

***Problem with TREC Genomics qrels. 

The relevant judgements generated above contain duplicate values such as a document for a query might have multiple judgements (RELEVENT/NON-RELEVANT) based on the document's maximum-length span. 

Eg: In trecgen2007.all.judgments.tsv.txt file: 

200 9063387 2059 1870 NOT_RELEVANT 
200 9063387 7300 1702 RELEVANT 
200 9063387 58122 4989 NOT_RELEVANT 
200 9063387 82135 1426 RELEVANT 
200 9063387 83588 3235 RELEVANT 
200 9063387 97901 27036 NOT_RELEVANT 

In trecgenomics-qrels.txt: 

root@integration-1:/data/trecgenomics/qrels# grep 9063387  trecgenomics-qrels.txt 
200     0       9063387 0 
200     0       9063387 2 
200     0       9063387 0 
200     0       9063387 2 
200     0       9063387 2 
200     0       9063387 0 

To fix this problem, use Rscript trecgenqrels.R (in ~/biocaddie/scripts), this script will group by query & document number and sum up the relevant number. If sum=0 -> document is non-relevant, its relevant number is kept 0; if sum>=2 -> document might include multiple relevant and non-relevant judgements, so we assign its relevant number to 2.   

Output file is trecgenomics-qrels-nondup-<year>.txt and saved at /shared/trecgenomics/qrels 

Also make a copy of the qrels at /data/trecgenomics/qrels 

5. IndriRunQuery - Output  

cd ~/biocaddie/baselines/trecgenomics 
./<model>.sh <topic> <collection> <year> |parallel -j 20 bash -c "{}"  

Eg: 

./jm.sh orig combined 2006| parallel -j 20 bash -c "{}"  
./dir.sh orig combined 2006| parallel -j 20 bash -c "{}"  
./tfidf.sh orig combined 2006| parallel -j 20 bash -c "{}"  
./two.sh orig combined 2006| parallel -j 20 bash -c "{}"  
./okapi.sh orig combined 2006| parallel -j 20 bash -c "{}"  
./rm3.sh orig combined 2006| parallel -j 20 bash -c "{}"  

IndriRunQuery outputs for different baselines are stored at: 

/data/trecgenomics/output/<year>/tfidf/combined/orig 

/data/trecgenomics/output/<year>/dir/combined/orig 

/data/trecgenomics/output/<year>/okapi/combined/orig 

/data/trecgenomics/output/<year>/jm/combined/orig 

/data/trecgenomics/output/<year>/two/combined/orig 

/data/trecgenomics/output/<year>/rm3/combined/orig  

6. Cross-validation 

cd ~/biocaddie  
scripts/mkeval_trecgenomics.sh <model> <topics> <collection> <year>

Eg: scripts/mkeval_trecgenomics.sh tfidf orig combined 

7. Compare models 

cd ~/biocaddie   
Rscript scripts/compare_trecgenomics.R <collection> <from model> <to model> <topic> <year>

 Results (compared to tfidf baseline) 

2007 data

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
tfidf0.24650.5280.33610.40770.20810.3915Sweep b and k106/23/17
Okapi0.0666-0.2568-0.1389-0.1393-0.0953-0.1415-Sweep b, k1, k306/23/17
QL (JM)0.2136-0.4771-0.34030.39510.1847-0.3583-Sweep lambda06/23/17
QL (Dir)0.21760.4772-0.35140.40690.1881-0.3576

Sweep mu

06/23/17
QL (TS)0.23790.51280.35690.4370.19860.399Sweep mu and lambda06/23/17
RM30.25360.52520.36530.42180.21640.3874Sweep mu, fbDocs, fbTerms, and lambda06/23/17


root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf dir orig 2007
[1] "map 0.2465 0.2176 p= 0.9297" 
[1] "ndcg 0.528 0.4772 p= 0.9838" 
[1] "P_20 0.3361 0.3514 p= 0.2011" 
[1] "ndcg_cut_20 0.4077 0.4069 p= 0.5111" 
[1] "P_100 0.2081 0.1881 p= 0.9771" 
[1] "ndcg_cut_100 0.3915 0.3576 p= 0.885" 
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf two orig 2007
[1] "map 0.2465 0.2379 p= 0.7532" 
[1] "ndcg 0.528 0.5128 p= 0.8973" 
[1] "P_20 0.3361 0.3569 p= 0.1197" 
[1] "ndcg_cut_20 0.4077 0.437 p= 0.1039" 
[1] "P_100 0.2081 0.1986 p= 0.8416" 
[1] "ndcg_cut_100 0.3915 0.399 p= 0.3308" 
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf jm orig 2007
[1] "map 0.2465 0.2136 p= 0.996" 
[1] "ndcg 0.528 0.4771 p= 1" 
[1] "P_20 0.3361 0.3403 p= 0.4073" 
[1] "ndcg_cut_20 0.4077 0.3951 p= 0.7083" 
[1] "P_100 0.2081 0.1847 p= 0.9802" 
[1] "ndcg_cut_100 0.3915 0.3583 p= 0.9727" 
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf okapi orig 2007
[1] "map 0.2465 0.0666 p= 1" 
[1] "ndcg 0.528 0.2568 p= 1" 
[1] "P_20 0.3361 0.1389 p= 0.9999" 
[1] "ndcg_cut_20 0.4077 0.1393 p= 1" 
[1] "P_100 0.2081 0.0953 p= 0.9998" 
[1] "ndcg_cut_100 0.3915 0.1415 p= 1"
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf rm3 orig 2007
[1] "map 0.2465 0.2536 p= 0.3791"
[1] "ndcg 0.528 0.5252 p= 0.5405"
[1] "P_20 0.3361 0.3653 p= 0.1006"
[1] "ndcg_cut_20 0.4077 0.4218 p= 0.3338"
[1] "P_100 0.2081 0.2164 p= 0.2333"
[1] "ndcg_cut_100 0.3915 0.3874 p= 0.5519"

2006 data 

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
tfidf0.27140.48970.33750.39220.17250.3884Sweep b and k106/27/17
Okapi0.23720.3963-0.2393-0.3176-0.1343-0.3216Sweep b, k1, k306/27/17
QL (JM)0.27740.50030.32680.4354+0.17640.4307+Sweep lambda06/27/17
QL (Dir)0.29390.51960.33750.4524+0.18360.4554+

Sweep mu

06/27/17
QL (TS)0.28840.51530.32680.4509+0.18610.4479+Sweep mu and lambda06/27/17
RM30.3415+0.50930.37680.4521+0.2168+0.448+Sweep mu, fbDocs, fbTerms, and lambda06/27/17


root@integration-1:~/biocaddie#  Rscript scripts/compare_trecgenomics.R combined tfidf dir orig 2006
[1] "map 0.2714 0.2939 p= 0.1046"
[1] "ndcg 0.4897 0.5196 p= 0.0642"
[1] "P_20 0.3375 0.3375 p= 0.5"
[1] "ndcg_cut_20 0.3922 0.4524 p= 0.006"
[1] "P_100 0.1725 0.1836 p= 0.1524"
[1] "ndcg_cut_100 0.3884 0.4554 p= 0.0023"
root@integration-1:~/biocaddie#  Rscript scripts/compare_trecgenomics.R combined tfidf jm orig 2006
[1] "map 0.2714 0.2774 p= 0.3499"
[1] "ndcg 0.4897 0.5003 p= 0.2605"
[1] "P_20 0.3375 0.3268 p= 0.6859"
[1] "ndcg_cut_20 0.3922 0.4354 p= 0.0334"
[1] "P_100 0.1725 0.1764 p= 0.3434"
[1] "ndcg_cut_100 0.3884 0.4307 p= 0.0157"
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf two orig 2006
[1] "map 0.2714 0.2884 p= 0.1384"
[1] "ndcg 0.4897 0.5153 p= 0.0823"
[1] "P_20 0.3375 0.3268 p= 0.6934"
[1] "ndcg_cut_20 0.3922 0.4509 p= 0.0062"
[1] "P_100 0.1725 0.1861 p= 0.1425"
[1] "ndcg_cut_100 0.3884 0.4479 p= 0.0032"
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf okapi orig 2006
[1] "map 0.2714 0.2372 p= 0.8888"
[1] "ndcg 0.4897 0.3963 p= 0.9909"
[1] "P_20 0.3375 0.2393 p= 0.9909"
[1] "ndcg_cut_20 0.3922 0.3176 p= 0.953"
[1] "P_100 0.1725 0.1343 p= 0.9559"
[1] "ndcg_cut_100 0.3884 0.3216 p= 0.9465"
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf rm3 orig 2006
[1] "map 0.2714 0.3415 p= 0.0136"
[1] "ndcg 0.4897 0.5093 p= 0.2485"
[1] "P_20 0.3375 0.3768 p= 0.0741"
[1] "ndcg_cut_20 0.3922 0.4521 p= 0.0199"
[1] "P_100 0.1725 0.2168 p= 0.0189"
[1] "ndcg_cut_100 0.3884 0.448 p= 0.0206"


Comments: