Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Also make a copy of the query at /data/trecgenomics/queries 

4. Qrels 

2007 qrels:

#download qrels to /shared/trecgenomics/qrels folder 

...

No Format
grep -v "#" /shared/trecgenomics/qrels/trecgen2007.all.judgments.tsv.txt | sed -e 's/\tRELEVANT/\t2/g' -e 's/\tNOT_RELEVANT/\t0/g' -e 's/\t/\t0\t/1' | cut -f 1,2,3,6 > trecgenomics-qrels-2007.txt  

2006 qrels:

#download qrels to /shared/trecgenomics/qrels folder 

No Format
wget http://skynet.ohsu.edu/trec-gen/data/2006/topics/2006topics.txt


#convert qrels into correct format for trec_eval (add in 0 in second column, replace NOT with 0, POSSIBLY with 1 and DEFINITELY with 2, remove columns 4, 5 and 6) 


No Format
grep -v "#" /shared/trecgenomics/qrels/trec2006.raw.relevance.tsv.txt | sed -e 's/\tDEFINITELY/\t2/g' -e 's/\tPOSSIBLY/\t1/g' -e 's/\tNOT/\t0/g' -e 's/\t/\t0\t/1' | cut -f 1,2,3,7 > trecgenomics-qrels-2006.txt 


***Problem with TREC Genomics qrels. 

...

Output file is trecgenomics-qrels-nondup-<year>.txt and saved at /shared/trecgenomics/qrels 

...

No Format
cd ~/biocaddie/baselines/trecgenomics 
./<model>.sh <topic> <collection> <year> |parallel -j 20 bash -c "{}"  

Eg: 

./jm.sh orig combined 2006| parallel -j 20 bash -c "{}"  
./dir.sh orig combined 2006| parallel -j 20 bash -c "{}"  
./tfidf.sh orig combined 2006| parallel -j 20 bash -c "{}"  
./two.sh orig combined 2006| parallel -j 20 bash -c "{}"  
./okapi.sh orig combined 2006| parallel -j 20 bash -c "{}"  
./rm3.sh orig combined 2006| parallel -j 20 bash -c "{}"  

...

/data/trecgenomics/output/<year>/tfidf/combined/orig 

/data/trecgenomics/output/<year>/dir/combined/orig 

/data/trecgenomics/output/<year>/okapi/combined/orig 

/data/trecgenomics/output/<year>/jm/combined/orig 

/data/trecgenomics/output/<year>/two/combined/orig 

/data/trecgenomics/output/<year>/rm3/combined/orig  

6. Cross-validation 

No Format
cd ~/biocaddie  
scripts/mkeval_trecgenomics.sh <model> <topics> <collection> <year>

Eg: scripts/mkeval_trecgenomics.sh tfidf orig combined 

...

No Format
cd ~/biocaddie   
Rscript scripts/compare_trecgenomics.R <collection> <from model> <to model> <topic> <year>

 Results (compared to tfidf baseline) 

2007 data

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
tfidf0.24650.5280.33610.40770.20810.3915Sweep b and k106/23/17
Okapi0.0666-0.2568-0.1389-0.1393-0.0953-0.1415-Sweep b, k1, k306/23/17
QL (JM)0.2136-0.4771-0.34030.39510.1847-0.3583-Sweep lambda06/23/17
QL (Dir)0.21760.4772-0.35140.40690.1881-0.3576

Sweep mu

06/23/17
QL (TS)0.23790.51280.35690.4370.19860.399Sweep mu and lambda06/23/17
RM30.25360.52520.36530.42180.21640.3874Sweep mu, fbDocs, fbTerms, and lambda06/23/17

...

No Format
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf dir orig 2007
[1] "map 0.2465 0.2176 p= 0.9297" 
[1] "ndcg 0.528 0.4772 p= 0.9838" 
[1] "P_20 0.3361 0.3514 p= 0.2011" 
[1] "ndcg_cut_20 0.4077 0.4069 p= 0.5111" 
[1] "P_100 0.2081 0.1881 p= 0.9771" 
[1] "ndcg_cut_100 0.3915 0.3576 p= 0.885" 
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf two orig 2007
[1] "map 0.2465 0.2379 p= 0.7532" 
[1] "ndcg 0.528 0.5128 p= 0.8973" 
[1] "P_20 0.3361 0.3569 p= 0.1197" 
[1] "ndcg_cut_20 0.4077 0.437 p= 0.1039" 
[1] "P_100 0.2081 0.1986 p= 0.8416" 
[1] "ndcg_cut_100 0.3915 0.399 p= 0.3308" 
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf jm orig 2007
[1] "map 0.2465 0.2136 p= 0.996" 
[1] "ndcg 0.528 0.4771 p= 1" 
[1] "P_20 0.3361 0.3403 p= 0.4073" 
[1] "ndcg_cut_20 0.4077 0.3951 p= 0.7083" 
[1] "P_100 0.2081 0.1847 p= 0.9802" 
[1] "ndcg_cut_100 0.3915 0.3583 p= 0.9727" 
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf okapi orig 2007
[1] "map 0.2465 0.0666 p= 1" 
[1] "ndcg 0.528 0.2568 p= 1" 
[1] "P_20 0.3361 0.1389 p= 0.9999" 
[1] "ndcg_cut_20 0.4077 0.1393 p= 1" 
[1] "P_100 0.2081 0.0953 p= 0.9998" 
[1] "ndcg_cut_100 0.3915 0.1415 p= 1"
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf rm3 orig 2007
[1] "map 0.2465 0.2536 p= 0.3791"
[1] "ndcg 0.528 0.5252 p= 0.5405"
[1] "P_20 0.3361 0.3653 p= 0.1006"
[1] "ndcg_cut_20 0.4077 0.4218 p= 0.3338"
[1] "P_100 0.2081 0.2164 p= 0.2333"
[1] "ndcg_cut_100 0.3915 0.3874 p= 0.5519"

2006 data 

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
tfidf





Sweep b and k106/23/17
Okapi





Sweep b, k1, k306/23/17
QL (JM)





Sweep lambda06/23/17
QL (Dir)





Sweep mu

06/23/17
QL (TS)





Sweep mu and lambda06/23/17
RM3





Sweep mu, fbDocs, fbTerms, and lambda06/23/17


 Comments:

    • TREC Genomics collection is a full-text collection (different from bioCaddie which is descriptive metadata collection). It consists of full-text HTML documents from 49 journals published via Highwire Press. Hence, each document's text is much longer.
    • Topics used in TREC Genomics collection are common queries and quite similar to bioCaddie original queries.
      Eg:
      <200>What serum [PROTEINS] change expression in association with high disease activity in lupus?
      <201>What [MUTATIONS] in the Raf gene are associated with cancer?
    • Relevant judgements contain judgements for different passages of a document (RELEVANT or NON-RELEVANT). Some documents can be divided into multiple passages of different length and can have different judgement for each passage. However, in our baselines run, we use the judgement for the whole document; hence if a document has one or more relevant passages, it is considered RELEVANT.
      Eg:
      200 10090921 10160 2221 RELEVANT
      200 10090921 12404 720 NOT_RELEVANT
      200 10090921 13147 1084 NOT_RELEVANT
      200 10090921 59180 515 RELEVANT
      200 10090921 101717 349 RELEVANT
      Document number 10090921 is considered RELEVANT as it has at least 1 RELEVANT passage.
    • Baselines run results:
      Okapi significantly performed worse than tfidf in all metrics
      Query Likelihood baselines did not show significant improvements compared to TFIDF (few metrics were even worse)
      RM3 did yield better results but the data did not provide a significant improvement compared to TFIDF baselines.