You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

1. Data (/shared/trecgenomics/data) 

#download TREC Genomics data 

mkdir –p /shared/trecgenomics/data 
cd /shared/trecgenomics/data 
wget http://skynet.ohsu.edu/trec-gen/data/2006/documents/ajepidem.zip 
wget http://skynet.ohsu.edu/trec-gen/data/2006/documents/ajpcell.zip 
… 

(total 59 files to be downloaded) 

#unzip TREC Genomics data 

unzip '*.zip' 

#convert data into trec format 

cd ~/biocaddie/scripts    
./trecgenomics2trec.sh  

***#documents=162259 

Output: /shared/trecgenomics/data/trecText/trecgenomics_all.txt 

Also make a copy at /data/trecgenomics/data/ 

2. Indexes (/shared/trecgenomics/indexes/trecgenomics_all) 

Index param file: ~/biocaddie/index/build_index.trecgenomics.params 

#Content 

<parameters>
  <index>/shared/trecgenomics/indexes/trecgenomics_all</index>
  <indexType>indri</indexType>
  <corpus>
    <path>/shared/trecgenomics/data/trecText/trecgenomics_all.txt</path>
    <class>trectext</class>
  </corpus>
</parameters>

#Build index 

mkdir -p /shared/trecgenomics/indexes/   
cd ~/biocaddie   
IndriBuildIndex index/build_index.trecgenomics.params   

Output is saved at /shared/trecgenomics/indexes/trecgenomics_all 

Also make a copy at /data/trecgenomics/indexes/trecgenomics_all 

3. Queries 

#download topics to /shared/trecgenomics/queries folder 

wget http://skynet.ohsu.edu/trec-gen/data/2007/2007topics.txt 

#convert query into trec format (use trecgentopics2trec.sh to create queries.combined.orig) 

cd ~/biocaddie   
scripts/trecgentopics2trec.sh 

Output is saved at /shared/trecgenomics/queries 

Also make a copy of the query at /data/trecgenomics/queries 

4. Qrels 

#download qrels to /shared/trecgenomics/qrels folder 

wget http://skynet.ohsu.edu/trec-gen/data/2007/trecgen2007.all.judgments.tsv.txt 

#convert qrels into correct format for trec_eval (add in 0 in second column, replace NOT_RELEVANT with 0 and RELEVANT with 2, remove columns 4 and 5) 

grep -v "#" /shared/trecgenomics/qrels/trecgen2007.all.judgments.tsv.txt | sed -e 's/\tRELEVANT/\t2/g' -e 's/\tNOT_RELEVANT/\t0/g' -e 's/\t/\t0\t/1' | cut -f 1,2,3,6 > trecgenomics-qrels.txt  

 ***Problem with TREC Genomics qrels. 

The relevant judgements generated above contain duplicate values such as a document for a query might have multiple judgements (RELEVENT/NON-RELEVANT) based on the document's maximum-length span. 

Eg: In trecgen2007.all.judgments.tsv.txt file: 

200 9063387 2059 1870 NOT_RELEVANT 
200 9063387 7300 1702 RELEVANT 
200 9063387 58122 4989 NOT_RELEVANT 
200 9063387 82135 1426 RELEVANT 
200 9063387 83588 3235 RELEVANT 
200 9063387 97901 27036 NOT_RELEVANT 

In trecgenomics-qrels.txt: 

root@integration-1:/data/trecgenomics/qrels# grep 9063387  trecgenomics-qrels.txt 
200     0       9063387 0 
200     0       9063387 2 
200     0       9063387 0 
200     0       9063387 2 
200     0       9063387 2 
200     0       9063387 0 

To fix this problem, use Rscript trecgenqrels.R (in ~/biocaddie/scripts), this script will group by query & document number and sum up the relevant number. If sum=0 -> document is non-relevant, its relevant number is kept 0; if sum>=2 -> document might include multiple relevant and non-relevant judgements, so we assign its relevant number to 2.   

Output file is trecgenomics-qrels-nondup.txt and saved at /shared/trecgenomics/qrels 

Also make a copy of the qrels at /data/trecgenomics/qrels 

5. IndriRunQuery - Output  

cd ~/biocaddie/baselines/trecgenomics 
./<model>.sh <topic> <collection> |parallel -j 20 bash -c "{}"  

Eg: 

./jm.sh orig combined| parallel -j 20 bash -c "{}"  
./dir.sh orig combined| parallel -j 20 bash -c "{}"  
./tfidf.sh orig combined| parallel -j 20 bash -c "{}"  
./two.sh orig combined| parallel -j 20 bash -c "{}"  
./okapi.sh orig combined| parallel -j 20 bash -c "{}"  
./rm3.sh orig combined| parallel -j 20 bash -c "{}"  

IndriRunQuery outputs for different baselines are stored at: 

/data/trecgenomics/output/tfidf/combined/orig 

/data/trecgenomics/output/dir/combined/orig 

/data/trecgenomics/output/okapi/combined/orig 

/data/trecgenomics/output/jm/combined/orig 

/data/trecgenomics/output/two/combined/orig 

/data/trecgenomics/output/rm3/combined/orig  

6. Cross-validation 

cd ~/biocaddie  
scripts/mkeval_trecgenomics.sh <model> <topics> <collection> 

Eg: scripts/mkeval_trecgenomics.sh tfidf orig combined 

7. Compare models 

cd ~/biocaddie   
Rscript scripts/compare_trecgenomics.R <collection> <from model> <to model> <topic> 

 Results (compared to tfidf baseline) 

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
tfidf0.24650.5280.33610.40770.20810.3915Sweep b and k106/23/17
Okapi0.0666-0.2568-0.1389-0.1393-0.0953-0.1415-Sweep b, k1, k306/23/17
QL (JM)0.2136-0.4771-0.34030.39510.1847-0.3583-Sweep lambda06/23/17
QL (Dir)0.21760.4772-0.35140.40690.1881-0.3576

Sweep mu

06/23/17
QL (TS)0.23790.51280.35690.4370.19860.399Sweep mu and lambda06/23/17
RM30.25360.52520.36530.42180.21640.3874Sweep mu, fbDocs, fbTerms, and lambda06/23/17


root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf dir orig 
[1] "map 0.2465 0.2176 p= 0.9297" 
[1] "ndcg 0.528 0.4772 p= 0.9838" 
[1] "P_20 0.3361 0.3514 p= 0.2011" 
[1] "ndcg_cut_20 0.4077 0.4069 p= 0.5111" 
[1] "P_100 0.2081 0.1881 p= 0.9771" 
[1] "ndcg_cut_100 0.3915 0.3576 p= 0.885" 
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf two orig 
[1] "map 0.2465 0.2379 p= 0.7532" 
[1] "ndcg 0.528 0.5128 p= 0.8973" 
[1] "P_20 0.3361 0.3569 p= 0.1197" 
[1] "ndcg_cut_20 0.4077 0.437 p= 0.1039" 
[1] "P_100 0.2081 0.1986 p= 0.8416" 
[1] "ndcg_cut_100 0.3915 0.399 p= 0.3308" 
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf jm orig 
[1] "map 0.2465 0.2136 p= 0.996" 
[1] "ndcg 0.528 0.4771 p= 1" 
[1] "P_20 0.3361 0.3403 p= 0.4073" 
[1] "ndcg_cut_20 0.4077 0.3951 p= 0.7083" 
[1] "P_100 0.2081 0.1847 p= 0.9802" 
[1] "ndcg_cut_100 0.3915 0.3583 p= 0.9727" 
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf okapi orig 
[1] "map 0.2465 0.0666 p= 1" 
[1] "ndcg 0.528 0.2568 p= 1" 
[1] "P_20 0.3361 0.1389 p= 0.9999" 
[1] "ndcg_cut_20 0.4077 0.1393 p= 1" 
[1] "P_100 0.2081 0.0953 p= 0.9998" 
[1] "ndcg_cut_100 0.3915 0.1415 p= 1"
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf rm3 orig
[1] "map 0.2465 0.2536 p= 0.3791"
[1] "ndcg 0.528 0.5252 p= 0.5405"
[1] "P_20 0.3361 0.3653 p= 0.1006"
[1] "ndcg_cut_20 0.4077 0.4218 p= 0.3338"
[1] "P_100 0.2081 0.2164 p= 0.2333"
[1] "ndcg_cut_100 0.3915 0.3874 p= 0.5519"

 
 Comments:

    • TREC Genomics collection is a full-text collection (different from bioCaddie which is descriptive metadata collection). It consists of full-text HTML documents from 49 journals published via Highwire Press. Hence, each document's text is much longer.
    • Topics used in TREC Genomics collection are common queries and quite similar to bioCaddie original queries.
      Eg:
      <200>What serum [PROTEINS] change expression in association with high disease activity in lupus?
      <201>What [MUTATIONS] in the Raf gene are associated with cancer?
    • Relevant judgements contain judgements for different passages of a document (RELEVANT or NON-RELEVANT). Some documents can be divided into multiple passages of different length and can have different judgement for each passage. However, in our baselines run, we use the judgement for the whole document; hence if a document has one or more relevant passages, it is considered RELEVANT.
      Eg:
      200 10090921 10160 2221 RELEVANT
      200 10090921 12404 720 NOT_RELEVANT
      200 10090921 13147 1084 NOT_RELEVANT
      200 10090921 59180 515 RELEVANT
      200 10090921 101717 349 RELEVANT
      Document number 10090921 is considered RELEVANT as it has at least 1 RELEVANT passage.
    • Baselines run results:
      Okapi significantly performed worse than tfidf in all metrics
      Query Likelihood baselines did not show significant improvements compared to TFIDF (few metrics were even worse)
      RM3 did yield better results but the data did not provide a significant improvement compared to TFIDF baselines.


  • No labels