1. Data (/shared/trecgenomics/data)
#download TREC Genomics data
mkdir –p /shared/trecgenomics/data cd /shared/trecgenomics/data wget http://skynet.ohsu.edu/trec-gen/data/2006/documents/ajepidem.zip wget http://skynet.ohsu.edu/trec-gen/data/2006/documents/ajpcell.zip …
(total 59 files to be downloaded)
#unzip TREC Genomics data
unzip '*.zip'
#convert data into trec format
cd ~/biocaddie/scripts ./trecgenomics2trec.sh
***#documents=162259
Output: /shared/trecgenomics/data/trecText/trecgenomics_all.txt
Also make a copy at /data/trecgenomics/data/
2. Indexes (/shared/trecgenomics/indexes/trecgenomics_all)
Index param file: ~/biocaddie/index/build_index.trecgenomics.params
#Content
<parameters> <index>/shared/trecgenomics/indexes/trecgenomics_all</index> <indexType>indri</indexType> <corpus> <path>/shared/trecgenomics/data/trecText/trecgenomics_all.txt</path> <class>trectext</class> </corpus> </parameters>
#Build index
mkdir -p /shared/trecgenomics/indexes/ cd ~/biocaddie IndriBuildIndex index/build_index.trecgenomics.params
Output is saved at /shared/trecgenomics/indexes/trecgenomics_all
Also make a copy at /data/trecgenomics/indexes/trecgenomics_all
3. Queries
#download topics to /shared/trecgenomics/queries folder
wget http://skynet.ohsu.edu/trec-gen/data/2007/2007topics.txt
#convert query into trec format (use trecgentopics2trec.sh to create queries.combined.orig)
cd ~/biocaddie scripts/trecgentopics2trec.sh
Output is saved at /shared/trecgenomics/queries
Also make a copy of the query at /data/trecgenomics/queries
4. Qrels
#download qrels to /shared/trecgenomics/qrels folder
wget http://skynet.ohsu.edu/trec-gen/data/2007/trecgen2007.all.judgments.tsv.txt
#convert qrels into correct format for trec_eval (add in 0 in second column, replace NOT_RELEVANT with 0 and RELEVANT with 2, remove columns 4 and 5)
grep -v "#" /shared/trecgenomics/qrels/trecgen2007.all.judgments.tsv.txt | sed -e 's/\tRELEVANT/\t2/g' -e 's/\tNOT_RELEVANT/\t0/g' -e 's/\t/\t0\t/1' | cut -f 1,2,3,6 > trecgenomics-qrels.txt
***Problem with TREC Genomics qrels.
The relevant judgements generated above contain duplicate values such as a document for a query might have multiple judgements (RELEVENT/NON-RELEVANT) based on the document's maximum-length span.
Eg: In trecgen2007.all.judgments.tsv.txt file:
200 9063387 2059 1870 NOT_RELEVANT 200 9063387 7300 1702 RELEVANT 200 9063387 58122 4989 NOT_RELEVANT 200 9063387 82135 1426 RELEVANT 200 9063387 83588 3235 RELEVANT 200 9063387 97901 27036 NOT_RELEVANT
In trecgenomics-qrels.txt:
root@integration-1:/data/trecgenomics/qrels# grep 9063387 trecgenomics-qrels.txt 200 0 9063387 0 200 0 9063387 2 200 0 9063387 0 200 0 9063387 2 200 0 9063387 2 200 0 9063387 0
To fix this problem, use Rscript trecgenqrels.R (in ~/biocaddie/scripts), this script will group by query & document number and sum up the relevant number. If sum=0 -> document is non-relevant, its relevant number is kept 0; if sum>=2 -> document might include multiple relevant and non-relevant judgements, so we assign its relevant number to 2.
Output file is trecgenomics-qrels-nondup.txt and saved at /shared/trecgenomics/qrels
Also make a copy of the qrels at /data/trecgenomics/qrels
5. IndriRunQuery - Output
cd ~/biocaddie/baselines/trecgenomics ./<model>.sh <topic> <collection> |parallel -j 20 bash -c "{}"
Eg:
./jm.sh orig combined| parallel -j 20 bash -c "{}"
./dir.sh orig combined| parallel -j 20 bash -c "{}"
./tfidf.sh orig combined| parallel -j 20 bash -c "{}"
./two.sh orig combined| parallel -j 20 bash -c "{}"
./okapi.sh orig combined| parallel -j 20 bash -c "{}"
./rm3.sh orig combined| parallel -j 20 bash -c "{}"
IndriRunQuery outputs for different baselines are stored at:
/data/trecgenomics/output/tfidf/combined/orig
/data/trecgenomics/output/dir/combined/orig
/data/trecgenomics/output/okapi/combined/orig
/data/trecgenomics/output/jm/combined/orig
/data/trecgenomics/output/two/combined/orig
/data/trecgenomics/output/rm3/combined/orig
6. Cross-validation
cd ~/biocaddie scripts/mkeval_trecgenomics.sh <model> <topics> <collection>
Eg: scripts/mkeval_trecgenomics.sh tfidf orig combined
7. Compare models
cd ~/biocaddie Rscript scripts/compare_trecgenomics.R <collection> <from model> <to model> <topic>
Results (compared to tfidf baseline)
Model | MAP | NDCG | P@20 | NDCG@20 | P@100 | NDCG@100 | Notes | Date |
---|---|---|---|---|---|---|---|---|
tfidf | 0.2465 | 0.528 | 0.3361 | 0.4077 | 0.2081 | 0.3915 | Sweep b and k1 | 06/23/17 |
Okapi | 0.0666- | 0.2568- | 0.1389- | 0.1393- | 0.0953- | 0.1415- | Sweep b, k1, k3 | 06/23/17 |
QL (JM) | 0.2136 | 0.4771 | 0.3403 | 0.3951 | 0.1847 | 0.3583 | Sweep lambda | 06/23/17 |
QL (Dir) | 0.2176 | 0.4772- | 0.3514 | 0.4069 | 0.1881- | 0.3576 | Sweep mu | 06/23/17 |
QL (TS) | 0.2379 | 0.5128 | 0.3569 | 0.437 | 0.1986 | 0.399 | Sweep mu and lambda | 06/23/17 |
RM3 | 0.2536 | 0.5252 | 0.3653 | 0.4218 | 0.2164 | 0.3874 | Sweep mu, fbDocs, fbTerms, and lambda | 06/23/17 |
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf dir orig [1] "map 0.2465 0.2176 p= 0.9297" [1] "ndcg 0.528 0.4772 p= 0.9838" [1] "P_20 0.3361 0.3514 p= 0.2011" [1] "ndcg_cut_20 0.4077 0.4069 p= 0.5111" [1] "P_100 0.2081 0.1881 p= 0.9771" [1] "ndcg_cut_100 0.3915 0.3576 p= 0.885" root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf two orig [1] "map 0.2465 0.2379 p= 0.7532" [1] "ndcg 0.528 0.5128 p= 0.8973" [1] "P_20 0.3361 0.3569 p= 0.1197" [1] "ndcg_cut_20 0.4077 0.437 p= 0.1039" [1] "P_100 0.2081 0.1986 p= 0.8416" [1] "ndcg_cut_100 0.3915 0.399 p= 0.3308" root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf jm orig [1] "map 0.2465 0.2136 p= 0.996" [1] "ndcg 0.528 0.4771 p= 1" [1] "P_20 0.3361 0.3403 p= 0.4073" [1] "ndcg_cut_20 0.4077 0.3951 p= 0.7083" [1] "P_100 0.2081 0.1847 p= 0.9802" [1] "ndcg_cut_100 0.3915 0.3583 p= 0.9727" root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf okapi orig [1] "map 0.2465 0.0666 p= 1" [1] "ndcg 0.528 0.2568 p= 1" [1] "P_20 0.3361 0.1389 p= 0.9999" [1] "ndcg_cut_20 0.4077 0.1393 p= 1" [1] "P_100 0.2081 0.0953 p= 0.9998" [1] "ndcg_cut_100 0.3915 0.1415 p= 1"