Page History

...

No Format
cd ~/biocaddie scripts/trecgentopics2trec.sh #for 2007 queries scripts/trecgentopics2trec2006.sh #for 2006 queries

Output is saved at /shared/trecgenomics/queries

...

No Format

grep -v "#" /shared/trecgenomics/qrels/trec2006.raw.relevance.tsv.txt | sed -e 's/\tDEFINITELY/\t2/g' -e 's/\tPOSSIBLY/\t1/g' -e 's/\tNOT/\t0/g' -e 's/\t/\t0\t/1' | cut -f 1,2,3,7 > trecgenomics-qrels-2006.txt

***Problem with TREC Genomics qrels.

The relevant judgements generated above contain duplicate values such as a document for a query might have multiple judgements (RELEVENT/NON-RELEVANT) based on the document's maximum-length span.

...

No Format

root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf dir orig 2007
[1] "map 0.2465 0.2176 p= 0.9297" 
[1] "ndcg 0.528 0.4772 p= 0.9838" 
[1] "P_20 0.3361 0.3514 p= 0.2011" 
[1] "ndcg_cut_20 0.4077 0.4069 p= 0.5111" 
[1] "P_100 0.2081 0.1881 p= 0.9771" 
[1] "ndcg_cut_100 0.3915 0.3576 p= 0.885" 
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf two orig 2007
[1] "map 0.2465 0.2379 p= 0.7532" 
[1] "ndcg 0.528 0.5128 p= 0.8973" 
[1] "P_20 0.3361 0.3569 p= 0.1197" 
[1] "ndcg_cut_20 0.4077 0.437 p= 0.1039" 
[1] "P_100 0.2081 0.1986 p= 0.8416" 
[1] "ndcg_cut_100 0.3915 0.399 p= 0.3308" 
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf jm orig 2007
[1] "map 0.2465 0.2136 p= 0.996" 
[1] "ndcg 0.528 0.4771 p= 1" 
[1] "P_20 0.3361 0.3403 p= 0.4073" 
[1] "ndcg_cut_20 0.4077 0.3951 p= 0.7083" 
[1] "P_100 0.2081 0.1847 p= 0.9802" 
[1] "ndcg_cut_100 0.3915 0.3583 p= 0.9727" 
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf okapi orig 2007
[1] "map 0.2465 0.0666 p= 1" 
[1] "ndcg 0.528 0.2568 p= 1" 
[1] "P_20 0.3361 0.1389 p= 0.9999" 
[1] "ndcg_cut_20 0.4077 0.1393 p= 1" 
[1] "P_100 0.2081 0.0953 p= 0.9998" 
[1] "ndcg_cut_100 0.3915 0.1415 p= 1"
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf rm3 orig 2007
[1] "map 0.2465 0.2536 p= 0.3791"
[1] "ndcg 0.528 0.5252 p= 0.5405"
[1] "P_20 0.3361 0.3653 p= 0.1006"
[1] "ndcg_cut_20 0.4077 0.4218 p= 0.3338"
[1] "P_100 0.2081 0.2164 p= 0.2333"
[1] "ndcg_cut_100 0.3915 0.3874 p= 0.5519"

2006 data

Model	MAP	NDCG	P@20	NDCG@20	P@100	NDCG@100	Notes	Date
tfidf	0.2714	0.4897	0.3375	0.3922	0.1725	0.3884	Sweep b and k1	06/

23

27/17

Okapi

Sweep b, k1, k306/

0.2372

0.3963-

0.2393-

0.3176-

0.1343-

0.3216

Sweep b, k1, k3

06/27

23

/17
QL (JM)	0.2774	0.5003	0.3268	0.4354+	0.1764	0.4307+	Sweep lambda	06/

23

27/17
QL (Dir)	0.2939	0.5196	0.3375	0.4524+	0.1836	0.4554+	Sweep mu	06/

23

27/17
QL (TS)	0.2884	0.5153	0.3268	0.4509+	0.1861	0.4479+	Sweep mu and lambda	06/

23

27/17

RM3

Sweep mu,

0.3415+

0.5093

0.3768

0.4521+

0.2168+

0.448+

Sweep mu, fbDocs, fbTerms, and lambda

06/

23/17

27/17

No Format

root@integration-1:~/biocaddie#  Rscript scripts/compare_trecgenomics.R combined tfidf dir orig 2006
[1] "map 0.2714 0.2939 p= 0.1046"
[1] "ndcg 0.4897 0.5196 p= 0.0642"
[1] "P_20 0.3375 0.3375 p= 0.5"
[1] "ndcg_cut_20 0.3922 0.4524 p= 0.006"
[1] "P_100 0.1725 0.1836 p= 0.1524"
[1] "ndcg_cut_100 0.3884 0.4554 p= 0.0023"
root@integration-1:~/biocaddie#  Rscript scripts/compare_trecgenomics.R combined tfidf jm orig 2006
[1] "map 0.2714 0.2774 p= 0.3499"
[1] "ndcg 0.4897 0.5003 p= 0.2605"
[1] "P_20 0.3375 0.3268 p= 0.6859"
[1] "ndcg_cut_20 0.3922 0.4354 p= 0.0334"
[1] "P_100 0.1725 0.1764 p= 0.3434"
[1] "ndcg_cut_100 0.3884 0.4307 p= 0.0157"
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf two orig 2006
[1] "map 0.2714 0.2884 p= 0.1384"
[1] "ndcg 0.4897 0.5153 p= 0.0823"
[1] "P_20 0.3375 0.3268 p= 0.6934"
[1] "ndcg_cut_20 0.3922 0.4509 p= 0.0062"
[1] "P_100 0.1725 0.1861 p= 0.1425"
[1] "ndcg_cut_100 0.3884 0.4479 p= 0.0032"
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf okapi orig 2006
[1] "map 0.2714 0.2372 p= 0.8888"
[1] "ndcg 0.4897 0.3963 p= 0.9909"
[1] "P_20 0.3375 0.2393 p= 0.9909"
[1] "ndcg_cut_20 0.3922 0.3176 p= 0.953"
[1] "P_100 0.1725 0.1343 p= 0.9559"
[1] "ndcg_cut_100 0.3884 0.3216 p= 0.9465"
root@integration-1:~/biocaddie# Rscript scripts/compare_trecgenomics.R combined tfidf rm3 orig 2006
[1] "map 0.2714 0.3415 p= 0.0136"
[1] "ndcg 0.4897 0.5093 p= 0.2485"
[1] "P_20 0.3375 0.3768 p= 0.0741"
[1] "ndcg_cut_20 0.3922 0.4521 p= 0.0199"
[1] "P_100 0.1725 0.2168 p= 0.0189"
[1] "ndcg_cut_100 0.3884 0.448 p= 0.0206"

Comments Comments:

TREC Genomics collection is a full-text collection (different from bioCaddie which is descriptive metadata collection). It consists of full-text HTML documents from 49 journals published via Highwire Press. Hence, each document's text is much longer.
Topics used in TREC Genomics collection are common queries and quite similar to bioCaddie original queries.
Eg:
<200>What serum [PROTEINS] change expression in association with high disease activity in lupus?
<201>What [MUTATIONS] in the Raf gene are associated with cancer?
Relevant judgements contain judgements for different passages of a document (RELEVANT or NON-RELEVANT). Some documents can be divided into multiple passages of different length and can have different judgement for each passage. However, in our baselines run, we use the judgement for the whole document; hence if a document has one or more relevant passages, it is considered RELEVANT.
Eg:
200 10090921 10160 2221 RELEVANT
200 10090921 12404 720 NOT_RELEVANT
200 10090921 13147 1084 NOT_RELEVANT
200 10090921 59180 515 RELEVANT
200 10090921 101717 349 RELEVANT
Document number 10090921 is considered RELEVANT as it has at least 1 RELEVANT passage.
Baselines run results:
Okapi significantly performed worse than tfidf in all metrics
Query Likelihood baselines did not show significant improvements compared to TFIDF (few metrics were even worse)
RM3 did yield better results but the data did not provide a significant improvement compared to TFIDF baselines.

Space shortcuts

Page tree

Versions Compared

Old Version 7

New Version 8

Key