Run Lucene baselines

1. Variables and notations

Term	Meaning	Variable in script	Scripts
collection	name of the collection/dataset (biocaddie, ohsumed, treccds, trecgenomics)	col	all
subset	set of data used for running baselines (combined, test, train)	subset	all
topics	name of the topics file (short, orig, stopped, etc)	topics	all
year	year of the collection/dataset, available for few collections such as trecgenomics (2006, 2007)	year	all
model	retrieval model (dir, rm3, jm, pubmed, etc)	model	mkeval.sh, mkeval-lucene.sh
from model	retrieval model (dir, rm3, jm, pubmed, etc), for comparison (t-test)	from	compare.R
to model	retrieval model (dir, rm3, jm, pubmed, etc), for comparison (t-test)	to	compare.R
metric	evaluation metric (map, ndcg, P_20, ndcg_cut_20, etc)	metric	mkeval.sh, mkeval-lucene.sh, compare.R
run method	running method (indri, lucene)	run	compare.R

2. Files and their locations

Collections without year: biocaddie, ohsumed

Collections with year: treccds (2015), trecgenomics (2006, 2007)

Type

Location

Example

Indexes

/data/<col>/lucene/<col>_all/shard0

/data/<col>/lucene/<col><year>_all/shard0

/data/biocaddie/lucene/biocaddie_all/shard0

/data/trecgenomics/lucene/trecgenomics2006_all/shard0

Queries

/data/<col>/queries/queries.<subset>.<topics>

/data/<col>/queries/queries.<subset>.<topics>.<year>

/data/biocaddie/queries/queries.test.short

/data/trecgenomics/queries/queries.combined.orig.2006

Qrels

/data/<col>/qrels/<col>.qrels.<subset>

/data/<col>/qrels/<col>.qrels.<subset>.<year>

/data/biocaddie/qrels/biocaddie.qrels.test

/data/trecgenomics/qrels/trecgenomics.qrels.combined.2006

Output

/data/<col>/lucene-output/<model>/<subset>/<topics>

/data/<col>/lucene-output/<year>/<model>/<subset>/<topics>

/data/biocaddie/lucene-output/dir/test/short

/data/trecgenomics/lucene-output/2006/dir/combined/orig

Eval

/data/<col>/lucene-eval/<model>/<subset>/<topics>

/data/<col>/lucene-eval/<year>/<model>/<subset>/<topics>

/data/biocaddie/lucene-eval/dir/test/short

/data/trecgenomics/lucene-eval/2006/dir/combined/orig

Loocv

/data/<col>/loocv/<model>.<subset>.<topics>.<metric>.lucene.out

/data/<col>/loocv/<year>/<model>.<subset>.<topics>.<metric>.lucene.out

/data/biocaddie/loocv/dir.test.short.ndcg.lucene.out

/data/trecgenomics/loocv/2006/dir.combined.orig.ndcg.lucene.out

*** Note: both Lucene and Indri's loocv results are saved in the same location for easy comparison across different runs.

3. Run Lucene baselines

a) Lucene Run (lucene-output)

cd ~/biocaddie
baselines/new/<model>-lucene.sh <topics> <subset> <col>| parallel -j 20 bash -c "{}"
baselines/new/<model>-lucene.sh <topics> <subset> <col> <year>| parallel -j 20 bash -c "{}"

Eg: baselines/new/dir-lucene.sh short test biocaddie | parallel -j 20 bash -c "{}"

b) Evaluation and Cross-validation (lucene-eval, loocv)

cd ~/biocaddie
scripts/new/mkeval-lucene.sh <model> <topics> <subset> <col>
scripts/new/mkeval-lucene.sh <model> <topics> <subset> <col> <year>

Eg: scripts/new/mkeval-lucene.sh dir short test biocaddie

c) Compare models

We have to input running method for comparison:

0 - both from and to models are from Indri run

1 - both from and to models are from Lucene run

2 - from model is from Indri run, to model is from Lucene run

3 - from model is from Lucene run, to model is from Indri run

cd ~/biocaddie
Rscript scripts/new/compare.R <subset> <from> <to> <topics> <col>
Rscript scripts/new/compare.R <subset> <from> <to> <topics> <col> <year>

Eg: Rscript scripts/new/compare.R test tfidf dir short biocaddie

4. Results

Model	MAP	NDCG	P@20	NDCG@20	P@100	NDCG@100	Notes	Date
classic tfidf	0.3283	0.5816	0.6933	0.5462	0.5007	0.4996	No parameters	06/30/17
BM25	0.3544	0.6061+	0.75	0.5958+	0.5067	0.5182	Sweep b, k1	06/30/17
QL (JM)	0.3367	0.6016	0.7233	0.5713	0.5007	0.5017	Sweep lambda	06/30/17
QL (Dir)	0.3677 (p-value=0.0526)	0.6169+	0.6667	0.5676	0.5213	0.5221	Sweep mu	06/30/17

root@integration-1:~/biocaddie# Rscript scripts/new/compare.R test tfidf dir short biocaddie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.3283 0.3677 p= 0.0526"
[1] "ndcg 0.5816 0.6169 p= 0.0408"
[1] "P_20 0.6933 0.6667 p= 0.9251"
[1] "ndcg_cut_20 0.5462 0.5676 p= 0.1533"
[1] "P_100 0.5007 0.5213 p= 0.2053"
[1] "ndcg_cut_100 0.4996 0.5221 p= 0.1162"

root@integration-1:~/biocaddie# Rscript scripts/new/compare.R test tfidf bm25 short biocaddie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.3283 0.3544 p= 0.0825"
[1] "ndcg 0.5816 0.6061 p= 0.0239"
[1] "P_20 0.6933 0.75 p= 0.0612"
[1] "ndcg_cut_20 0.5462 0.5958 p= 0.0302"
[1] "P_100 0.5007 0.5067 p= 0.4072"
[1] "ndcg_cut_100 0.4996 0.5182 p= 0.1975"

root@integration-1:~/biocaddie# Rscript scripts/new/compare.R test tfidf jm short biocaddie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.3283 0.3367 p= 0.2231"
[1] "ndcg 0.5816 0.6016 p= 0.0753"
[1] "P_20 0.6933 0.7233 p= 0.1356"
[1] "ndcg_cut_20 0.5462 0.5713 p= 0.1312"
[1] "P_100 0.5007 0.5007 p= 0.5"
[1] "ndcg_cut_100 0.4996 0.5017 p= 0.4441"

Space shortcuts

Page tree