Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

    a) Lucene Run (lucene-output)

Using biocaddie_all indexes

No Format
cd ~/biocaddie
baselines/new/<model>-lucene.sh <topics> <subset> <col>| parallel -j 20 bash -c "{}"
baselines/new/<model>-lucene.sh <topics> <subset> <col> <year>| parallel -j 20 bash -c "{}"

         Eg: baselines/new/dir-lucene.sh short test biocaddie| parallel -j 20 bash -c "{}"

    b) Evaluation and Cross-validation (lucene-eval, loocv) 

...

      baselines/new/

...

tfidf-lucene.sh

...

short test biocaddie| parallel -j 20 bash -c "{}"
     baselines/new/

...

jm-lucene.sh

...

     

         Eg: scripts/new/mkevalshort test biocaddie| parallel -j 20 bash -c "{}"
     baselines/new/bm25-lucene.sh
dir short test biocaddie

    c) Compare models

        We have to input running method for comparison:

        0 - both from and to models are from Indri run

        1 - both from and to models are from Lucene run

        2 - from model is from Indri run, to model is from Lucene run

        3 - from model is from Lucene run, to model is from Indri run

No Format
cd ~/biocaddie
Rscript scripts/new/compare.R <subset> <from> <to> <topics> <col>
Rscript scripts/new/compare.R <subset> <from> <to> <topics> <col> <year>

        Eg: Rscript scripts/new/compare.R test tfidf dir short biocaddie

4. Results

Using biocaddie_all indexes.

...

0.3675

(p-value=0.0548)

...

0.6163+

(p-value=0.0502)

...

Sweep mu

...

0.6417

(p-value= p= 0.0533)

...

| parallel -j 20 bash -c "{}"
    baselines/new/rocchio-lucene.sh short test biocaddie| parallel -j 20 bash -c "{}"

Using biocaddie_all.snowball indexes

No Format
cd ~/biocaddie
baselines/new/<model>-lucene-snowball.sh <topics> <subset> <col>| parallel -j 20 bash -c "{}"
baselines/new/<model>-lucene-snowball.sh <topics> <subset> <col> <year>| parallel -j 20 bash -c "{}"

Eg: baselines/new/dir-lucene-snowball.sh short test biocaddie| parallel -j 20 bash -c "{}"
     
baselines/new/tfidf-lucene-snowball.sh short test biocaddie| parallel -j 20 bash -c "{}"
     baselines/new/jm-lucene-snowball.sh short test biocaddie| parallel -j 20 bash -c "{}"
     baselines/new/bm25-lucene-snowball.sh short test biocaddie| parallel -j 20 bash -c "{}"
    baselines/new/rocchio-lucene-snowball.sh short test biocaddie| parallel -j 20 bash -c "{}"

    b) Evaluation and Cross-validation (lucene-eval, loocv) 

No Format
cd ~/biocaddie
scripts/new/mkeval-lucene.sh <model> <topics> <subset> <col>
scripts/new/mkeval-lucene.sh <model> <topics> <subset> <col> <year>

Eg: scripts/new/mkeval-lucene.sh dir short test biocaddie
       scripts/new/mkeval-lucene.sh tfidf short test biocaddie
       scripts/new/mkeval-lucene.sh jm short test biocaddie
       scripts/new/mkeval-lucene.sh bm25 short test biocaddie
       scripts/new/mkeval-lucene.sh rocchio short test biocaddie

       scripts/new/mkeval-lucene.sh dir-snowball short test biocaddie
       
scripts/new/mkeval-lucene.sh tfidf-snowball short test biocaddie
       scripts/new/mkeval-lucene.sh jm-snowball short test biocaddie
       scripts/new/mkeval-lucene.sh bm25-snowball short test biocaddie
       scripts/new/mkeval-lucene.sh rocchio-snowball short test biocaddie

    c) Compare models

        We have to input running method for comparison:

        0 - both from and to models are from Indri run

        1 - both from and to models are from Lucene run

        2 - from model is from Indri run, to model is from Lucene run

        3 - from model is from Lucene run, to model is from Indri run

No Format
cd ~/biocaddie
Rscript scripts/new/compare.R <subset> <from> <to> <topics> <col>
Rscript scripts/new/compare.R <subset> <from> <to> <topics> <col> <year>


        Eg: Rscript scripts/new/compare.R test tfidf dir short biocaddie

              Rscript scripts/new/compare.R test tfidf-snowball dir-snowball short biocaddie

4. Results

Using biocaddie_all indexes.

Using biocaddie_all.snowball indexes

Model
Model
MAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
classic tfidf0.
3375
32820.
5944
58240.
6667
68670.
5256
54780.
4987
50130.
5002
5018No parameters07/
06
05/17
BM250.35430.
3764
6105+0.
6239
7467+0.
73
5917+0.
6006+
5060.
5413+0.539+Sweep b, k107/06/17QL (JM)0.34480.60580.670.5813+0.49870.5289+Sweep lambda07/06/17QL (Dir)

0.3776+

0.6315+

0.70330.6006+0.53070.5365+

Sweep mu

07/06/17Rocchio0.3959

0.6052

0.72670.598+0.54530.525Sweep b, k1, fbTerms, fbDocs, fbOrigWeight07/06/17

Verification

Using biocaddie_all indexes:

5186Sweep b, k107/05/17
QL (JM)0.33820.60220.72330.5710.50.4996Sweep lambda07/05/17
QL (Dir)

0.3675

(p-value=0.0548)

0.6163+

(p-value=0.0502)

0.65670.56640.52130.522

Sweep mu

07/05/17
Rocchio0.4044+

0.6417

(p-value= p= 0.0533)

0.69670.54030.4920.4912Sweep b, k1, fbTerms, fbDocs, fbOrigWeight07/05/17

Using biocaddie_all.snowball indexes

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
classic tfidf (tfidf-snowball)0.33750.59440.66670.52560.49870.5002No parameters07/06/17
BM25 (bm25-snowball)0.3764+0.6239+0.73+0.6006+0.5413+0.539+Sweep b, k107/06/17
QL (JM) (jm-snowball)0.34480.60580.670.5813+0.49870.5289+Sweep lambda07/06/17
QL (Dir) (dir-snowball)

0.3776+

0.6315+

0.70330.6006+0.53070.5365+

Sweep mu

07/06/17
Rocchio (rocchio-snowball)0.3959

0.6052

0.72670.598+0.54530.525Sweep b, k1, fbTerms, fbDocs, fbOrigWeight07/06/17

Difference between unstemmed and stemmed indexes

ModelMAPNDCGP@20NDCG@20P@100NDCG@100NotesDate
classic tfidf0.32820.58240.68670.54780.50130.5018No parameters07/10/17
classic tfidf (tfidf-snowball)0.33750.59440.66670.52560.49870.5002No parameters07/10/17
BM250.35430.61050.74670.59170.5060.5186Sweep b, k107/10/17
BM25 (bm25-snowball)0.3764+0.62390.730.60060.5413+0.539+Sweep b, k107/10/17
QL (JM)0.33820.60220.72330.5710.50.4996Sweep lambda07/10/17
QL (JM) (jm-snowball)0.34480.60580.670.58130.49870.5289+Sweep lambda07/10/17
QL (Dir)

0.3675

0.6163

0.65670.56640.52130.522

Sweep mu

07/10/17
QL (Dir) (dir-snowball)

0.3776

0.6315

(p-value=0.0534)

0.7033+0.6006+0.53070.5365

Sweep mu

07/10/17
Rocchio0.4044

0.6417

0.69670.54030.4920.4912Sweep b, k1, fbTerms, fbDocs, fbOrigWeight07/10/17
Rocchio (rocchio-snowball)0.3959

0.6052

0.72670.5980.54530.525Sweep b, k1, fbTerms, fbDocs, fbOrigWeight07/10/17


Verification

Using biocaddie_all indexes:

No Format
thphan@biocaddie-dev:/data/thphan/biocaddie$ Rscript scripts/new/compare.R test tfidf dir short biocaddie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.3282 0.3675 p= 0.0548"
[1] "ndcg 0.5824 0.6163 p= 0.0502"
[1] "P_20 0.6867 0.6567 p= 0.9461"
[1] "ndcg_cut_20 0.5478 0.5664 p= 0.186"
[1] "P_100 0.5013 0.5213 p= 0.2168"
[1] "ndcg_cut_100 0.5018 0.522 p= 0.1401"


thphan@biocaddie-dev:/data/thphan/biocaddie$ Rscript scripts/new/compare.R test tfidf jm short biocadd                                                                                                         ie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.3282 0.3382 p= 0.1719"
[1] "ndcg 0.5824 0.6022 p= 0.0932"
[1] "P_20 0.6867 0.7233 p= 0.0831"
[1] "ndcg_cut_20 0.5478 0.571 p= 0.145"
[1] "P_100 0.5013 0.5 p= 0.5301"
[1] "ndcg_cut_100 0.5018 0.4996 p= 0.5552"


No Format
thphan@biocaddie-dev:/data/thphan/biocaddie$ Rscript scripts/new/compare.R test tfidf dirbm25 short biocaddie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.3282 0.36753543 p= 0.05480846"
[1] "ndcg 0.5824 0.61636105 p= 0.05020148"
[1] "P_20 0.6867 0.65677467 p= 0.94610491"
[1] "ndcg_cut_20 0.5478 0.56645917 p= 0.1860496"
[1] "P_100 0.5013 0.5213506 p= 0.2168428"
[1] "ndcg_cut_100 0.5018 0.5225186 p= 0.14012195"


thphan@biocaddie-dev:/data/thphan/biocaddie$ Rscript scripts/new/compare.R test tfidf jmrocchio short biocaddbiocaddie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.3282 0.4044 p= 0.0188"
[1] "ndcg 0.5824 0.6417 p= 0.0533"
[1] "P_20 0.6867 0.6967 p= 0.3785"
[1] "ndcg_cut_20 0.5478 0.5403 p= 0.6276"
[1] "P_100 0.5013 0.492 p= 0.6184"
[1] "ndcg_cut_100 0.5018 0.4912 p= 0.6071"

Using biocaddie_all.snowball indexes

No Format
thphan@biocaddie-dev:/data/thphan/biocaddie$ Rscript scripts/new/compare.R test tfidf-snowball dir-snowball short biocaddie                                                ie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.32823375 0.33823776 p= 0.17190387"
[1] "ndcg 0.58245944 0.60226315 p= 0.09320072"
[1] "P_20 0.68676667 0.72337033 p= 0.08311042"
[1] "ndcg_cut_20 0.54785256 0.5716006 p= 0.1450046"
[1] "P_100 0.50134987 0.55307 p= 0.53010652"
[1] "ndcg_cut_100 0.50185002 0.49965365 p= 0.55520207"


thphan@biocaddie-dev:/data/thphan/biocaddie$ Rscript scripts/new/compare.R test tfidf-snowball bm25jm-snowball short biocaddie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.32823375 0.35433448 p= 0.08462782"
[1] "ndcg 0.58245944 0.61056058 p= 0.01482069"
[1] "P_20 0.68676667 0.746767 p= 0.0491475"
[1] "ndcg_cut_20 0.54785256 0.59175813 p= 0.04960161"
[1] "P_100 0.50134987 0.5064987 p= 0.4285"
[1] "ndcg_cut_100 0.50185002 0.51865289 p= 0.21950117"

thphan@biocaddie-dev:/data/thphan/biocaddie$ Rscript scripts/new/compare.R test tfidf-snowball rocchiobm25-snowball short biocaddie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.32823375 0.40443764 p= 0.01880284"
[1] "ndcg 0.58245944 0.64176239 p= 0.0533011"
[1] "P_20 0.68676667 0.696773 p= 0.37850331"
[1] "ndcg_cut_20 0.54785256 0.54036006 p= 0.62760045"
[1] "P_100 0.50134987 0.4925413 p= 0.61840326"
[1] "ndcg_cut_100 0.50185002 0.4912539 p= 0.60710149"

...

Compare results between unstemmed and stemmed indexes:

No Format
root@integrationthphan@biocaddie-1:~/biocaddie#dev:/data/thphan/biocaddie$ Rscript scripts/new/compare.R test tfidf dirtfidf-snowball short biocaddie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.33753282 0.37763375 p= 0.03871463"
[1] "ndcg 0.59445824 0.63155944 p= 0.00721454"
[1] "P_20 0.66676867 0.70336667 p= 0.1042808"
[1] "ndcg_cut_20 0.52565478 0.60065256 p= 0.00468715"
[1] "P_100 0.49875013 0.53074987 p= 0.06525819"
[1] "ndcg_cut_100 0.50025018 0.53655002 p= 0.02075652"

root@integrationthphan@biocaddie-1:~/biocaddie#dev:/data/thphan/biocaddie$ Rscript scripts/new/compare.R test tfidfdir jmdir-snowball short biocaddie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.33753675 0.34483776 p= 0.27820842"
[1] "ndcg 0.59446163 0.60586315 p= 0.20690534"
[1] "P_20 0.66676567 0.677033 p= 0.4750011"
[1] "ndcg_cut_20 0.52565664 0.58136006 p= 0.01610222"
[1] "P_100 0.49875213 0.49875307 p= 0.51942"
[1] "ndcg_cut_100 0.5002522 0.52895365 p= 0.01170645"

root@integrationthphan@biocaddie-1:~/biocaddie#dev:/data/thphan/biocaddie$ Rscript scripts/new/compare.R test tfidfjm bm25jm-snowball short biocaddie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.33753382 0.37643448 p= 0.02842603"
[1] "ndcg 0.59446022 0.62396058 p= 0.0113358"
[1] "P_20 0.66677233 0.7367 p= 0.03318885"
[1] "ndcg_cut_20 0.5256571 0.60065813 p= 0.00452551"
[1] "P_100 0.49875 0.54134987 p= 0.032655"
[1] "ndcg_cut_100 0.50024996 0.5395289 p= 0.01490026"


thphan@biocaddie-dev:/data/thphan/biocaddie$ Rscript scripts/new/compare.R test tfidfbm25 rocchiobm25-snowball short biocaddie
Please enter run methods for comparison:
        0: both are Indri
        1: both are Lucene
        2: from is Indri, to is Lucene
        3: from is Lucene, to is Indri
1
[1] "map 0.33753543 0.39593764 p= 0.07070317"
[1] "ndcg 0.59446105 0.60526239 p= 0.40180945"
[1] "P_20 0.66677467 0.726773 p= 0.0758548"
[1] "ndcg_cut_20 0.52565917 0.5986006 p= 0.02312775"
[1] "P_100 0.4987506 0.54535413 p= 0.11550209"
[1] "ndcg_cut_100 0.50025186 0.525539 p= 0.27190441"