Page History

...

Mike at PEARC this week; Thuong's last week
Final deliverables:
- ElasticSearch plugin (NDS-868) and test process (NDS-956)
- PubMed ingest process (new)
- biocaddie repo release
- Documentation/whitepaper
  - Results of comparative evaluation
  - Indri v Lucene
  - Baselines
  - BM25, BM25+Rocchio, BM25+PubMed Rocchio
Others
- Kubernetes + parallel
- Publish data?
Report/paper points (ECIR/10-16-17;
- BioCADDIE
  - Baseline results
  - Query expansion and document expansion results
  - Indri > Lucene/ElasticSearch
    - Lucene's models aren't valid
    - No built-in query expansion
    - Limitations of the real-world search engine
  - Test collection
    - Train v test
    - Short v orig
  - Query characterization and QPP
- Other collections
  - OHSUMED/TRECDS?/Genomics
- Infrastructure
  - ir-tools/Maven
  - Cross-validation
  - Kubernetes/parellel

6/27

Thuong's last day ~7/15; Garrick out next week; Craig out all next week
Open discussion/status:
- Garrick: focusing on query expansion/Rocchio; how to make a plugin
- Mike: stress testing on Gluster/Kubernetes for BioCaddie; 4 large nodes;
- Thoung: re-ran baselines with test queries only; updated results; ran TREC Genomics 2006/7 baselines; compared to official results; started looking at Lucene baselines; runquery/mkeval/compare generalization
- Craig: merged LuceneRunQuery with 6.6 support; preliminary Rocchio implementation based on Garrick's work; QPP
Revisit statement of work and task status (BioCADDIE)
- What we've done:
  - Comparative evaluation of RM and Rocchio using BioCADDIE test collection
  - Comparative evaluation of SDM
  - Decided what to implement (ElasticSearch plugin, Rocchio expansion)
- Still need to do
  - Implement actual plugin
  - Implement PubMed OA index and ingest process (ElasticSearch)
  - Testing (test plan, integration, performance, execution)
  - Release packaging (in progress)
  - Documentation
- What we can't do
  - Analysis with respect to current pipeline (we never got it running)
- What we did that wasn't on the SOW
  - Comparative evaluation with CDS, OHSUMED, Genomics
  - Document expansion
  - Train/test analysis
  - Query performance prediction
Review "test" results + Genomics results
- A few open questions (why OKAPI is so bad on 2007; why 2006 results are better for LM than 2007)
Remaining priorities
- From SOW
  - Create ES plugin (
    Jira
    server JIRA
    serverId b14d4ad9-eb00-3a94-88ac-a843fb6fa1ca
    key NDS-868
    )
    - Mike had an early prototype
      Jira
      server JIRA
      serverId b14d4ad9-eb00-3a94-88ac-a843fb6fa1ca
      key NDS-840
    - Garrick implemented Rocchio/BM25 for Lucene (
      Jira
      server JIRA
      serverId b14d4ad9-eb00-3a94-88ac-a843fb6fa1ca
      key NDS-829
      )
    - We have a rudimentary example, but now we need to implement.
  - Create ElasticSearch index for PubMed (NDS-876)
  - Lucene baseline runs: Use LuceneRunQuery to run baselines for biocaddie (NDS-949)
  - Lucene Rocchio runs: Once reviewed/merged, use LuceneRunQuery for Rocchio baselines for biocaddie
  - Testing (Mike?)
  - Release
  - Documentation
- Other
  - Create ElasticSearch index for Wikipedia
  - Lucene baseline runs: Use LuceneRunQuery to run baselines for other collections
  - Lucene Rocchio runs: Once reviewed/merged, use LuceneRunQuery for Rocchio baselines for other collections
  - Audit/cleanup results: Review everything we've done, make sure we've run all models we want to
  - Finalize QPP analysis
  - Revisit repository priors

...

Space shortcuts

Page tree

Versions Compared

Old Version 18

New Version 19

Key

6/27