...
- Mike at PEARC this week; Thuong's last week
- Final deliverables:
- ElasticSearch plugin (NDS-868) and test process (NDS-956)
- PubMed ingest process (new)
- biocaddie repo release
- Documentation/whitepaper
- Results of comparative evaluation
- Indri v Lucene
- Baselines
- BM25, BM25+Rocchio, BM25+PubMed Rocchio
- Others
- Kubernetes + parallel
- Publish data?
- Report/paper points (ECIR/10-16-17;
- BioCADDIE
- Baseline results
- Query expansion and document expansion results
- Indri > Lucene/ElasticSearch
- Lucene's models aren't valid
- No built-in query expansion
- Limitations of the real-world search engine
- Test collection
- Train v test
- Short v orig
- Query characterization and QPP
- Other collections
- OHSUMED/TRECDS?/Genomics
- Infrastructure
- ir-tools/Maven
- Cross-validation
- Kubernetes/parellel
- BioCADDIE
6/27
- Thuong's last day ~7/15; Garrick out next week; Craig out all next week
- Open discussion/status:
- Garrick: focusing on query expansion/Rocchio; how to make a plugin
- Mike: stress testing on Gluster/Kubernetes for BioCaddie; 4 large nodes;
- Thoung: re-ran baselines with test queries only; updated results; ran TREC Genomics 2006/7 baselines; compared to official results; started looking at Lucene baselines; runquery/mkeval/compare generalization
- Craig: merged LuceneRunQuery with 6.6 support; preliminary Rocchio implementation based on Garrick's work; QPP
- Revisit statement of work and task status (BioCADDIE)
- What we've done:
- Comparative evaluation of RM and Rocchio using BioCADDIE test collection
- Comparative evaluation of SDM
- Decided what to implement (ElasticSearch plugin, Rocchio expansion)
- Still need to do
- Implement actual plugin
- Implement PubMed OA index and ingest process (ElasticSearch)
- Testing (test plan, integration, performance, execution)
- Release packaging (in progress)
- Documentation
- What we can't do
- Analysis with respect to current pipeline (we never got it running)
- What we did that wasn't on the SOW
- Comparative evaluation with CDS, OHSUMED, Genomics
- Document expansion
- Train/test analysis
- Query performance prediction
- What we've done:
- Review "test" results + Genomics results
- A few open questions (why OKAPI is so bad on 2007; why 2006 results are better for LM than 2007)
- Remaining priorities
- From SOW
- Create ES plugin (
)Jira server JIRA serverId b14d4ad9-eb00-3a94-88ac-a843fb6fa1ca key NDS-868 - Mike had an early prototype
Jira server JIRA serverId b14d4ad9-eb00-3a94-88ac-a843fb6fa1ca key NDS-840 - Garrick implemented Rocchio/BM25 for Lucene (
)Jira server JIRA serverId b14d4ad9-eb00-3a94-88ac-a843fb6fa1ca key NDS-829 - We have a rudimentary example, but now we need to implement.
- Mike had an early prototype
- Create ElasticSearch index for PubMed (NDS-876)
- Lucene baseline runs: Use LuceneRunQuery to run baselines for biocaddie (NDS-949)
- Lucene Rocchio runs: Once reviewed/merged, use LuceneRunQuery for Rocchio baselines for biocaddie
- Testing (Mike?)
- Release
- Documentation
- Create ES plugin (
- Other
- Create ElasticSearch index for Wikipedia
- Lucene baseline runs: Use LuceneRunQuery to run baselines for other collections
- Lucene Rocchio runs: Once reviewed/merged, use LuceneRunQuery for Rocchio baselines for other collections
- Audit/cleanup results: Review everything we've done, make sure we've run all models we want to
- Finalize QPP analysis
- Revisit repository priors
- From SOW
...