You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

5/25/2017

Notes from NDS/BioCADDIE team meeting.  This meeting is primarily to plan for the next sprint. The following are up for discussion:

  • Evaluation framework -- where should we go from here? 
    • Clean-up/prune ir-utils
    • Lucene-centric evaluation (lucene4ir)
    • Improving the shell-script approach (balance understandability/simplicity with scale)
    • Possible tasks:
      • Tie breaking
      • Retrieval models without rescoring
        • Hack Indri or extend Lucene
      • Extend Lucene
        • Dirichlet + TwoStage
        • RM/RM3
        • Is it KL
        • PLM
        • LDA
        • Kmeans
        • Handling priors
        • CER 
  • Distributed evaluation (Kubernetes)
    • Mike has a prototype working with hyperkube
    • Comment about missing Okapi expansion
    • Possible tasks: 
      • Test on a real cluster via deploy-tools (NDS-hackathon project)
      • Provision attached storage for each node (already done with deploy-tools?)
      • How can we get data to and from all of the nodes (for prototype, manual is fine). Ideally, something similar to hdfsput hdfs get from hadoop.
      • Garrick: qrels/topics?
      • Explore AWS/GCE/Azure?
  • ES RM plugin
    • Possible tasks:
      • 1.7.5 support!! (NDS-897)
      • Actually implement the plugin (NDS-868)
      • Custom scoring exploration  (Garrick)
  • Stemming in ES (NDS-885)
    • Create index both stemmed (Snowball) and unstemmed
  • VM resources: 
    • SDSC vs NCSA
    • Shared data directories
  • Performance characterization (recommended by Kirk)
  • New ideas?
    • Boolean/"sufficient" query - (Garrick)
      • Boolean queries in Indri Queries: scoreif
    • Structured search (using the document structure somehow)
    • Try other collections (UMLS/MeSH, medical subsets)
    • Analyze relevance judgments
    • Compare baselines against medical collections
    • Cluster-based expansion models
    • Query performance prediction



5/23/2017

Notes from BioCADDIE core developer meeting

  • Presented status update
  • BioCADDIE is running ES 1.7.5 in production, but more recent versions in development 
  • Xiaoling emailed results from DataMed system for full test collection in TREC format.
  • Kirk suggested that we look at a fallback strategy – use one model for higher precision, another for long tail
    • When does it work? What queries does it work for?
    • Better characterization of what's working
    • DataMed is a P@20 system, mainly
  • Gerard? has installed the current pipeline and will document. Maybe we can do the same.


  • No labels