Currently, we are checking for very basic attributes during validation:
- index and query must be explicitly specified
- type and field must have values, which can be assumed if they are not passed
- fbDocs and fbTerms must be positive integers
- alpha and beta must be positive reals between 0 and 1
- k1 and b must be positive reals
We also check that the target index has been instantiated properly to run Rocchio:
- index exists and contains the expected type
- index.type contains a mapping (either under "_all" or under or target field) that enables term vectors by setting store=true
Open questions:
- Must alpha = 1 - beta? perhaps this is just convention I've seen elsewhere? (right now I make no assumptions about the two value being related, but this would remove the need to explicitly specify beta)
- Is there a minimum fbTerms/fbDocs (perhaps 10?) under which the values would be nonsensical? (currently just checking that they are both >= 1)
- Do k1 and b have upper/lower bounds? (these are easy to adjust in Rocchio.java)
- Can any of these values ever be zero or negative? (my assumption is currently no)
- Are there other things we should verify on the index settings? (ie check that target index has documents added?)
- is #2 above sufficient? perhaps I should actually retrieve term vectors / field stats instead, to verify that they are accessible?
This ticket is complete when we have discussed and explored the edge cases described above.