LuceneContext is a Context wrapper that maintains a full-text index on any metadata that is written to it, using Lucene as the full-text indexing engine.

Note that LuceneContext will not index text that is already in the wrapped Context, instead only indexing text when new metadata is written to the LuceneContext.

LuceneContext works by creating a full text index for all literal objects, which may be searched using TripleMatcher.

Setting it up

This context requires two bits of information for the configuration.

  1. A context for its triples
  2. A directory for the Lucene index

Lucene, of course, has no concept of what a triple is nor does it know anything about RDF, hence a backing context is used to store the triples is needed.

Lucene also needs a directory for its indices. This will be need to be initialized before use. So, before the very first use of the context, clients will need to call the initialize method. Warning: This will remove any existing index! If you do not need to initialize (or re-initialize) the index, do not call this method.

Usage

You can write metadata and data to LuceneContext as with most other Context implementations; the operations are forwarded to the TripleContext. For TripleMatcher, patterns are allowed that contain Lucene-specific query syntax. For example:

TripleMatcher tm = new TripleMatcher();
tm.setObject(Resource.literal("comment*"));
luceneContext.perform(tm);
//... etc.

Would find every triple in the context whose object has a string that starts with "comment". Specifying subject and/or predicate would filter the results. Leaving the object unbound will simply pass the matcher off to the backing context. Lucene does allow for initial wildcard searches, such "*fnord" but as an performance issue, these are implemented (in Lucene) using various nested looping contructs and are therefore likely to be slow.

Matching and unification

A warning about matching vs. unification. When matching, Lucene is queried for possible matches as is the backing context. In unification only the backing context is used.

Searching for dates

The way that dates are entered is using the DateTools from Lucene, so every date ends up in the format

yyyyMMddHHmmss

This might cause some problems. For one thing, the full date (including time) mean that date range queries can't work -- they must include times. Normally one would issue something like [19970122 TO 19980215] to search for all documents between Jan. 22, 1997 and Feb. 15, 1998, but instead this should be done as [19970122000000 TO 19980215000000]. Lucene will not do a range query on something like [19970122* TO 19980215*].

Performance notes

Lucene, if not properly tweaked, is very, very slow at updating its indices. Straight out of the box it is optimized for simple testing rather than heavy use. Performance considerations also mean that it is not practical to use LuceneContext and JournalingContext together.

  • No labels