Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added discussion of different API, incremental infoset, pathological formats.

...

Infoset elements must be produced incrementally by the parser. These can only be produced once surrounding points of uncertainty are resolved fully. An architecture for this is needed. There may be some limitations.
This is closely related to the backtrack issues with being able to over-write state rather than allocate new state when parsing - or the changes should be in the same area of the code anyway.

Cursor-style Pull API

The Daffodil API ProcessorFactory class has an onPath("...") method. (Currently only "/" is allowed as  a path.) This is intended to enable a cursor-like behavior if given a path that identifies an array. Successive calls to the DataProcessor's parse method should advance through the data one element of the array at a time, returning an Infoset each time which has as its root the successive InfosetElement items. Using this along with co-routines an event-based API can be produced..

A cursor-style API caters to applications that are schema-specific more than generic applications. The notion here is that each parse action is some meaningful (to the schema) chunk of stuff.  Dealing with any points of uncertainty that the onPath(...path..) crosses becomes the application's problem in this sort of API.

StAX-style Pull API

Because data formats can have points of uncertainty in them, an entirely XML-oriented pull-parser API can be problematic. See section below on Pathological Data Formats.

One design point: the lowest level API pushes the uncertainty back to the consumer - the establishing of known-to-exist or known-not-to-exist resolutions of uncertainty, well that's an event like any other event. So when a discriminator evaluates to true - that produces the discriminator-true event. Basically, the backtracking of the parser is visible as events, to the application, not hidden and invisible. Unwinding the stack from a failed element parse, well that's a different kind of end-element event. This allows applications to parse and process flawed data. An application could implement recovery points for example, such that it skips over broken data, and tries again from some place in the schema.

Incremental parse events are a lot like the Daffodil debugger's trace output in this case.  Call-back handlers are the common way to do this. I.e., application provides an object that implements a particular interface. Then, if the user application wants to do co-routines so that it looks like a flat stream of events, and not a nested call, they can do so.

The next layer up can implement a full StAX style API where no event is released until all points-of-uncertainty about it have been resolved. But I suspect many applications are going to want events out for inner elements even though the outermost point of uncertainty is not resolved. They want to be incrementally consuming and processing data even though it may happen later that a parse error indicates the overall structure is not correct. (Classic example: last record contains number of records, and it's not correct. Another example: file is damaged - but way down near the end, and most of the data is good.)

Incremental Infoset

Infoset Infoset elements must be deleted from the infoset once no longer needed so that the accumulated infoset does not grow endlessly.  To insure they aren't needed forever, any value that is to be referenced by an expression must be stored into a variable, and the referencing expression changed to dereference that variable. This requires the dfdl:newVariableInstance functionality, and some Daffodil compiler capabilities to identify and insert new variables at the proper scopes.

...

Note: This notion - an array of single-assignment variables - is exactly like I-structures in the Id programming language (http://en.wikipedia.org/wiki/Id_%28programming_language%29). There may be alternative mechanisms using list-comprehensions/array-comprehensions as they are formulated in functional languages like Haskell (see http://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28list_comprehension%29#Haskell).

Pathological Data Formats

It's always possible to create a schema where there is a point of uncertainty right near the very top of the data. For both parsing and unparsing this is possible. For unparsing, data where the first record must contain the length in bytes of the entire data contents is the classic example. For parsing, it is data with deep-discriminators is the usual example, i.e., two schemas are v0.1 and v0.2 of some data format, and the only way you can tell the difference is that way down inside there's a date with slashes in it, vs. ISO notation. So the infoset corresponding to the parser of the whole file is pending until you detect that deep detail.