Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added "all about arrays" point. Bucket algorithm description.

...

  1. There is no analog to problem (1) above, as one can simply write data to a Java OutputStream.
  2. Support for providing access to a large data object (larger than largest Java/Scala object size) via some sort of handle object that is placed into the DFDL Infoset Item, and having the Daffodil unparser obtain data from that handle for unparsing. This must not require bringing the entire object into memory even in pieces.
  3. Support for incremental delivery of the Infoset to the unparser.
  4. Incremental conversion of XML input data (or JSON) to the DFDL Infoset, so that we don't require the entire incoming XML document (or JSON) to reside in memory for unparsing.

All About the Arrays

A key observation that this design depends on is this: It's all about the arrays. An Infoset can only be unbounded in size if one or more arrays in it are unbounded in size. Keeping the storage for the Infoset down to a finite size can be achieved if the number of Infoset nodes needed for arrays can be made finite. All other aspects of the Infoset that are not arrays can be held in memory as an Infoset tree without compromising the ability to stream. They are inherently not unbounded in size.

This holds so long as the DFDL language does not allow recursion, which it doesn't in v1.0.

So, if an algorithm can determine that (a) the application no longer needs an Infoset array element (b) no expression needs an Infoset array element. Then that array element can be dropped.  This holds true for both parsing and unparsing.

The simplest way to insure this is to have a tunable parameter about the expanse, across array elements, that expressions are allowed to have.  For example, if we allow an expanse of 1 element, then there is a window of 3 array elements for every array, which are N-1, N, and N+1. When N is advanced, then element N-2 can be dropped, and expressions that refer into it are exceeding the implementation defined limit. For unparsing, when N+1 is created, then expressions that are part of element N (dfdl:outputValueCalc on elements within an array element of complex type) can be evaluated, and if they reach further into the future infoset than N+1, they will error for exceeding the implementation defined limit.

This expanse across array elements could be enlarged, but examples of expressions reaching backward or forward arbitrarily far,... are hard to find and/or are contrived. However, many useful and common formats reference backward by 1 array element or forward by 1 array element.

Streaming Input

The I/O layer input system must be modified to do streaming - that is, all reading operations must be on finite buffers (CharBuffer and ByteBuffer) and must handle the underflow/overflow protocol so as to be restartable so that if one does not get "enough data", one can extend the buffer and fill it from the input, or make room for the data in the receiving buffer.

...

Note that the regular expression Pattern and Matcher class do not operate on an unbounded InputStream nor Reader, but only on a finite CharSequence interface - realistically, this means using CharBuffer. The Matcher hitEnd() and other API features make it possible to identify when a match needs more data to determine the result. However, this implies one cannot hide the 64-bit layer in an InputStream/reader-like capability, and then use the Pattern and Matcher classes without modification. So, a Daffodil-specific Pattern and Matcher variant are needed which hide the management of the finite-sized CharSequence, and it's filling from the input layer.

See the section below about The Bucket Algorithm.

Handle Objects for Large Strings or HexBinary

...

The next layer up can implement a full StAX style API where no event is released until all points-of-uncertainty about it have been resolved. But I suspect many applications are going to want events out for inner elements even though the outermost point of uncertainty is not resolved. They want to be incrementally consuming and processing data even though it may happen later that a parse error indicates the overall structure is not correct. (Classic example: last record contains number of records, and it's not correct. Another example: file is damaged - but way down near the end, and most of the data is good.)

Incremental Infoset

Infoset elements must be deleted from the infoset once no longer needed so that the accumulated infoset does not grow endlessly.  To insure they aren't needed forever, any value that is to be referenced by an expression must be stored into a variable, and the referencing expression changed to dereference that variable. This requires the dfdl:newVariableInstance functionality, and some Daffodil compiler capabilities to identify and insert new variables at the proper scopes.

There are some expressions which cannot be hoisted into newVariableInstance in this manner - specifically, referring to prior array element, since there is no scope for a new variable instance that can scope over two elements in an array. This is a DFDL v1.0 limitation. We may need a Daffodil-specific capability - some sort of array variables so that you can set a single-assignment location in an array of those.

Note: This notion - an array of single-assignment variables - is exactly like I-structures in the Id programming language (http://en.wikipedia.org/wiki/Id_%28programming_language%29). There may be alternative mechanisms using list-comprehensions/array-comprehensions as they are formulated in functional languages like Haskell (see http://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28list_comprehension%29#Haskell).

Pathological Data Formats

It's always possible to create a schema where there is a point of uncertainty right near the very top of the data. For both parsing and unparsing this is possible. For unparsing, data where the first record must contain the length in bytes of the entire data contents is the classic example. For parsing, it is data with deep-discriminators is the usual example, i.e., two schemas are v0.1 and v0.2 of some data format, and the only way you can tell the difference is that way down inside there's a date with slashes in it, vs. ISO notation. So the infoset corresponding to the parser of the whole file is pending until you detect that deep detail.

The Bucket Algorithm

Unfortunately, many operations such as regex matching operate only on finite CharSequence arguments. They don't take a java.io.Reader as argument.

To get streaming I/O behavior, when the low-level operations require these finite objects, one must deal with finite buffers. This includes filling finite byte buffers from the raw input source, and filling finite CharBuffers from the byte buffers.

The input system needs to work in a manner analogous to this idiom involving glasses of water, pitchers, buckets, and a well, which is the ultimate source of all the water.

Parsing is analogous to getting a glass of water or koolAid.

A beverage (water or koolAid) is "poured" into glasses (DISimple infoset element values) from a pitcher(koolAid = text) or bucket (water = binary data).
Pour = any atomic I/O operation like trying to grab data that matches a regex, or grab N bytes, or grab N characters or grab N bits. (operations on the InputStream object)

If the glass is filled (called "overflow" - doesn't mean any spilled, but that the container is filled to the brim) we're done.

Any time a glass is only partly filled (called "underflow"), one must (text case) take the pitcher and refill it from the bucket. This can overflow so the pitcher is now full, but some water is left in the bucket, or this can underflow so the bucket has to be refilled from the well, then we go back to filling the pitcher from the bucket, and the glass from the pitcher. The water (binary data) case is similar, except with one fewer hop because there is no pitcher.

It's possible there is no more water in the well. In which case the bucket may come back partly full or even empty, and similarly the pitcher can come back partly full or even empty.
The glass can then be further filled (or not if the pitcher/bucket was empty).

The customer (parser) that is trying to get a glass of koolaid or water can then decide if this is ok (satisfied), another underflow, or an error (an error because it didn't get enough data, but there is no more to be had)

The well = ReadableByteChannel
The bucket = ByteBuffer
The pitcher = CharBuffer (extends CharSequence)
The glass = Array[Byte] or StringBuffer (StringBuilder? the unsynchronized one)
KoolAid = text data
water = binary data
pour = carry out a primitive input operation

What makes this work is that the Input operations (pouring) are all restartable. That is, an Underflow can be detected where they do not have sufficient data to satisfy their operation (fill the glass). Hence, the pitcher or bucket is a finite thing, not an unbounded stream/channel.

Optimizations:

All Text: If the data is all text (whole format uses nothing but text), and the dfdl:encodingErrorPolicy='replace', then we can setup the Charset decoders to create unicode replacement chars on decode errors. Then we throw the koolAid powder down the well, and now we fill the pitcher directly from the well, bypassing the bucket. (We create a java.io.Reader on top of the java.nio.ReadableByteChannel, which does the decoding directly.)

Small Data: (a less important optimization) If the data is small, we can memory map it. (Bucket IS the well), and if it is small and all text and dfdl:encodingErrorPolicy='replace', then we can create a giant pitcher filled once from the bucket, after which we can drop the bucket. (code for this exists today)

Additional issues/notes:

Overflow when filling CharBuffer from ByteBuffer (filling pitcher from bucket). Just indicates that the pitcher is full.
Underflow when filling CharBuffer from ByteBuffer means that the pitcher can hold more.  

dfdl:encodingErrorPolicy='error' - adds complexity to filling the pitcher from the bucket. Any given byte might experience a decode error. Without having source-code mods to every decoder for every encoding, we can't make this throw. Possible ideas: maybe it's ok to be slow in this case, and use a Scala Stream[Char], and decode characters one by one?

Inefficiencies to watch out for: In the general case of mixed binary and text fields, the bucket can be big (4 Meg?). But the pitcher shouldn't be big unless the data contains large character strings frequently. Consider a format with mixture of binary and text elements of average size 8 bytes/chars per element. Now imagine the pitcher is 4K characters in size. Every time we go to fill the pitcher from the bucket, we will decode perhaps 4K characters from the binary data. But then we use on average 8 of them, and then if the next element is binary, that pitcher is discarded (because we've bypassed it and gone directly to the bucket for binary data), so we might not have any way to re-use the pitcher of koolAid, so now we fill the pitcher again and decode 4K characters just to get on average 8 more.

Ways to fix:
* (Best idea) Implement our own character-by-character decode loop for the general case. Don't bother trying to preserve a CharBuffer, or amortize the cost of filling it. Just decode character by character and make that as fast as possible. This would work fine when our DFA is consuming characters one by one, and also makes it easy to implement dfdl:encodingErrorPolicy="error". It is inefficient for specified-length strings, where getting N characters could be as cheap (on average) as getting a string of length N from a CharBuffer.  This is effectively punting on the notion of a pitcher in the above bucket-algorithm.
* (another idea) Just use small pitchers (64 characters? - could do a micro benchmark to pick a reasonable size, or come up with some adaptive scheme.)