Large Data Files and Large Data Objects (64-bit Support)

This page is about the various aspects of Daffodil that must change in order to implement large objects and large input data.

There are several related problems. For parsing:

Support for input streams or files that are larger than any particular fixed limit. (e.g. larger than the largest Java/Scala Array[Byte], or larger than 4Gbytes)
Support for individual DFDL Infoset objects larger than the largest Java/Scala object size. E.g, within a particular data format an MPEG video is found which is larger than 4Gbytes.
Support for DFDL Infosets larger than any particular Java/Scala JVM can hold in its virtual memory at one time.
Conversion of DFDL Infosets to XML (and JSON) as incrementally produced, so as to avoid the need to hold the entire XML document (or JSON) in memory.

For unparsing, the problem is simpler.

There is no analog to problem (1) above, as one can simply write data to a Java OutputStream.
Support for providing access to a large data object (larger than largest Java/Scala object size) via some sort of handle object that is placed into the DFDL Infoset Item, and having the Daffodil unparser obtain data from that handle for unparsing. This must not require bringing the entire object into memory even in pieces.
Support for incremental delivery of the Infoset to the unparser.
Incremental conversion of XML input data (or JSON) to the DFDL Infoset, so that we don't require the entire incoming XML document (or JSON) to reside in memory for unparsing.

Streaming Input

The I/O layer input system must be modified to do streaming - that is, all reading operations must be on finite buffers (CharBuffer and ByteBuffer) and must handle the underflow/overflow protocol so as to be restartable so that if one does not get "enough data", one can extend the buffer and fill it from the input, or make room for the data in the receiving buffer.

Note: Branch review-mjb-backend-improvements-01 has some work in this direction - but never made it to mainline because it was simply too ambitious. Nevertheless there is useful work to be mined on that branch.

This cleanup of the I/O layer for parsing is overdue anyway. There are far too many layers here, and scala's Reader[T] and PagedSeq[T] are never going to be fast enough. Everything that does scanning for characters must work more in the manner of a Java InputStream - with it's position, mark, and reset capabilities for backing up to a previously marked location. The 64-bit I/O layer must implement this InputStream-like capability and hide the management of finite-size buffers.

Note that the regular expression Pattern and Matcher class do not operate on an unbounded InputStream nor Reader, but only on a finite CharSequence interface - realistically, this means using CharBuffer. The Matcher hitEnd() and other API features make it possible to identify when a match needs more data to determine the result. However, this implies one cannot hide the 64-bit layer in an InputStream/reader-like capability, and then use the Pattern and Matcher classes without modification. So, a Daffodil-specific Pattern and Matcher variant are needed which hide the management of the finite-sized CharSequence, and it's filling from the input layer.

Handle Objects for Large Strings or HexBinary

Large atomic objects of type xs:string and xs:hexBinary cannot be turned into ordinary Java String and Array[Byte]. Rather, they must be some sort of small handle or proxy object. A tunable threshold should be available to tell Daffodil when to create a handle versus an ordinary String or Array[Byte].

The DFDL Infoset doesn't really specify what the [value] member is for a hexBinary object - that is it does not specify what the API is for accessing this value. Currently it is Array[Byte], but we can provide other abstractions. Also, the [value] member for type xs:string is assumed to be a java.lang.String, but we can provide other abstractions.

These handle objects would support the ability to open and access the contents of these large objects as java.nio.Channel or java.io.InputStream (for hexBinary), and java.io.Reader (for String).

When projecting the DFDL infoset into XML, these handle objects would have to show up as the XML serialization of the handle object, with usable members so that other software can access the data the handle is referring to. One example would be that the handle contains a fileName or URI and an offset (type Long) into it, and a length (type Long), and possibly the first N bytes/characters of the data.

This mechanism needs to work both for parsing and unparsing; hence, an API way of constructing these large-data handle objects is needed.

Infoset Events

Infoset elements must be produced incrementally by the parser. These can only be produced once surrounding points of uncertainty are resolved fully. An architecture for this is needed. There may be some limitations.
This is closely related to the backtrack issues with being able to over-write state rather than allocate new state when parsing - or the changes should be in the same area of the code anyway.

The Daffodil API ProcessorFactory class has an onPath("...") method. (Currently only "/" is allowed as a path.) This is intended to enable a cursor-like behavior if given a path that identifies an array. Successive calls to the DataProcessor's parse method should advance through the data one element of the array at a time, returning an Infoset each time which has as its root the successive InfosetElement items. Using this along with co-routines an event-based API can be produced.

Infoset elements must be deleted from the infoset once no longer needed so that the accumulated infoset does not grow endlessly. To insure they aren't needed forever, any value that is to be referenced by an expression must be stored into a variable, and the referencing expression changed to dereference that variable. This requires the dfdl:newVariableInstance functionality, and some Daffodil compiler capabilities to identify and insert new variables at the proper scopes.

There are some expressions which cannot be hoisted into newVariableInstance in this manner - specifically, referring to prior array element, since there is no scope for a new variable instance that can scope over two elements in an array. This is a DFDL v1.0 limitation. We may need a Daffodil-specific capability - some sort of array variables so that you can set a single-assignment location in an array of those.

Note: This notion - an array of single-assignment variables - is exactly like I-structures in the Id programming language (http://en.wikipedia.org/wiki/Id_%28programming_language%29). There may be alternative mechanisms using list-comprehensions/array-comprehensions as they are formulated in functional languages like Haskell (see http://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28list_comprehension%29#Haskell).

Space shortcuts

Child pages

Streaming Input

Handle Objects for Large Strings or HexBinary

Infoset Events