This page is about the various aspects of Daffodil that must change in order to implement large objects and large input data.
There are several related problems. For parsing:
- Support for input streams or files that are larger than any particular fixed limit. (e.g. larger than the largest Java/Scala Array[Byte], or larger than 4Gbytes)
- Support for individual DFDL Infoset objects larger than the largest Java/Scala object size. E.g, within a particular data format an MPEG video is found which is larger than 4Gbytes.
Support for DFDL Infosets larger than any particular Java/Scala JVM can hold in its virtual memory at one time.
- Conversion of DFDL Infosets to XML (and JSON) as incrementally produced, so as to avoid the need to hold the entire XML document (or JSON) in memory.
For unparsing, the problem is simpler.
- There is no analog to problem (1) above, as one can simply write data to a Java OutputStream.
- Support for providing access to a large data object (larger than largest Java/Scala object size) via some sort of handle object that is placed into the DFDL Infoset Item, and having the Daffodil unparser obtain data from that handle for unparsing. This must not require bringing the entire object into memory even in pieces.
- Support for incremental delivery of the Infoset to the unparser.
- Incremental conversion of XML input data (or JSON) to the DFDL Infoset, so that we don't require the entire incoming XML document (or JSON) to reside in memory for unparsing.
All About the Arrays
A key observation that this design depends on is this: It's all about the arrays. An Infoset can only be unbounded in size if one or more arrays in it are unbounded in size. Keeping the storage for the Infoset down to a finite size can be achieved if the number of Infoset nodes needed for arrays can be made finite. All other aspects of the Infoset that are not arrays can be held in memory as an Infoset tree without compromising the ability to stream. They are inherently not unbounded in size.
This holds so long as the DFDL language does not allow recursion, which it doesn't in v1.0.
So, if an algorithm can determine that (a) the application no longer needs an Infoset array element (b) no expression needs an Infoset array element. Then that array element can be dropped. This holds true for both parsing and unparsing.
The simplest way to insure this is to have a tunable parameter about the expanse, across array elements, that expressions are allowed to have. For example, if we allow an expanse of 1 element, then there is a window of 3 array elements for every array, which are N-1, N, and N+1. When N is advanced, then element N-2 can be dropped, and expressions that refer into it are exceeding the implementation defined limit. For unparsing, when N+1 is created, then expressions that are part of element N (dfdl:outputValueCalc on elements within an array element of complex type) can be evaluated, and if they reach further into the future infoset than N+1, they will error for exceeding the implementation defined limit.
This expanse across array elements could be enlarged, but examples of expressions reaching backward or forward arbitrarily far,... are hard to find and/or are contrived. However, many useful and common formats reference backward by 1 array element or forward by 1 array element.
The I/O layer input system must be modified to do streaming - that is, all reading operations must be on finite buffers (CharBuffer and ByteBuffer) and must handle the underflow/overflow protocol so as to be restartable so that if one does not get "enough data", one can extend the buffer and fill it from the input, or make room for the data in the receiving buffer.
This cleanup of the I/O layer for parsing is overdue anyway. Everything that does scanning for characters must work more in the manner of a Java InputStream - with it's position, mark, and reset capabilities for backing up to a previously marked location. The 64-bit I/O layer must implement this InputStream-like capability and hide the management of finite-size buffers.
About Regular Expression Matching
DFDL's dfdl:lengthKind 'pattern', and dfdl:assert/dfdl:discriminator with testKind 'pattern' imply regular expression scanning of the input data stream.
Note that the regular expression Pattern and Matcher class do not operate on an unbounded InputStream nor Reader, but only on a finite CharSequence interface - realistically, this means using CharBuffer. The Matcher hitEnd() and other API features make it possible to identify when a match needs more data to determine the result. However, it is generally true that the Pattern and Matcher objects are incompatible with streaming of data.
Regular expression matching using the java.util.Scanner class does operate on java.io.InputStream (or ReadableByteChannel), and takes a charset argument as well. So a revised streaming input layer may have to accomodate use of java.io.Scanner. The java.util.Scanner implementation would need to be evaluated to see if it in fact can carry out very large matches. Specifically, DFDL image file formats may involve large BLOB objects (hexBinary) which have marker delimiters. These BLOBS can be much larger than any single Java object, and may actually be larger than can be accomodated in the JVM heap memory. The Input layer needs to accomodate identifying the ending position for an object of unlimited size.
See the section below about The Bucket Algorithm.
Handle Objects for Large Strings or HexBinary
Large atomic objects of type xs:string and xs:hexBinary cannot be turned into ordinary Java String and Array[Byte]. Rather, they must be some sort of small handle or proxy object. A tunable threshold should be available to tell Daffodil when to create a handle versus an ordinary String or Array[Byte].
Data objects larger than a single JVM object can store (e.g., video or images) may have to be represented in the Infoset by a proxy object. Standard streaming-style events normally produce simple values as regular objects representing the value. If a simple value is larger than a single JVM object can store, then a streaming API to access the value is needed.
The DFDL Infoset doesn't really specify what the [value] member is for a hexBinary object - that is it does not specify what the API is for accessing this value. Currently it is Array[Byte], but we can provide other abstractions. Also, the [value] member for type xs:string is assumed to be a java.lang.String, but we can provide other abstractions.
When projecting the DFDL infoset into XML, these handle objects would have to show up as the XML serialization of the handle object, with usable members so that other software can access the data the handle is referring to. One example would be that the handle contains a fileName or URI and an offset (type Long) into it, and a length (type Long), and possibly the first N bytes/characters of the data.
This mechanism needs to work both for parsing and unparsing; hence, an API way of constructing these large-data handle objects is needed.
Infoset elements must be produced incrementally by the parser. These can only be produced once surrounding points of uncertainty are resolved fully. An architecture for this is needed. There may be some limitations.
Cursor-style Pull API
The Daffodil API ProcessorFactory class has an onPath("...") method. (Currently only "/" is allowed as a path.) This is intended to enable a cursor-like behavior if given a path that identifies an array. Successive calls to the DataProcessor's parse method should advance through the data one element of the array at a time, returning an Infoset each time which has as its root the successive InfosetElement items.
A cursor-style API caters to applications that are schema-specific more than generic applications. The notion here is that each parse action is some meaningful (to the schema) chunk of stuff. Dealing with any points of uncertainty that the onPath(...path..) crosses becomes the application's problem in this sort of API.
Because data formats can have points of uncertainty in them, an entirely XML-oriented pull-parser API can be problematic. See section below on Pathological Data Formats.
One design point: the lowest level API pushes the uncertainty back to the consumer - the establishing of known-to-exist or known-not-to-exist resolutions of uncertainty, well that's an event like any other event. So when a discriminator evaluates to true - that produces the discriminator-true event. Basically, the backtracking of the parser is visible as events, to the application, not hidden and invisible. Unwinding the stack from a failed element parse, well that's a different kind of end-element event. This allows applications to parse and process flawed data. An application could implement recovery points for example, such that it skips over broken data, and tries again from some place in the schema.
Incremental parse events are a lot like the Daffodil debugger's trace output in this case. Call-back handlers are the common way to do this. I.e., application provides an object that implements a particular interface. Then, if the user application wants to do co-routines so that it looks like a flat stream of events, and not a nested call, they can do so.
The next layer up can implement a full StAX style API where no event is released until all points-of-uncertainty about it have been resolved. But I suspect many applications are going to want events out for inner elements even though the outermost point of uncertainty is not resolved. They want to be incrementally consuming and processing data even though it may happen later that a parse error indicates the overall structure is not correct. (Classic example: last record contains number of records, and it's not correct. Another example: file is damaged - but way down near the end, and most of the data is good.)
Pathological Data Formats
It's always possible to create a schema where there is a point of uncertainty right near the very top of the data. For both parsing and unparsing this is possible. For unparsing, data where the first record must contain the length in bytes of the entire data contents is the classic example. For parsing, it is data with deep-discriminators is the usual example, i.e., two schemas are v0.1 and v0.2 of some data format, and the only way you can tell the difference is that way down inside there's a date with slashes in it, vs. ISO notation. So the infoset corresponding to the parser of the whole file is pending until you detect that deep detail.
The Bucket Algorithm
(Note: this whole section on the Bucket Algorithm was written before examining the java.util.Scanner class, which operates regex Patterns on java.io.InputStream. This may be a superior API to use for regex matching than Matcher, but matches/scans are most likely limited in size to the maximum size of a single Java object. To accommodate very large objects, DFDL needs to be able to scan to determine the ending position of objects larger than memory. )