Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: updated to match existing code

...

Note: there is code that does this already as part of unit testing for the new Infoset introduced with the DPath expression implementation. This provides a way to create a DFDL Infoset from XML so as to execute DPath expressions against that Infoset.

This code was created with unit testing in mind. Performance was not a consideration at that time.

The "real" runtime XML-to-Infoset conversion is done by objects implementing InfosetCursor, assisted by XMLEventCursor.

The major issues are:

  1. Determining the DFDL schema component that corresponds to a specific unassociated Infoset element, given the presence of xs:choice and optional/array elements.
  2. Inferring arrays

...

Determining the DFDL Schema Component that Corresponds to an XML Element

Perhaps the most One complex issue for creating a DFDL Infoset, or for unparsing one, is determining which DFDL schema component corresponds to a particular Infoset element. The same problem occurs if a relatively naive program is constructing the DFDL Infoset using the API. When a Infoset element is being created, one must identify the DFDL schema component that corresponds to it. The context for this is the enclosing parent element, and any prior sibling elements. Unfortunately, one also needs some following elements in some cases.

...

The algorithm for this was a recent topic of conversation in the DFDL working group (January 2015). It was resolved that the nature of the repeating of the elements must be taken into account, including the dfdl:occursCountKind. Specifically, when dfdl:occursCountKind is 'parsed', then the min/maxOccurs are not considered, and any number of adjacent elements can be "matched up" to the element declaration of an array. However, when the dfdl:occursCountKind is implicit, delimited, then the unparser must count the number of elements it has output, and stop when maxOccurs (if bounded) is reached.


Note: The test-rig code for testing the DPath expressions constructs arrays of the Infoset blindly. That is, all adjacent elements are coalesced into arrays. This is ultimately incorrect, but may be fine as a first version, and the true array inference that is properly schema-aware can be added later.

Infoset to Data

The "real" InfosetCursor and XMLEventCursor do this correctly.

Infoset to Data

  • The Unparser's state is class UState. Unlike the early versions of the Parser & PState, the Unparser from the start mutates the UState rather than doing the "functional programming" kind of thing - copying it with changes. The unparse methods do not return a UState object. They modify the one that is passed in (which enforces this contract). Each thread must have its own UState.
  • The Unparser has no limitations on data sizes. This problem is fundamentally easier to solve for unparsing than it is for parsing. Data buffering may still be needed (see discussion of Pending Calculations).
  • The grammar rules part of the middle of Daffodil has some universal productions - they apply whether parsing or unparsing, but some grammar productions are parser or unparser specific. This is done with guards on the productions that specify whether the rule applies only to parsing, to unparsing, or both. This implies that there are Terminal objects that are parser or unparser specific, which is to say they have an implementation of only the parser() method or the unparser() method.
  • Required elements that are missing from the infoset must be added.

...

The simplest example is for dfdl:lengthKind 'explicit'. In this case, one (or more) elements will commonly carry dfdl:outputValueCalc properties. The expressions in these properties will reference the Infoset to obtain (typically) length information, by calling the dfdl:valueLength() or dfdl:contentLength() functions.

If one considers the unparsing process as an incremental process that is called in an element-by-element manner, then the unparsing process must be able to suspend output, and the API must allow an element which carries a dfdl:outputValueCalc to be passed to the unparser, without the calculation having been performed. These outputValueCalc expressions may reference other infoset elements (directly, or through variables) that are also calculated with dfdl:outputValueCalc. Calculations of the value of variables may also reference forward in Infoset order. So in general these expressions must be evaluated with a on-demand type of algorithm where computations can be suspended until the unparser is fed the non-calculated data elements needed to enable the calculations to proceed.

...

This interferes with the basic concept of recursively walking the Infoset and DFDL Schema, writing out elements based on the DFDL Schema component. 

Design For Test

Task: TDML runner modifications are required. It is roughly symmetric to parser testing features. The biggest issue complexity-wise is converting an XML-expressed DFDL infoset into an actual DFDL infoset but this is just a method call for the TDML runner.

Ideas:

Ideas:

  • Some parser tests are invertible. Having parsed data Some parser tests are invertible. Having parsed data to an Infoset, one can unparse back to data and for some DFDL schemas, get the identical data. This doesn't work for all DFDL schemas - escape schemes can parse things with say, surrounding quotes which on unparsing are determined to be unnecessary and so are not output. Also multiple values are allowed for delimiters, but the first of these values is used on output, so incoming data that uses one of the other delimiters (not the first) won't unparse to the same delimiters. That said, many tests will be invertible.  A flag on TDML parser tests should indicate whether the test can be inverted. Some way to bulk-set this flag so it doesn't have to be done explicitly.
  • Note that unparser Note that unparser tests are much more likely to be invertible. It is possible to create a schema that is asymmetric - what it writes out isn't the same format that it reads in. But this is an atypical corner case rather than a common thing. An example of this is Nil ambiguity. Data containing "nil,nil,nil" might parse as 3 nilled elements. However, an Infoset containing three strings each of length 3 containing "nil", could output as "nil,nil,nil", and therefor it would not round-trip with the parser.
  • The second loop around is better guaranteed to work. Meaning that you parse data A to Infoset B. You unparse B to data C, you parse data C to Infoset D. You unparse D to data E which you parse to Infoset F. The data E and the data C should match exactly, and the Infoset D and Infoset F should match exactly. The ambiguities are wrung out in the first cycle around this loop.

...

Other Details

  • Streaming output for large objects - this is symmetric with a parser feature we need, which is the ability of the unparser to accept a large object not as a giant string or hexBinary blob, but as a file descriptor or other specification that can be opened and pulled separately from the Infoset elements.
  • Truncated output when length units is bytes and encoding is variable width (e.g., utf-8). Truncated output when length units is bytes and encoding is variable width (e.g., utf-8). The issue is truncating that chops the code units of a character off part way through.Improvements in coding style: smaller Scala code files, smaller TDML files - for parsing there are some giant files and some TDML files that have hundreds of tests in them. We ought not repeat these mistakes.