Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: first cut.

This page for a set of notes about the design of the Daffodil Unparser. This includes how to test it, and how it works internally.

(Note: there is use of future tense in this page - once the Unparser exists that will no longer make sense so should be fixed or those aspects or this whole page replaced.)

Terminology

Lookahead, Forward Reference and Infoset Order

...

When constructing a DFDL Infoset from XML (as opposed to programatically constructing it), some XML may carry xsi:type attributes e.g.,

    <foo xsi:type="xs:int">5</foo>

...

For unparsing, this term means the Infoset without any hidden elements created, and without any dfdl:outputValueCalc elements computed. Furthermore, elements have not been padded/filled or truncated to their specified length.

Current and Future Infoset

The unparser is incremental meaning it does not require that the entire Infoset (pre-augmentation) is present at the start of unparsing, but operates in a manner where it can be fed/called with Infoset elements incrementally.

In this case, the current Infoset is the part of the Infoset already available to the unparser. The future Infoset means the Infoset that will occur in the future once additional elements have been added to the current Infoset.

Note that the Infoset grows monotonically. There is no 'taking back' of Infoset elements. (The implementation can optimize use of memory by discarding parts of the current infoset that are no longer needed.)

Basic Algorithms

Daffodil's parser performs two steps:

...

Archives of the DFDL Workgroup email contain a number of discussions about xs:choice and inferring the right choice-arm given an unassociated Infoset Element.

The unparser must look ahead by one subsequent element in these situations:

  1. to determine which alternative arm of a choice a particular element is in
  2. to determine if an element instance is the last element of an array

... TBD discussion here ....

Inferring Arrays

This is a sub-issue of determining the DFDL schema component. Specifically it is the issue of determining when <foo>5</foo><foo>6</foo> are two elements of the same array, or two separate elements either scalar or of different arrays. DFDL and XML Schema specifically allow for schemas like this:

    <element name="foo" type="int" maxOccurs="2"/>
<element name="bar" type="string" minOccurs="0"/>
<element name="foo" type="int"/>

...

  • Some parser tests are invertible. Having parsed data to an Infoset, one can unparse back to data and for some DFDL schemas, get the identical data. This doesn't work for all DFDL schemas - escape schemes can parse things with say, surrounding quotes which on unparsing are determined to be unnecessary and so are not output. Also multiple values are allowed for delimiters, but the first of these values is used on output, so incoming data that uses one of the other delimiters (not the first) won't unparse to the same delimiters. That said, many tests will be invertible.  A flag on TDML parser tests should indicate whether the test can be inverted. Some way to bulk-set this flag so it doesn't have to be done explicitly.

TBD

  • Streaming output - pending calculations interferes with thisfor large objects - this is symmetric with a parser feature we need, which is the ability of the unparser to accept a large object not as a giant string or hexBinary blob, but as a file descriptor or other specification that can be opened and pulled separately from the Infoset elements.
  • Truncated output when length units is bytes and encoding is variable width (e.g., utf-8). The issue is truncating that chops the code units of a character off part way through.
  • Improvements in coding style: smaller Scala code files, smaller TDML files - for parsing there are some giant files and some TDML files that have hundreds of tests in them. We ought not repeat these mistakes.

...