Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The Unparser's state is class UState. Unlike the early versions of the Parser & PState, the Unparser from the start mutates the UState rather than doing the "functional programming" kind of thing - copying it with changes. The unparse methods do not return a UState object. They modify the one that is passed in (which enforces this contract). Each thread must have its own UState.
  • The Unparser has no limitations on data sizes. This problem is fundamentally easier to solve for unparsing than it is for parsing. Data buffering may still be needed (see discussion of Pending Calculations).
  • The grammar rules part of the middle of Daffodil has some universal productions - they apply whether parsing or unparsing, but some grammar productions are parser or unparser specific. This is done with guards on the productions that specify whether the rule applies only to parsing, to unparsing, or both. This implies that there are Terminal objects that are parser or unparser specific, which is to say they have an implementation of only the parser() method or the unparser() method.

Incremental Unparsing - Pending Calculations and Forward Reference

...

Task: TDML runner modifications are required. It is roughly symmetric to parser testing features. The biggest issue complexity-wise is converting an XML-expressed DFDL infoset into an actual DFDL infoset but this is just a method call for the TDML runner.

Ideas: (some half baked?)

  • Some parser tests are invertible. Having parsed data to an Infoset, one can unparse back to data and for some DFDL schemas, get the identical data. This doesn't work for all DFDL schemas - escape schemes can parse things with say, surrounding quotes which on unparsing are determined to be unnecessary and so are not output. Also multiple values are allowed for delimiters, but the first of these values is used on output, so incoming data that uses one of the other delimiters (not the first) won't unparse to the same delimiters. That said, many tests will be invertible.  A flag on TDML parser tests should indicate whether the test can be inverted. Some way to bulk-set this flag so it doesn't have to be done explicitly.
  • Note that unparser tests are much more likely to be invertible. It is possible to create a schema that is asymmetric - what it writes out isn't the same format that it reads in. But this is an atypical corner case rather than a common thing. An example of this is Nil ambiguity with "empty" values. Nil might be represented by empty string which might not have quotes, so  in comma delimited data several adjacent nil fields may be ",,,," but that might parse as several empty values which take a default value such as 0. Data containing "nil,nil,nil" might parse as 3 nilled elements. However, an Infoset containing three strings each of length 3 containing "nil", could output as "nil,nil,nil", and therefor it would not round-trip with the parser.
  • The second loop around is better guaranteed to work. Meaning that you parse data A to Infoset B. You unparse B to data C, you parse data C to Infoset D. You unparse D to data E which you parse to Infoset F. The data E and the data C should match exactly, and the Infoset D and Infoset F should match exactly. The ambiguities are wrung out in the first cycle around this loop.

TBD

  • Streaming output for large objects - this is symmetric with a parser feature we need, which is the ability of the unparser to accept a large object not as a giant string or hexBinary blob, but as a file descriptor or other specification that can be opened and pulled separately from the Infoset elements.
  • Truncated output when length units is bytes and encoding is variable width (e.g., utf-8). The issue is truncating that chops the code units of a character off part way through.
  • Improvements in coding style: smaller Scala code files, smaller TDML files - for parsing there are some giant files and some TDML files that have hundreds of tests in them. We ought not repeat these mistakes.

...