You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

This page for a set of notes about the design of the Daffodil Unparser. This includes how to test it, and how it works internally.

(Note: there is use of future tense in this page - once the Unparser exists that will no longer make sense so should be fixed or those aspects or this whole page replaced.)

 

See also the attachments to this page and

Terminology

Lookahead, Forward Reference and Infoset Order

The DFDL Infoset has an order to the elements stored in it. Normally one thinks of the elements as being unparsed in this order. This order must be consistent with the DFDL schema. When the unparser must look at an element that is later in infoset order that is called lookahead or forward reference.  When the unparse must look at elements earlier in this order that is lookbehind or backward reference.

There is a real asymmetry between parser and unparser behavior here as the parser has no notion of forward reference, only backward.

Programatic Construction of Infoset

For unparsing, a DFDL Infoset can be created programatically by a DFDL-schema-aware program using the Daffodil Infoset API.

Task(s): design and documentation of this API - includes Scala and Java versions of this API, and javadoc and scaladoc for them, and unit tests that drive these APIs in Scala and Java.

The alternative way to construct a DFDL Infoset for unparsing is to create the Infoset from XML data.

Infoset Creation Error

This is an error that occurs when converting XML into a DFDL Infoset. This includes

  1. XML not well formed
  2. XML not schema-structured
  3. XML not schema-valid

Being schema-structured means that for each XML element there is a corresponding DFDL schema element declaration, and that the value of the element (simple types) or complex content of the element (complex types) corresponds to the type given in the DFDL schema. For example, if the XML is well formed <x>foo</x>, but the corresponding element declaration for element 'x' has type xs:int, then since "foo" is not a valid xs:int, this XML is not schema-structured for this DFDL Schema.

Schema-structured is about the situations that the parser would view as a parse error.  From a parser perspective, if "foo" is in the data, and the parser encounters it while trying to parse element 'x' of type xs:int, then a parse error will occur.  Symmetric to this when unparsing is if the XML being converted into a DFDL Infoset has this same kind of type error.

General validity when unparsing corresponds to general validity checking in the Daffodil parser. This includes checking the facets and the number of occurrences.

An exception would be if that element happens to have an assertion/discriminator and calls the dfdl:checkConstraints function. This tells the parser to treat the facet compliance as a parse error. There is no unparser feature corresponding to this as assertions/discriminators are not evaluated when unparsing.

Unassociated Infoset Element

During the construction of a DFDL Infoset, the intermediate state where one knows the element name and namespace, but not the corresponding DFDL Schema Component. In this state the value, for simple types, is just a string that has not yet been converted into any typed Infoset value object.

When constructing a DFDL Infoset from XML (as opposed to programatically constructing it), some XML may carry xsi:type attributes e.g.,

    <foo xsi:type="xs:int">5</foo>

In this case, the value "5" may be converted into an xs:int type; however, whether this correctly corresponds to a DFDL schema element declaration's type or not has not been determined as the association of the element to the corresponding part of the DFDL Schema has not yet been performed.

Augmented and Unaugmented Infoset (for Unparsing)

For unparsing, the unaugmented infoset means the Infoset without any hidden elements created, and without any dfdl:outputValueCalc elements computed. Furthermore, elements have not been padded/filled or truncated to their specified length. Also, there may be required elements that are missing and these need to be created using default values.

Current and Future Infoset

The unparser is incremental meaning it does not require that the entire Infoset (pre-augmentation) is present at the start of unparsing, but operates in a manner where it can be fed/called with Infoset elements incrementally.

In this case, the current Infoset is the part of the Infoset already available to the unparser. The future Infoset means the Infoset that will occur in the future once additional elements have been added to the current Infoset.

Note that the Infoset grows monotonically. There is no 'taking back' of Infoset elements. (The implementation can optimize use of memory by discarding parts of the current infoset that are no longer needed.)

Basic Algorithms

Daffodil's parser performs two steps:

  1. Data to Infoset: data parsed into Daffodil's version of the DFDL Infoset
  2. Infoset to XML: DFDL Infoset converted into XML. (or JSON see DFDL-1210 JIRA)

Symmetric to this, the unparser performs these:

  1. XML to Infoset: XML (or JSON)  parsed into Daffodil's version of the DFDL Infoset
  2. Infoset to Data: DFDL Infoset unparsed into data.

This symmetry is not very deep, as the algorithms and actions taken during these activities are quite different. Consider: when parsing, converting data to Infoset is the hard part. Given the Infoset, producing XML from it is quite easy. Unparsing is not the same. The symmetric activity of converting XML to a DFDL Infoset is more complex, but the symmetric activity of unparsing DFDL Infoset to data is in many ways a bit simpler than parsing data to Infoset.

XML to Infoset

Converting XML into a DFDL Infoset has a few complexities.

Note: there is code that does this as part of unit testing for the new Infoset introduced with the DPath expression implementation. This provides a way to create a DFDL Infoset from XML so as to execute DPath expressions against that Infoset.

This code was created with unit testing in mind. Performance was not a consideration at that time.

The "real" runtime XML-to-Infoset conversion is done by objects implementing InfosetCursor, assisted by XMLEventCursor.

The major issues are:

  1. Determining the DFDL schema component that corresponds to a specific unassociated Infoset element, given the presence of xs:choice and optional/array elements.
  2. Inferring arrays

It is not possible to construct a correct DFDL Infoset from XML without the DFDL Schema, whether from XML or programatically.

Determining the DFDL Schema Component that Corresponds to an XML Element

One complex issue for creating a DFDL Infoset, or for unparsing one, is determining which DFDL schema component corresponds to a particular Infoset element. The same problem occurs if a relatively naive program is constructing the DFDL Infoset using the API. When a Infoset element is being created, one must identify the DFDL schema component that corresponds to it. The context for this is the enclosing parent element, and any prior sibling elements. Unfortunately, one also needs some following elements in some cases.

The unparser must look ahead by one subsequent element in these situations:

  1. to determine which alternative arm of a choice a particular element is in
  2. to determine if an element instance is the last element of an array

The algorithm for (1) is described in the DFDL Specification document (Section 15.1.3) as:

On unparsing there is the question of how one identifies the appropriate schema choice branch corresponding to the data in the infoset. This is complicated by the fact that the children may not be elements. They may themselves be sequences or choices.The selection of the choice branch is as follows: The element in the infoset is used to search the choice branches in the schema, in schema definition order, but without looking inside any complex elements. If the element occurs in a branch, then that branch is selected and if subsequently a processing error occurs, this selection is not revisited (that is, there is no backtracking).

To avoid any unintended behavior, all the children of a choice can be modeled as elements.


Daffodil pre-computes for each branch of a choice, the element name and namespace that distinguish that branch.  This is always unambiguous thanks to XML Schema's UPA rules. (It is a Schema Definition Error if there is ambiguity here.)  Hence, when the unparser encounters an element in the infoset that corresponds to a choice in the DFDL schema, the name and namespace comparison is quick and does not involve the search process described in the DFDL specification - that analysis is done at the time of schema compilation.

The Unparser corresponding to a choice contains this name and namespace lookup table (in the unparser runtime data structure), and recursively invokes the appropriate sub-unparser.

The algorithm for (2) is relatively simple. For some values of the dfdl:occursCountKind property, the array unparser must count, and when the maximum number of element occurences has been unparsed, it then ends the repeating/array unparser (successfully - there is no backtracking in unparsing) and begins populating the DFDL infoset with whatever is next after the array.

Inferring Arrays

This is a sub-issue of determining the DFDL schema component. Specifically it is the issue of determining when <foo>5</foo><foo>6</foo> are two elements of the same array, or two separate elements either scalar or of different arrays. DFDL and XML Schema specifically allow for schemas like this:

    <element name="foo" type="int" maxOccurs="2"/>
<element name="bar" type="string" minOccurs="0"/>
<element name="foo" type="int"/>

In this situation, since the element named "bar" is optional, it is ambiguous which of the two element declarations named "foo" is the one that a particular instance should be associated to.

The algorithm for this was a recent topic of conversation in the DFDL working group (January 2015). It was resolved that the nature of the repeating of the elements must be taken into account, including the dfdl:occursCountKind. Specifically, when dfdl:occursCountKind is 'parsed', then the min/maxOccurs are not considered, and any number of adjacent elements can be "matched up" to the element declaration of an array. However, when the dfdl:occursCountKind is implicit, delimited, then the unparser must count the number of elements it has output, and stop when maxOccurs (if bounded) is reached.


Note: The test-rig code for testing the DPath expressions constructs arrays of the Infoset blindly. That is, all adjacent elements are coalesced into arrays. The "real" InfosetCursor and XMLEventCursor do this correctly.

Infoset to Data

  • The Unparser's state is class UState. Unlike the early versions of the Parser & PState, the Unparser from the start mutates the UState rather than doing the "functional programming" kind of thing - copying it with changes. The unparse methods do not return a UState object. They modify the one that is passed in (which enforces this contract). Each thread must have its own UState.
  • The Unparser has no limitations on data sizes. This problem is fundamentally easier to solve for unparsing than it is for parsing. Data buffering may still be needed (see discussion of Pending Calculations).
  • The grammar rules part of the middle of Daffodil has some universal productions - they apply whether parsing or unparsing, but some grammar productions are parser or unparser specific. This is done with guards on the productions that specify whether the rule applies only to parsing, to unparsing, or both. This implies that there are Terminal objects that are parser or unparser specific, which is to say they have an implementation of only the parser() method or the unparser() method.
  • Required elements that are missing from the infoset must be added.

Incremental Unparsing - Pending Calculations and Forward Reference

There is an attachment to this page which is a powerpoint slide deck which describes the Daffodil unparser behavior for an outputValueCalc scenario.

output-value-calc-scenario.pdf or the powerpoint source: output-value-calc-scenario.pptx

When unparsing, the situation can arise where one must write out an element, but the value of that element is to be computed from one or more other elements that appear later in Infoset order.

The simplest example is for dfdl:lengthKind 'explicit'. In this case, one (or more) elements will commonly carry dfdl:outputValueCalc properties. The expressions in these properties will reference the Infoset to obtain (typically) length information, by calling the dfdl:valueLength() or dfdl:contentLength() functions.

If one considers the unparsing process as an incremental process that is called in an element-by-element manner, then the unparsing process must be able to suspend output, and the API must allow an element which carries a dfdl:outputValueCalc to be passed to the unparser, without the calculation having been performed. These outputValueCalc expressions may reference other infoset elements that are also calculated with dfdl:outputValueCalc. Calculations of the value of variables may also reference forward in Infoset order. So in general these expressions must be evaluated with a on-demand type of algorithm where computations can be suspended until the unparser is fed the non-calculated data elements needed to enable the calculations to proceed.

In addition, some calculations will need characteristics of the physical unparsed representation of data, where that data is later in Infoset order. This requires not only that the calculation is deferred until that Infoset element is available, it also requires that one buffer up output data. For example consider dfdl:lengthUnits 'bytes' for an element of type xs:string with dfdl:encoding 'utf-8'. In this case, one must unparse the string into a buffer so that one can count the number of bytes.

A simpler case would be a textual number, but with grouping separators. E.g., 1,000,000.00. The Infoset value is just the integer 1000000. The fact that this requires the insertion of two commas as grouping separators is value-dependent.

The DFDL Specification refers to the unpadded data (but having interior-things like grouping separators) as the Value Region, and the Padded/Filled data as the Content Region, which surrounds and encompasses the Value Region.

Length Expressions

The DFDL Workgroup has clarified that when dfdl:lengthKind is 'explicit' and the dfdl:length is an expression, that this expression is executed both when parsing and when unparsing. When unparsing this calculation gives the length to which the data is truncated (when truncation is allowed) or to which it is padded/filled before unparsing.

This interferes with the basic concept of recursively walking the Infoset and DFDL Schema, writing out elements based on the DFDL Schema component. 

Design For Test

Ideas:

  • Some parser tests are invertible. Having parsed data to an Infoset, one can unparse back to data and for some DFDL schemas, get the identical data. This doesn't work for all DFDL schemas - escape schemes can parse things with say, surrounding quotes which on unparsing are determined to be unnecessary and so are not output. Also multiple values are allowed for delimiters, but the first of these values is used on output, so incoming data that uses one of the other delimiters (not the first) won't unparse to the same delimiters. That said, many tests will be invertible.  A flag on TDML parser tests should indicate whether the test can be inverted.
  • Note that unparser tests are much more likely to be invertible. It is possible to create a schema that is asymmetric - what it writes out isn't the same format that it reads in. But this is an atypical corner case rather than a common thing. An example of this is Nil ambiguity. Data containing "nil,nil,nil" might parse as 3 nilled elements. However, an Infoset containing three strings each of length 3 containing "nil", could output as "nil,nil,nil", and therefor it would not round-trip with the parser.
  • The second loop around is better guaranteed to work. Meaning that you parse data A to Infoset B. You unparse B to data C, you parse data C to Infoset D. You unparse D to data E which you parse to Infoset F. The data E and the data C should match exactly, and the Infoset D and Infoset F should match exactly. The ambiguities are wrung out in the first cycle around this loop.

Other Details

  • Truncated output when length units is bytes and encoding is variable width (e.g., utf-8). The issue is truncating that chops the code units of a character off part way through.

 

  • No labels