You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

This page for a set of notes about the design of the Daffodil Unparser. This includes how to test it, and how it works internally.

(Note: there is use of future tense in this page - once the Unparser exists that will no longer make sense so should be fixed or those aspects or this whole page replaced.)

Terminology

Lookahead, Forward Reference and Infoset Order

The DFDL Infoset has an order to the elements stored in it. Normally one thinks of the elements as being unparsed in this order. This order must be consistent with the DFDL schema. When the unparser must look at an element that is later in infoset order that is called lookahead or forward reference.  When the unparse must look at elements earlier in this order that is lookbehind or backward reference.

There is a real asymmetry between parser and unparser behavior here as the parser has no notion of forward reference, only backward.

Programatic Construction of Infoset

For unparsing, a DFDL Infoset can be created programatically by a DFDL-schema-aware program using the Daffodil Infoset API.

Task(s): design and documentation of this API - includes Scala and Java versions of this API, and javadoc and scaladoc for them, and unit tests that drive these APIs in Scala and Java.

The alternative way to construct a DFDL Infoset for unparsing is to create the Infoset from XML data.

Infoset Creation Error

This is an error that occurs when converting XML into a DFDL Infoset. This includes

  1. XML not well formed
  2. XML not schema-structured
  3. XML not schema-valid

Being schema-structured means that for each XML element there is a corresponding DFDL schema element declaration, and that the value of the element (simple types) or complex content of the element (complex types) corresponds to the type given in the DFDL schema. For example, if the XML is well formed <x>foo</x>, but the corresponding element declaration for element 'x' has type xs:int, then since "foo" is not a valid xs:int, this XML is not schema-structured for this DFDL Schema.

Schema-structured is about the situations that the parser would view as a parse error.  From a parser perspective, if "foo" is in the data, and the parser encounters it while trying to parse element 'x' of type xs:int, then a parse error will occur.  Symmetric to this when unparsing is if the XML being converted into a DFDL Infoset has this same kind of type error.

General validity when unparsing corresponds to general validity checking in the Daffodil parser. This includes checking the facets and the number of occurrences.

An exception would be if that element happens to have an assertion/discriminator and calls the dfdl:checkConstraints function. This tells the parser to treat the facet compliance as a parse error. There is no unparser feature corresponding to this as assertions/discriminators are not evaluated when unparsing.

Unassociated Infoset Element

During the construction of a DFDL Infoset, the intermediate state where one knows the element name and namespace, but not the corresponding DFDL Schema Component. In this state the value, for simple types, is just a string that has not yet been converted into any typed Infoset value object.

When constructing a DFDL Infoset from XML (as opposed to programatically constructing it), some XML may carry xsi:type attributes e.g.,

    <foo xsi:type="xs:int">5</foo>

In this case, the value "5" may be converted into an xs:int type; however, whether this correctly corresponds to a DFDL schema element declaration's type or not has not been determined as the association of the element to the corresponding part of the DFDL Schema has not yet been performed.

Augmented Infoset (for Unparsing)

For unparsing, this term means the Infoset without any hidden elements created, and without any dfdl:outputValueCalc elements computed. Furthermore, elements have not been padded/filled or truncated to their specified length.

Current and Future Infoset

The unparser is incremental meaning it does not require that the entire Infoset (pre-augmentation) is present at the start of unparsing, but operates in a manner where it can be fed/called with Infoset elements incrementally.

In this case, the current Infoset is the part of the Infoset already available to the unparser. The future Infoset means the Infoset that will occur in the future once additional elements have been added to the current Infoset.

Note that the Infoset grows monotonically. There is no 'taking back' of Infoset elements. (The implementation can optimize use of memory by discarding parts of the current infoset that are no longer needed.)

Basic Algorithms

Daffodil's parser performs two steps:

  1. Data to Infoset: data parsed into Daffodil's version of the DFDL Infoset
  2. Infoset to XML: DFDL Infoset converted into XML. (or JSON see DFDL-1210 JIRA)

Symmetric to this, the unparser performs these:

  1. XML to Infoset: XML (or JSON)  parsed into Daffodil's version of the DFDL Infoset
  2. Infoset to Data: DFDL Infoset unparsed into data.

This symmetry is not very deep, as the algorithms and actions taken during these activities are quite different. Consider: when parsing, converting data to Infoset is the hard part. Given the Infoset, producing XML from it is quite easy. Unparsing is not the same. The symmetric activity of converting XML to a DFDL Infoset is more complex, but the symmetric activity of unparsing DFDL Infoset to data is in many ways a bit simpler than parsing data to Infoset.

XML to Infoset

Converting XML into a DFDL Infoset has a few complexities.

Note: there is code that does this already as part of unit testing for the new Infoset introduced with the DPath expression implementation. This provides a way to create a DFDL Infoset from XML so as to execute DPath expressions against that Infoset.

This code was created with unit testing in mind. Performance was not a consideration at that time.

The major issues are:

  1. Determining the DFDL schema component that corresponds to a specific unassociated Infoset element, given the presence of xs:choice and optional/array elements.
  2. Inferring arrays

It is not possible to construct a correct DFDL Infoset from XML without the DFDL Schema, whether from XML or programatically.

Determining the DFDL Schema Component that Corresponds to an XML Element

Perhaps the most complex issue for creating a DFDL Infoset, or for unparsing one, is determining which DFDL schema component corresponds to a particular Infoset element. The same problem occurs if a relatively naive program is constructing the DFDL Infoset using the API. When a Infoset element is being created, one must identify the DFDL schema component that corresponds to it. The context for this is the enclosing parent element, and any prior sibling elements. Unfortunately, one also needs some following elements in some cases.

Archives of the DFDL Workgroup email contain a number of discussions about xs:choice and inferring the right choice-arm given an unassociated Infoset Element.

The unparser must look ahead by one subsequent element in these situations:

  1. to determine which alternative arm of a choice a particular element is in
  2. to determine if an element instance is the last element of an array

... TBD discussion here ....

Inferring Arrays

This is a sub-issue of determining the DFDL schema component. Specifically it is the issue of determining when <foo>5</foo><foo>6</foo> are two elements of the same array, or two separate elements either scalar or of different arrays. DFDL and XML Schema specifically allow for schemas like this:

    <element name="foo" type="int" maxOccurs="2"/>
<element name="bar" type="string" minOccurs="0"/>
<element name="foo" type="int"/>

In this situation, since the element named "bar" is optional, it is ambiguous which of the two element declarations named "foo" is the one that a particular instance should be associated to.

The algorithm for this was a recent topic of conversation in the DFDL working group (January 2015). It was resolved that the nature of the repeating of the elements must be taken into account, including the dfdl:occursCountKind. Specifically, when dfdl:occursCountKind is 'parsed', then the min/maxOccurs are not considered, and any number of adjacent elements can be "matched up" to the element declaration of an array. However, when the dfdl:occursCountKind is implicit, delimited, then the unparser must count the number of elements it has output, and stop when maxOccurs (if bounded) is reached.


Note: The test-rig code for testing the DPath expressions constructs arrays of the Infoset blindly. That is, all adjacent elements are coalesced into arrays. This is ultimately incorrect, but may be fine as a first version, and the true array inference that is properly schema-aware can be added later.

Infoset to Data

  • The Unparser's state is class UState. Unlike the early versions of the Parser & PState, the Unparser from the start mutates the UState rather than doing the "functional programming" kind of thing - copying it with changes. The unparse methods do not return a UState object. They modify the one that is passed in. Each thread must have its own UState.
  • The Unparser has no limitations on data sizes. This problem is fundamentally easier to solve for unparsing than it is for parsing. Data buffering may still be needed (see discussion of Pending Calculations).

Incremental Unparsing - Pending Calculations and Forward Reference

When unparsing, the situation can arise where one must write out an element, but the value of that element is to be computed from one or more other elements that appear later in Infoset order.

The simplest example is for dfdl:lengthKind 'explicit'. In this case, one (or more) elements will commonly carry dfdl:outputValueCalc properties. The expressions in these properties will reference the Infoset to obtain (typically) length information.

If one considers the unparsing process as an incremental process that is called in an element-by-element manner, then the unparsing process must be able to suspend output, and the API must allow an element which carries a dfdl:outputValueCalc to be passed to the unparser, without the calculation having been performed. These outputValueCalc expressions may reference other infoset elements (directly, or through variables) that are also calculated with dfdl:outputValueCalc. Calculations of the value of variables may also reference forward in Infoset order. So in general these expressions must be evaluated with a on-demand type of algorithm where computations can be suspended until the unparser is fed the non-calculated data elements needed to enable the calculations to proceed.

In addition, some calculations will need characteristics of the physical unparsed representation of data, where that data is later in Infoset order. This requires not only that the calculation is deferred until that Infoset element is available, it also requires that one buffer up output data. For example consider dfdl:lengthUnits 'bytes' for an element of type xs:string with dfdl:encoding 'utf-8'. In this case, one must unparse the string into a buffer so that one can count the number of bytes.

A simpler case would be a textual number, but with grouping separators. E.g., 1,000,000.00. The Infoset value is just the integer 1000000. The fact that this requires the insertion of two commas as grouping separators is value-dependent.

The DFDL Specification refers to the unpadded data (but having interior-things like grouping separators) as the Value Region, and the Padded/Filled data as the Content Region, which surrounds and encompasses the Value Region.

Length Expressions

The DFDL Workgroup has clarified that when dfdl:lengthKind is 'explicit' and the dfdl:length is an expression, that this expression is executed both when parsing and when unparsing. When unparsing this calculation gives the length to which the data is truncated (when truncation is allowed) or to which it is padded/filled before unparsing.

This interferes with the basic concept of recursively walking the Infoset and DFDL Schema, writing out elements based on the DFDL Schema component. 

Design For Test

Task: TDML runner modifications are required. It is roughly symmetric to parser testing features. The biggest issue complexity-wise is converting an XML-expressed DFDL infoset into an actual DFDL infoset but this is just a method call for the TDML runner.

Ideas: (some half baked?)

  • Some parser tests are invertible. Having parsed data to an Infoset, one can unparse back to data and for some DFDL schemas, get the identical data. This doesn't work for all DFDL schemas - escape schemes can parse things with say, surrounding quotes which on unparsing are determined to be unnecessary and so are not output. Also multiple values are allowed for delimiters, but the first of these values is used on output, so incoming data that uses one of the other delimiters (not the first) won't unparse to the same delimiters. That said, many tests will be invertible.  A flag on TDML parser tests should indicate whether the test can be inverted. Some way to bulk-set this flag so it doesn't have to be done explicitly.

TBD

  • Streaming output for large objects - this is symmetric with a parser feature we need, which is the ability of the unparser to accept a large object not as a giant string or hexBinary blob, but as a file descriptor or other specification that can be opened and pulled separately from the Infoset elements.
  • Truncated output when length units is bytes and encoding is variable width (e.g., utf-8). The issue is truncating that chops the code units of a character off part way through.
  • Improvements in coding style: smaller Scala code files, smaller TDML files - for parsing there are some giant files and some TDML files that have hundreds of tests in them. We ought not repeat these mistakes.

 

  • No labels