You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

This page for a set of notes about the design of the Daffodil Unparser. This includes how to test it, and how it works internally.

Terminology

Programatic Construction of Infoset

For unparsing, a DFDL Infoset can be created programatically by a DFDL-schema-aware program using the Daffodil Infoset API.

Task(s): design and documentation of this API - includes Scala and Java versions of this API, and javadoc and scaladoc for them, and unit tests that drive these APIs in Scala and Java.

The alternative way to construct a DFDL Infoset for unparsing is to create the Infoset from XML data.

Infoset Creation Error

This is an error that occurs when converting XML into a DFDL Infoset. This includes

  1. XML not well formed
  2. XML not schema-structured
  3. XML not schema-valid

Being schema-structured means that for each XML element there is a corresponding DFDL schema element declaration, and that the value of the element (simple types) or complex content of the element (complex types) corresponds to the type given in the DFDL schema. For example, if the XML is well formed <x>foo</x>, but the corresponding element declaration for element 'x' has type xs:int, then since "foo" is not a valid xs:int, this XML is not schema-structured for this DFDL Schema.

Schema-structured is about the situations that the parser would view as a parse error.  From a parser perspective, if "foo" is in the data, and the parser encounters it while trying to parse element 'x' of type xs:int, then a parse error will occur.  Symmetric to this when unparsing is if the XML being converted into a DFDL Infoset has this same kind of type error.

General validity when unparsing corresponds to general validity checking in the Daffodil parser. This includes checking the facets and the number of occurrences.

An exception would be if that element happens to have an assertion/discriminator and calls the dfdl:checkConstraints function. This tells the parser to treat the facet compliance as a parse error. There is no unparser feature corresponding to this as assertions/discriminators are not evaluated when unparsing.

Unassociated Infoset Element

During the construction of a DFDL Infoset, the intermediate state where one knows the element name and namespace, but not the corresponding DFDL Schema Component. In this state the value, for simple types, is just a string that has not yet been converted into any typed object.

When constructing a DFDL Infoset from XML (as opposed to programatically constructing it), some XML may carry xsi:type attributes e.g.,

<foo xsi:type="xs:int">5</foo>

In this case, the value "5" may be converted into an xs:int type; however, whether this correctly corresponds to a DFDL schema element declaration's type or not has not been determined as the association of the element to the corresponding part of the DFDL Schema has not yet been performed.

Basic Algorithms

Daffodil's parser performs two steps:

  1. Data to Infoset: data parsed into Daffodil's version of the DFDL Infoset
  2. Infoset to XML: DFDL Infoset converted into XML.

Symmetric to this, the unparser performs these:

  1. XML to Infoset: XML parsed into Daffodil's version of the DFDL Infoset
  2. Infoset to Data: DFDL Infoset unparsed into data.

This symmetry is not very deep, as the algorithms and actions taken during these activities are quite different. Consider: when parsing, converting data to Infoset is the hard part. Given the Infoset, producing XML from it is quite easy. Unparsing is not the same. The symmetric activity of converting XML to a DFDL Infoset is more complex, but the symmetric activity of unparsing DFDL Infoset to data is quite a bit simpler than parsing data to Infoset.

XML to Infoset

Converting XML into a DFDL Infoset has a few complexities.

Note: there is code that does this already as part of unit testing for the new Infoset introduced with the DPath expression implementation. This provides a way to create a DFDL Infoset from XML so as to execute DPath expressions against that Infoset.

This code was created with unit testing in mind. Performance was not a consideration at that time.

The major issues are:

  1. Inferring arrays
  2. Determining the DFDL schema component that corresponds to a specific unassociated Infoset element, given the presence of xs:choice and optional/array elements.

It is not possible to construct a correct DFDL Infoset from XML without the DFDL Schema.

Inferring Arrays

This is the issue of determining when <foo>5</foo><foo>6</foo> are two elements of the same array, or two separate elements either scalar or of different arrays. DFDL and XML Schema specifically allow for schemas like this:

<element name="foo" type="int" maxOccurs="2"/>
<element name="bar" type="string" minOccurs="0"/>
<element name="foo" type="int"/>

In this situation, since the element named "bar" is optional, it is ambiguous which of the two element declarations named "foo" is the one that a particular instance should be associated to.

The algorithm for this was a recent topic of conversation in the DFDL working group (January 2015). It was resolved that the nature of the repeating of the elements must be taken into account, including the dfdl:occursCountKind. Specifically, when dfdl:occursCountKind is 'parsed', then the min/maxOccurs are not considered, and any number of adjacent elements can be "matched up" to the element declaration of an array. However, when the dfdl:occursCountKind is implicit, delimited, then the unparser must count the number of elements it has output, and stop when maxOccurs (if bounded) is reached.


Note: The test-rig code for testing the DPath expressions constructs arrays of the Infoset blindly. That is, all adjacent elements are coalesced into arrays. This is ultimately incorrect, but may be fine as a first version, and the true array inference that is properly schema-aware can be added later.

 

Determining the DFDL Schema Component that Corresponds to an XML Element

Perhaps the most complex issue for creating a DFDL Infoset, or for unparsing one, is determining which DFDL schema component corresponds to a particular Infoset element.


Archives of the DFDL Workgroup email contain a number of discussions about xs:choice and inferring the right choice-arm given an unassociated Infoset Element.

Design For Test

TDML runner modifications are required. It is roughly symmetric to parser testing features. The biggest issue complexity-wise is converting an XML-expressed DFDL infoset into an actual DFDL infoset.

  • No labels