Daffodil is an implementation of DFDL which uses JDOM and XML to represent the DFDL Infoset.
The DFDL Infoset is somewhat different from the XML Infoset.
In truth, Daffodil approximates the DFDL Infoset using a subset of the features in the XML Infoset made visible via the JDOM libraries, and embellishing JDOM Elements with distinguished attributes.
Namespaces and Prefixes
The Daffodil implementation uses uses attributes in a few distinct namespaces to embellish JDOM Elements.
The string "urn:ogf:dfdl:2013:imp:opensource.ncsa.illinois.edu:2012" is the daffodil implementation namespace prefix. All Daffodil-specific namespaces extend this.
The URN suffix "...:int" appended to the prefix above is the URN for Daffodil internal use. By convention it is bound to the prefix 'dafint'. Attributes and elements in this namespace are for internal use by the Daffodil implementation.
The URN suffix "..:ext" is the daffodil extension namespace, by convention bound to the prefix 'daffodil'. This is used for Daffodil extensions to the DFDL specification, such as new properties or annotations. Attributes or elements in this namespace are effectively visible parts of the Daffodil API intended to be used and understood by DFDL schema authors using Daffodil.
We also use the standard 'xsi' prefix/namespace, and 'xs' prefix/namespace.
Mapping of DFDL Infoset to Daffodil JDOM Infoset
|DFDL Infoset||Daffodil's JDOM XML Infoset|
|Document Information Item||JDOM Document|
|dfdlVersion||attribute daffodil:dfdlVersion on the root element.|
|schema (reserved for future use)||(no implementation)|
|attribute daffodil:unicodeByteOrderMark on the root element.|
|Element Information Item||JDOM Element|
attribute xsi:type with value one of the set of XML Schema simple type QNames that are in the DFDL Subset of XML Schema.
For example: xsi:type='xs:string'
By convention, the prefix 'xsi' and 'xs' denote here the usual standard namespace URIs.
For simple types other than xs:string, the cannonical XML representation of the value, as returned by getText().
However, for the value nil, the representation is an element with no value having the xsi:nil='true' attribute.
For type xs:string, the DFDL Infomrmation set allows representation of characters that are illegal in XML.
These are represented by replacing them with characters in the Unicode Private Use Area by a scheme described below.
A special attribute dafi:schemaComponentID has a value which can be used to retrieve the associated schema component.
(Not yet implemented: means to create a standard Schema Component Designator or SCD)
|valid||(Not yet implemented)|
|unionMemberSchema||(Not yet implemented)|
|"No Value"||A JDOM Element with no children, and with no dataValue is the representation of an element with "No Value".|
A JDOM Element with a special marker attribute: dafi:hidden='true' signifies that the element is part of the augmented infoset.
This attribute is used to identify and filter out elements when the un-augmented infoset is needed.
Implementation of DFDL Infoset Strings
Since DFDL strings can contain characters that are not allowed in XML at all, these characters are mapped into the Unicode Private Use Area (PUA), which is characters #xE000 to #xF8FF
This is similar to the scheme used by Microsoft Visio (See: http://msdn.microsoft.com/en-us/library/office/aa218415%28v=office.10%29.aspx), but extended to handle all the XML 1.0 illegal characters including those with 16-bit codepoint values.
These are the legal XML characters (for XML v1.0)
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
For illegal characters with values from #x00 to #x1F, these values are mapped to the PUA by adding #xE000 to their character code.
For illegal characters #xD800 to #xDFFF, these values are mapped to the PUA by adding #x1000 to their character code. So #xD800 maps to #xE800, and #xDFFF maps to #xEFFF.
For illegal characters #xFFFE and #xFFFF these values are mapped to the PUA by subtracting #x0F00 from their character code, so to characters #xF0FE and #xF0FF.
This mapping is used bi-directionally, that is, illegal characters are replaced by their legal counterparts when parsing, and the reverse transformation is performed when unparsing thereby allowing the creation of data containing the XML illegal characters from legal XML documents that contain only the mapped PUA corresponding characters.
It is a processing error when parsing if any DFDL infoset string contains characters in the parts of the PUA used by this mapping for illegal XML codepoints.
(Possible future: toggle mechanism so you can turn on/off this mapping, allowing processing of data so long as it does not contain both PUA characters AND illegal XML characters)
It is a processing error if any DFDL infoset string character is created with a character code greater than #x10FFFF.
Daffodil Infoset and TDML Runner
The Daffodil TDML runner constructs the <tdml:dfdlInfoset> element contents by post-processing all strings so that the DFDL character entities notation can be used to express XML-illegal characters.
So for example:
would translate the %NUL; entity notation into character #x00, which is illegal in XML, and so it would be remapped to character #xE000. Hence, the above example is equivalent to writing:
which uses the XML numeric character entity to directly insert the remapped #xE000 character directly. The use of DFDL character entities simply allows the notational convenience of the use of the symbolic form of these entities (NUL, CR, LF, HT, VT, FF, etc.)