Skip to end of metadata
Go to start of metadata

Daffodil is an implementation of DFDL which uses Scala's scala.xml.Elem or JDOM  to represent the DFDL Infoset in XML.

The DFDL Infoset is somewhat different from the XML Infoset.

In truth, Daffodil approximates the DFDL Infoset using a subset of the features in the XML Infoset by embellishing elements with distinguished attributes.

Ultimately, the Scala API for Daffodil converts to/from Scala's native XML objects, e.g., scala.xml.Elem being the class of Element nodes.

The Java API converts to/from JDOM objects.

Namespaces and Prefixes

The Daffodil implementation uses uses attributes in a few distinct namespaces to embellish XML Elements.

The string "urn:ogf:dfdl:2013:imp:opensource.ncsa.illinois.edu:2012" is the daffodil implementation namespace prefix. All Daffodil-specific namespaces extend this.

The URN suffix "...:int" appended to the prefix above is the URN for Daffodil internal use. By convention it is bound to the prefix 'dafint'. Attributes and elements in this namespace are for internal use by the Daffodil implementation.

The URN suffix "..:ext" is the daffodil extension namespace, by convention bound to the prefix 'daf'. This is used for Daffodil extensions to the DFDL specification, such as new properties or annotations. Attributes or elements in this namespace are effectively visible parts of the Daffodil API intended to be used and understood by DFDL schema authors using Daffodil.

We also use the standard 'xsi' prefix/namespace, and 'xs' prefix/namespace.

Mapping of DFDL Infoset to Daffodil JDOM Infoset and to Scala XML Nodes

DFDL InfosetDaffodil's JDOM XML InfosetScala scala.xml.Node Infoset
Document Information ItemJDOM DocumentThe document is represented by the root element. There is no separate document item.
rootgetRootElement()none
dfdlVersion

attribute daf:dfdlVersion on the root element.

(Not yet implemented)

none
schema (reserved for future use)

daf:schema attribute

(No implementation)

none

unicodeByteOrderMark

attribute daf:unicodeByteOrderMark on the root element.

(Not yet implemented)

same attribute scheme as JDOM
Element Information ItemJDOM Elementscala.xml.Elem
namespacegetNamespace(): org.jdom.Namespacedef namespace: String
namegetName(): Stringdef name: String
documentgetDocument()none (see parent)
datatype

attribute xsi:type with value one of the set of XML Schema simple type QNames that are in the DFDL subset of XML Schema.

For example: xsi:type='xs:string'

By convention, the prefix 'xsi' and 'xs' denote here the usual standard namespace URIs.

(Not yet implemented)

same attribute scheme as JDOM
dataValue

For simple types other than xs:string, the cannonical XML representation of the value, as returned by getText().

For type xs:string, the DFDL Infoset allows representation of characters that are illegal in XML.

These are represented by replacing them with characters in the Unicode Private Use Area by a scheme described below.

def text: String to obtain canonical text.

Values containing XML-illegal characters use the same scheme.

nilledxsi:nil='true' attribute on element. Absence of this attribute implies 'false'Same attribute xsi:nil
childrengetChildren()def child: Node*
parentgetParent()

none

Scala XML nodes are immutable, and do not have parent references.

This allows nodes to be shared.

schema

A special attribute daf:schemaComponentID has a value which can be used to retrieve the associated schema component.

(Not yet implemented. Note: requires a means to create a standard Schema Component Designator  or SCD)

Same attribute scheme
valid

daf:valid='true' means the data has been tested and is valid, daf:valid="false" means the data has been tested and is invalid. The absence of the attribute means that no position is taken on the validity of the data.

(Not yet implemented)

Same attribute scheme
unionMemberSchema(Not yet implemented)(not yet implemented)
"No Value"A JDOM Element with no children (not even Text node children)  is the representation of an element with "No Value".A scala.xml.Elem with no children.
Augmented Infoset

A JDOM Element with a special marker attribute: dafint:hidden='true' signifies that the element is part of the augmented infoset.

This attribute is used to identify and filter out elements when the un-augmented infoset is needed.

Same attribute scheme, but on scala.xml.Elem element.

Implementation of DFDL Infoset Strings

Since DFDL strings can contain characters that are not allowed in XML at all, these characters are mapped into the Unicode Private Use Area (PUA), which is characters #xE000 to #xF8FF

This is similar to the scheme used by Microsoft Visio (See: http://msdn.microsoft.com/en-us/library/office/aa218415%28v=office.10%29.aspx), but extended to handle all the XML 1.0 illegal characters including those with 16-bit codepoint values.

These are the legal XML characters (for XML v1.0)  

#x0 | #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

#xD - treated specially

For illegal characters with values from #x00 to #x1F, these values are mapped to the PUA by adding #xE000 to their character code.

Character #xD (Carriage Return or CR) is mapped to #xA (Line Feed, or LF). The CR character is allowed in the textual representation of XML documents, but is always converted to LF in the XML Infoset. That is, it is read by XML processors, but CRLF is converted to just LF, and CR alone is converted to LF. Daffodil is in a sense a different 'reader' of data into the XML infoset, so to be consistent with XML we map CR to LF. 

The pair CRLF when it appears within data (i.e., is not a delimiter) is treated as regular text characters, so the CR is converted to LF, and so CRLF will become LFLF. 

For illegal characters #xD800 to #xDFFF, these values are mapped to the PUA by adding #x1000 to their character code. So #xD800 maps to #xE800, and #xDFFF maps to #xEFFF.

For illegal characters #xFFFE and #xFFFF these values are mapped to the PUA by subtracting #x0F00 from their character code, so to characters #xF0FE and #xF0FF.

This mapping is used bi-directionally, that is, illegal characters are replaced by their legal counterparts when parsing, and the reverse transformation is performed when unparsing thereby allowing the creation of data containing the XML illegal characters from legal XML documents that contain only the mapped PUA corresponding characters.

It is a processing error when parsing if any DFDL infoset string contains characters in the parts of the PUA used by this mapping for illegal XML codepoints.

(Possible future: toggle mechanism so you can turn on/off this mapping, allowing processing of data so long as it does not contain both PUA characters AND illegal XML characters)

It is a processing error if any DFDL infoset string character is created with a character code greater than #x10FFFF.

XML Character Entity Conversion:

(Proposed)

While Daffodil proper stops with the DFDL Infoset, many applications of Daffodil will want to construct an actual XML document as a string/text representation from the DFDL Infoset.

If the output encoding is UTF-8, no special conversion is needed. However, some systems do not handle UTF-8 well.  Specifically, the Microsoft Windows Operating System, as of Version 7, when installed in the default US-English configuration, does not display UTF-8 unicode properly in the default tools such as at the command line, and the ubiquitous notepad and wordpad programs.

To better accommodate this, a special translation may be helpful when converting from the Daffodil infoset to the XML textual representation. This does not affect the content of the Daffodil infoset, but only its realization as a XML file/string.

(Note: This conversion to text is outside of normal Daffodil processing which stops when the Infoset is created. In fact this is a general capability which could be used by any XML-oriented application programs which create an actual file/string/stream of textual characters.)

The special transformation has the following characteristics:

  • The ability to specify an encoding.
    • For example, a MS-Windows user may wish to specify the windows-1252 encoding.
    • The minimum set of supported encodings would be ASCII, windows-1252, and UTF-8
      • Specifying UTF-8 turns off the numeric character entity substitution part of this special transformation.
  • Any unicode codepoint which cannot be mapped to the selected encoding can be replaced by its XML numeric character entity equivalent.
    • Example: If the user specifies the US-ASCII encoding, there is no mapping for the Euro symbol €, which is Unicode #x20AC. This would be output as €
    • Example: If the user specifies windows-1252 encoding, the PUA-mapped characters for the XML-illegal code points such as codepoints 0 to 8, become #xE000 to #xE008 in the Daffodil Infoset according to the PUA mapping described above, and would become  to  in the output text.
  • An option allows the user to control whether an XML heading line such as <?xml version="1.0" encoding="windows-1252" ?> is generated at the start of the textual output.

Note that choice of the ASCII or US-ASCII encoding creates an output that is universal, in that it would have only the ASCII 7-bit characters in use yet would be able to represent any character allowed in XML accurately. This form however, would be largely unreadable not only to users of oriental language scripts, but even to users of commonplace accented forms from European language scripts.

CDATA Escaping Option

(Proposed)

An additional option controls escaping of the special XML characters, <,>,&,", and '.

CDATA preference: When it is expected that string data contains, or could contain one or more of the characters &, ", ', <, and >, then the user can specify an option for whether they prefer use of CDATA sections, or standard escaping where the standard character entities are used: &amp; &quot; &apos; &lt; &gt;.

An additional sub-option controls whether strings greater than some tunable size should always be surrounded by CDATA sections.

If characters requiring the above described numeric character entities are encountered, then the CDATA section will be ended, the character entity inserted, and then another CDATA section begun.

Examples:

  • The string { ../x > 5 } could be rendered to text as either

    •  { ../x &gt; 5 } or as

    • <![CDATA[{ ../x > 5 }]]> depending on the CDATA preference.

  • In US-ASCII encoding, the string { the cost is '800€' } could be encoded as

    •  { the cost is &apos;800&#x20AC;&apos; } or

    • <![CDATA[{ the cost is '800]]>&#x20AC;<![CDATA[' }]]> which has two CDATA sections with the numeric character entity for the Euro symbol € between them.

DFDL Expressions and Daffodil Infoset Strings

The DFDL v1.0 specification now includes functions dfdl:decodeDFDLEntities(...) and dfdl:encodeDFDLEntities(...) which take a string value as argument, and which decode DFDL's entity syntax (such as "%LF;") into the corresponding unicode characters (decode), or the inverse of that, creating a string containing DFDL entities for characters that have entities defined (encode). By use of these functions strings can be constructed at runtime for properties that must use character entities. For example, the dfdl:terminator property's value is a list of DFDL string literals separated by whitespace. Within each string literal in the list, any whitespace must be represented using DFDL entities such as "%SP;" for the space character.  In the situation where the dfdl:terminator property is obtained from an element earlier in the data stream containing a single character, that character could be " " (a space); hence, the expression for dfdl:terminator must be

    dfdl:terminator='{ dfdl:encodeDFDLEntities(../../terminatorElement) }'

The value of the terminator property would then be "%SP;" which means the terminator is a single space character.

Daffodil Infoset and TDML Runner

The Daffodil TDML runner constructs the <tdml:dfdlInfoset> element contents by post-processing all strings so that the DFDL character entities notation can be used to express XML-illegal characters.

So for example:

     <tdml:dfdlInfoset><foo>abc%NUL;</foo></tdml:dfdlInfoset>

would translate the %NUL; entity notation into character #x00, which is illegal in XML, and so it would be remapped to character #xE000. Hence, the above example is equivalent to writing:

     <tdml:dfdlInfoset><foo>abc&#xE000;</foo></tdml:dfdlInfoset>

which uses the XML numeric character entity to directly insert the remapped #xE000 character directly.  The use of DFDL character entities simply allows the notational convenience of the use of the symbolic form of these entities (NUL, CR, LF, HT, VT, FF, etc.), or the DFDL numeric entities form (for example "%#x02;") for notational consistency across DFDL schema and TDML test files.

Use of the DFDL character entities is preferred as it is portable to other DFDL implementations than just Daffodil. The remapping of XML-illegal characters to the PUA is a Daffodil-specific behaviour.

 

Other XML Output Options

XSLT has a variety of options in the xsl:output element that may be useful in terms of copying their names or meanings. It has options for encoding, for whether or not to add the xml declaration at the beginning of the xml, and even a way to list the elements where the contents should be surrounded by CDATA bracketing. 

See https://www.w3.org/TR/xslt#output

  • No labels