Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Next »

Daffodil is an implementation of DFDL which uses JDOM and XML to represent the DFDL Infoset.

The DFDL Infoset is somewhat different from the XML Infoset.

In truth, Daffodil approximates the DFDL Infoset using a subset of the features in the XML Infoset made visible via the JDOM libraries, and embellishing JDOM Elements with distinguished attributes.

Use of JDOM is motivated by the ability to plug JDOM into the Saxon-B XPath implementation as a way of realizing the DFDL expression language, which is a subset of XPath 2.0.

Ultimately, the Scala API for Daffodil converts the JDOM objects into Scala's native XML objects, e.g., scala.xml.Elem being the class of Element nodes.

The Java API returns the JDOM objects directly to the caller.

Namespaces and Prefixes

The Daffodil implementation uses uses attributes in a few distinct namespaces to embellish JDOM Elements.

The string "urn:ogf:dfdl:2013:imp:opensource.ncsa.illinois.edu:2012" is the daffodil implementation namespace prefix. All Daffodil-specific namespaces extend this.

The URN suffix "...:int" appended to the prefix above is the URN for Daffodil internal use. By convention it is bound to the prefix 'dafint'. Attributes and elements in this namespace are for internal use by the Daffodil implementation.

The URN suffix "..:ext" is the daffodil extension namespace, by convention bound to the prefix 'daf'. This is used for Daffodil extensions to the DFDL specification, such as new properties or annotations. Attributes or elements in this namespace are effectively visible parts of the Daffodil API intended to be used and understood by DFDL schema authors using Daffodil.

We also use the standard 'xsi' prefix/namespace, and 'xs' prefix/namespace.

Mapping of DFDL Infoset to Daffodil JDOM Infoset to Scala XML Nodes

DFDL InfosetDaffodil's JDOM XML InfosetScala scala.xml.Node Infoset
Document Information ItemJDOM DocumentThe document is represented by the root element. There is no separate document item.
rootgetRootElement()none
dfdlVersionattribute daffodil:dfdlVersion on the root element.none
schema (reserved for future use)(no implementation)none

unicodeByteOrderMark

attribute daf:unicodeByteOrderMark on the root element.same attribute scheme as JDOM
Element Information ItemJDOM Elementscala.xml.Elem
namespacegetNamespace(): org.jdom.Namespacedef namespace: String
namegetName(): Stringdef name: String
documentgetDocument()none (see parent)
datatype

attribute xsi:type with value one of the set of XML Schema simple type QNames that are in the DFDL Subset of XML Schema.

For example: xsi:type='xs:string'

By convention, the prefix 'xsi' and 'xs' denote here the usual standard namespace URIs.

same attribute scheme as JDOM
dataValue

For simple types other than xs:string, the cannonical XML representation of the value, as returned by getText().

However, for the value nil, the representation is an element with no value having the xsi:nil='true' attribute.

For type xs:string, the DFDL Infoset allows representation of characters that are illegal in XML.

These are represented by replacing them with characters in the Unicode Private Use Area by a scheme described below.

def text: String to obtain cannonical text.

Nil representation is the same attribute scheme.

Values containing XML-illegal characters use the same scheme.

childrengetChildren()def child: Node*
parentgetParent()

none

Scala XML nodes are immutable, and do not have parent references.

This allows nodes to be shared.

schema

A special attribute dafi:schemaComponentID has a value which can be used to retrieve the associated schema component.

(Not yet implemented: means to create a standard Schema Component Designator  or SCD)

Same attribute scheme
valid(Not yet implemented)(not yet implemented)
unionMemberSchema(Not yet implemented)(not yet implemented)
"No Value"A JDOM Element with no children, and with no dataValue is the representation of an element with "No Value".A scala Elem with no children and no dataValue.
Augmented Infoset

A JDOM Element with a special marker attribute: dafint:hidden='true' signifies that the element is part of the augmented infoset.

This attribute is used to identify and filter out elements when the un-augmented infoset is needed.

Same attribute scheme, but on scala.xml.Elem element.

Implementation of DFDL Infoset Strings

Since DFDL strings can contain characters that are not allowed in XML at all, these characters are mapped into the Unicode Private Use Area (PUA), which is characters #xE000 to #xF8FF

This is similar to the scheme used by Microsoft Visio (See: http://msdn.microsoft.com/en-us/library/office/aa218415%28v=office.10%29.aspx), but extended to handle all the XML 1.0 illegal characters including those with 16-bit codepoint values.

These are the legal XML characters (for XML v1.0)  

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

For illegal characters with values from #x00 to #x1F, these values are mapped to the PUA by adding #xE000 to their character code.

For illegal characters #xD800 to #xDFFF, these values are mapped to the PUA by adding #x1000 to their character code. So #xD800 maps to #xE800, and #xDFFF maps to #xEFFF.

For illegal characters #xFFFE and #xFFFF these values are mapped to the PUA by subtracting #x0F00 from their character code, so to characters #xF0FE and #xF0FF.

This mapping is used bi-directionally, that is, illegal characters are replaced by their legal counterparts when parsing, and the reverse transformation is performed when unparsing thereby allowing the creation of data containing the XML illegal characters from legal XML documents that contain only the mapped PUA corresponding characters.

It is a processing error when parsing if any DFDL infoset string contains characters in the parts of the PUA used by this mapping for illegal XML codepoints.

(Possible future: toggle mechanism so you can turn on/off this mapping, allowing processing of data so long as it does not contain both PUA characters AND illegal XML characters)

It is a processing error if any DFDL infoset string character is created with a character code greater than #x10FFFF.

XML Character Entity Conversion:

(Proposed)

While Daffodil proper stops with the DFDL Infoset, many applications of Daffodil will want to construct an actual XML document as a string/text representation from the DFDL Infoset.

If the output encoding is UTF-8, no special conversion is needed. However, some systems do not handle UTF-8 well.  Specifically, the Microsoft Windows Operating System, as of Version 7, when installed in the default US-English configuration, does not display UTF-8 unicode properly in the default tools such as at the command line, and the ubiquitous notepad and wordpad programs.

To better accommodate this, a special translation may be helpful when converting from the Daffodil infoset to the XML textual representation. This does not affect the content of the Daffodil infoset, but only its realization as a XML file/string.

(Note: This conversion to text is outside of normal Daffodil processing which stops when the Infoset is created. In fact this is a general capability which could be used by any XML-oriented application programs which create an actual file/string/stream of textual characters.)

The special transformation has the following characteristics:

  • The ability to specify an encoding.
    • For example, a MS-Windows user may wish to specify the windows-1252 encoding.
    • The minimum set of supported encodings would be ASCII, windows-1252, and UTF-8
      • Specifying UTF-8 turns off the numeric character entity substitution part of this special transformation.
  • Any unicode codepoint which cannot be mapped to the selected encoding can be replaced by its XML numeric character entity equivalent.
    • Example: If the user specifies the US-ASCII encoding, there is no mapping for the Euro symbol €, which is Unicode #x20AC. This would be output as €
    • Example: If the user specifies windows-1252 encoding, the PUA-mapped characters for the XML-illegal code points such as codepoints 0 to 8, become #xE000 to #xE008 in the Daffodil Infoset according to the PUA mapping described above, and would become  to  in the output text.
  • An option allows the user to control whether an XML heading line such as <?xml version="1.0" encoding="windows-1252" ?> is generated at the start of the textual output.

 

Note that choice of the ASCII or US-ASCII encoding creates an output that is universal, in that it would have only the ASCII 7-bit characters in use yet would be able to represent any character allowed in XML accurately. This form however, would be largely unreadable not only to users of oriental language scripts, but even to users of commonplace accented forms from european language scripts.

CDATA Escaping Option

(Proposed)

An additional option controls escaping of the special XML characters, <,>,&,", and '.

 

CDATA preference: When it is expected that string data contains, or could contain one or more of the characters &, ", ', <, and >, then the user can specify an option for whether they prefer use of CDATA sections, or standard escaping where the standard character entities are used: &amp; &quot; &apos; &lt; &gt;.

An additional sub-option controls whether strings greater than some tunable size should always be surrounded by CDATA sections.

If characters requiring the above described numeric character entities are encountered, then the CDATA section will be ended, the character entity inserted, and then another CDATA section begun.

Examples:

  • The string { ../x > 5 } could be rendered to text as either

    •  { ../x &gt; 5 } or as

    • <![CDATA[{ ../x > 5 }]]> depending on the CDATA preference.

  • In US-ASCII encoding, the string { the cost is '800€' } could be encoded as

    •  { the cost is &apos;800&#x20AC;&apos; } or

    • <![CDATA[{ the cost is '800]]>&#x20AC;<![CDATA[' }]]> which has two CDATA sections with the numeric character entity for the Euro symbol € between them.

DFDL Expressions and Daffodil Infoset Strings

We use Saxon-B and JDOM so as to utilize the XPath implementation to realize DFDL expressions.

DFDL Infoset strings are accommodated by way of a function daf:string(...). This function takes a single argument of type string, and it interprets the DFDL numeric entities and DFDL character entities notations, and inserts the corresponding characters into the string result.

In addition, if the DFDL character entities identify XML-illegal characters, then the PUA-replacement described above is performed.

(Note 2012-12-05: this function is proposed to the DFDL Working Group for inclusion in DFDL version 1.0 standard, in which case it would used the standard prefix, i.e., dfdl:string(..) )

Daffodil Infoset and TDML Runner

The Daffodil TDML runner constructs the <tdml:dfdlInfoset> element contents by post-processing all strings so that the DFDL character entities notation can be used to express XML-illegal characters.

So for example:

     <tdml:dfdlInfoset><foo>abc%NUL;</foo></tdml:dfdlInfoset>

would translate the %NUL; entity notation into character #x00, which is illegal in XML, and so it would be remapped to character #xE000. Hence, the above example is equivalent to writing:

     <tdml:dfdlInfoset><foo>abc&#xE000;</foo></tdml:dfdlInfoset>

which uses the XML numeric character entity to directly insert the remapped #xE000 character directly.  The use of DFDL character entities simply allows the notational convenience of the use of the symbolic form of these entities (NUL, CR, LF, HT, VT, FF, etc.), or the DFDL numeric entities form (for example "%#x02;") for notational consistency across DFDL schema and TDML test files.

Use of the DFDL character entities is preferred as it is portable to other DFDL implementations than just Daffodil. The remapping of XML-illegal characters to the PUA is a Daffodil-specific behaviour.

 

 

  • No labels