Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added scheme for output in non-utf-8 with numeric character entities

...

It is a processing error if any DFDL infoset string character is created with a character code greater than #x10FFFF.

XML Character Entity Conversion:

While Daffodil proper stops with the DFDL Infoset, many applications of Daffodil will want to construct an actual XML document as a string/text representation from the DFDL Infoset.

If the output encoding is UTF-8, no special conversion is needed. However, some systems do not handle UTF-8 well.  Specifically, the Microsoft Windows Operating System, as of Version 7, when installed in the default US-English configuration, does not display UTF-8 unicode properly in the default tools such as at the command line, and the ubiquitous notepad and wordpad programs.

To better accommodate this, a special translation may be helpful when converting from the Daffodil infoset to the XML textual representation. This does not affect the content of the Daffodil infoset, but only its realization as a XML file/string. Hence, it is something outside of normal Daffodil processing used by application programs which create an actual file/string/stream of textual characters.

The special transformation has the following characteristics:

  • The ability to specify an encoding.
    • For example, a MS-Windows user may wish to specify the windows-1252 encoding.
    • The minimum set of supported encodings would be ASCII, windows-1252, and UTF-8
      • Specifying UTF-8 turns off the numeric character entity substitution part of this special transformation.
  • Any unicode codepoint which cannot be mapped to the selected encoding can be replaced by its XML numeric character entity equivalent.
    • Example: If the user specifies the US-ASCII encoding, there is no mapping for the Euro symbol €, which is Unicode #x20AC. This would be output as €
    • Example: If the user specifies windows-1252 encoding, the PUA-mapped characters for the XML-illegal code points such as codepoints 0 to 8, become #xE000 to #xE008 in the Daffodil Infoset according to the PUA mapping described above, and would become  to  in the output text.
  • An option allows the user to control whether an XML heading line such as <?xml version="1.0" encoding="windows-1252" ?> is generated at the start of the textual output.
  • CDATA preference: When a string to be translated contains no numeric character entities, per the above conversion of unmapped characters, but it does contain one or more of the characters &, ", ', <, and >, then the user can specify whether they prefer use of CDATA sections, or standard escaping where the standard character entities are used: &amp; &quot; &apos; &lt; &gt;.
    • Example: The string { ../x > 5 } could be rendered to text as either { ../x &gt; 5 } or as <![CDATA[{ ../x > 5 }]]> depending on the CDATA preference.

Note that choice of the ASCII or US-ASCII encoding creates an output that is universal, in that it would have only the ASCII 7-bit characters in use yet would be able to represent any character allowed in the Daffodil infoset accurately. This form however, would be largely unreadable not only to users of oriental language scripts, but even to users of commonplace accented forms from european language scripts.

DFDL Expressions and Daffodil Infoset Strings

...