Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Improved cdata discussion

...

It is a processing error if any DFDL infoset string character is created with a character code greater than #x10FFFF.

XML Character Entity Conversion:

(Proposed)

While Daffodil proper stops with the DFDL Infoset, many applications of Daffodil will want to construct an actual XML document as a string/text representation from the DFDL Infoset.

...

To better accommodate this, a special translation may be helpful when converting from the Daffodil infoset to the XML textual representation. This does not affect the content of the Daffodil infoset, but only its realization as a XML file/string. Hence, it is something

(Note: This conversion to text is outside of normal Daffodil processing which stops when the Infoset is created. In fact this is a general capability which could be used by any XML-oriented application programs which create an actual file/string/stream of textual characters.)

The special transformation has the following characteristics:

  • The ability to specify an encoding.
    • For example, a MS-Windows user may wish to specify the windows-1252 encoding.
    • The minimum set of supported encodings would be ASCII, windows-1252, and UTF-8
      • Specifying UTF-8 turns off the numeric character entity substitution part of this special transformation.
  • Any unicode codepoint which cannot be mapped to the selected encoding can be replaced by its XML numeric character entity equivalent.
    • Example: If the user specifies the US-ASCII encoding, there is no mapping for the Euro symbol €, which is Unicode #x20AC. This would be output as €
    • Example: If the user specifies windows-1252 encoding, the PUA-mapped characters for the XML-illegal code points such as codepoints 0 to 8, become #xE000 to #xE008 in the Daffodil Infoset according to the PUA mapping described above, and would become  to  in the output text.
  • An option allows the user to control whether an XML heading line such as <?xml version="1.0" encoding="windows-1252" ?> is generated at the start of the textual output.

 

Note that choice of the ASCII or US-ASCII encoding creates an output that is universal, in that it would have only the ASCII 7-bit characters in use yet would be able to represent any character allowed in XML accurately. This form however, would be largely unreadable not only to users of oriental language scripts, but even to users of commonplace accented forms from european language scripts.

CDATA Escaping Option

(Proposed)

An additional option controls escaping of the special XML characters, <,>,&,", and '.

 

CDATA preference: When

...

it is expected that string data contains, or could contain one or more of the characters &, ", ', <, and >, then the user can specify an option for whether they prefer use of CDATA sections, or standard escaping where the standard character entities are used: &amp; &quot; &apos; &lt; &gt;.

An additional sub-option controls whether strings greater than some tunable size should always be surrounded by CDATA sections.

If characters requiring the above described numeric character entities are encountered, then the CDATA section will be ended, the character entity inserted, and then another CDATA section begun.

Examples:

  • Example:

    The string { ../x > 5 } could be rendered to text as either

    •   either { ../x &gt; 5 } or as

    • <![CDATA[{ ../x > 5 }]]> depending on the CDATA preference.

...

  • In US-ASCII encoding

...

  • , the string { the cost is '800€' } could be encoded as

    •  { the cost is &apos;800&#x20AC;&apos; } or

    • <![CDATA[{ the cost is '800]]>&#x20AC;<![CDATA[' }]]> which has two CDATA sections with the numeric character entity for the Euro symbol € between them.

DFDL Expressions and Daffodil Infoset Strings

...