Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

When writing TDML tests there are some tricks and techniques that make the tests more robust and insure their portability across platforms. When code-reviewing TDML tests these things can be considered to be part of the "checklist" of things we look for.

Using CDATA Regions

...

The problem is that CDATA regions are not "preserve exactly what is here". Rather, they are just a different way of being able to avoid having to escape the & and < characters. XML's general fungible whitespace behavor stuff still applies.

OK: To preserve textual formatting within TDML - for clarity reasons.

E.g.,

<tdml:documentPart type="byte"><![CDATA[
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f              
10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f
20 21    23 24 25    27 28 29 2a 2b 2c 2d 2e 2f
30 31 32 33 34 35 36 37 38 39 3a 3b    3d    3f
40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f
50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f
60 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f
70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f
80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f
90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f
a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae af
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf
d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df
e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef
f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff  
]]></tdml:documentPart>

The above matrix of hex would be hard to understand specifically, where those holes in it are, without the formatting, but logically, the whitespace is irrelevant. In effect, we have CDATA here so that tooling like IDEs, XML editor, etc. will not mess with the formatting of the content.

OK: As a clearer way to escape things than using &amp; &gt; &lt; &apos;.

E.g.,

<foo>abc<![CDATA[&&&]]>def&#xE000;ghi</foo>

...

<foo>abc&amp;&amp;&amp;defghi<foo>

This will have only one Text node in it containing 13 characters: Text(abc&&&defghi).

We get this single text node to match the 5 text nodes above by clever comparison routines that are used when XML is compared. These special purpose routines also do things such as ignoring the namespace prefixes on element names. (Probably undesirable longer term.)

OK: To avoid insertion of whitespace that would make things incorrect.

For example, here we need the document to contain exactly and only two characters:

<document><documentPart type="text"><![CDATA[a年]]></documentPart></document>

The problem is that the contents of the documentPart element are treated as literal data, had we left off the CDATA, then some XML tool might have reformatted this as:

       <document>
          <documentPart type="text">
            a年
          </documentPart>
        </document>

...

         <document><!-- WS --><documentPart type="text">
            a年
         </documentPart><!-- WS --></document>

Those all-whitespace nodes are usually unimportant. There's a feature (that we don't use) in XML called xml:space='preserve'. If this is found on an element, then it indicates that the all-whitespace nodes are important and should be preserved. But we don't care about these all-whitespace nodes. What we do care about is the whitespace surrounding the content of the documentPart element.

Now XML itself never inserts whitespace - tools/IDEs and people editing the text of XML documents insert it. One might have a policy that says "never mess with the formatting of the TDML files", particularly using auto-formatter tools. But XML/TDML is quite verbose, and without automatic formatting tools it can become a bit of a mess. It is better to have practices where tests are supposed to be created which are NOT sensitive to auto-reformatting.

The above is problematic because the reformatting inserted content whitespace. That is, the whitespace surrounding the 'a年' text all ends up as part of the content of the documentPart element. The inserted leading and trailing whitespace here is NOT separate all-whitespace nodes. Rather, there is a single Text node that is the content of this documentPart element. In my opinion, XML tools should NOT insert whitespace unless it is all-whitespace nodes, but tools will do the above re-indenting of XML sometimes, and it will break our tests if it does.

For the above reason, we use CDATA bracketing - XML tools (that we've seen so far - Eclipse, Altova XML Spy, ...) seem to sense the CDATA bracketing and turn off any auto-indent logic for them. Tools -that we've used so far- will not insert whitespace before or after a CDATA region.

So we care very much about the whitespace inside the documentPart element when it is type="text". (When type="byte" or "bits" we don't care, except for TDML file formatting - which is still very important for clarity reasons.)

In the above case, since we really do care about whitespace being inserted here, we use CDATA.

NOT OK: To preserve specific line endings

Using CDATA does NOT preserve line endings (necessarily). So if you had a test where you have this:

<documentPart type="text"><![CDATA[This is text followed by a CR LF
]]></documentPart>

If you edit that on a windows machine, where CRLF is the usual text line ending, then the file will actually have a CRLF line ending in that text. If the test has say, dfdl:terminator="%CR;%LF;", then this will (or should) fail, because, no matter what, XML always standardizes line endings to just one character, LF. It replaces CRLF with LF, and isolated CR with LF. The net result: by the time a program is reading the XML data, it will (or should) only see LF line endings.

It is possible to get a literal CR character into XML content, but ONLY by using the numeric character entity notation, i.e., &#xD;. So one might try to write the above test as:

<documentPart type="text"><![CDATA[This is text followed by a CR LF]]></documentPart>
<documentPart type="text">&#xD;&#xA;</documentPart>

Even this, however, is not a sure thing, because re-indenting the XML might cause you to get:

<documentPart type="text"><![CDATA[This is text followed by a CR LF]]></documentPart>
<documentPart type="text">
   &#xD;&#xA;
</documentPart>

which would be broken because of the whitespace insertions around the &#xD;&#xA;.

There are two good solutions to this problem. First one can use type="byte" document parts:

<documentPart type="text"><![CDATA[This is text followed by a CR LF]]></documentPart>
<documentPart type="byte">0D 0A</documentPart>

This will always create exactly the bytes 0D and 0A, and documentParts are concatenated together with nothing between. However, this will break if the documentPart has an encoding where CR and LF are not exactly represented by the bytes 0D and 0A. For example currently we support encoding="us-ascii-7-bit-packed" which is needed for MIL-STD-2045 and related formats. In that encoding, CR and LF each take up only 7 bits, resulting in 14 bits, not 2 full bytes.

The best way to handle this problem is to use the documentPart replaceDFDLEntities attribute:

<documentPart type="text" replaceDFDLEntities="true"><![CDATA[This is text followed by a CR LF%CR;%LF;]]></documentPart>

The line gets kind of long, but those %CR; and %LF; are DFDL entities syntax for those Unicode characters. These are translated into whatever encoding the documentPart specifies, so this will be robust even if the encoding is say, UTF-16, or the 7-bit stuff.

If you have a multi-line piece of data and need CRLFs in it, then this does get a bit clumsy as you have to do it like this where each text line gets its own documentPart:

<documentPart type="text" replaceDFDLEntities="true"><![CDATA[Of all the gin joints%CR;%LF;]]></documentPart>
<documentPart type="text" replaceDFDLEntities="true"><![CDATA[In all the towns in the world%CR;%LF;]]></documentPart>
<documentPart type="text" replaceDFDLEntities="true"><![CDATA[She walked into mine%CR;%LF;]]></documentPart>

So the general rule is that CDATA regions cannot be used to insure specific kinds of line endings will be preserved in a file.

Some tests, however, are insensitive to the presence of whitespace. This is true of many tests for delimited text formats. In those cases you may want CDATA to preserve formatting of text (so it won't be re-indented), and to preserve *some* line endings. If this same test example was instead using dfdl:terminator="%NL;", well the NL entity matches CRLF, CR, or LF, and even some other obscure Unicode line ending characters. In that case, the original documentPart XML

<documentPart type="text"><![CDATA[Of all the gin joints
In all the towns of the world
She walked into mine
]]></documentPart>

Is fine, and will work and be robust.

About Banners and Comments in XML/XSD/TDML Files.

Unfortunately, tools will wrap lines in XML comments. Our nicely structured comments like

<!--
    Test name: csv_test
       Schema: csv.dfdl.xsd
         Root: file
      Purpose: This test is to exercise the csv schema.
  -->

turns into

<!-- Test name: csv_test Schema: csv.dfdl.xsd Root: file Purpose: This test is 
    to exercise the csv schema. -->

Tools like Eclipse have a setting to turn this "feature" off. See Eclipse Settings for DFDL Schema Authoring/Editing. Other tools may have similar options.

An alternative is to use XML Processing Instructions instead of comments. For example, our nicely formatted block comment can be done like so:

<?tdml test-doc
    Test name: csv_test
       Schema: csv.dfdl.xsd
         Root: file
      Purpose: This test is to exercise the csv schema.
  ?>

...

this page has moved to https://daffodil.apache.org/tdml/