You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

When writing TDML tests there are some tricks and techniques that make the tests more robust and insure their portability across platforms. When code-reviewing TDML tests these things can be considered to be part of the "checklist" of things we look for.

Using CDATA Regions

We use the XML <![CDATA[ ... ]]> bracketing several different ways. Some are problematic, and as we spot such usage in our TDML tests, we should remove it and replace with more robust usage.

We load TDML files differently than the way we load other XML files or DFDL schemas because of the way it treats these CDATA regions. This is very undesirable, as it introduces a possibility of different diagnostic behavior on validation errors, doesn't do line-numbering right in diagnostic messages, etc

The problem is that CDATA regions are not "preserve exactly what is here". Rather, they are just a different way of being able to avoid having to escape the & and < characters. XML's general fungible whitespace behavor stuff still applies.

OK: To preserve textual formatting within TDML - for clarity reasons.

E.g.,

<tdml:documentPart type="byte"><![CDATA[
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f              
10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f
20 21    23 24 25    27 28 29 2a 2b 2c 2d 2e 2f
30 31 32 33 34 35 36 37 38 39 3a 3b    3d    3f
40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f
50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f
60 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f
70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f
80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f
90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f
a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae af
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf
d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df
e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef
f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff  
]]></tdml:documentPart>

The above matrix of hex would be hard to understand specifically, where those holes in it are, without the formatting, but logically, the whitespace is irrelevant. In effect, we have CDATA here so that tooling like IDEs, XML editor, etc. will not mess with the formatting of the content.

OK: As a clearer way to escape things than using &amp; &gt; &lt; &apos;.

E.g.,

<foo>abc<![CDATA[&&&]]>def&#xE000;ghi</foo>

Note that the character corresponding to &#xE000; is . The above creates issues when used in the expected result infoset part of a TDML file. The above turns into a <foo> element containing 5 children which are roughly Text(abc), PCData(&&&), Text(def), Text(), Text(ghi).

We compare the actual and expected infosets by converting the actual to XML and doing an XML comparison. But a naive comparison of XML looking for the same nodes, be they elements, text, etc. will fail here because the actual infoset, converted to XML would likely be roughly:

<foo>abc&amp;&amp;&amp;defghi<foo>

This will have only one Text node in it containing 13 characters: Text(abc&&&defghi).

We get this single text node to match the 5 text nodes above by clever comparison routines that are used when XML is compared. These special purpose routines also do things such as ignoring the namespace prefixes on element names. (Probably undesirable longer term.)

OK: To avoid insertion of whitespace that would make things incorrect.

For example, here we need the document to contain exactly and only two characters:

<document><documentPart type="text"><![CDATA[a年]]></documentPart></document>

The problem is that the contents of the documentPart element are treated as literal data, had we left off the CDATA, then some XML tool might have reformatted this as:

       <document>
          <documentPart type="text">
            a年
          </documentPart>
        </document>

But this would be a documentPart containing some letters with surrounding whitespace. Our test, in this case, expects data of length exactly 2 characters.

In general, whitespace is considered fungible in XML. All-whitespace text nodes are usually meaningless. Here's that same example, but with XML comments added to replace the all-whitespace text nodes:

         <document><!-- WS --><documentPart type="text">
            a年
         </documentPart><!-- WS --></document>

Those all-whitespace nodes are usually unimportant. There's a feature (that we don't use) in XML called xml:space='preserve'. If this is found on an element, then it indicates that the all-whitespace nodes are important and should be preserved. But we don't care about these all-whitespace nodes. What we do care about is the whitespace surrounding the content of the documentPart element.

Now XML itself never inserts whitespace - tools/IDEs and people editing the text of XML documents insert it. One might have a policy that says "never mess with the formatting of the TDML files", particularly using auto-formatter tools. But XML/TDML is quite verbose, and without automatic formatting tools it can become a bit of a mess. It is better to have practices where tests are supposed to be created which are NOT sensitive to auto-reformatting.

The above is problematic because the reformatting inserted content whitespace. That is, the whitespace surrounding the 'a年' text all ends up as part of the content of the documentPart element. The inserted leading and trailing whitespace here is NOT separate all-whitespace nodes. Rather, there is a single Text node that is the content of this documentPart element. In my opinion, XML tools should NOT insert whitespace unless it is all-whitespace nodes, but tools will do the above re-indenting of XML sometimes, and it will break our tests if it does.

For the above reason, we use CDATA bracketing - XML tools (that we've seen so far - Eclipse, Altova XML Spy, ...) seem to sense the CDATA bracketing and turn off any auto-indent logic for them. Tools -that we've used so far- will not insert whitespace before or after a CDATA region.

So we care very much about the whitespace inside the documentPart element when it is type="text". (When type="byte" or "bits" we don't care, except for TDML file formatting - which is still very important for clarity reasons.)

In the above case, since we really do care about whitespace being inserted here, we use CDATA.

NOT OK: To preserve specific line endings

Using CDATA does NOT preserve line endings (necessarily). So if you had a test where you have this:

<documentPart type="text"><![CDATA[This is text followed by a CR LF
]]></documentPart>

If you edit that on a windows machine, where CRLF is the usual text line ending, then the file will actually have a CRLF line ending in that text. If the test has say, dfdl:terminator="%CR;%LF;", then this will (or should) fail, because, no matter what, XML always standardizes line endings to just one character, LF. It replaces CRLF with LF, and isolated CR with LF. The net result: by the time a program is reading the XML data, it will (or should) only see LF line endings.

It is possible to get a literal CR character into XML content, but ONLY by using the numeric character entity notation, i.e., &#xD;. So one might try to write the above test as:

<documentPart type="text"><![CDATA[This is text followed by a CR LF]]></documentPart>
<documentPart type="text">&#xD;&#xA;</documentPart>

Even this, however, is not a sure thing, because re-indenting the XML might cause you to get:

<documentPart type="text"><![CDATA[This is text followed by a CR LF]]></documentPart>
<documentPart type="text">
   &#xD;&#xA;
</documentPart>

which would be broken because of the whitespace insertions around the &#xD;&#xA;.

There are two good solutions to this problem. First one can use type="byte" document parts:

<documentPart type="text"><![CDATA[This is text followed by a CR LF]]></documentPart>
<documentPart type="byte">0D 0A</documentPart>

This will always create exactly the bytes 0D and 0A, and documentParts are concatenated together with nothing between. However, this will break if the documentPart has an encoding where CR and LF are not exactly represented by the bytes 0D and 0A. For example currently we support encoding="us-ascii-7-bit-packed" which is needed for MIL-STD-2045 and related formats. In that encoding, CR and LF each take up only 7 bits, resulting in 14 bits, not 2 full bytes.

The best way to handle this problem is to use the documentPart replaceDFDLEntities attribute:

<documentPart type="text" replaceDFDLEntities="true"><![CDATA[This is text followed by a CR LF%CR;%LF;]]></documentPart>

The line gets kind of long, but those %CR; and %LF; are DFDL entities syntax for those Unicode characters. These are translated into whatever encoding the documentPart specifies, so this will be robust even if the encoding is say, UTF-16, or the 7-bit stuff.

If you have a multi-line piece of data and need CRLFs in it, then this does get a bit clumsy as you have to do it like this where each text line gets its own documentPart:

<documentPart type="text" replaceDFDLEntities="true"><![CDATA[Of all the gin joints%CR;%LF;]]></documentPart>
<documentPart type="text" replaceDFDLEntities="true"><![CDATA[In all the towns in the world%CR;%LF;]]></documentPart>
<documentPart type="text" replaceDFDLEntities="true"><![CDATA[She walked into mine%CR;%LF;]]></documentPart>

So the general rule is that CDATA regions cannot be used to insure specific kinds of line endings will be preserved in a file.

Some tests, however, are insensitive to the presence of whitespace. This is true of many tests for delimited text formats. In those cases you may want CDATA to preserve formatting of text (so it won't be re-indented), and to preserve *some* line endings. If this same test example was instead using dfdl:terminator="%NL;", well the NL entity matches CRLF, CR, or LF, and even some other obscure Unicode line ending characters. In that case, the original documentPart XML

<documentPart type="text"><![CDATA[Of all the gin joints
In all the towns of the world
She walked into mine
]]></documentPart>

Is fine, and will work and be robust.

About Banners and Comments in XML/XSD/TDML Files.

Unfortunately, tools will wrap lines in XML comments. So auto-formatting a whole XML file containing this banner

<!--
Copyright (c) 2012-2013 Tresys Technology, LLC. All rights reserved.

Developed by: Tresys Technology, LLC
http://www.tresys.com

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal with
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

 1. Redistributions of source code must retain the above copyright notice,
    this list of conditions and the following disclaimers.

 2. Redistributions in binary form must reproduce the above copyright
    notice, this list of conditions and the following disclaimers in the
    documentation and/or other materials provided with the distribution.

 3. Neither the names of Tresys Technology, nor the names of its contributors
    may be used to endorse or promote products derived from this Software
    without specific prior written permission.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE
SOFTWARE.
-->

will result in

<!-- Copyright (c) 2012-2013 Tresys Technology, LLC. All rights reserved. Developed 
  by: Tresys Technology, LLC http://www.tresys.com Permission is hereby granted, free
  of charge, to any person obtaining a copy of this software and associated documentation
  files (the "Software"), to deal with the Software without restriction, including
  without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
  and/or sell copies of the Software, and to permit persons to whom the Software is
  furnished to do so, subject to the following conditions: 1. Redistributions of source
  code must retain the above copyright notice, this list of conditions and the following
  disclaimers. 2. Redistributions in binary form must reproduce the above copyright
  notice, this list of conditions and the following disclaimers in the documentation
  and/or other materials provided with the distribution. 3. Neither the names of Tresys
  Technology, nor the names of its contributors may be used to endorse or promote products
  derived from this Software without specific prior written permission. THE SOFTWARE
  IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
  BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE
  AND NONINFRINGEMENT. IN NO EVENT SHALL THE CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE
  FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT
  OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
  OR OTHER DEALINGS WITH THE SOFTWARE. -->

That might be fine legally, but it's definitely unacceptable.

Similarly, our nicely structured comments like

<!--
    Test name: csv_test
       Schema: csv.dfdl.xsd
         Root: file
      Purpose: This test is to exercise the csv schema.
  -->

turns into

<!-- Test name: csv_test Schema: csv.dfdl.xsd Root: file Purpose: This test is 
    to exercise the csv schema. -->

But there is a good fix for this, (at least as far as Eclipse tooling is concerned). You can use XML's so called "Processing Instructions" to hold this content. Processing Instructions are part of the XML information model, but almost nothing uses them. (xsl-stylesheets are the only use I know of). They can have anything in them except the sequence '?>' which ends them.

I tested these banners and comments and eclipse will not reformat them when auto-formatting, and they are ignored by the TDML runner and by Daffodil. So, for example, our nicely formatted block comment can be done like so:

<?tdml test-doc
    Test name: csv_test
       Schema: csv.dfdl.xsd
         Root: file
      Purpose: This test is to exercise the csv schema.
  ?>

And then auto-reformatting won't mess it up. Our copyright banner can be similarly bracketed.

  • No labels