Q: When should I use an XSD facet like maxLength, and when should I use the DFDL length property?

Here's part of an example from the DFDL tutorial of a street address:

<xs:element name="houseNumber" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="6"/>

Note that the length of the house number is constrained with DFDL.  XSD can also be used to constrain lengths.

When should you used XSD to do this, and when should you use DFDL?  Should you ever use both? 

You must use the dfdl:length property, because it can't parse the data without it. You may use the XSD facets to check further, and it often makes sense to use both.

Consider

<xs:element name="article" type="xs:string" dfdl:length="{ ../header/articleLength }" dfdl:lengthKind='explicit'/>

Now the length is coming from a field someplace at runtime. Validating that it is within some additional constraints on maxLength might be very valuable. To do that you nave to write the more verbose:

<xs:element name="article" dfdl:length="{ ../header/articleLength }" dfdl:lengthKind='explicit'>
  <xs:simpleType>
    <xs:restriction base="xs:string">
      <xs:maxLength value="140"/>
    </xs:restriction>
  </xs:simpleType>
</xs:element>

Not too bad actually. And if you can reuse some simple type definitions it's not bad at all.

One further point. Suppose you want to parse the string using the header-supplied length, but it's flat out a parse error if the length turns out to be greater than 140. You can ask the DFDL processor to check the facet maxLength at parse time using an assertion like this:

<xs:element name="article" dfdl:length="{ ../header/articleLength }" dfdl:lengthKind='explicit'>
  <xs:simpleType>
    <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/dfdl-1.0">
       <dfdl:assert>{ dfdl:checkConstraints() }</dfdl:assert>
    </xs:appinfo></xs:annotation>
    <xs:restriction base="xs:string">
      <xs:maxLength value="140"/>
    </xs:restriction>
  </xs:simpleType>
</xs:element>

The dfdl:assert statement annotation calls a built-in DFDL function called dfdl:checkConstraints, which tells DFDL to test the facet constraints and issue a parse error if they are not satisfied. This is particularly useful for enumeration constraints where an element value is an identifier of some sort.

Q: Should I use dfdl:assert to validate while parsing?

In general, no. The dfdl:assert statement annotation should be used to guide the parser. It should test things that must be true in order to successfully parse the data and create an Infoset from it.

But, it should not be used to insure validation of the values of the data elements.

By way of illustrating what not to do, it is tempting to put facet constraints on simple type definitions in your schema, and then use a dfdl:assert like this:

<dfdl:assert>{ checkConstraints(.) }</dfdl:assert>

so that the parser will validate as it parses, and will fail to parse values that do not satisfy the facet constraints.

Don't do this. Your schema will not be as useful because it will not be able to be used for some applications, for example, applications that want to accept well-formed, but invalid data and analyze, act,  or report on the invalid aspects.

In some sense, embedding checks like this into a DFDL schema is second-guessing the application's needs, and assuming the application does not even want to successfully parse and create an infoset from data that does not obey the facet constraints.

Q: How to I prevent my DFDL expressions and regular expressions from being modified by my XML editor

Use CDATA with expressions and regular expressions, and generally to stop XML editors from messing with your DFDL schema layouts

Most XML editors will wrap long lines. So your

<a>foobar</a>

just might get turned into

<a>foobar
</a>

Now most of the time that is fine. But sometimes the whitespace really matters. One such place is when you type a regular expression.

In DFDL this can come up in this way:

<dfdl:assert testKind="pattern"> *</dfdl:assert>

Now the contents of that element is " *", i.e., a single space, and the "*" character. That means zero or more spaces in regex language.

If you don't want your XML tooling to mess with the whitespace do this instead:

<dfdl:assert testKind="pattern"><![CDATA[ *]]></dfdl:assert>

CDATA informs XML processors that you very much care about this. Any decent XML tooling/editor will see this and decide it cannot line-wrap this or in any way mess with the whitespace.

Also useful if you want to write a complex DFDL expression in the expression language, and you want indentation and lines to be respected. Here's an example:

<dfdl:discriminator><![CDATA[{
    if (fn:trace((fn:trace(../../ex:presenceBit,"presenceBit") = 0),"pbIsZero")) then false()
    else if
    (fn:trace(fn:trace(dfdl:occursIndex(),"occursIndex") = 1,"indexIsOne")) then true()
    else if
    (fn:trace(fn:trace(xs:int(fn:trace(../../ex:A1[fn:trace(dfdl:occursIndex()-1,"indexMinusOne")],
                                       "occursIndexMinusOneNode")/ex:repeatBit),
                       "priorRepeatBit") = 0,
              "priorRepeatBitIsZero")) 
    then false()
    else true()  
}]]></dfdl:discriminator> 

If you get done writing something very deeply nested like this (and XPath style languages require this all the time), then you do NOT want anything messing with the whitespace.

About the xml:space='preserve' attribute: According to this thread on the stack overflow web site, xml:space is only about whitespace-only nodes, not nodes that are part whitespace. Within element-only content, the text nodes found between the elements are whitespace-only nodes. Unless you use xml:space='preserve', those are eliminated. None of the above discussion is about whitespace-only nodes. It's about value nodes containing text strings with surrounding whitespace.