Data Format Description Language (DFDL) v1.0 Specification

 

 

Status of This Document

 

Grid Final Draft (GFD)

 

 

Obsoletes

This document obsoletes GFD-P-R.174 dated January 2011 [OBSOLETE_DFDL].

 

 

Copyright Notice

 

Copyright © Global Grid Forum (2004-2006).  Some Rights Reserved. Distribution is unlimited.

Copyright © Open Grid Forum (2006-2014).  Some Rights Reserved. Distribution is unlimited

 

 

Abstract

 

This document provides a definition of a standard Data Format Description Language (DFDL).  This language allows description of text, dense binary, and legacy data formats in a vendor-neutral declarative manner. DFDL is an extension to the XML Schema Description Language (XSDL).

 


Contents

 

Data Format Description Language (DFDL) v1.0 Specification. 1

1.     Introduction. 9

1.1       Why is DFDL Needed?. 10

1.2       What is DFDL?. 10

1.2.1        Simple Example. 10

1.3       What DFDL is not 13

1.4       Scope of version 1.0. 13

1.5       Related standards. 14

2.     Notational and Definitional Conventions. 15

2.1       Failure Types. 15

2.2       Schema Definition Error 15

2.3       Processing Errors. 16

2.3.1        Ambiguity of Data Formats. 16

2.4       Validation Errors. 17

2.5       Recoverable Error 17

2.6       Specific Errors Classified. 17

2.7       Optional Checks and Warnings. 19

3.     Glossary. 21

4.     The DFDL Information Set (Infoset) 28

4.1       Information Items. 28

4.1.1        Document Information Item.. 28

4.1.2        Element Information Items. 29

4.2       "No Value'' 30

4.3       DFDL Information Item Order 30

4.4       DFDL Infoset Object model 30

4.5       DFDL Augmented Infoset 31

5.     DFDL Schema Component Model 33

5.1       DFDL Subset of XML Schema. 34

5.2       XSD Facets, min/maxOccurs, default, and fixed. 36

5.2.1        MinOccurs, MaxOccurs. 37

5.2.2        MinLength, MaxLength. 37

5.2.3        MaxInclusive, MaxExclusive, MinExclusive, MinInclusive, TotalDigits, FractionDigits. 37

5.2.4        Pattern. 38

5.2.5        Enumeration. 38

5.2.6        Default 38

5.2.7        Fixed. 38

6.     DFDL Syntax Basics. 39

6.1       Namespaces. 39

6.2       The DFDL Annotation Elements. 39

6.3       DFDL Properties. 41

6.3.1        DFDL String Literals. 41

6.3.2        DFDL Expressions. 46

6.3.3        DFDL Regular Expressions. 46

6.3.4        Enumerations in DFDL. 46

7.     Syntax of DFDL Annotation Elements. 47

7.1       Component Format Annotations. 47

7.1.1        The dfdl:ref Property. 47

7.1.2        Property Binding Syntax. 48

7.1.3        Empty String as a Representation Property Value. 49

7.2       dfdl:defineFormat - Reusable Data Format Definitions. 50

7.2.1        Inheritance for dfdl:defineFormat 50

7.2.2        Using/Referencing a Named Format Definition. 50

7.3       The dfdl:assert Statement Annotation Element 50

7.3.1        Properties for dfdl:assert 51

7.3.2        Controlling the Timing of Statement Evaluation. 53

7.4       The dfdl:discriminator Statement Annotation Element 54

7.4.1        Properties for dfdl:discriminator 54

7.5       The dfdl:defineEscapeScheme Defining Annotation Element 57

7.5.1        Using/Referencing a Named escapeScheme Definition. 58

7.6       The dfdl:escapeScheme Annotation Element 58

7.7       The dfdl:defineVariable Annotation Element 58

7.7.1        Examples. 59

7.7.2        Predefined Variables. 59

7.8       The dfdl:newVariableInstance Statement Annotation Element 60

7.8.1        Examples. 60

7.9       The dfdl:setVariable Statement Annotation Element 61

7.9.1        Examples. 61

8.     Property Scoping Rules. 62

8.1       Providing Defaults for DFDL properties. 62

8.2       Combining DFDL Representation Properties from a dfdl:defineFormat 63

8.3       Combining DFDL Properties from References. 64

9.     DFDL Processing Introduction. 67

9.1       Parser Overview. 67

9.2       DFDL Data Syntax Grammar 68

9.2.1        Nil Representation. 70

9.2.2        Empty Representation. 70

9.2.3        Normal Representation. 70

9.2.4        Absent Representation. 70

9.2.5        Zero-length Representation. 71

9.2.6        Missing. 71

9.2.7        Examples of Missing and Empty Representation. 72

9.2.8        Round Trip Ambiguities. 72

9.3       Parsing Algorithm.. 73

9.3.1        Known-to-exist and Known-not-to-exist 73

9.3.2        Establishing Representation. 74

9.3.3        Points of Uncertainty. 75

9.4       Element Defaults. 77

9.4.1        Definition 'default value' 77

9.4.2        Element Defaults When Parsing. 77

9.4.3        Element Defaults When Unparsing. 78

9.5       Evaluation Order for Statement Annotations. 79

9.5.1        Asserts and Discriminators with testKind 'expression' 80

9.5.2        Discriminators with testKind 'expression' 80

9.5.3        Elements and setVariable. 80

10.       Core Representation Properties and their Format Semantics. 81

11.       Properties Common to both Content and Framing. 82

11.1          Unicode Byte Order Mark (BOM) 85

11.2          Character Encoding and Decoding Errors. 87

11.2.1      Property dfdl:encodingErrorPolicy. 87

11.2.2      Unicode UTF-16 Decoding/Encoding Non-Errors. 88

11.2.3      Preserving Data Containing Decoding Errors. 88

11.3          Byte Order and Bit Order 88

11.4          dfdl:bitOrder Example. 89

11.4.1      Example Using Right-to-Left Display for 'leastSignificantBitFirst' 89

11.4.2      dfdl:bitOrder and Grammar Regions. 90

12.       Framing. 91

12.1          Aligned Data. 91

12.1.1      Implicit Alignment 93

12.1.2      Mandatory Alignment for Textual Data. 93

12.1.3      Mandatory Alignment for Packed Decimal Data. 94

12.1.4      Example: AlignmentFill 94

12.2          Properties for Specifying Delimiters. 94

12.3          Properties for Specifying Lengths. 98

12.3.1      dfdl:lengthKind 'explicit' 99

12.3.2      dfdl:lengthKind 'delimited' 99

12.3.3      dfdl:lengthKind 'implicit' 101

12.3.4      dfdl:lengthKind 'prefixed' 102

12.3.5      dfdl:lengthKind  'pattern' 105

12.3.6      dfdl:lengthKind 'endOfParent' 106

12.3.7      Elements of Specified Length. 107

13.       Simple Types. 112

13.1          Properties Common to All Simple Types. 112

13.2          Properties Common to All Simple Types with Text representation. 113

13.2.1      The dfdl:escapeScheme Properties. 114

13.3          Properties for Bidirectional support for All Simple Types with Text representation. 119

13.4          Properties Specific to String. 120

13.5          Properties Specific to Number with Text or Binary Representation. 122

13.6          Properties Specific to Number with Text Representation. 122

13.6.1      The dfdl:textNumberPattern Property. 129

13.6.2      Converting logical numbers to/from text representation. 136

13.7          Properties Specific to Number with Binary Representation. 137

13.7.1      Converting Logical Numbers to/from Binary Representation. 139

13.8          Properties Specific to Float/Double with Binary Representation. 144

13.9          Properties Specific to Boolean with Text Representation. 144

13.10       Properties Specific to Boolean with Binary Representation. 146

13.11       Properties specific to Calendar with Text or Binary Representation. 147

13.11.1        The dfdl:calendarPattern property. 149

13.11.2        The dfdl:calendarCheckPolicy Property. 152

13.12       Properties Specific to Calendar with Text Representation. 152

13.13       Properties Specific to Calendar with Binary Representation. 153

13.14       Properties Specific to Opaque Types (xs:hexBinary) 154

13.15       Nil Value Processing. 155

13.16       Properties for Nillable Elements. 156

14.       Sequence Groups. 159

14.1          Empty Sequences. 159

14.2          Sequence Groups with Separators. 160

14.2.1      Separators and Suppression. 162

14.2.2      Parsing Sequence Groups with Separators. 163

14.2.3      Unparsing Sequence Groups with Separators. 166

14.3          Unordered Sequence Groups. 168

14.3.1      Restrictions for Unordered Sequences. 168

14.3.2      Parsing an Unordered Sequence. 168

14.3.3      Unparsing an Unordered Sequence. 170

14.4          Floating Elements. 170

14.5          Hidden Groups. 171

15.       Choice Groups. 174

15.1          Resolving Choices. 175

15.1.1      Resolving Choices via Speculation. 176

15.1.2      Resolving Choices via Direct Dispatch. 176

15.1.3      Unparsing Choices. 176

16.       Properties for Array Elements and Optional Elements. 177

16.1          The dfdl:occursCountKind property. 178

16.1.1      dfdl:occursCountKind 'fixed' 178

16.1.2      dfdl:occursCountKind 'implicit' 178

16.1.3      dfdl:occursCountKind 'parsed' 178

16.1.4      dfdl:occursCountKind 'expression' 178

16.1.5      dfdl:occursCountKind 'stopValue' 179

16.2          Default Values for Arrays. 179

16.3          Arrays with DFDL Expressions. 179

16.4          Points of Uncertainty. 179

16.5          Arrays and Sequences. 179

16.6          Forward Progress Requirement 180

16.7          Parsing Occurrences with Non-Normal Representation. 180

16.8          Sparse Arrays. 180

17.       Calculated Value Properties. 182

17.1          Example: 2d Nested Array. 183

17.2          Example: Three-Byte Date. 184

18.       External Control of the DFDL Processor 187

19.       Built-in Specifications. 188

20.       Conformance. 189

21.       Optional DFDL Features. 190

22.       Property Precedence. 192

22.1          Parsing. 192

22.1.1      dfdl:element (simple) and dfdl:simpleType. 192

22.1.2      dfdl:element (complex) 197

22.1.3      dfdl:sequence and dfdl:group (when reference is to a sequence) 198

22.1.4      dfdl:choice and dfdl:group (when reference is to a choice) 199

22.2          Unparsing. 200

22.2.1      dfdl:element (simple) and dfdl:simpleType. 200

22.2.2      dfdl:element (complex) 206

22.2.3      dfdl:sequence and dfdl:group (when reference is a sequence) 207

22.2.4      dfdl:choice and dfdl:group (when reference is a choice) 208

23.       Expression language. 209

23.1          Expression Language Data Model 209

23.2          Variables. 210

23.2.1      Rewinding of Variable Memory State. 211

23.2.2      Variable Memory State Transitions. 211

23.3          General Syntax. 212

23.4          DFDL Expression Syntax. 213

23.5          Constructors, Functions and Operators. 214

23.5.1      Constructor Functions for XML Schema Built-in Types. 214

23.5.2      Standard XPath Functions. 215

23.5.3      DFDL Functions. 219

23.5.4      DFDL Constructor Functions. 221

24.       DFDL Regular Expressions. 223

25.       Security Considerations. 224

26.       Authors and Contributors. 225

27.       Intellectual Property Statement 226

28.       Disclaimer 227

29.       Full Copyright Notice. 228

30.       References. 229

31.       Appendix A: Escape Scheme Use Cases. 232

31.1          Escape Character Same as dfdl:escapeEscapeCharacter 232

31.2          Escape Character Different from dfdl:escapeEscapeCharacter 232

31.3          Escape Block with Different Start and End Characters. 233

31.4          Escape Block with Same Start and End Characters. 234

32.       Appendix B: Rationale for Single-Assignment Variables. 236

33.       Appendix C: Processing of DFDL String literals. 237

33.1          Interpreting a DFDL String Literal 237

33.2          Recognizing a DFDL String Literal 237

33.3          Recognizing DFDL String Literal Part 237

34.       Appendix D: DFDL Standard Encodings. 239

34.1          Purpose. 239

34.2          Conventions. 239

34.3          Specification Template. 239

34.4          Encoding X-DFDL-US-ASCII-7-BIT-PACKED.. 240

34.4.1      Name. 240

34.4.2      Translation table. 240

34.4.3      Width. 240

34.4.4      Alignment 240

34.4.5      Byte Order 240

34.4.6      Example 1. 240

34.4.7      Example 2. 241

34.5          Encoding X-DFDL-US-ASCII-6-BIT-PACKED.. 242

34.5.1      Name. 242

34.5.2      Translation Table. 242

34.5.3      Width. 243

34.5.4      Alignment 243

34.5.5      ByteOrder 243

34.5.6      Example 1. 243

34.6          References for Appendix D.. 244

1.     Introduction

Data interchange is critically important for most computing. Grid computing, Cloud computing, and all forms of distributed computing require distributed software and hardware resources to work together. Inevitably, these resources read and write data in a variety of formats. General tools for data interchange are essential to solving such problems. Scalable and High Performance Computing  (HPC) applications require high-performance data handling, so data interchange standards must enable efficient representation of data. Data Format Description Language (DFDL) enables powerful data interchange and very high-performance data handling.

We envisage three dominant kinds of data in the future, as follows:

1.     Textual data defined by a format specific schema such as XML[XML] or JSON[JSON].

2.     Binary data in standard formats.

3.     Data with DFDL descriptors.

Textual XML data is the most successful data interchange standard to date. All such data are by definition new, by which we mean created in the XML era. Because of the large overhead that XML tagging imposes, there is often a need to compress and decompress XML data. However, there is a high-cost for compression and decompression that is unacceptable to some applications. Standardized binary data formats are also relatively new, and are suitable for larger data because of the reduced costs of encoding and more compact size. Examples of standard binary formats are data described by modern versions of ASN.1[ASN1], or the use of XDR [XDR]. These techniques lack the self-describing nature of XML-data. Scientific formats, such as NetCDF[NetCDF] and HDF[HDF] are used by some communities to provide self-describing binary data. There are also standardized binary-encoded XML data formats such as EXI [EXI].

It is an important observation that both XML format and standardized binary formats are prescriptive in that they specify or prescribe a representation of the data. To use them your applications must be written to conform to their encodings and mechanisms of expression.

DFDL suggests an entirely different scheme. The approach is descriptive in that one chooses an appropriate data representation for an application based on its needs and one then describes the format using DFDL so that multiple programs can directly interchange the described data. DFDL descriptions can be provided by the creator of the format, or developed as needed by third parties intending to use the format. That is, DFDL is not a format for data; it is a way of describing any data format. DFDL is intended for data commonly found in scientific and numeric computations, as well as record-oriented representations found in commercial data processing.

DFDL can be used to describe legacy data files, to simplify transfer of data across domains without requiring global standard formats, or to allow third-party tools to easily access multiple formats. DFDL can also be a powerful tool for supporting backward compatibility as formats evolve.

DFDL is designed to provide flexibility and also permit implementations that achieve very high levels of performance. DFDL descriptions are separable and native applications do not need to use DFDL libraries to parse their data formats. DFDL parsers can also be highly efficient. The DFDL language is designed to permit implementations that use lazy evaluation of formats and to support seekable, random access to data. The following goals can be achieved by DFDL implementations:

·         Density. Fewest bytes to represent information (without resorting to compression). Fastest possible I/O.

·         Optimized I/O. Applications can write data aligned to byte, word, or even page boundaries and to use memory-mapped I/O to insure access to data with the smallest number of machine cycles for common use cases without sacrificing general access.

DFDL can describe the same types of abstract data that other binary or textual data formats can describe and, furthermore, it can describe almost any possible representation scheme for those data. It is the spirit of DFDL to support canonical data descriptions that correspond closely to the original in-memory representation of the data, and also to provide sufficient information to write as well as to read the given format.

1.1       Why is DFDL Needed?

The question arises of why DFDL is needed in an era when there are so many standard data formats available. Ultimately, it is because there are a number of social phenomena in the way software is developed that have lead to the situation today where DFDL is needed to standardize descriptions of diverse data formats.

First, programs are very often written speculatively, that is, without any advance understanding of how important they will become. Given this situation, little effort is expended on data formats since it remains easier to program the I/O in the most straightforward way possible with the programming tools in use. Even something as simple as using an XML-based data format is harder than just using the native I/O libraries of a programming language.

In time, however, it is realized that a software program is important because either many people are using it, or it has become important for business or organizational needs to start using it in larger scale deployments. At that point it is often too late to go back and change the data formats. For example, there may be real or perceived business costs to delaying the deployment of a program for a rewrite just to change the data formats, particularly if such rewriting will reduce the performance of the program and increase the costs of deployment. (It takes longer to program, but at least it's slower when you are doneJ)

Additionally, the need for data format standardization for interchange with other software may not be clear at the point where a program first becomes 'important'. Eventually, however, the need for data interchange with the program becomes apparent.

The above phenomena are not something that is going away any time soon. There are, of course, efforts to smoothly integrate standardized data format handling into programming languages. Nevertheless, we see a critical role for DFDL since it allows after-the-fact description of a data format.

1.2       What is DFDL?

DFDL is a language for describing data formats. A DFDL description allows data to be read from its native format and to be presented as an instance of an information set or indeed converted to the corresponding XML document. DFDL also allows data to be taken from an instance of an information set and written out to its native format.

DFDL achieves this by leveraging W3C XML Schema Definition Language (XSDL) 1.0. [XSDL]

An XML schema is written for the logical model of the data. The schema is augmented with special DFDL annotations. These annotations are used to describe the native representation of the data. This is an established approach that is already being used today in commercial systems.

1.2.1       Simple Example

Consider the following XML data:

<w>5</w>

<x>7839372</x>

<y>8.6E-200</y>

<z>-7.1E8</z>

 

The logical model for this data can be described by the following fragment of an XML schema document that simply provides description of the name and type of each element:

  <xs:complexType name="example1">

    <xs:sequence>

      <xs:element name="w" type="xs:int"/>

      <xs:element name="x" type="xs:int"/>

      <xs:element name="y" type="xs:double"/>

      <xs:element name="z" type="xs:float"/>

    </xs:sequence>

  </xs:complexType>

Now, suppose we have the same data but represented in a non-XML format. A binary representation of the data could be visualized like this (shown as hexadecimal):

0000 0005 0077 9e8c

169a 54dd 0a1b 4a3f

ce29 46f6

To describe this in DFDL, we take our original XML schema document that described the data model and we annotate the type definition as follows:

  <xs:complexType>                   

    <xs:sequence>

      <xs:element name="w" type="xs:int">

        <xs:annotation>

          <xs:appinfo source="http://www.ogf.org/dfdl/">

            <dfdl:element representation="binary"

                      binaryNumberRep="binary"

                      byteOrder="bigEndian"

                      lengthKind="implicit"/>                  

          </xs:appinfo>

        </xs:annotation>

      </xs:element>

      <xs:element name="x" type="xs:int ">

        <xs:annotation>

          <xs:appinfo source="http://www.ogf.org/dfdl/">

            <dfdl:element representation="binary"

                      binaryNumberRep="binary"

                      byteOrder="bigEndian"

                      lengthKind="implicit"/>                  

          </xs:appinfo>

        </xs:annotation>

      </xs:element>

      <xs:element name="y" type="xs:double">

        <xs:annotation>

          <xs:appinfo source="http://www.ogf.org/dfdl/">

            <dfdl:element representation="binary"

                      binaryFloatRep="ieee"

                      byteOrder="bigEndian"

                      lengthKind="implicit"/>                  

          </xs:appinfo>

        </xs:annotation>

      </xs:element>

      <xs:element name="z" type="xs:float" >

        <xs:annotation>

          <xs:appinfo source="http://www.ogf.org/dfdl/">

            <dfdl:element representation="binary"

                    byteOrder="bigEndian"

                      lengthKind="implicit"

                      binaryFloatRep="ieee" />                  

          </xs:appinfo>

        </xs:annotation>

      </xs:element>

    </xs:sequence>

  </xs:complexType>

This simple DFDL annotation expresses that the data are represented in a binary format and that the byte order will be big endian. This is all that a DFDL parser needs to read the data.

Consider if the same data are represented in a text format:

5,7839372,8.6E-200,-7.1E8

Once again, we can annotate the same data model, this time with properties that provide the character encoding, the field separator (comma) and the decimal separator (period):

  <xs:complexType>

    <xs:sequence>

      <xs:annotation>

        <xs:appinfo source="http://www.ogf.org/dfdl/">

          <dfdl:sequence encoding="UTF-8" separator="," />

        </xs:appinfo>

      </xs:annotation>

      <xs:element name="w" type="xs:int">

        <xs:annotation>

          <xs:appinfo source="http://www.ogf.org/dfdl/">

            <dfdl:element representation="text"

                        encoding="UTF-8"

                        textNumberRep ="standard"

                      textNumberPattern="####0"

                        textStandardDecimalSeparator="."

                        lengthKind="delimited"/>

          </xs:appinfo>

        </xs:annotation>

      </xs:element>

      <xs:element name="x" type="xs:int">

        <xs:annotation>

          <xs:appinfo source="http://www.ogf.org/dfdl/">

            <dfdl:element representation="text"

                        encoding="UTF-8"

                        textNumberRep ="standard"

                      textNumberPattern="#######0"

                        textStandardDecimalSeparator="."

                        lengthKind="delimited"/>

          </xs:appinfo>

        </xs:annotation>

      </xs:element>

      <xs:element name="y" type="xs:double">

        <xs:annotation>

          <xs:appinfo source="http://www.ogf.org/dfdl/">

             <dfdl:element representation="text"

                        encoding="UTF-8"

                        textNumberRep ="standard"

                      textNumberPattern="0.0E+000"

                        textStandardDecimalSeparator="."

                        lengthKind="delimited"/>

          </xs:appinfo>

        </xs:annotation>

      </xs:element>

      <xs:element name="z" type="xs:float">

        <xs:annotation>

          <xs:appinfo source="http://www.ogf.org/dfdl/">

             <dfdl:element representation="text"

                        encoding="UTF-8"

                        textNumberRep ="standard"

                        textNumberPattern="0.0E0"

                        textStandardDecimalSeparator="."

                        lengthKind="delimited"/>

          </xs:appinfo>

        </xs:annotation>

      </xs:element>

    </xs:sequence>

  </xs:complexType>

Many properties are repeatedly expressed in the example for the sake of simplicity. Later sections of this specification will define the mechanisms DFDL provides to avoid this repetitiveness.

1.3       What DFDL is not

DFDL maps data from a non-XML representation to an instance of an information set. This can be thought of as a data transformation. However, DFDL is not intended to be a general transformation language and, in particular, DFDL does not intend to provide a mechanism to map data to arbitrary XML models. There are two specific limitations on the data models that DFDL can work to:

  1. DFDL uses a subset of XML Schema, in particular, you cannot use XML attributes in the data model.
  2. The order of the data in the data model must correspond to the order and structure of the data being described.

This latter point deserves some elaboration. The XML schema used must be suitable for describing the physical data format. There must be a correspondence between the XML schema's constructs and the physical data structures. For example, generally the elements in the XML schema must match the order of the physical data. DFDL does allow for certain physically unordered formats as well.

The key concept here is that when using DFDL, you do not get to design an XML schema to your preference and then populate it from data. That would involve describing the data format, and describing a transformation for mapping it to the XML schema you have designed. DFDL is only about the format part of this problem. There are other languages, such as XSLT, which are for transformation. In DFDL, you describe only the format of the data, and this format constrains the nature of the XML schema you must use in its description.

1.4       Scope of version 1.0

The goals of version 1.0 are as follows:

  1. Leverage XML technology and concepts
  2. Support very efficient parsers/formatters
  3. Avoid features  that require unnecessary data copying
  4. Support round-tripping, that is, read and write data in a described format from the same description
  5. Keep simple cases simple
  6. Simple descriptions should be "human readable" to the same degree that XSDL is.

The general features of version 1.0 are as follows:

a)     Text and binary data parsing and unparsing

b)    Validate the data when parsing and unparsing using XSDL validation.

c)     Defaulted input and output for missing representations

d)    Reference – use of the value of a previously read element in subsequent expressions

e)     Choice – capability to select among format variations

f)     Hidden sequence of elements – A description of an intermediate representation whose corresponding Infoset is not exposed in the final result.

g)    Basic Math – in DFDL expressions

h)     Out-of-type value handling (e.g., The string value 'NIL' to indicate nil for an integer)

i)      Speculative parsing to resolve uncertainty.

j)      Very general parsing capability: Lookahead/Push-back

Version 1.0 of DFDL is a language capable of expressing a wide range of binary and text-based data formats.

DFDL is capable of describing binary data as found in the data structures of COBOL, C, PL1, Fortran, etc. In particular, it is able to describe repeating sub-arrays where the length of an array is stored in another location of the structure.

DFDL is capable of describing a wide variety of textual data formats such as HL7, X12, and SWIFT. Textual data formats often use syntax delimiters, such as initiators, separators and terminators to delimit fields.

DFDL has certain composition properties. I.e., two formats can be nested or concatenated and a working format results.

The following topics have been deferred to future versions of the standard:

-       Extensibility: There are real examples of proprietary data format description languages that we use as our base of experience from which to derive standard DFDL. However, there are no examples of extensible format description languages. Therefore, while extensibility is desirable in DFDL, there is not yet a base of experience with extensibility from which to derive a standard.

-       Rich Layering: Some formats require data to be described in multiple passes. Combining these into one DFDL schema requires very rich layering functionality. In these layers one element's value becomes the representation of another element. DFDL V1.0 allows description of only a limited kind of layering.

1.5       Related standards

1.     Prescriptive systems:

a.     Javascript Object Notation (JSON)  [JSON]

b.    EXI (binary XML) [EXI]

c.     Thrift [Thrift]

d.    Avro  [AVRO]

e.     ASN.1 with any of the prescribed encoding rules: Basic Encoding Rules (BER), Distinguished Encoding Rules (DER), Canonical Encoding Rules(CER)[ASN1CER] or Packed Encoding Rules (PER) [ASN1PER]

2.     Descriptive systems:

a.     ASN1 Encoding Control Notation (also known as ITU-T X.692) [ASN1ECN]

b.    BFD:  Binary Format Description (BFD) Language [BFD]

2.     Notational and Definitional Conventions

The key words must, must not, required, shall, shall not, should, should not, recommended, may, may not and optional in this document are to be interpreted as described in [RFC2119]. Note that for reasons of clarity these words are not always capitalized in this document.

Examples are for illustration purposes only and for clarity they will often not include all the necessary DFDL properties.

2.1       Failure Types

Where the phrase "must be consistent with" is used, it is assumed that a conforming DFDL implementation must check for the consistency and issue appropriate diagnostic messages when an inconsistency is found.  

There are several kinds of failures that can occur when a DFDL processor is handling data and/or a DFDL schema.

2.2       Schema Definition Error

When the DFDL schema itself contains an error, it implies that the DFDL processor cannot process data because the DFDL schema is not meaningful. It may be ambiguous, or contain conflicting definitions. Equivalently, we can say that there is no possible data that conforms to the schema; hence, the schema cannot be meaningful. All conforming DFDL processors must detect all schema definition errors, and must issue some kind of appropriate diagnostic message. The behavior of a DFDL processor after a schema definition error is detected is out of scope for this specification.

When a Schema definition error can be detected statically, that is given only the schema, it is desirable, though not required by the DFDL standard, that such errors be detected and diagnostic messages issued before any data are processed. Of course not all schema definition errors can be detected without reference to data as some representation properties may obtain their values from the data (see also section 2.3.1 Ambiguity of Data Formats).

The expression language included within DFDL is strongly, statically type checkable. This means that type checking of expressions can be performed without processing data, and implementations are encouraged to perform this checking statically so that schema definition errors having to do with type inconsistencies can be detected before processing data.

Note that schema definition errors cannot be suppressed by points of uncertainty.

2.2.1.1      Schema Component Constraint: Unique Particle Attribution

A DFDL processor MUST implement the Schema Component Constraint: Unique Particle Attribution defined in XML Schema Part 1: Structures [XSDLV1] that applies to the DFDL schema subset.

Two elements overlap if

A schema will violate the unique attribution constraint if it contains two particles which overlap and which either

Or

2.3       Processing Errors

If a DFDL schema contains no schema definition errors, then there is the additional possibility of a processing error when processing data using a DFDL schema. A processing error occurs if the data does not conform to the format described by the schema, that is to say, the data is not well-formed relative to the schema.

Processing errors can be suppressed by a point of uncertainty. See section 9.3.3.  

It is expected that DFDL implementations will provide additional implementation-defined mechanisms for dealing with effective processing errors, such as the means of specifying retry points or the means of skipping some data so as to recover from the error in some way.

Exceptions that occur in the evaluation of the DFDL expression language are processing errors.

Non-conformance with the XSDL minOccurs or maxOccurs constraints is either a processing error or only a validation error depending on the settings of certain DFDL properties (see section 16 below).

2.3.1       Ambiguity of Data Formats

A data format using delimiters may be ambiguous if the delimiters are not distinct, and a data format description which has fixed data requirements (that is, where some elements have fixed values) may be ambiguous even with fixed length elements.[1]

If the delimiter string values are stored within the data, perhaps as elements of a header part of the data, then this ambiguity certainly cannot be examined until the data is available.

Given an ambiguous grammar, a DFDL implementation may successfully parse a particular input data stream. That is, the part of the schema with the ambiguity may not be exercised by a particular data stream, or the data may parse successfully anyway because the ambiguity may not cause any kind of failure or processing error.

Hence, to insure compatible behavior, DFDL v1.0 implementations MUST NOT detect grammar ambiguities as errors. Implementations are of course free to issue warnings to help users identify these situations, but ambiguity is neither a Schema Definition Error nor a Processing Error.

2.3.1.1      Unparsing Must be Unambiguous

Usually, the behavior of the unparser is symmetric to the behavior of the parser; however, there are cases where the DFDL schema will accept several equivalent representations for the same logical data. In this case it would be ambiguous which of these equivalent representations should be produced by the unparser. The DFDL standard contains representation properties which are used to eliminate this ambiguity. It is a schema definition error if a DFDL schema is being used to unparse data and there is any ambiguity about the representation.

2.4       Validation Errors

Logical validation checks are constraints expressed in XSDL, and they apply to the logical values of the infoset. Hence, parsing must successfully construct the infoset from the representation in order for validation checks to be meaningful. This implies that validation errors cannot affect the ability of a DFDL processor to successfully parse or unparse data; that is, validation errors are independent of whether the data is well-formed with respect to the DFDL schema.

DFDL processors may provide both validating and non-validating behaviors on either or both of parse and unparse. (A DFDL implementation could support validate on parse, but not support it on unparse and still be considered conforming.)

Validation on unparsing takes place on the augmented infoset that is created by the unparser as a side-effect of creating the output data stream.

When resolving points of uncertainty (during parsing), validation errors are ignored.

The way a validation error is presented to the execution context of a DFDL processor is not specified by the DFDL language. The validity of an element is recorded in the DFDL Infoset, see Section 4 The DFDL Information Set (Infoset).

The following DFDL schema constructs are allowed in DFDL and are checked when validating:

  1. XSDL pattern facet - (for XSD string type elements only)
  2. XSDL minLength, maxLength
  3. XSDL minInclusive, minExclusive, maxInclusive, maxExclusive
  4. XSDL enumeration
  5. XSDL maxOccurs

Note that validation is distinct from the checking of DFDL assert or discriminator predicates. When a DFDL discriminator or assert is used to discriminate a choice or other point of uncertainty when parsing, then that assert or discriminator is essential to parsing and it is evaluated irrespective of whether validation is enabled or disabled.

There is also a function dfdl:checkConstraints available in the DFDL Expression language. This can be used to explicitly include checking of the XSD facet constraints as part of parsing a specific element. Such checking is part of parsing, and does not create validation errors. See Section 23.5.3 DFDL Functions for details.

2.5       Recoverable Error

This error type is used with the dfdl:assert annotation when parsing to permit the checking of physical format constraints without terminating a parse. For example, some formats will have redundancy by having known lengths, as well as delimiters. A recoverable error can be issued, using an assert to check a physical length constraint when property lengthKind is 'delimited'.

Recoverable errors are independent of validation, and when resolving points of uncertainty, recoverable errors are ignored.

2.6       Specific Errors Classified

This section clarifies which errors are schema definition errors and which are processing errors.

The following are processing errors:

·         Arithmetic Errors

o    Division by zero

o    Integer Arithmetic Underflow

o    Integer Arithmetic Overflow

o    Note: Floating point math can produce NaN (Not a Number) values. This is not an error, nor are properly typed operations on floating point NaN values.

·         Expression Errors

o    Dynamic Type  Error – unable to convert to target type

§  Example: non-digits found in string argument to xs:int(…) constructor.

§  Note: if a DFDL Implementation cannot distinguish Dynamic Type Errors from Static Type Errors, then a Dynamic Type Error should cause a Schema Definition Error

o    Index out of bounds error – index not <= number of occurrences, or is < 1.

§  Note: same error for dfdl:testBit if bitPos is not 1..8, or for character positions in a string-value

o    Indexing of non-array non-optional element

§  Example: x[1] when x is declared and has both minOccurs="1" and maxOccurs="1" explicitly, or by not stating either or both of them.

o    Illegal argument value (correct type, illegal value)

·         Parse Errors

o    Delimiter not found

o    Data not convertible to type

o    Assertion failed

o    Discriminator failed

o    Required occurrence not found

o    No choice alternative successfully parsed.

o    Character set decoding failure and dfdl:encodingErrorPolicy is 'error'

·         Unparse Errors

o    Truncation scenarios where truncation is being disallowed

o    Rounding error – rounding needed but not allowed. (Unparsing)

o    No choice alternative successfully unparsed.

o    Character set encoding failure and dfdl:encodingErrorPolicy is 'error'

·         Implementation-defined Limit Errors - Implementations can have fixed or adjustable limits that some formats and some data may exceed at processing time. This specification does not further specify what these errors are, but some possible examples are:

o    Data longer than allowed for representation of a given data type

§  Example: exceed maximum length of representation of xs:decimal when dfdl:representation is "text".

o    Expression references too far back into infoset (parsing)

o    Expression references too far forward into infoset (unparsing)

o    Number of array elements exceeds limit.

o    Regular expression exceeds time limit

The following are schema definition errors, regardless of whether they are detected in advance of processing or once processing begins:

·         Errors in XML Schema Construction and Structure

o    See XML Schema Specification Part 1, Section 5.1 [XSDLV1]

·         Use of XSD constructs outside of DFDL subset

·         Implementation-defined Limitations

o    Use of DFDL schema constructs not supported by this implementation.

§  Example: xs:choice is an optional part of the DFDL specification (see section 21). If not supported, it must be rejected as a Schema Definition Error.

§  Example: use of packed-decimal when it is not supported by the implementation.

§  Example: use of dfdl:assert when it is not supported by the implementation (See Spec section 21 on DFDL Subsets)

§  Note: Unrecognized DFDL properties or property values can produce a warning and an implementation can attempt to process data despite the warning.

o    Exceeding implementation-dependent limits for schema size/complexity

§  Example: schema too large – simply a limit on how large the schema can be, how many files, how many top-level constructs, etc.

·         Schema Not Valid

o    See XML Schema Specification Part 1, Section 5.2 [XSDLV1]

·         UPA violation (Unique Particle Attribution)

·         Reference to DFDL global definition not found

o    Format definition (dfdl:defineFormat)

o    Escape schema definition (dfdl:defineEscapeScheme)

o    Variable Definition (dfdl:defineVariable)

·         DFDL Annotations not well-formed or not valid

·         DFDL Annotations Incompatible

o    E.g., dfdl:assert and dfdl:discriminator at same combined annotation point, or more than one format annotation at an annotation point.

·         DFDL Properties and their values

o    Property not applicable to DFDL annotation

o    Property value not suitable for property

o    Property conflict

§  Between Element Reference and Element Declaration

§  Between Element Declaration and Simple Type Definition

§  Between Simple Type Definition and Base Simple Type Definition

§  Between Group Reference and Sequence/Choice of Group Definition

o    Required property not found

·         Expressions

o    Expression syntax error

o    Named child element doesn't exist – E.g., /a/b, and there is no child b in existence.

§  Note: no child possible in the schema is a different error, but also a Schema Definition Error, as /a/b would not have a type in that case.

§  Note: This is an SDE, as schema authors are advised to use fn:exists(…) to test for existence of elements when it is possible that they not exist.

o    Variable read but not defined

o    Variable assigned after read

o    Variable assigned more than once

o    Static Type error – type is incorrect for usage

§  Note: if an implementation is unable to distinguish Static Type Errors from Dynamic Type Errors, then both should cause Schema Definition Errors.

o    Path step definition not found – e.g., /a/n:b but no definition for n:b as local or global element.

o    Not enough arguments for function

o    Expression value is not single node

§  Most DFDL expression contexts require an expression to identify a single node, not an array (aka sequence of nodes). There are a few exceptions such as the fn:count(…) function, where the path expression must be to an array or optional element.

o    Expression value is not array element or optional element.

§  Some DFDL expression contexts require an array or an optional element.

§  Example: The fn:count(...) function argument must be to an array or optional element. It is an SDE if the argument expression is otherwise.

·         Regular Expressions

o    Syntax error

2.7       Optional Checks and Warnings

A DFDL processor:

There are two exceptions to this, which must be checked:

·         Global simple types that are referenced by prefixLengthType property

·         Global elements that are the document root.

Some situations suggest likely errors, but a DFDL processor cannot be certain. In these situations, a DFDL processor may issue warnings to assist a DFDL schema author in identifying likely errors. An important case of this is when the DFDL processor encounters a schema component and annotation where there are explicitly properties that are not relevant to the component as defined. Depending on the specifics of the component and property the DFDL processor can or must take certain actions. If the:

 

3.     Glossary

Adjacent - Two parts of the input/output stream are adjacent if they are at consecutive addresses.

Addressable Unit - This is the unit of storage that makes up the input or output stream holding the representation of the data. The units are bits, bytes, or characters.

Annotation point - A location within a DFDL schema where DFDL annotation elements are allowed to appear.

Applicable properties - All the DFDL properties that apply to that type of schema construct. For example all the DFDL properties that apply to an xs:simpleType.

Array - The set of adjacent elements whose XSDL element declaration specifies the potential for it to have more than one occurrence (XSD property maxOccurs > '1' or 'unbounded'). Of course any given array can have any number of element occurrences, including zero elements or exactly 1 element as long as the occurrence constraints are met. If XSD property maxOccurs is 'unbounded' then there is no constraint to the maximum number of occurrences, though implementations may have implementation-defined maximum capabilities. An optional element (where XSD property maxOccurs is '1', minOccurs is '0') is not considered to be an array as described in this document. Note that a sequence is not to be confused with an array. A sequence is a complex tuple type for an element; the children of a sequence can be of different types. All elements of an array have the same type and have the same information item members except for the value member.

Array Element – An element declaration or reference with XSD property maxOccurs > '1' or 'unbounded'.

Augmented Infoset - When unparsing one begins with the DFDL schema and conceptually with the logical infoset. As the values of items are filled in by defaulting, and by use of the DFDL outputValueCalc property (including on hidden items), these new item values augment the infoset. The resulting infoset is called the augmented infoset.

Binary - There are two meanings for this word depending on context.

Binary Representation - Of type xs:hexBinary, or of other type with property dfdl:representation 'binary'. Note that type xs:string can never have binary representation.

Bit Order - .Within a binary integer, if the most-significant bit is assigned bit position 1, then the bit order is most-significant-bit first. If the least-significant bit is assigned bit position 1, then the bit order is least-significant-bit first. When the bit order is most-significant-bit first, then the least-significant bit of byte N is considered to be adjacent to the most-significant bit of byte N+1. When the bit order is least-significant-bit first, then the most-significant bit of byte N is considered to be adjacent to the least-significant bit of byte N+1.

Bit Position - The data stream is assumed to be a collection of consecutively numbered unsigned bytes. Each byte is a numeric value from 0 to 255. The bits of a byte are referred to by their numerical significance as the 2n bit, for n from 0 to 7. Hence, the byte value 255 = 27 + 26 + 25 + 24 + 23 + 22 + 21 + 20. The 27-bit is the most-significant bit, and the 20-bit is the least significant bit. The bits within each byte are assigned numbered bit positions 1 to 8 according to the bit order.  Given a bit-order, every bit in the data stream has a unique bit position.

Bit String - The ordered set of bits from a first bit with bit position N, to bit position N+M is a bit string of length M bits.

Byte - The term "byte" refers to an 8-bit octet. Can also refer to an integer with value from 0 to 255 inclusive.

CCSID - see Coded Character Set Identifier.   

Character - An ISO10646 [ISO10646] character having a unique character code as its identifier. This concept is independent of font, typeface, size, and style, so 'F', 'F', 'F', are all the same character 'F'.   

Character Code - The canonical integer used to identify a character in the ISO10646 [ISO10646] standards. This number identifies the character, but can be independent of any specific character set encoding of the character. Example: The '{' character known in Unicode [Unicode] as LEFT CURLY BRACKET has character code U+007B. However, depending on the character set encoding, the value 0x7B may or may not appear in the representation of that character.   

Character Set - An abstract set of characters that are assigned (or mapped to) a representation by a particular character set encoding. For most character set encodings their character set is a subset of the Unicode character set.   

Character Set Encoding - Often abbreviated to just 'encoding'. A specific representation of a character set as bytes or bits of data. A character set encoding is usually identified by a standard character set encoding name or a recognized alias name, or by a coded character set identifier or CCSID [CCSID]. These identifiers are standardized. The names and aliases are standardized by the IANA [IANA] (where unfortunately, they are called character set names). CCSIDs are an industry standard. Examples of character set encoding names are UTF-8, USASCII, GB2312, ebcdic-cp-it, ISO-8859-5, UTF-16BE, Shift_JIS. There are also additional DFDL standard character set encodings, see DFDL Standard Encoding. The DFDL standard also allows for implementation-defined character set encodings to be supported..   

Character Width - The number of code units or alternatively the number of bytes or bits used to represent a character in a specific character set encoding is called the character width. Encodings are either fixed width (all characters encoded using the same width), or variable-width (different characters are encoded using different widths). For example the UTF-32 character set encoding has 4-byte character width, whereas USASCII has a 1-byte character width. UTF-8 is variable width, and any specific character has width 1, 2, 3, or 4 bytes. See also Fixed-Width Character Encoding and Variable-Width Character Encoding  

Code Point - The integer that identifies a character within a character set encoding. A code point is represented by one or more code units.  When a character set is fixed width, then there is no distinction between a code unit and a code point. For unicode character set encodings, there is no distinction between a character code and a code point. Examples:

Code Unit - When a character set encoding uses differing variable width representations for characters, the units making up these variable width representations are called code units. For example the UTF-8 encoding uses between 1 and 4 code units to represent characters, and for UTF-8, the individual code units are single bytes. DFDL's interpretation of the UTF-16 encoding is either fixed or variable width. When format property dfdl:utf16Width 'variable' then UTF-16 is variable width and this encoding uses either one or two code units per character, but in this case each individual code unit is a 16-bit value. When a character set is fixed width, then there is no distinction between a code unit and a code point.   

Coded Character Set Identifier (CCSID) - An alternate identifier of a character set encoding. Originally created by IBM, CCSIDs are a broadly used industry standard. See [CCSID].]  

Component - A construct within a DFDL schema that may contain a DFDL annotation.

Content - The content is the bits of data that are interpreted to compute a logical value.

Content Model - Used in describing the syntactic structure of XSD and DFDL annotations of it. An element of a schema can have empty, simple, or element-only content. An element declaration for an element of complex type containing a xs:sequence element is said to have a sequence in its content model. 

Contiguous - An element has a contiguous representation if all parts of its representation are adjacent in the input/output stream. Most simple types have contiguous representations naturally. Groups containing elements that are themselves contiguous are also considered to have contiguous representations irrespective of alignment fill or padding of any kind that exists within the group. Similarly, arrays of elements that are themselves contiguous are also contiguous. An example of a non-contiguous representation would be a nillable element, where a flag is used to determine whether or not the element is nil, and the location of that flag is not adjacent to the value representation.

Count - The number of occurrences of an element.

Data Stream - Data where the format is being described by a DFDL schema. This use of 'stream' implies only that there is a numbering scheme that specifies a unique bit position for every bit within the data. This use of 'stream' does not imply anything about whether the data is persistently stored or not, nor does it imply anything about whether there are sequential or random access capabilities for access to the data.

DBCS - See Double-Byte Character Set

Decimal - This term is used several different ways distinguished by context:

  1. Base 10. When data has text representation, a decimal number has base-10 digits.
  2. Type xs:decimal - a logical type of number that has an integer component and an optional base-10 fractional component. This type subsumes all integer types, as they are of type xs:decimal but with the further restriction that the fractional part doesn't exist. Note that a base-10 fraction has different rounding properties than a base-2 or floating point numeric fraction; hence, xs:decimal is the type commonly used to represent currency/money in data.
  3. Packed Decimal - A binary data representation. See separate glossary entry below.

Defining Annotations - The annotation elements dfdl:defineFormat, dfdl:defineVariable, and dfdl:defineEscapeScheme

Delimiter - A character or string used to separate, or mark the start and end of, items of data. In DFDL, dfdl:lengthKind 'delimited' scans the data for initiators, separators, and terminators.

Delimiter scanning - When parsing, the process of scanning for a specific item in the input data which either marks the end of an item or the beginning of a subsequent item. Delimiter scanning also takes into account escape schemes so as to allow the delimiters to appear within data if properly escaped.

DFDL – Data Format Description Language

DFDL Processor - A program that uses DFDL schemas in order to process data described by them.

DFDL Schema - An XML schema containing DFDL annotations to describe data format.

DFDL Standard Encoding - A character set for which there is no IANA name or CCSID but the name and definition of which DFDL implementations must agree on. See Section 34 Appendix D: DFDL Standard Encodings.

Double-Byte Character Set (DBCS) - A character set encoding where each character code consists of one code unit which uses exactly 2 bytes.

Dynamic extent - This is a characteristic of the data stream. When parsing data corresponding to a schema component, the collection of bits within the data stream that contain any aspect of the representation of that schema component make up the component's dynamic extent.

Dynamic scope - This is a characteristic of parts of the DFDL schema. When a definition or declaration contains or references another declaration or definition, then the contained definition or declaration is said to be in the dynamic scope of the enclosing one. The important characteristic of dynamic scoping is that it traverses references. When parsing, the dynamic scope of an element declaration includes all definitions and declarations used as part of parsing that element.

Element - A part of the data described by an element declaration in the schema and presented as an element information item in the infoset.

Encoding - See Character Set Encoding.   

Explicit properties - The explicit properties are the combination of any defined locally on the annotation and any defined by a dfdl:defineFormat annotation referenced by a local dfdl:ref property.

Fixed-Width Character Encoding - A character set encoding where all characters are encoded using a single code unit for their representation. Note that a code unit is not necessarily a single byte. It may be more than one byte, or some number of bits less than a byte.  Examples of different fixed widths are:

Fixed Array Element - An array element where XSDL minOccurs is equal to XSDL maxOccurs.

Format annotations - The annotation elements dfdl:format, dfdl:element, dfdl:simpleType, dfdl:group, dfdl:sequence, dfdl:choice, and dfdl:escapeScheme.

Format property – A DFDL property carried on a DFDL format annotation.

Framing - The term used to describe the delimiters, length fields, and other parts of the data stream which are present, and may be necessary to determine the length or position of the content of an element.

Implementation-defined feature - A feature where the implementation has discretion in how it is performed, and the implementation must document how it is performed.

Implementation-dependent feature - A feature where the implementation has discretion in how it is performed, but the implementation is not required to document how the feature is performed.

Index - The position of an occurrence in a count, starting at 1.

Item - A DFDL information set consists of a number of information items; or just items for short.

Least-Significant Bit - Often abbreviated to LSB. In a binary integer the least significant bit is the bit having the least place value. Within an 8-bit unsigned byte, the bit with place value 20 (or 1) is the least significant bit.

Length - When discussing data items and their representations, the term 'length' is used to refer to the measure of the size of the representation of an item in units of bits, bytes, or characters. The length of an array is the number of bits, bytes, or characters making up its representation, and has nothing to do with the number of occurrences of the array. Any element occurrence has length. Only array elements and optional elements have numbers of occurrences other than 1.

Lexical scope - In a DFDL Schema document, the lexical scope of any element is the collection of schema declarations, definitions, and annotations contained within the element textually.

Local properties – Local properties are the properties defined on an annotation in either short, attribute or element form

Logical layer - A DFDL Schema with all the DFDL annotations ignored is an ordinary XSDL schema. The logical structure described by this XSDL is called the DFDL logical layer.

Most-Significant Bit - Often abbreviated to MSB. In a binary integer the most significant bit is the bit having the greatest place value. Within an 8-bit unsigned byte, the bit with place value 27 is the most significant bit.

Nibble - 4 bits. A single hexadecimal digit (0 to 9, A to F) is often referred to as a nibble as it can be represented in exactly 4 bits.

Node - The term Node is a shorter equivalent to Element Information Item of the DFDL Infoset described in Section 4.1.2 Element Information Items.

Non-representation property – A format property that is not a representation property, specifically dfdl:ref, dfdl:hiddenGroupRef, dfdl:choiceChoiceBranchKey, dfdl:choiceDispatchKey, dfdl:inputValueCalc, dfdl:outputValueCalc. See also representation property.

Occurrence - An instance of an element in the data, or an item in the DFDL Infoset.

Optional Element - An element declaration or reference where XSDL minOccurs is equal to zero.

Optional Occurrence - An occurrence with an index greater than XSDL minOccurs.

Packed decimal – A physical representation of a decimal and integer numbers where each digit is packed into one nibble (4 bits) of a byte. There are several variants, some also include a sign nibble and some include a padding nibble. The term covers all the following enums of the dfdl:binaryNumberRep and dfdl:binaryCalendarRep properties – 'packed' (IBM 390 packed), 'bcd' (standard binary coded decimals or BCDs) and 'ibm4690Packed' (IBM 4690 packed).

Potentially represented - An element declaration in the schema describes a potentially represented item if that element declaration does not have a dfdl:inputValueCalc property. Whether the element declaration describes an occurrence that is actually represented or not depends on whether the element declaration is for an optional element, and whether the element has a corresponding value in the augmented infoset.

Physical Layer - A DFDL Schema adds DFDL annotations onto an XSDL language schema. The annotations describe the physical representation or physical layer of the data.

Point of Uncertainty - A point of uncertainty occurs in the data stream when there is more than one schema component that might occur at that point.

Representation property - A format property that is used to describe a physical characteristic of a component. Such a property will apply to one or more grammar regions of the component. See also non-representation property.

Required Element - An element declaration or reference where XSDL minOccurs is greater than zero.

Required Occurrence - An occurrence with an index less than or equal to XSDL minOccurs.

Required Property – A DFDL property that must have a value. The required properties for each xs:schema component are listed in the Property Precedence tables in section 23.

Resolved set of annotations - When DFDL annotations appear on a group reference and the sequence or choice of the referenced global group, or appear among an element reference, an element declaration, and its type definition, then they are combined together and the resulting set of annotations is referred to as the resolved set of annotations for the schema component.

SBCS - See Single Byte Character Set

Scan – Examine the input data looking for delimiters such as separators and terminators, or matches to regular expressions.

Single-Byte Character Set (SBCS) - A character set encoding where each character code consists of one code unit which is exactly a single byte (8 bits).

Schema - The set of all declarations and definitions in the schema, including all included and imported schemas taken together. This includes both the XSDL declarations and definitions, and the DFDL definitions provided in the top-level DFDL annotations.

Schema Component Designator (SCD) - A notation for referring to one of the components of a DFDL Schema. This is being standardized by W3C. See http://www.w3.org/TR/xmlschema-ref.

Schema Definition Order – The order that the schema components are defined in a schema document.

Specified length - An item has specified length when dfdl:lengthKind is "implicit" (simple type only), "explicit", or "prefixed". 

Speculative Parsing – When the parser reaches a point of uncertainty it attempts to parse each option in turn until one is known-to-exist or known-not-to-exist.

Statement annotations - The annotation elements dfdl:assert, dfdl:discriminator, dfdl:setVariable, and dfdl:newVariableInstance. Also called DFDL Statements.

Statically - A DFDL Implementation can analyze a DFDL schema and determine the presence of many kinds of errors. This is called static analysis, compilation of the schema, or determining the presence of the error statically.

Surrogate Pair - A Unicode character whose character code value is greater than 0xFFFF can be encoded into variable-width UTF-16BE or UTF-16LE (which are variable-width encodings when the DFDL property utf16Width is 'variable'). In this case the representation uses two adjacent code units each of which is called a surrogate, and the pair of which is called a surrogate pair.    

Target length - When unparsing, the length (in dfdl:lengthUnits) of an item's representation is the target length. The length of the content corresponding to a logical data value in the infoset may be shorter or longer than the target length, in which case padding or truncation may be necessary to make the logical data content conform to the target length. Rules for when padding and truncation occur, and how they are applied are specific to simple data types, and are controlled by a number of DFDL format properties.

Text - Consisting of characters in some character set encoding. Normally we think of text data as being human readable, but many character set encodings contain special control characters that are not human readable but we call data containing these text anyway. The dfdl:encoding property is required in order to decode/encode the text.

Text Representation - Of type xs:string, or of other types (except xs:hexBinary) with property dfdl:representation 'text'. Note that type xs:hexBinary never has text representation. This term specifically refers to the representation of the SimpleContent region being textual.

Textual - See Text.

Twos-Complement - A very common scheme for representing binary integers within data.  A positive integer consisting of N bits is represented as its base-2 absolute value. A negative integer is represented as the complement (all bits inverted) of its absolute value plus 1.  

Unicode - A character set defined by the Unicode Consortium, and standardized at the International Standards Organization (ISO) as ISO10646.   

Unit - See Addressable Unit.

Unpadded length - This is the length of the content of an item of the infoset, prior to any filling or padding which might be introduced due to dfdl:lengthKind "prefixed" or dfdl:lengthKind "explicit". It is equal to or smaller than the target length.

Validity - A DFDL Infoset is said to be valid with respect to a DFDL schema if each Infoset item is valid with respect to its corresponding DFDL schema component. Validity is about the Infoset and the values it holds. It is independent of the data representation when parsing or unparsing. See Section 2.4 Validation Errors, for a list of the specific value checks that are performed when validating a DFDL Infoset against a DFDL schema.

Variable-Width Character Encoding - A character set encoding where characters are encoded using one or more code units for their representation depending on which specific character is being encoded. Examples with their ranges of varying width:

·       1 to 4 bytes: UTF-8

·       1 or 2 16-bit code units: UTF-16 when property dfdl:utf16Width is 'variable'

·       1 or 2 bytes: Shift-JIS  

Well-formed - Data is said to be well-formed with respect to a DFDL schema if a DFDL processor can parse the data into a DFDL Infoset, or a DFDL processor can unparse to that data from a DFDL Infoset. The validity of values in the infoset is not necessary for data to be well-formed.

Width - See Character Width.

4.     The DFDL Information Set (Infoset)

This section defines an abstract data set called the DFDL Information Set (Infoset). Its purpose is to define the abstract data structure that must be provided:

The DFDL Infoset contains enough information so that a DFDL schema can be defined that will unparse the infoset and reparse the resultant datastream to produce the same infoset.

There is no requirement for DFDL-described data to be valid in order to have a DFDL information set.

DFDL information sets may be created by methods (not described in this specification) other than parsing DFDL-described data.

A DFDL information set consists of a number of information items; or just items for short. The information set for any well-formed DFDL-described data will contain at least a document information item and one element information item. An information item is an abstract description of a part of some DFDL-described data: each information item has a set of associated named members. In this specification, the member names are shown in square brackets, [thus]. The types of information item are listed in Section 4.1 Information Items.

The DFDL Information Set does not require or favor a specific interface or class of interfaces. This specification presents the information set as a modified tree for the sake of clarity and simplicity, but there is no requirement that the DFDL Information Set be made available through a tree structure; other types of interfaces, including (but not limited to) event-based and query-based interfaces, are also capable of providing information conforming to the DFDL Information Set.

The terms "information set" and "information item" are similar in meaning to the generic terms "tree" and "node", as they are used in computing. However, the former terms are used in this specification to reduce possible confusion with other specific data models.

The DFDL Information Set is similar in purpose to the XML Information Set [XMLInfoset], however, it is not identical, nor a perfect subset, as there are important differences.

4.1       Information Items

An information set contains two different types of information items, as explained in the following sections. Every information item has members. For ease of reference, each member is given a name, indicated [thus].

4.1.1       Document Information Item

There is exactly one document information item in the information set, and all other information items are accessible through the [root] member of the document information item.

There is no specific DFDL schema component that corresponds to this item. It is a concrete artifact describing the information set.

The document information item has the following members:

[root] The element information item corresponding to the root element declaration of the DFDL Schema.

[dfdlVersion] String. The version of the DFDL specification to which this information set conforms. For DFDL V1.0 this is 'dfdl-1.0'

[schema] String. This member is reserved for future use.

[unicodeByteOrderMark] Enum. When the encoding of the root element of the document is exactly UTF-8, UTF-16, or UTF-32 (or CCSID equivalent), the member value indicates whether the document starts with a Byte-order-mark (BOM), and what the value of the mark was. If there is a BOM at the start of the data stream, then for UTF-8 encoding the value is 'UTF-8'; for UTF-16 encoding the value is 'UTF-16LE' or 'UTF-16BE'; for UTF-32 the value is 'UTF-32LE' or 'UTF-32BE'. If there is no BOM then the member value is empty. When the encoding of the root element of the document is any other encoding, the member value is empty. When unparsing, if this member is not empty and the encoding is UTF-8, UTF-16, or UTF32, then this member's value is used to determine the specific byte-order mark written, and for UTF-16 and UTF-32, the byte order used when characters are encoded to the output data stream.

4.1.2       Element Information Items

There is an element information item for each value parsed from the non-hidden DFDL-described data. This corresponds to an occurrence of a non-hidden element declaration of simple type in the DFDL Schema and is known as a simple element information item.

There is an element information item for each explicitly declared structure in the DFDL-described data. This corresponds to an occurrence of an element declaration of complex type in the DFDL Schema and is known as a complex element information item.

In this information set, as in an XML document, an array is just a set of adjacent elements with the same name and namespace. (To represent the array explicitly, introduce a new complex type element to contain the array elements only.)

One of the element information items is the [root] member of the document information item, corresponding to the root element declaration of a DFDL Schema, and all other element information items are accessible by recursively following its [children] member.

An element information item has the following members:

[namespace] String. The namespace, if any, of the element. If the element does not belong to a namespace, the value is the empty string.

[name] String. The local part of the element name.

[document] The document information item representing the DFDL information set that contains this element. This element is empty except in the root element of an information set.

[datatype] String. The name of the XML Schema 1.0 built-in simple type to which the value corresponds. DFDL supports a subset of these types listed in section 5.1 DFDL Subset of XML Schema. In a complex element information item this member has no value.

[dataValue] The value in the value space (as defined by XML Schema Part 2: Datatypes [XSDLV1]) of the [datatype] member. In a complex element information item this member has no value. If the [nilled] member is true, then this member has no value.

For information items of datatype xs:string, the value is an ordered collection of unsigned 16-bit integer codepoints each having any value from 0x0000 to 0xFFFF. Where defined, these are interpreted as the ISO646 character codes. Codepoints disallowed by ISO 10646, such as 0xD800 to 0xDFFF are explicitly allowed by the DFDL infoset. The codepoints of the string are stored in 'implicit' (also known as logical), left-to-right bidirectional ordering and orientation. DFDL's infoset represents Unicode characters with character codes beyond 0xFFFF by way of surrogate pairs (2 adjacent codepoints) in a manner consistent with the UTF-16 encoding of ISO 10646. The value can have length 0, in which case the value may be referred to as an 'empty string'

For information items of datatype xs:hexBinary, the value is an ordered collection of unsigned 8-bit bytes each having value from 0 to 255. The length of this collection can be 0 in which case the value may be referred to as an 'empty hexBinary'.

[nilled] Boolean. True if the nillable item is nil. False if the nillable item is not nil. If the element is not nillable this member has no value. If this member is true then for a simple element the [dataValue] member has no value, and for a complex element the [children] member has no value. If this member is true then the Infoset item is said to be nil or nilled.

[children] An ordered set of zero or more element information items. The order they appear in the set is the order implied by the DFDL Schema. 'Ordered set' is not formally defined here, but two operations are assumed: 'count' gives the number of information items, and 'at (index)' gives the element at ordinal position 'index' starting from 1. In a simple element information item this member has no value. In a document information item this member contains exactly one element information item. If the [nilled] member is true then this member has no value.

[parent] The complex element information item which contains this information item in its [children] member. In the root element of an information set this member is empty.

[schema] String. A reference to a schema component associated with this information item, if any. If not empty, the value must be an absolute or relative Schema Component Designator [SCD].

[valid] Boolean[3]. True if the element is valid as determined by a DFDL implementation that performs validation checking. A complex element information item is not valid if any of its [children] are not valid. Empty if validation is not enabled.

[unionMemberSchema][4] String. For simple element information items, this member contains an SCD reference to the member of the union that matched the value of the element. Empty if validation is not enabled. Empty if the element's type is not a union.

On unparsing, any non-empty values for the [valid] or [unionMemberSchema] members are ignored. However, in the augmented infoset which is built during the unparse operation [valid] will have a value, and [unionMemberSchema] may have a value.

4.2       "No Value''

Some members may sometimes have the value no value, and it is said that such a member has no value. This value is distinct from all other values. In particular it is distinct from the empty string, the empty set, and the empty list, each of which simply has no members.

4.3       DFDL Information Item Order

On parsing and unparsing information items will be presented in the order they are defined in the DFDL Schema.

4.4       DFDL Infoset Object model

By way of illustration, the DFDL information set is presented below as an object model using a Unified Modeling Language (UML) class diagram, augmented using the Object Constraint Language (OCL) [UML].

The structure of the information set follows the Composite design pattern. In case of inconsistency or ambiguity, the preceding discussion takes precedence.

DFDL is able to describe the format of the physical representation for data whose structure conforms to this model. Note that this model allows hierarchically nested data, but does not allow representation of arbitrary connected graphs of data objects.

Figure 1 DFDL Infoset Object Model

 

4.5       DFDL Augmented Infoset

When unparsing, one begins with the DFDL schema and conceptually with the logical infoset. As the values of items are filled in by defaulting, and by use of the dfdl:outputValueCalc property  (including on hidden items) (see section 17 Calculated Value Properties), these new item values augment the infoset. The resulting infoset is called the augmented infoset.

An element declaration in the schema describes a potentially represented item if that element declaration does not have a dfdl:inputValueCalc property (see section 17 Calculated Value Properties). Whether the element declaration describes an item that is actually represented or not depends on whether the element declaration is for an optional element and whether the element has a corresponding value in the augmented infoset.   

In expressions, the function dfdl:contentLength() and dfdl:valueLength() can be called to determine the length of an item. If an element declaration is not potentially represented, then these functions are defined to return 0.

When unparsing, an element declaration and the infoset are considered as follows. An implementation may use any technique consistent with this algorithm:

a)       If the element declaration has a dfdl:outputValueCalc property then the expression which is the dfdl:outputValueCalc property value is evaluated and the resulting value becomes the value of the element item in the augmented infoset. Any pre-existing value for the infoset item is superseded by this new value.

References to other augmented infoset items from within the dfdl:outputValueCalc expression must obtain their values from the augmented infoset directly (when the value is already present) or by recursively using these methods (a) and (b) as needed.

b)       If the element declaration has no corresponding value in the augmented infoset, and the element declaration is for a required occurrence, and it has a default value specified, then an element item having the default value is created in the augmented infoset.

c)       If any Infoset item's value is requested recursively as a part of (a) above and (a) does not apply, and the corresponding value is not present, and (b) does not apply then it is a processing error.

Given this augmented infoset, then if the potentially represented element declaration has a corresponding infoset item then that item is converted to its representation according to its DFDL properties. If the element declaration is for a required occurrence, and there is no value in the augmented infoset then it is a processing error.

Because rule (a) above is used even if the augmented infoset item already exists and has a value, it is possible for a dfdl:outputValueCalc expression to be evaluated multiple times. DFDL implementations are free to cache values and avoid this repeated evaluation for efficiency, as the semantics of DFDL require that the dfdl:outputValueCalc expression return the same value every time it is evaluated.

5.     DFDL Schema Component Model

When using DFDL, the format of data is described by means of a DFDL Schema.

The DFDL Schema Component Model is shown in conceptual UML in Figure 2. First we show the model for elements, groups and the top of the type hierarchy.

The shaded boxes have direct corresponding element syntax and therefore appear in DFDL schema

Figure 2 DFDL Schema UML diagram

The simple types are shown in Figure 3. The graph shows all the types defined by XML Schema version 1.0, and the subset of these types supported by DFDL are shown as shaded.

Figure 3 DFDL simple types

 

These types are defined as they are in XML Schema, with exceptions for:

·         String – In DFDL a string can contain any character codes. None are reserved. (Including the character with character code U+0000, which is not permitted in XML documents.)

Each object defined by a class in the above UML is called a DFDL Schema component.

We express the DFDL Schema Model using a subset of the XML Schema Description Language (XSDL). XSDL provides a standardized schema language suitable for expressing the DFDL Schema Model.

A DFDL Schema is an XML schema containing only a restricted subset of the constructs available in full W3C XML Schema Description Language. Within this XML schema, special DFDL annotations are distributed that carry the information about the data's format or representation.

A DFDL Schema is a valid XML schema. However, the converse is not true since the DFDL Schema Model does not include many concepts that appear in XML schema.

5.1       DFDL Subset of XML Schema

The DFDL subset of XSDL is a general model for hierarchically-nested data. It avoids the XSDL features used to describe the peculiarities of XML as a syntactic textual representation of data, and features that are simply not needed by DFDL.

The following lists detail the similarities and differences between general XSDL and this subset.

DFDL Schemas consist of:

Note: xs:nonNegativeInteger is treated as an unsigned xs:integer.

The following constructs from XML Schema are not used as part of the DFDL Schema Model of DFDL v1.0 schemas; however, they are all reserved[6] for future use since the data model may be extended to use them in future versions of DFDL:

5.2       XSD Facets, min/maxOccurs, default, and fixed

XSD element declarations and references can carry several properties that express constraints on the described data. These constraints are mainly used for validation. These properties include:

The facets and the types they are applicable to are:

The facets (but not maxOccurs nor minOccurs) are also checked by the dfdl:checkConstraints DFDL expression language function.

The following sections describe these in more detail.

5.2.1       MinOccurs, MaxOccurs

The XSDL minOccurs property is used:

The XSDL maxOccurs property is used:

For some values of dfdl:occursCountKind such as 'implicit', it is a processing error when an array is found to have a number of occurrences not conforming to XSDL minOccurs in the absence of a default value specification. For other values of dfdl:occursCountKind such as 'parsed', it is only a validation error if an array is found to have fewer than XSDL minOccurs occurrences. See Section 16, Properties for Array Elements and Optional Elements, for more details.

5.2.2       MinLength, MaxLength

These facets are used:

5.2.3       MaxInclusive, MaxExclusive, MinExclusive, MinInclusive, TotalDigits, FractionDigits

·         Used for validation only

The format of numbers is not derived from these facets. Rather dfdl properties are used to specify the format.

5.2.4       Pattern

·         Allowed only on elements of type xs:string or types derived from it in Figure 3 DFDL simple types.

·         Used for validation only

It is important to avoid confusion of the pattern facet with other uses of regular expressions that are needed in DFDL (for example, to determine the length of an element by regular expression matching).

Note: in XSD, pattern is about the lexical representation of the data, and since all is text there, everything has a lexical representation. In DFDL only strings are guaranteed to have a lexical and logical value that is identical.

5.2.5       Enumeration

Enumerations are used to provide a list of valid values in XSD.

Note: in DFDL we do not use XSD enumeration as a means to define symbolic constants. These are captured using dfdl:defineVariable constructs so they can be referenced from expressions.

5.2.6       Default

The 'default' property is used:

Note that the 'fixed' and 'default' properties are mutually exclusive on an element declaration.

5.2.7       Fixed

The 'fixed' property is used in the same ways as the 'default' property but in addition:

Note that the 'fixed' and 'default' properties are mutually exclusive on an element declaration.  

6.     DFDL Syntax Basics

Using DFDL, a data format is described by placing special annotations at various positions within an XML schema. This XML schema conveys the basic structure of the data format, while the annotations fill in the detail. Annotations are used to describe aspects such as the file encoding and byte ordering, as well as declaring variables for reference elsewhere, and specifying properties that govern the capabilities of the DFDL processor. A DFDL processor requires these annotations, along with the structural information of the enclosing XML schema, to make sense of the physical data model.

6.1       Namespaces

The xs:appinfo source URI http://www.ogf.org/dfdl/ is used to distinguish DFDL annotations from other annotations.

The element and attribute names in the DFDL syntax are in a namespace defined by the URI http://www.ogf.org/dfdl/dfdl-1.0/. All symbols in this namespace are reserved. DFDL implementations may not provide extensions to the DFDL standard using names in this namespace. Within this specification, the namespace prefix for DFDL is "dfdl" referring to the namespace http://www.ogf.org/dfdl/dfdl-1.0/.

Attributes on DFDL annotations that are not in the DFDL namespace or or in no namespace are ignored.

A DFDL Schema document contains XML schema annotation elements that define and assign names to parts of the format specification. These names are defined using the target namespace of the schema document where they reside, and are referenced using QNames in the usual manner. A DFDL schema document can include or import another schema document, and namespaces work in the usual manner for XML schema documents. The schema is the schema including all additional schemas referenced through import and include. Generally, in this specification, when we refer to the DFDL Schema we mean the schema. When we refer to a specific document we will use the term DFDL Schema document.

6.2       The DFDL Annotation Elements

DFDL annotations must be positioned specifically where DFDL annotations are allowed within an XML schema document. These positions are known as annotation points. When an annotation is positioned at an annotation point, it binds some additional information to the schema component that encloses it. The description of a data format is achieved by correctly placing annotations on the structural components of the schema.

DFDL specifies a collection of annotations for different purposes. They are organized into three different annotation types: Format Annotations, Statement Annotations, and Defining Annotations

At any single annotation point of the schema there can be only one format annotation, but there can be several statement annotations although there are rules about which of those are allowed to co-exist as well which will be described in sections about those specific annotation types.

Annotation Type

Annotation Element

Description

Format Annotation

choice

Defines the physical data format properties of an xs:choice group. See section 7.1.

element

Defines the physical data format properties of an xs:element and xs:element reference. See section 7.1.

format

Defines the physical data format properties for multiple DFDL schema constructs. Used on an xs:schema and as a child of a dfdl:defineFormat annotation. This includes aspects such as the encodings, separators, and many more. See section 7.1.

group

Defines the physical data format properties of an xs:group reference. See section 7.1.

property

Used in the syntax of format annotations. See section 7.1.2.2.

sequence

Defines the physical data format properties of an xs:sequence group. See section 7.1.

simpleType

Defines the physical data format properties of an xs:simpleType. See section 7.1.

escapeScheme

Defines the scheme by which quotation marks and escape characters can be specified. This is for use with delimited text formats. See section 7.6.

Statement Annotation

assert

Defines a test to be used to ensure the data are well formed. Assert is used only when parsing data. See section 7.3

discriminator

Defines a test to be used when resolving a point of uncertainty such as choice branches or optional elements. A dfdl:discriminator is used only when parsing data to resolve the point of uncertainty to one of the alternatives. See section 7.4

newVariableInstance

Creates a new instance of a variable. See section 7.8

setVariable

Sets the value of a variable whose declaration is in scope See section 7.9

Defining Annotation

defineEscapeScheme

Defines a named, reusable escapeScheme See section 7.5

defineFormat

Defines a reusable data format by collecting together other annotations and associating them with a name that can be referenced from elsewhere. See section 7.2

defineVariable

Defines a variable that can be referenced elsewhere. This can be used to communicate a parameter from one part of processing to another part. See section 7.7

Table 1 - DFDL Annotation Elements

6.3       DFDL Properties

Properties on DFDL annotations may be one or more of the following types

Some properties accept a list or union of types

6.3.1       DFDL String Literals

DFDL String Literals represent a sequence of literal bytes or characters which appear in the data stream. This presents the following challenges

A DFDL string literal is therefore able to describe any arbitrary sequence of bytes and characters.

Empty String: The special DFDL entity %ES; is provided for describing an empty string or an empty byte sequence. The %ES; entity is the only way to do this. A DFDL string literal with value "" (the empty string) is usually invalid. There are a few properties that explicitly allow an empty DFDL String Literal, and these properties assign a property-specific meaning to the empty string value.

Whitespace: When whitespace must be used as part of a property value, the DFDL string literal must use entities (such as %WSP;) to represent the whitespace. (This allows a property to represent lists of DFDL string literals by using literal spaces to separate list elements.)

6.3.1.1      Character strings in DFDL String Literals

A literal string in a DFDL Schema is written in the character set encoding specified by the XML directive that begins all XML documents:

<?xml version="1.0" encoding="UTF-8" ?>

In this example, the DFDL schema is written in UTF-8, so any literal strings contained in it, and particularly string literals found in its representation property bindings in the format annotations, are expressed in UTF-8.

However, these strings are being used to describe features of text data that are commonly in other character set encodings. For example, we may have EBCDIC data that is comma separated. A comma in EBCDIC has a single-byte code unit of 0x6B in the data, the numeric value of which does not correspond to the Unicode character code for comma which is U+002C. However, when we indicate that an item is "," (comma) separated and we specify this using a string literal along with specifying the 'encoding' property to be 'ebcdic-cp-us' then this means that the data are separated by EBCDIC commas regardless of what character set encoding is used to write the DFDL Schema.

<?xml version="1.0" encoding="UTF-8">

<xs:schema ... >

    ...

    <dfdl:format encoding="ebcdic-cp-us" separator=","/>

    ...

</xs:schema>

When a DFDL processor uses the separator expressed in this manner, the string literal "," is translated into the character set encoding of the data it is separating as specified by the encoding representation property. Hence, in this case we would be searching the data for a character with codepoint 0x6B (the EBCDIC comma), not a UTF-8 or Unicode (0x2C) comma which is what exists in the DFDL schema document file.

Character strings can include bidirectional data.

6.3.1.2      DFDL Character Entities, Character Class Entities, and Byte Values in String Literals

DFDL character entities specify a single Unicode character and provide a convenient way to specify code points that appear in the data stream but would be difficult to specify in XML strings. For example, common non-printable characters or code points, such as 0x00, that are not valid in XML documents. DFDL entities are based on XML entities, which can also be used in a DFDL schema.

The following grammar gives the syntax of DFDL String Literals generally, including the various kinds of entities.

DfdlStringLiteral

::=

(DfdlStringLiteralPart)+ | DfdlESEntity

DfdlStringLiteralPart

::=

LiteralString | DfdlCharEntity | DfdlCharClass | ByteValue

LiteralString

::=

A string of literal characters

DfdlCharEntity

::=

DfdlEntity | DecimalCodePoint | HexadecimalCodePoint

DfdlCharClass           

::=

'%' DfdlCharClassName ';'

ByteValue               

::=

'%#r' [0-9a-fA-F]{2} ';'

DfdlEntity        

::=

'%' DfdlEntityName ';'

DecimalCodePoint      

::=

'%#' [0-9]+ ';'

HexadecimalCodePoint  

::=

'%#x' [0-9a-fA-F]+ ';'

DfdlEntityName      

::=

'NUL'|'SOH''|'STX'|'ETX'|        

'EOT'|'ENQ'|'ACK'|'BEL'|        

'BS'|'HT'|'LF'|'VT'|'FF'|       

'CR'|'SO'|'SI'|'DLE'|       

'DC1'|'DC2'|'DC3'|'DC4'|        

'NAK'|'SYN'|'ETB'|'CAN'|        

'EM'|'SUB'|'ESC'|'FS'|        

'GS'|'RS'|'US'|'SP'|          

'DEL'|'NBSP'|'NEL'|'LS'

DfdlCharClassName      

::=

DfdlNLEntity | DfdlWSPEntity | DfdlWSPStarEntity | DfdlWSPPlusEntity

DfdlNLEntity

::=

'NL'

DfdlWSPEntity

::=

'WSP'

DfdlWSPStarEntity

::=

'WSP*'

DfdlWSPPlusEntity

::=

'WSP+'

DfdlESEntity

::=

'ES'

Table 2 DFDL Character Entity, Character Class Entity, and Byte Value Entity syntax

Using %% inserts a single literal "%" into the string literal. This "%" is subject to character set encoding translation as is any other character.

A HexadecimalCodePoint provides a hexadecimal representation of the character's code point in ISO/IEC 10646.

A DecimalCodePoint provides a decimal representation of the character's code point in ISO/IEC 10646.

A DfdlEntityName is one of the mnemonics given in the following tables.

Mnemonic

Meaning

Unicode Character Code

NUL

null

U+0000

SOH

start of heading

U+0001

STX

start of text

U+0002

ETX

end of text

U+0003

EOT

end of transmission

U+0004

ENQ

enquiry

U+0005

ACK

acknowledge

U+0006

BEL

bell

U+0007

BS

backspace

U+0008

HT

horizontal tab

U+0009

LF

line feed

U+000A

VT

vertical tab

U+000B

FF

form feed

U+000C

CR

carriage return

U+000D

SO

shift out

U+000E

SI

shift in

U+000F

DLE

data link escape

U+0010

DC1

device control 1

U+0011

DC2

device control 2

U+0012

DC3

device control 3

U+0013

DC4

device control 4

U+0014

NAK

negative acknowledge

U+0015

SYN

synchronous idle

U+0016

ETB

end of transmission block

U+0017

CAN

cancel

U+0018

EM

end of medium

U+0019

SUB

substitute

U+001A

ESC

escape

U+001B

FS

file separator

U+001C

GS

group separator

U+001D

RS

record separator

U+001E

US

unit separator

U+001F

SP

space

U+0020

DEL

delete

U+007F

NBSP

no break space

U+00A0

 NEL

Next line

U+0085

 LS

Line separator

U+2028 

Table 3 DFDL Entities

6.3.1.3      DFDL Character Class Entities in DFDL String Literals

The following DFDL character classes are provided to specify one or more characters from a set of related characters.

Mnemonic

Meaning

Unicode Character Code(s)

NL

Newline

On parse any one of the single characters CR, LF, NEL or LS or the character combination CRLF.

On unparse the value of the dfdl:outputNewLine property is output, which must specify one of the single characters %CR;, %LF;,  %NEL;, or %LS; or the character combination %CR;%LF;.

U+000A LF

U+000D CR

U+000D U+000A CRLF

U+0085 NEL

U+2028  LS

WSP

Single whitespace

On parse any whitespace character

On unparse a space (U+0020) is output

U+0009-U+000D (Control characters)

U+0020 SPACE

U+0085 NEL

U+00A0 NBSP

U+1680 OGHAM SPACE MARK

U+180E MONGOLIAN VOWEL SEPARATOR

U+2000-U+200A (different sorts of spaces)

U+2028 LSP

U+2029 PSP

U+202F NARROW NBSP

U+205F MEDIUM MATHEMATICAL SPACE

U+3000 IDEOGRAPHIC SPACE

WSP*

Optional Whitespaces

On parse whitespace characters are ignored.

On unparse nothing is output

Same as WSP

WSP+

Whitespaces

On parse one or more whitespace characters are ignored. It is an processing error if no whitespace character is found.

On unparse a space (U+0020) is output.

Same as WSP

ES

Empty String

Used in whitespace separated lists when empty string is one of the values.

 

Table 4 DFDL Character Class Entities

6.3.1.4      DFDL Byte Value Entities in DFDL String Literals

DFDL byte value entities provide a way to specify a single byte as it appears in the data stream without any character set encoding translation. To specify a string of byte values, a sequence of two or more byte value entities must be used. The syntax is in Error! Reference source not found. above. Example:

%#rFF;

6.3.2       DFDL Expressions

Some DFDL properties allow DFDL expressions (see Section 23 Expression language) to be used so that the property can be set dynamically at processing-time.

The general syntax of expressions is "{" expression "}"

The rules for recognizing DFDL expressions are

DFDL expressions reference other items in the infoset or augmented infoset using absolute or relative paths. Relative paths are evaluated when the component containing the expression is referenced not when it is declared. For example a global element may have a DFDL property which is an expression that contains a relative path to another element. The relative path is evaluated when the global element is referenced from an element reference.

DFDL expressions that are used to provide the value of DFDL properties in the dfdl:format annotation on the top level xs:schema declaration MAY NOT contain relative paths.

6.3.3       DFDL Regular Expressions

The DFDL lengthPattern property expects a regular expression to be specified. The DFDL Regular Expression language is defined in the section 24 DFDL Regular Expressions.

6.3.4       Enumerations in DFDL

Some DFDL properties accept an enumerated list of valid values. It is a schema definition error if a value other than one of the enumerated values is specified. The case of the specified value must match the enumeration. An enumeration is of type string unless otherwise stated.

7.     Syntax of DFDL Annotation Elements

This section describes the syntax of each of the DFDL annotation elements along with discussion of their basic meanings.

The DFDL annotation elements are listed in Table 1 - DFDL Annotation Elements

7.1       Component Format Annotations

A data format can be 'used' or put into effect for a part of the schema by use of the component format annotation elements.

There are specific annotations for each type of schema component that supports only the representation properties applicable to that component. The table below gives the specific annotation for each schema component.

Schema component

DFDL annotation

xs:choice

dfdl:choice

xs:element

dfdl:element

xs:element reference

dfdl:element

xs:group reference

dfdl:group

xs:schema

dfdl:format

xs:sequence

dfdl:sequence

xs:simpleType

dfdl:simpleType

Table 5 DFDL Component Format Annotations

In addition the dfdl:format annotation is used inside a dfdl:defineFormat annotation to define a named reusable set of representation properties that can be referenced from any component specific format annotation or from other named format definitions.

A dfdl:format annotation at the top level of a schema, that is as an annotation child element on the xs:schema, provides a set of default properties for the lexically enclosed schema document. See 8.1 Providing Defaults for DFDL properties.

Example of DFDL component format annotation:

<xs:schema ...>

  ...

  <xs:element name="root">

    <xs:annotation>

      <xs:appinfo source="http://www.ogf.org/dfdl/">

        <dfdl:element ref="aBaseConfig"

                     representation="text"

                     encoding="UTF-8"/>

      </xs:appinfo>

    </xs:annotation>

  </xs:element>

  ...

</xs:schema>

 

7.1.1       The dfdl:ref Property

A named, reusable, dfdl:defineFormat definition is used by referring to its name from a format annotation using the 'ref' property. For example:

<dfdl:element ref="reusableDef" encoding="ebcdic-cp-us" />

The behavior of this dfdl:defineFormat definition is as if all representation properties defined by the named dfdl:defineFormat definition were instead written directly on this format annotation; however, these are superseded by any representation properties that are defined here such as the encoding property in the example above.

7.1.2       Property Binding Syntax

The format properties may be specified in one of three forms:

  1. Attribute form
  2. Element form
  3. Short form

A DFDL property may be specified using any form with the following exceptions

It is a schema definition error if the same property is specified in more than one form in the resolved set of annotations for an annotation point.

7.1.2.1      Property Binding Syntax: Attribute Form

Within the format annotation elements are bindings for properties of the form:

 Property="Value"

For example:

  <xs:annotation>

    <xs:appinfo source="http://www.ogf.org/dfdl/">

      <dfdl:format encoding="utf-8" separator="%NL;"/>

    </xs:appinfo>

  </xs:annotation>

The Property is the name of the property. The Value is an XML string literal corresponding to a value of the appropriate type.

7.1.2.2      Property Binding Syntax: Element Form

The representation properties can sometimes have complex syntax, so an element form for representation property bindings is provided as element content within the format element content model. This is provided to ease syntactic expression difficulties. The element is called dfdl:property and it has one attribute 'name' which provides the name of the property.

For example:

  <xs:annotation>

    <xs:appinfo source="http://www.ogf.org/dfdl/">

      <dfdl:format>

        <dfdl:property name='encoding'>utf-8</dfdl:property>

        <dfdl:property name='separator'>%NL;</dfdl:property>

      </dfdl:format>

    </xs:appinfo>

  </xs:annotation>

Element form is mostly used for properties that themselves contain the quotation mark characters and escape characters so that they can be expressed without concerns about confusion with the XSDL syntax use of these same characters. CDATA encapsulation can be used so as to allow malformed XML and mismatched quotes to be easily used as representation property values:

<dfdl:property name='initiator'><[CDATA[<!-- ]]></dfdl:property>

7.1.2.3      Property Binding Syntax:Short Form

To save textual clutter, short-form syntax for format annotations is also allowed on xs:element, xs:sequence, xs:choice, xs:group (for group references only), and xs:simpleType schema elements. (The xs:schema element cannot carry short-form annotations). Attributes which are in the namespace 'http://www.ogf.org/dfdl/dfdl-1.0/' and whose local name matches one of the DFDL representation properties are assumed to be equivalent to specific DFDL attribute form annotations.

For example the two forms below are equivalent in that they describe the same data format. The first is the short form of the second:

<xs:element name="elem1">

  <xs:complexType>

     <xs:sequence dfdl:separator="%HT;" >

       ...

     </xs:sequence>

  </xs:complexType>

</xs:element>

 

<xs:element name="elem2">

  <xs:complexType>

    <xs:sequence>

      <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">

        <dfdl:sequence separator="%HT;" />

      </xs:appinfo></xs:annotation>

      ...

    </xs:sequence>

  </xs:complexType>

</xs:element>

Another example:

<xs:sequence dfdl:separator=",">

  <xs:element name="elem1" type="xs:int" maxOccurs="unbounded"

                       dfdl:representation="text"

                       dfdl:textNumberRep="standard"

                       dfdl:initiator="["

                       dfdl:terminator="]"/>

 

  <xs:element name="elem2" type="xs:int" maxOccurs="unbounded">

    <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">

      <dfdl:element representation="text"

                     textNumberRep="standard"

                     initiator="["                  

                     terminator="]"/>

    </xs:appinfo></xs:annotation>

  </xs:element>

</xs:sequence>

Because short form syntax is not allowed on the xs:schema element, an attribute form dfdl:format annotation must be used instead.

7.1.3       Empty String as a Representation Property Value

DFDL provides no mechanism to un-set a property. Setting a representation property's value to the empty string doesn't remove the value for that property, but sets it to the empty string value. This may not be appropriate as a value for certain properties.

For example, in delimited text data formats, it is sensible for the separator to be defined to be the empty string. This turns off use of separator delimiters. For many other string-valued properties, it is a schema definition error to assign them the empty string value. For example, the character set encoding property (dfdl:encoding) cannot be set to the empty string.

7.2       dfdl:defineFormat - Reusable Data Format Definitions

One or more dfdl:defineFormat annotation elements can appear within the annotation children of the xs:schema element. DFDL defining annotation elements may only appear as annotation children of the xs:schema element.

The order of their appearance does not matter, nor does their position relative to other non-annotation children of the xs:schema.

Each dfdl:defineFormat has a required name attribute.

The construct creates a named data format definition. The value of the name attribute is of XML type NCName. The format name will become a member of the schema's target namespace. These names must be unique within the namespace.

If multiple format definitions have the same 'name' attribute, in the same namespace, then it is a schema definition error.

Here is an example of a format definition:

<xs:schema ...>

  <xs:annotation>

    <xs:appinfo source="http://www.ogf.org/dfdl/">

      <dfdl:defineFormat name="myConfig" >

        <dfdl:format representation="text"

                     ref="textSpecialFormat1" />

      </dfdl:defineFormat>

    </xs:appinfo>

  </xs:annotation>

  ...

</xs:schema>

A dfdl:defineFormat serves only to supply a named definition for a format for reuse from other places. It does not cause any use of the representation properties it contains to describe any actual data.

7.2.1       Inheritance for dfdl:defineFormat

A dfdl:defineFormat declaration can inherit from another named format definition by use of the dfdl:ref property of the dfdl:format annotation. This allows a single-inheritance hierarchy that reuses definitions. When one definition extends another in this way, any property definitions contained in its direct elements override those in any inherited definition.

Conceptually, the dfdl:ref inheritance chains can be flattened and removed by copying all inherited property bindings and then superseding those for which there is a local binding. Throughout this document we will assume inheritance is fully flattened. That is, all dfdl:ref inheritance is first removed by flattening before any other examination of properties occurs.

It is a schema definition error if use of the dfdl:ref property results in a circular path.

7.2.2       Using/Referencing a Named Format Definition

See Section 7.1.1 The dfdl:ref Property.

7.3       The dfdl:assert Statement Annotation Element

The dfdl:assert statement annotation element is used to assert truths about a DFDL model that are used when parsing to ensure that the data are well-formed. They are not used when unparsing.These checks are separate from validation checking and are performed even when validation is off. This distinction is needed to ensure that switching validation off does not affect parsing.

Examples of dfdl:assert elements are below:

<dfdl:assert message="Value is not zero." test="{ ../x eq 0}" />

 

<dfdl:assert message="Precondition violation." >

        {../x le 0 and ../y ne "-->" and ../y ne "<!—" }

</dfdl:assert>

 

 

<dfdl:assert message="Postcondition violation."  testKind='expression'>    

     {../x ne "'"}

</dfdl:assert>

7.3.1       Properties for dfdl:assert

A dfdl:assert annotation contains a test expression or a test pattern. The dfdl:assert is said to be successful if the test expression evaluates to true or the test pattern returns a non-zero length match, and unsuccessful if the test expression evaluates to false or the test pattern returns a zero length match. An unsuccessful dfdl:assert causes either a processing error or a recoverable error to be issued, as specified by the failureType property of the dfdl:assert.

The testKind property specifies whether an expression or pattern is used by the dfdl:assert. The expression or pattern can be expressed as an attribute or as a value.

<dfdl:assert  test="{test expression}" />

 

<dfdl:assert>

            {test expression}

</dfdl:assert>

It is a schema definition error if a property is specified in more than one form.

It is a schema definition error if both a test expression and a test pattern are specified.

A dfdl:assert can appear as an annotation on:

If the resolved set of statement annotations for a schema component contains multiple dfdl:assert statements, then those with testKind 'pattern' are executed before those with testKind 'expression' (the default). However, within each group the order of execution among them is not specified.

If one of the resolved set of asserts for a schema component is unsuccessful, and the failureType of the assert is ‘processingError’, then no further asserts in the set are executed.

 

Property Name

Description

testKind

Enum (optional)

Valid values are 'expression',  'pattern'

Default value is 'expression'

Specifies whether a DFDL expression or DFDL regular expression is used in the dfdl:assert.

Annotation: dfdl:assert

test

DFDL Expression

Applies when testKind is 'expression'

A DFDL expression that evaluates to true or false. If the expression evaluates to true then parsing continues. If the expression evaluates to false then a processing error is raised.

Any element referred to by the expression must have already been processed or must be a descendent of this element.

If a processing error occurs during the evaluation of the test expression then the dfdl:assert also fails.

It is a schema definition error if testKind is 'expression' or not specified, and an expression is not supplied by either the value of the dfdl:assert element or the value of the test attribute.

Annotation: dfdl:assert

testPattern

DFDL Regular Expression

Applies when testKind is 'pattern'

A DFDL regular expression that is applied against the data stream starting at the data position corresponding to the beginning of the representation. Consequently the framing (including any initiator) is visible to the pattern.at the start of the component on which the dfdl:assert is positioned.

If the pattern matching of the regular expression reads data that cannot be decoded into characters of the current encoding, then the behavior is controlled by the dfdl:encodingErrorPolicy property. See Section 11.2.1   Property dfdl:encodingErrorPolicy for details.

If the length of the match is zero then the dfdl:assert evaluates to false and a processing error is raised.

If the length of the match is non-zero then the dfdl:assert evaluates to true.

If a processing error occurs during the evaluation of the test regular expression then the dfdl:assert also fails.

It is a schema definition error if testKind is 'pattern', and a pattern is not supplied by either the value of the dfdl:assert element or the value of the testPattern property.

It is a schema definition error if there is no value for the dfdl:encoding property in scope.

It is a schema definition error if dfdl:leadingSkip is other than 0.

It is a schema definition error if the dfdl:alignment is not 1 or 'implicit'

Annotation: dfdl:assert

message

String or DFDL Expression

Defines text to be used as a diagnostic code or for use in an error message, when the assert is unsuccessful.

The DFDL Expression must return type xs:string. Any element referred to by the message expression must have already been processed or must be a descendent of this element. There is special treatment for errors that occur while evaluating the message expression. See below for details.

Annotation: dfdl:assert

failureType

Enum (optional)

Valid values are 'processingError', 'recoverableError'.

Default value is 'processingError'.

Specifies the type of failure that occurs when the dfdl:assert is unsuccessful.

When 'processingError', a processing error is raised.

When 'recoverableError', a recoverable error is raised.

If an error occurs while evaluating the test expression, a processing error occurs, not a recoverable error.

Recoverable errors do not cause backtracking like processing errors.

Annotation: dfdl:assert

Table 6 dfdl:assert properties


Example of a dfdl:assert with a message expression:

<dfdl:assert message="{ fn:concat('unknown case ', ../data1) }">
{  if (...pred1...) then ...expr1...
   else if (...pred2...) then ...expr2...
   else fn:false()
}

</dfdl:assert>

The message specified by the message property is issued only if the dfdl:assert is unsuccessful, that is, the test expression  evaluates to false or the test pattern returns a zero-length match. If so, and the message property is an expression, the message expression is evaluated at that time.

If a processing error or schema definition error occurs while evaluating the message expression, a recoverable error is issued to record this error (containing implementation-dependent content), then processing of the assert continues as if there was no problem and in a manner consistent with the failureType property, but using an implementation-dependent substitute message.

 

7.3.2       Controlling the Timing of Statement Evaluation

Schema authors can insert xs:sequence constructs to control the timing of evaluation of statements more precisely.For example:

<xs:sequence dfdl:separator=",">

   ...

   <xs:element ref="a" .../>

   <xs:sequence>

     <xs:sequence>

       <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/" >

         <dfdl:assert  test="{test expression}" />

       </xs:appinfo></xs:annotation>

     </xs:sequence>

     <xs:element ref="b" .../>

   </xs:sequence>

   ...

</xs:sequence>

In the above, the assert test expression is evaluated after parsing element 'a', and before parsing element "b". The use of two nested interior sequences surrounding element 'b' in this manner insures that the outermost sequence's separator usage is not disrupted.

 

7.4       The dfdl:discriminator Statement Annotation Element

DFDL discriminators are used during parsing to resolve points of uncertainty that cannot be resolved by speculative parsing. Discriminators are not used during unparsing.  They can also be used to force a resolution earlier during the parsing of a group so that subsequent parsing errors are treated as processing errors of a known component rather than a failure to find a component.

A discriminator determines the existence or non-existence of a component. If the discriminator is successful then the component is known to exist and any subsequent errors will not cause backtracking at points of uncertainty. If a discriminator is unsuccessful then the component is known not to exist and backtracking occurs immediately.

If the complex type of an element contains a sequence group as its content model then if the sequence group is known not to exist, then the element is known not to exist.

Examples of dfdl:discriminator annotation are below :

<dfdl:discriminator>

  { ../recType eq 0 }

</dfdl:discriminator>

 

<dfdl:discriminator test="{ ../recType eq 0}" />

When the discriminator's expression evaluates to "false", then it causes a processing error, and the discriminator is said to fail.

7.4.1       Properties for dfdl:discriminator

A DFDL discriminator contains a test expression that is an expression that evaluates to true or false. The discriminator is said to be successful if the test evaluates to true and unsuccessful (or fails) if the test evaluates to false.

The testKind property specifies whether an expression or pattern is used by the dfdl:discriminator. The expression or pattern can be expressed as an attribute or as a value.

<dfdl:discriminator test="{test expression}" />

 

<dfdl:discriminator>

    { test expression }

</dfdl:discriminator>

It is a schema definition error if a property is specified in more than one form.

It is a schema definition error if both a test expression and a test pattern are specified.

A dfdl:discriminator can be an annotation on

The resolved set of statement annotations for a schema component can contain only a single dfdl:discriminator or one or more dfdl:assert annotations, but not both. To clarify: dfdl:assert annotations and dfdl:discriminator annotations are exclusive of each other. It is a schema definition error otherwise.

Property Name

Description

testKind

Enum

Valid values are 'expression',  'pattern'

Default value is 'expression'

Specifies whether a DFDL expression or DFDL regular expression is used in the dfdl:discriminator .

Annotation: dfdl:discriminator

test

DFDL Expression

Applies when testKind is 'expression'

A DFDL expression that evaluates to true or false. If the expression evaluates to true then the discriminator succeeds and parsing continues. If the expression evaluates to false then the discriminator fails and a processing error is raised.
If a processing error occurs during the evaluation of the test expression then the discriminator also fails.

Any element referred to by the expression must have already been processed or is a descendent of this element.

The expression must have been evaluated by the time this element and it descendents have been processed or when a processing error occurs when processing this element or its descendents.

It is a schema definition error if testKind is 'expression' or not specified, and an expression is not supplied by either the value of the dfdl:discriminator element or the value of the test attribute.

Annotation: dfdl:discriminator

testPattern

DFDL Regular Expression

Applies when testKind is 'pattern'

A DFDL regular expression that is applied against the data stream starting at the data position corresponding to the beginning of the representation. Consequently the framing (including any initiator) is visible to the pattern.at the start of the component on which the dfdl:discriminator is positioned.

If the pattern matching of the regular expression reads data that cannot be decoded into characters of the current encoding, then the behavior is controlled by the dfdl:encodingErrorPolicy property. See Section 11.2.1   Property dfdl:encodingErrorPolicy for details.

If the length of the match is zero then the dfdl:discriminator evaluates to false and a processing error is raised.

If the length of the match is non-zero then the dfdl:discriminator evaluates to true.

It is a schema definition error if testKind is 'pattern', and a pattern is not supplied by either the value of the dfdl:discriminator element or the value of the testPattern property.

It is a schema definition error if there is no value for the dfdl:encoding property in scope.

It is a schema definition error if dfdl:leadingSkip is other than 0.

It is a schema definition error if the dfdl:alignment is not 1 or 'implicit'

Annotation: dfdl:discriminator

message

String or DFDL Expression

Defines text to be used as a diagnostic code or for use in an error message, when the discriminator is unsuccessful.

The DFDL Expression must return type xs:string. Any element referred to by the message expression must have already been processed or must be a descendent of this element. There is special treatment for errors that occur while evaluating the message expression. See below for details.

Annotation: dfdl:discriminator

Table 7 dfdl:discriminator properties

The message specified by the message property is issued only if the discriminator is unsuccessful, that is, the test expression  evaluates to false or the test pattern returns a zero-length match. If so, and the message property is an expression, the message expression is evaluated at that time.

If a processing error or schema definition error occurs while evaluating the message expression, a recoverable error is issued to record this error (containing implementation-dependent content), then processing of the discriminator continues as if there was no problem, but in the case of failure using an implementation-dependent substitute message.

Examples of dfdl:discriminator annotations:

<xs:sequence>

  <xs:choice>

    <xs:element  name='branchSimple' >

      <xs:annotation>

        <xs:appinfo source="http://www.ogf.org/dfdl/">

          <dfdl:discriminator test='{. eq "a"}'       />

        </xs:appinfo>

      </xs:annotation>

    </xs:element>

 

    <xs:element name='branchComplex' >

      <xs:annotation>

        <xs:appinfo source="http://www.ogf.org/dfdl/">

          <dfdl:discriminator test='{./identifier eq "b"}' />

        </xs:appinfo>

      </xs:annotation>

      <xs:complexType >

         <xs:sequence>

           <xs:element name='identifier'  />

           ...

         </xs:sequence>

      </xs:complexType>

    </xs:element>

 

    <xs:element name='branchNestedComplex' >

      <xs:annotation>

       <xs:appinfo source="http://www.ogf.org/dfdl/">

          <dfdl:discriminator test='{./Header/identifier eq "c"}'/>

        </xs:appinfo>

      </xs:annotation>

      <xs:complexType >

        <xs:sequence>

          <xs:element name='Header'  />

            <xs:complexType >

              <xs:sequence>

                <xs:element name='identifier'  />

                ...              

              </xs:sequence>

            </xs:complexType>

          </xs:element>

        </xs:sequence>

      </xs:complexType>

    </xs:element>

  </xs:choice>

</xs:sequence>

 

7.5       The dfdl:defineEscapeScheme Defining Annotation Element

One or more dfdl:defineEscapeScheme annotation elements can appear within the annotation children of the xs:schema. The dfdl:defineEscapeScheme elements may only appear as annotation children of the xs:schema.

The order of their appearance does not matter, nor does their position relative to other annotation or non-annotation children of the xs:schema.

Each dfdl:defineEscapeScheme has a required name attribute and a required dfdl:escapeScheme child element.

The construct creates a named escape scheme definition. The value of the name attribute is of XML type NCName. The name will become a member of the schema's target namespace. These names must be unique within the namespace among escape schemes.

If multiple dfdl:defineEscapeScheme definitions have the same 'name' attribute, in the same namespace, then it is a schema definition error.

Each dfdl:defineEscapeScheme annotation element contains a dfdl:escapeScheme annotation element as detailed below.

Here is an example of an escapeScheme definition:

<xs:schema ...>

  <xs:annotation>

    <xs:appinfo source="http://www.ogf.org/dfdl/">

      <dfdl:defineEscapeScheme name="myEscapeScheme">

        ...

        <dfdl:escapeScheme escapeKind="escapeCharacter"

                           escapeCharacter='/' />

        ...      

      </dfdl:defineEscapeScheme>

    </xs:appinfo>

  </xs:annotation>

  ...

</xs:schema>

A dfdl:defineEscapeScheme serves only to supply a named definition for a dfdl:escapeScheme for reuse from other places. It does not cause any use of the representation properties it contains to describe any actual data.

7.5.1       Using/Referencing a Named escapeScheme Definition

A named, reusable, escape scheme is used by referring to its name from a dfdl:escapeSchemeRef property on an element. For example:

<xs:element name="foo" type="xs:string" >
  <xs:annotation><xs:appinfo source="
http://www.ogf.org/dfdl/">

    <dfdl:element representation="text" 
                  escapeSchemeRef="myEscapeScheme"/>

  </xs:appinfo></xs:annotation>
</xs:element>

7.6       The dfdl:escapeScheme Annotation Element

The dfdl:escapeScheme annotation is used within a dfdl:defineEscapeScheme annotation to group the properties of an escape scheme and allows a common set of properties to be defined that can be reused.

An escape scheme defines the properties that describe the text escaping rules in force when data such as text delimiters are present in the data. There are two variants on such schemes,

-       The use of a single escape character to cause the next character to be interpreted literally. The escape character itself is escaped by the escape escape character.

-       The use of a pair of escape strings to cause the enclosed group of characters to be interpreted literally. The ending escape string is escaped by the escape escape character.

On parsing, the escape scheme is applied after pad characters are trimmed and on unparsing before pad characters are added.

DFDL does not perform any substitutions for ampersand notations like &lt;.

The syntax of dfdl:escapeScheme is defined in Section 13.2.1.Table 26 Properties Common to All Simple Types with Text Representation

The dfdl:escapeScheme Properties

7.7       The dfdl:defineVariable Annotation Element

Variables provide a means for communication within a set of DFDL schema. They are defined as top-level elements in a schema and therefore have global scope.  .

A new variable is introduced using dfdl:defineVariable:

<dfdl:defineVariable

       name = NCName

       type? = QName

      defaultValue? = logical value or dfdl expression

      external? = 'false' | 'true' >

  <!-- Contains: logical value or dfdl expression (default value) -->

</dfdl:defineVariable>

The name of a newly defined variable is placed into the target namespace of the schema containing the annotation. Variable names are distinct from format and escape scheme names and so cannot conflict with them.  A variable can have any type from the DFDL subset of XML schema simple types. If no type is specified, the type is xs:string.

The defaultValue is optional. This is a literal value or an expression which evaluates to a constant, and it can be specified as an attribute or as the element value. If specified the default value must match the type of the variable (otherwise it is a schema definition error).

Note that the syntax supports both a defaultValue attribute and the default value being specified by the element value. Only one or the other may be present (otherwise it is a schema definition error). To set the default value to "" (empty string), the defaultValue attribute syntax must be used, or the expression { "" } must be used as the element value.

Note the value of the name attribute is an NCName. The name of a variable is defined in the target namespace of the schema containing the definition. If multiple dfdl:defineVariable definitions have the same 'name' attribute in the same namespace then it is a schema definition error.

A default instance of the variable is created (with global scope).  Further instances of the variable may subsequently be created on schema elements. If the variable has a default value, this will used as the default value for any instances of the variable (unless overridden when the instance is created).

The external property is optional. If not specified it takes the default value 'false'. If true the value may be provided by the DFDL processor and this external value will be used as the global default value (overriding any defaultValue specified on the dfdl:defineVariable). The mechanism by which the processor provides this value is implementation-defined.

There is no required order between dfdl:defineVariable and other schema level defining annotations or a dfdl:format annotation that may refer to the variable.

A defaultValue expression is evaluated before processing the data stream begins.

A defaultValue expression can refer to other variables but not to the infoset (so no path locations).The referenced variable must either have a defaultValue or be external. It is a schema definition error otherwise.

If a defaultValue expression references another variable then that prevents the referenced variable's value from ever changing, that is, it is considered to be a read of the variable's value.

If a defaultValue expression references another variable and this causes a circular reference, it is a schema definition error.

It is a schema definition error if the type of the variable is a user-defined simple type restriction.

7.7.1       Examples

 <dfdl:defineVariable name="EDIFACT_DS" type="xs:string"

                     defaultValue="," />

 

<dfdl:defineVariable name="codepage" type="xs:string"

                     external="true">utf-8</dfdl:defineVariable>

7.7.2       Predefined Variables

The following variables are predefined

Name

Namespace URI

Type

Default value

External

encoding

http://www.ogf.org/dfdl/dfdl-1.0/

xs:string

'UTF-8'

true

byteOrder

http://www.ogf.org/dfdl/dfdl-1.0/

xs:string

'bigEndian'

true

binaryFloatRep

http://www.ogf.org/dfdl/dfdl-1.0/

xs:string

'ieee'

true

outputNewLine

http://www.ogf.org/dfdl/dfdl-1.0/

xs:string

'%LF;'

true

Table 8 Pre-defined variables

These variables are expected to be commonly set externally so are predefined for convenience.

      <xs:element name="title" type="xs:string">
        <xs:annotation>
          <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:element encoding="{$dfdl:encoding}" />
          </xs:appinfo>
        </xs:annotation>
      </xs:element>

7.8       The dfdl:newVariableInstance Statement Annotation Element

Scoped instances of defined variables are created using dfdl:newVariableInstance:

<dfdl:newVariableInstance

       ref = QName

      defaultValue? = logical value or dfdl expression >

  <!-- Contains: logical value or dfdl expression (value) -->

</dfdl:newVariableInstance>

Since an initial instance is created when the variable is defined, the use of dfdl:newVariableInstance is optional. It would be used if an instance with restricted scope is needed.

The dfdl:newVariableInstance annotation can be used on a group reference, sequence or choice only. It is a schema definition error otherwise.

The scope of the instance of a variable is the dynamic scope of the schema component and its content model and so is inherited by any contained constructs or construct references.

The ref property is a QName. That is, it may be qualified with a namespace prefix.

An optional defaultValue for the instance may be specified. It can be specified as an attribute or as the element value. The expression must not contain forward references to elements which have not yet been processed nor to the current component. If specified the default value must match the type of the variable as specified by dfdl:defineVariable. If the instance is not assigned a new default value then it will inherit the default value specified by dfdl:defineVariable or externally provided by the DFDL processor. If a default value is not specified (and has not been specified by dfdl:defineVariable) then the value of this instance is undefined until explicitly set (using dfdl:setVariable).

If a default value is specified this initial value of the instance will be set when the instance is created. The value will override any (global) default value which was specified by dfdl:defineVariable or which was provided externally to the DFDL processor. A variable instance with a valid value (specified or default) can be referenced anywhere within the scope of the element on which the instance was created.

Note that the syntax supports both a defaultValue attribute and the default value being specified by the element value. Only one or the other may be present. (Schema definition error otherwise.)

To set the default value to "" (empty string), the defaultValue attribute syntax must be used, or the expression { "" } must be used as the element value.

The resolved set of annotations for a component may contain multiple dfdl:newVariableInstance statements. They must all be for unique variables, it is a schema definition error otherwise. The order of execution is specified in Section 9.5 Evaluation Order for Statement Annotations.

There is no short form syntax for creating variable instances.

7.8.1       Examples

<dfdl:newVariableInstance ref="EDIFACT_DS" defaultValue=","/>

 

<dfdl:newVariableInstance ref="lengthUnitBits">

    { if (../hdr/fmtCode eq "bits") then 1 else 8 }  

</dfdl:newVariableInstance>

7.9       The dfdl:setVariable Statement Annotation Element

Variable instances get their values either by default, by external definition, or by subsequent assignment using the dfdl:setVariable statement annotation.

<dfdl:setVariable

       ref = QName

       value? = logical value or dfdl expression >

  <!-- Contains: logical value or dfdl expression (value) -->

</dfdl:setVariable>

The dfdl:setVariable annotation can be used on a simpleType, group reference, sequence or choice. It may be used on an element or element reference only if the element is of simple type. It is a schema definition error if dfdl:setVariable appears on an element of complex type, or an element reference to an element of complex type.

The ref property is a QName. That is, it may be qualified with a namespace prefix.

The syntax supports both a value attribute and the 'value' being specified by the element value. Only one or the other may be present (otherwise it is a schema definition error). To set the value to "" (empty string), the value attribute syntax must be used, or the expression { "" } must be used as the element value.

The value must match the type of the variable as specified by dfdl:defineVariable.

A dfdl:setVariable value expression may refer to the value of this element using a relative path value ".". Use of relative path expressions is recommended wherever possible as this will allow the behavior of the parser to be more effectively scoped. However this practice is not enforced and there may be situations in which use of an absolute path is in fact necessary.

The declaration of a variable must be in scope at the point of the assignment, and at the point of reference.

In normal processing, the value of an instance can only be set once using dfdl:setVariable.  Attempting to set the value of the variable instance for a second time is a schema definition error. In addition, if a reference to the variable's value has already occurred and returned a default or an externally supplied value, then no assignment (even a first one) can occur. An exception to this behavior occurs whenever the DFDL processor backtracks because it is processing multiple branches of a choice or as a result of speculative parsing. In this case the variable state is also rewound.

A dfdl:setVariable will override any default value specified on either dfdl:defineVariable or dfdl:newVariableInstance, or externally.

The resolved set of annotations for an annotation point may contain multiple dfdl:setVariable statements. They must all be for unique variables and it is a schema definition error otherwise. The order of execution is specified in Section 9.5 Evaluation Order for Statement Annotations.

There is no short form syntax for variable assignment.

7.9.1       Examples

<xs:element name="ds" type="xs:string">

   <xs:annotation>< xs:appinfo source="http://www.ogf.org/dfdl/">

      <dfdl:setVariable ref="EDI:EDIFACT_DS" value="{.}" />

      <dfdl:setVariable ref="delta"> {.} </dfdl:setVariable>

   </xs:appinfo></xs:annotation>

</xs:element>

In the above example, the element named "ds" contains the string to be used as the EDI:EDIFACT_DS delimiter at other places in the data, so the above defines the value of the EDI:EDIFACT_DS variable to take on the value of this element.

8.      Property Scoping Rules

This section describes the rules that govern the scope over which DFDL representation properties apply

The scope of the representational properties on each of the component format annotations is given in Table 9 DFDL annotation scoping

Annotation Point

Property Scope

Schema declaration

dfdl:format representation properties apply lexically as default properties over all components in the schema

Element declaration

dfdl:element properties apply locally

Element reference

dfdl:element properties apply locally

Simple type definition

dfdl:simpleType properties apply locally

Sequence

dfdl:sequence properties apply locally

Choice

dfdl:choice properties apply locally

Group reference

dfdl:group properties apply locally

Table 9 DFDL annotation scoping

Note: This table lists DFDL annotations on schema components. DFDL annotations can also be placed on other DFDL annotations, such as a dfdl:format within a dfdl:defineFormat, to provide a named reusable format definition. In this case the annotation applies only where the named format is referenced.

DFDL representation properties explicitly defined on annotations, other than a dfdl:format on an xs:schema declaration, apply locally to that component only. The explicitly defined properties are the combination of any defined locally on the annotation and any defined on the dfdl:defineFormat annotation referenced by a local dfdl:ref property. When a property is defined both locally and on the dfdl:defineFormat, the locally defined property takes precedence.

The dfdl:format annotation on the top level xs:schema declaration provides defaults for the DFDL representation properties at every DFDL-annotatable component contained in the schema document. They do not apply to any components in any included or imported schema document (these may have their own defaults).

8.1       Providing Defaults for DFDL properties

A dfdl:format annotation on the top level xs:schema declaration may provide defaults for some or all the DFDL representation properties at every annotation point within the schema document. The default properties may be specified in attribute or element form. (Short form is not allowed on the xs:schema element.)

The dfdl:ref property is not a representation property so no default can be set.

The dfdl:escapeSchemeRef property provides a default reference to a dfdl:defineEscapeScheme, the properties of dfdl:escapeScheme are not defaulted individually.

DFDL representation properties defined explicitly on a component apply only to that component and override the default value of that property provided by a default format specified by an xs:schema dfdl:format annotation.

The example below demonstrates the overriding of the encoding property. The  value'ASCII' is the default value for the title element, but then it is overridden by the locally defined utf-8 value for the encoding property, which takes precedence.

<xs:schema>

  <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/">
      <dfdl:format encoding="ASCII" />
    </xs:appinfo>
  </xs:annotation>

  <xs:element name="book">
    <xs:complexType>

      <xs:sequence>
        <xs:element name="title" type="xs:string">
          <xs:annotation>
            <xs:appinfo source="http://www.ogf.org/dfdl/">
              <dfdl:element encoding="utf-8" />
            </xs:appinfo>
          </xs:annotation>
        </xs:element>
        <xs:element name="pages" type="xs:int"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

</xs:schema>

8.2       Combining DFDL Representation Properties from a dfdl:defineFormat

The DFDL representation properties contained in a referenced dfdl:defineFormat are combined with any DFDL representation properties defined locally on a construct as if they had been defined locally. If the same property is defined locally in and in the referenced dfdl:defineFormat then the local property takes precedence. The combined set of explicit DFDL properties has precedence over any defaults set by a dfdl:format on the xs:schema.

<xs:schema>

  <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/">
      <dfdl:defineFormat name='myFormat'>

        <dfdl:format encoding="ASCII" />

      </dfdl:defineFormat>
    </xs:appinfo>
  </xs:annotation>

  <xs:element name="book">
    <xs:complexType>

      <xs:sequence>
        <xs:element name="title" type="xs:string">
          <xs:annotation>
            <xs:appinfo source="http://www.ogf.org/dfdl/">
              <dfdl:element ref='myFormat' encoding="UTF-8" />
            </xs:appinfo>
          </xs:annotation>
        </xs:element>
        <xs:element name="pages" type="xs:int"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

</xs:schema>

The example above demonstrates the overriding of an encoding property. The 'ASCII' format encoding from the 'myFormat' is overridden by the UTF-8 format encoding, which as a locally defined property takes precedence.

8.3       Combining DFDL Properties from References

The DFDL properties from the following types of reference are combined using the rules below:

Rules

  1. Create an empty working set of "explicit" properties. Create an empty working set of "default" properties.
  2. Move to the innermost schema component in the chain of references.
  3. Assemble its applicable "explicit" properties from its local dfdl:ref (if present) and its local properties (if present), the latter overriding the former (that is, local wins over referenced).
  4. Combine these with the current working set of "explicit" properties. It is a schema definition error if the same property appears twice. The result is a new working set of "explicit" properties.
  5. Obtain applicable "default" properties from a dfdl:format annotation on the xs:schema that contains the component (if such annotation is present).  Combine these with the current working set of "default" properties, the latter overriding the former (that is, inner wins). Result is a new working set of "default" properties.
  6. Move to the schema component that references the current component, and repeat starting at step 3. If there is no referencing component, carry out step 5 and then go to step 7.
  7. Combine the resultant sets of properties. The "explicit" properties take priority, "defaults" only used when no "explicit" is present. It is a schema definition error if a required property is in neither the "explicit" nor the "default" working sets.

"Applicable" properties are all the DFDL properties that apply to that schema component. For example all the DFDL properties that apply to a particular xs:simpleType (as defined by section 13).

<xs:simpleType name="newType">

  <xs:annotation>

    <xs:appinfo source="http://www.ogf.org/dfdl/">

      <dfdl:simpleType alignment="16"/>

    </xs:appinfo>

  </xs:annotation>

  <xs:restriction base="xs:integer">

    <xs:maxInclusive value="10"/>

  </xs:restriction>

</xs:simpleType>

 

<xs:element name="testElement1" type="newType">

  <xs:annotation>

    <xs:appinfo source="http://www.ogf.org/dfdl/">

      <dfdl:element representation="binary"/>

    </xs:appinfo>

  </xs:annotation>

</xs:element>

The locally defined dfdl:alignment property with value '16' from the xs:simpleType 'newType' is combined with the locally defined dfdl:representation property with value 'binary' and applied to element 'testElement1',

<xs:simpleType name="otherNewType">

  <xs:annotation>

    <xs:appinfo source="http://www.ogf.org/dfdl/">

      <dfdl:simpleType alignment="64"/>

    </xs:appinfo>

  </xs:annotation>

  <xs:restriction base="newType">

    <xs:maxInclusive value="5"/>

  </xs:restriction>

</xs:simpleType>

 

<xs:simpleType name="newType">

  <xs:annotation>

    <xs:appinfo source="http://www.ogf.org/dfdl/">

      <dfdl:simpleType representation='binary'/>

    </xs:appinfo>

  </xs:annotation>

  <xs:restriction base="xs:int">

    <xs:maxInclusive value="10"/>

  </xs:restriction>

</xs:simpleType>

The locally defined dfdl:representation property with value 'binary' is combined with the locally defined dfdl:alignment property with value '64' from the xs:simpleType restriction 'otherNewType'.

<xs:sequence>

  <xs:element ref="testElement1">

    <xs:annotation>

      <xs:appinfo source="http://www.ogf.org/dfdl/">

        <dfdl:element binaryNumberRep ="binary"/>

      </xs:appinfo>

    </xs:annotation>

  </xs:element>

</xs:sequence>

 

<xs:element name="testElement1" type="newType">

  <xs:annotation>

    <xs:appinfo source="http://www.ogf.org/dfdl/">

      <dfdl:element representation="binary"/>

    </xs:appinfo>

  </xs:annotation>

</xs:element>

 

<xs:simpleType name="newType">

  <xs:annotation>

    <xs:appinfo source="http://www.ogf.org/dfdl/">

      <dfdl:simpleType alignment="16"/>

    </xs:appinfo>

  </xs:annotation>

  <xs:restriction base="xs:int">

    <xs:maxInclusive value="10"/>

  </xs:restriction>

</xs:simpleType>

The locally defined dfdl:alignment property with value '16' from the xs:simpleType 'newType' is combined with the locally defined dfdl:representation property with value 'binary' and locally defined dfdl:binaryNumberRep with a value of 'binary'

<!-- SCHEMA1 -->

<xs:schema targetNamespace="" xmlns:tns1="http://tns1">

 

  <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/">
      <dfdl:format encoding="ASCII" byteOrder="littleEndian"

                initiator="" terminator=""

                sequenceKind="ordered"  />
    </xs:appinfo>
  </xs:annotation>

 

  <xsd:import namespace="http://tns2" schemaLocation="SCHEMA2.xsd"/>


 
<xs:element name="book">
    <xs:complexType>

      <xs:group ref="tns2:ggrp1" dfdl:separator=","></xs:group>

    </xs:complexType>
  </xs:element>

 

</xs:schema>

 

<!-- SCHEMA2 -->

<xs:schema targetNamespace="" xmlns:tns2="http://tns2">

 

  <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/">
      <dfdl:format encoding="UTF-8" byteOrder="littleEndian"

                initiator=""

                sequenceKind="ordered"  />
    </xs:appinfo>
  </xs:annotation>

  <xs:group name="ggrp1" >

    <xs:sequence dfdl:separatorPosition="infix" >

      <xs:element name="customer" type="xs:string"

              dfdl:length="8" dfdl:lengthKind="explicit" />  

    </xs:sequence>

  </xs:group>

</xs:schema>

The DFDL properties applied to the xs:sequence in xs:group "ggrp1" in SCHEMA2 when referenced from the group reference in SCHEMA1 are

  1. dfdl:separator "," from the group reference in SCHEMA1
  2. dfdl:separatorPosition "infix" from the group declaration in SCHEMA2
  3. dfdl:encoding "UTF-8", dfdl:initiator ''"  from the default dfdl:format annotation in SCHEMA2
  4. dfdl:terminator ""   from the default dfdl:format annotation in SCHEMA1

9.     DFDL Processing Introduction

A DFDL Parser is an application or code library that takes as input:

It is able to use the DFDL schema description to interpret the data stream and realize the DFDL Information Model. This information set could then be written out (for example it could be realized as an XML text string) or it could be accessed by an application through an API (for example, a DOM-like tree could be created in memory for access by applications).

Symmetrically, there is a notion of a DFDL Unparser. The unparser works from an instance of the DFDL Information Model, a DFDL annotated schema and writes out to a target data stream in the appropriate representation formats.

Often both parser and unparser would be implemented in the same body of software and so we do not always distinguish them. Collectively they are called a DFDL Processor. The parser and unparser may, of course, be different bodies of software. Conforming DFDL processors may implement only a parser, because the unparser is an optional feature of DFDL.

9.1       Parser Overview

The DFDL logical parser is a recursive-descent parser[7] having guided, but potentially unbounded look ahead that is used to resolve points of uncertainty. A DFDL parser reads a specification (the DFDL schema) and it recursively walks down and up the schema as it processes the data. This is done in a manner consistent with the scoping of properties and variables described in Section 8  Property Scoping Rules.

The unbounded look ahead means that there are situations where the parser must speculatively attempt to parse data where the occurrence of a processing error causes the parser to suppress the error, back out and make another attempt.

Implementations of DFDL may provide control mechanisms for limiting the speculative search behavior of DFDL parsers. The nature of these mechanisms is beyond the scope of the DFDL specification which defines the behavior of conforming parsers only on correct data. That is, data that can be parsed without any effective processing errors. Any such control mechanisms must be documented by the implementation and are thus implementation-defined.

The logical parser recursively descends the DFDL schema beginning with the distinguished global element declaration (specified for the processor in an implementation-defined manner, see Section 18).  Depending on the kind of schema construct that is encountered and the DFDL annotations on it, and the pre-existing context, the parser performs specific parsing operations on the data stream. These parsing operations typically recognize and consume data from the stream and construct values in the logical model. For values of complex types and for arrays, these logical model values may incorporate values created by recursive parsing.

DFDL Implementations are free to use whatever techniques for parsing they wish so long as the semantics are equivalent to that of the speculative recursive-descent logical parser described in this specification. It is required that implementations distinguish the various kinds of errors (schema definition error, processing error, etc.) no matter what time they are detected. Some implementations may not detect certain schema definition errors until data are being parsed; however, they must still distinguish schema definition errors (which indicate that the schema itself is not meaningful), from parsing errors (which indicate that the input data doesn't satisfy the requirements of the schema), or unparsing errors (which indicate that the infoset does not satisfy the requirements of the schema).

9.2       DFDL Data Syntax Grammar

Data in a format describable via a DFDL schema obeys the grammar given here. A given DFDL schema is read by the DFDL processor to provide specific meaning to the terminals and decisions in this grammar.

The bits of the data are divided into two broad categories:

  1. Content
  2. Framing

The content is the bits of data that are interpreted to compute a logical value.

Framing is the term used to describe the delimiters, length fields, and other parts of the data stream which are present, and may be necessary to determine the length or position of the content of DFDL Infoset items.

Note that sometimes the framing is not strictly necessary for parsing, but adds useful redundancy to the data format, allowing corrupt data to be more robustly detected, and sometimes the framing adds human readability to the data format.

In the grammar tables below, the terminal symbols are shown in bold italic font.

Productions

 

Document =  UnicodeByteOrderMark DocumentElement

DocumentElement = SimpleElement | ComplexElement

 

SimpleElement = SimpleLiteralNilElementRep | SimpleEmptyElementRep |

                            SimpleNormalRep

SimpleEnclosedElement = SimpleElement | AbsentElementRep

 

ComplexElement = ComplexLiteralNilElementRep | ComplexNormalRep |

                               ComplexEmptyElementRep

ComplexEnclosedElement = ComplexElement | AbsentElementRep

 

EnclosedElement = SimpleEnclosedElement | ComplexEnclosedElement

 

 

AbsentElementRep = Absent

 

 

SimpleEmptyElementRep =  EmptyElementLeftFraming EmptyElementRightFraming

ComplexEmptyElementRep =  EmptyElementLeftFraming EmptyElementRightFraming

 

EmptyElementLeftFraming = LeadingAlignment EmptyElementInitiator PrefixLength

EmptyElementRightFraming = EmptyElementTerminator TrailingAlignment

 

 

SimpleLiteralNilElementRep = NilElementLeftFraming [NilLiteralCharacters |

                                                 NilElementLiteralContent] NilElementRightFraming

ComplexLiteralNilElementRep = NilElementLeftFraming NilLiteralValue NilElementRightFraming

 

NilElementLeftFraming = LeadingAlignment NilElementInitiator PrefixLength

NilElementRightFraming = NilElementTerminator TrailingAlignment

 

NilElementLiteralContent = LeftPadding  NilLiteralValue RightPadOrFill

 

 

SimpleNormalRep = LeftFraming PrefixLength SimpleContent RightFraming

ComplexNormalRep = LeftFraming PrefixLength ComplexContent ElementUnused

                                    RightFraming

 

LeftFraming = LeadingAlignment Initiator

RightFraming = Terminator TrailingAlignment

 

PrefixLength = SimpleContent | PrefixPrefixLength SimpleContent

PrefixPrefixLength = SimpleContent

 

SimpleContent =   LeftPadding [ NilLogicalValue | SimpleValue ]  RightPadOrFill

 

ComplexContent = Sequence | Choice

 

 

Sequence =  LeftFraming SequenceContent RightFraming

SequenceContent = [ PrefixSeparator  EnclosedContent [ Separator EnclosedContent ]*

                                   PostfixSeparator ]

 

Choice = LeftFraming ChoiceContent RightFraming

ChoiceContent = [ EnclosedContent ] ChoiceUnused

 

EnclosedContent = [ EnclosedElement | Array | Sequence | Choice ]

 

Array = [ EnclosedElement [ Separator EnclosedElement ]*  [ Separator StopValue] ]

 

StopValue = SimpleElement

 

 

LeadingAlignment = LeadingSkip AlignmentFill

TrailingAlignment = TrailingSkip

RightPadOrFill = RightPadding | RightFill | RightPadding RightFill

 

Table 10 DFDL Grammar Productions

XML Schema and DFDL properties are used to control constraints on the terminals of the above grammar, as well as repetition (the "*" operator), and alternatives (the "|" operator). For a given set of XML Schema and DFDL properties, and prior data, any terminal may be allowed to be length zero, to contain specific data, or to contain a variety of different admissible data. 

Some definitions are needed to cover the range of representations that are possible in the data stream for an element. These definitions are with respect to the grammar above.

9.2.1       Nil Representation

An element occurrence has a nil representation if the element has XSDL nillable property 'true' and the occurrence either:

The LeadingAlignment, TrailingAlignment, PrefixLength regions may be present.

9.2.2       Empty Representation

An element occurrence has an empty representation if the occurrence does not have a nil representation and it conforms to the grammar for SimpleEmptyElementRep or ComplexEmptyElementRep. Specifically, the EmptyElementInitiator and EmptyElementTerminator regions must be conformant with dfdl:emptyValueDelimiterPolicy and the occurrence's content in the data stream is of length zero. (If non-conformant it is not a processing error and the representation is not empty). LeadingAlignment, TrailingAlignment, PrefixLength regions may be present.

The empty representation is special in DFDL, because when parsing it is this condition that can trigger the creation of a default value for an element occurrence. See Section 9.4 Element Defaults below about default values.

9.2.3       Normal Representation

An element occurrence has a normal representation if the occurrence does not have the nil representation or the empty representation and it conforms to the grammar for SimpleNormalRep or ComplexNormalRep.

9.2.4       Absent Representation

An element occurrence has an absent representation if the occurrence does not have a nil or empty or normal representation, and it conforms to the grammar for AbsentElementRep. Specifically, the occurrence's representation in the data stream is of length zero. Consequently, the Initiator, Terminator, LeadingAlignment, TrailingAlignment, PrefixLength regions must not be present.

Example of an absent representation: During unparsing, if an optional element does not have an item in the infoset then nothing is output. However if a separator of an enclosing structure is subsequently output as the immediate next thing, then a subsequent parse of the element may return a representation of length zero. If this happens, and this zero-length representation does not conform to any of the nil representation, the empty representation, or the normal representation, then it is the absent representation, and it behaves as if the element occurrence is 'missing'. (The term 'missing' is defined below.)

The point of this term 'absent representation', is that often we know the location where an element or group's representation would be in the data based on the delimiters of an enclosing group. (An example: if there are adjacent delimiters of an enclosing sequence.) When this location in the data, which is of zero length, cannot be a nil, empty, or normal representation, then we say it has absent representation, or "the representation is absent".

9.2.5       Zero-length Representation

We use the term zero-length representation to describe the situations where any of the above representations turn out to be of length zero due to specific combinations of data type and format properties:

The nil representation can be a zero-length representation if dfdl:nilValue is "%ES;", and there is no framing or framing is suppressed by dfdl:nilValueDelimiterPolicy.

The empty representation can be a zero-length representation if there is no framing or framing is suppressed by dfdl:emptyValueDelimiterPolicy.

The normal representation can be a zero-length representation if the type is xs:string or xs:hexBinary and there is no framing.

The absent representation always has a zero-length representation.

If the nil representation may be zero-length, then the absent representation cannot occur because zero-length will be interpreted as nil representation.

If the nil representation may not be zero length, but the empty representation is zero-length, then the absent representation cannot occur because zero-length will be interpreted as the empty representation.

If the nil and empty representations can not be zero-length, but the normal representation may be zero length then the absent representation cannot occur because zero length will be interpreted as a normal representation.

If the nil representation may not be zero-length, the empty representation is not zero-length, and the normal representation may not be zero-length, then a zero-length representation is the absent representation, or "is absent". 

9.2.6       Missing

When parsing, an element occurrence is missing if it does not have nil, empty, or normal representations, or it has the absent representation.

When parsing, the term missing really covers two situations. Firstly it subsumes absent representation. Secondly it applies when an element does not have a representation at all in the data stream, that is, when we do not even have the constructs in the data stream to determine the location of the representation of the element; hence, none of the concepts above apply. This will be made clearer in the examples below. If an element occurrence is missing when parsing, no item is ever added to the Infoset.

When unparsing, an element occurrence is missing if there is no item in the infoset. For a required element occurrence, it is this condition that can trigger the creation of a default value in the augmented infoset. See Section 9.4 Element Defaults below about default values. For an optional element occurrence, no item is ever added to the augmented Infoset nor any representation ever output in the data stream.

 

9.2.7       Examples of Missing and Empty Representation

The following examples illustrate missing and empty representation.

<xs:sequence dfdl:separator="," dfdl:terminator="@"

             dfdl:separatorSuppressionPolicy="trailingEmpty">

       <xs:element name="A" type="xs:string"  

                  dfdl:lengthKind="delimited"/>

       <xs:element name="B" type="xs:string" minOccurs="0"

                  dfdl:lengthKind="delimited"/>

       <xs:element name="C" type="xs:string" minOccurs="0"

                  dfdl:lengthKind="delimited"/>

</xs:sequence>

 

In data stream aaa,@ element B has the empty representation, and element C does not have a representation so is missing.

<xs:sequence dfdl:separator=","

             dfdl:separatorSuppressionPolicy="anyEmpty">

       <xs:element name="A" type="xs:string"

                  dfdl:lengthKind="delimited" dfdl:initiator="A:"

                  dfdl:emptyValueDelimiterPolicy=initiator"/>

       <xs:element name="B" type="xs:string" minOccurs="0"

                  dfdl:lengthKind="delimited" dfdl:initiator="B:"

                  dfdl:emptyValueDelimiterPolicy="initiator"/>

       <xs:element name="C" type="xs:string" minOccurs="0"

                  dfdl:lengthKind="delimited" dfdl:initiator="C:"

                  dfdl:emptyValueDelimiterPolicy=initiator"/>

</xs:sequence>

 

In data stream A:aaaa,C:cccc  element B does not have a representation so is missing.

In data stream A:aaaa,B:,C:cccc element B has the empty representation.

In the data stream A:aaaa,,C:cccc element B has the absent representation so is missing.

 

9.2.8       Round Trip Ambiguities

The overlapping nature of the possible representations: normal, empty, nil, and absent, creates a number of ambiguities where taking an Infoset, unparsing it, and reparsing it will result in a second Infoset that is not the same as the original.  However taking the second Infoset, unparsing it, and reparsing it, will result in a third Infoset which is the same as the second.

When unparsing, if a string Infoset item happens to contain a string that matches either one of the nilValues or the default value, it is not given any special treatment. The string's characters are output, or if the value is the empty string, zero length content is output. (In both cases along with an initiator or terminator if defined.) This creates an ambiguity where one can unparse an Infoset item which has member [nilled] true, but when reparsed will produce an Infoset item which has member [nilled] false.

These ambiguities are natural and unavoidable. If the nilValue is the 3-character string "nil", then encountering the characters "nil" in the data stream will parse to produce an Infoset item with [nilled] true in the Infoset. If you unparsed a string infoset item with contents of the 3 characters "nil", this will be output as the letters "nil", which on parse will not produce a string with the characters "nil", but rather an Infoset item with no data value and member [nilled] true.

To avoid this issue, one can use validation, along with a pattern that prevents the string from matching any of the nil values.

 

9.3       Parsing Algorithm

A DFDL parser proceeds by determining the existence of occurrences of schema components. It does this by examining the data and the schema, so as to:

a)     Establish representation

b)    Resolve points of uncertainty

These two activities are defined below. They are mutually recursive in the expected way as a DFDL schema is a recursive nest of schema components.

Establishing the representation of an occurrence of a schema component and resolving points of uncertainty involve the concepts of known-to-exist and known-not-to-exist.

9.3.1       Known-to-exist and Known-not-to-exist

9.3.1.1      Known-to-exist

An occurrence of a schema component is said to be known-to-exist when any of these positive discriminations hold:

  1. There is a dfdl:discriminator[8] applying to the component and its expression evaluates to true or regular expression pattern matches.
  2. The component is a direct child of an xs:sequence or xs:choice with dfdl:initiatedContent 'yes' and an initiator defined for the component is found.
  3. The component is a direct child of an xs:choice with dfdl:choiceDispatchKey and the result of the dfdl:choiceDispatchKey expression matches the dfdl:choiceChoiceBranchKey property of the child.

If none of those hold because they are not applicable then the occurrence is still known-to-exist if ALL of the following hold, and no processing error occurs during their determination:

There are dfdl:asserts with failureType 'processingError' on the component and all their expressions evaluate to true or their regular expression patterns match,

It has nil, empty, or normal representation

When it has normal representation, this of course implies that the content of the representation is convertible to the element type without error.

Note that validation errors or recoverable errors do not prevent determination that a component is known-to-exist.

9.3.1.2      Processing Error After Determining Known-to-exist

Note that it is possible for an occurrence of a schema component to be known-to-exist due to a positive discrimination, but then subsequently a processing error occurs when evaluating a statement annotation such as a dfdl:assert or a dfdl:setVariable, or a processing error occurs when determining the representation, or in the case of normal representation and simpleType, when converting that representation's content into a value of the type. This processing error does not change the fact that the schema component was determined to be known-to-exist. This is important in the discussion of resolving Points of Uncertainty below.

9.3.1.3      Known-not-to-exist

An occurrence of a schema component is known-not-to-exist when any of these negative discriminations holds:

  1. There is a dfdl:discriminator applying to the component and its expression evaluates to false or regular expression pattern fails to match, or a processing error occurs while processing the dfdl:discriminator.
  2. The component is a direct child of an xs:sequence or xs:choice with dfdl:initiatedContent 'yes' and an initiator defined for the component is not found.
  3. The component is a direct child of an xs:choice with dfdl:choiceDispatchKey and the result of the dfdl:choiceDispatchKey expression does not match the dfdl:choiceChoiceBranchKey property of the child.

If none of those hold because they are not applicable, then a schema component is known-not-to-exist when any of the following hold:

  1. The occurrence is missing
  2. There is a dfdl:assert with failureType 'processingError' on the component and its expression evaluates to false or its regular expression pattern fails to match, or a processing error occurs while processing the dfdl:assert.
  3. A processing error occurs when parsing the component. Processing errors include, but are not limited to, inability to identify any of nil, empty, normal or absent representations, or failure to convert a value to the built-in logical type.

Note that validation errors or recoverable errors do not cause a component to be known-not-to-exist.

Note: based on the above, when processing a sequence for which a separator is defined, the presence of a match in the data for the separator is not sufficient to cause the parser to determine that an associated component is known-to-exist. See Section 14.2 Sequence Groups with Separators  for details.

9.3.2       Establishing Representation

Unless an element occurrence is known-not-to-exist, it must be established if it has the nil, empty, normal, or absent representation.

The first step is to see if the content is trivially of length zero. This is dfdl:lengthKind dependent.

9.3.2.1      Simple element

If the result is length zero as described above, the representation is then established by checking, in order for:

  1. nil representation (if %ES; is a literal nil value).
  2. empty representation.
  3. normal representation (xs:string or xs:hexBinary only)
  4. absent representation (if none of the prior representations apply).

If the result is not length zero, the representation is then established by checking, in order, for:

  1. nil representation (as a literal nil value)
  2. nil representation (as a logical nil value)
  3. normal representation

9.3.2.2      Complex element

If the result is length zero as described above, the representation is then established by checking for:

To establish any other representations requires that the parser descends into the complex type for the element, and returns successfully (that is, no unsuppressed processing error occurs). If the result is zero bits consumed, the representation is then established by checking, in order, for:

  1. empty representation.
  2. absent representation (if none of the prior representations apply).

Otherwise the element has normal representation.

Note: The DFDL parser shall not recursively parse the schema components inside a complex element when it has already established that the element occurrence is missing[11].

9.3.3       Points of Uncertainty

A point of uncertainty occurs in the data stream when there is more than one schema component that might occur at that point. Points of uncertainty can be nested.

Any one of the following constructs is a potential point of uncertainty:

The parser resolves these points of uncertainty by way of a set of construct-specific rules given below along with determining whether schema components are known-to-exist or known-not-to-exist. For some of these constructs, whether there is an actual point of uncertainty depends on the representation of the constructs in the data.

An xs:choice is always a point of uncertainty. It is resolved sequentially, or by direct dispatch. Sequential choice resolution occurs by parsing each choice branch in schema definition order until one is known-to-exist. It is a processing error if none of the choice branches are known-to-exist. Direct-dispatch choice resolution occurs by matching the value of the dfdl:choiceDispatchKey property to the value of the dfdl:choiceChoiceBranchKey property of one of the choice branches. It is a processing error if none of the choice branches have a matching value in their dfdl:choiceChoiceBranchKey property.

An element in an unordered xs:sequence is always a point of uncertainty. It is resolved by parsing for the child components of the sequence in schema definition order at each point in the data stream where a component can exist until the required number of occurrences of each child component is known- to-exist or the sequence is terminated by delimiters or specified length.

An element in a sequence with one or more floating elements is always a point of uncertainty. It is resolved by parsing for the expected element at that point in the data stream. If the expected element is known-not-to-exist then an occurrence of each floating element is parsed in schema definition order.

When parsing an array, points of uncertainty only occur for certain values of occursCountKind, as follows:

occursCountKind

Details of Point of Uncertainty

fixed

No point of uncertainty (maxOccurs occurrences expected).

implicit

A point of uncertainty exists after minOccurs occurrences found and until

maxOccurs found.

parsed

A point of uncertainty exists for all occurrences

expression

No point of uncertainty (occursCount occurrences expected)

stopValue

No point of uncertainty (stopValue must always be present, even

when minOccurs is 0).

Table 11: Points of Uncertainty and dfdl:occursCountKind

An optional element point of uncertainty is resolved by parsing the element until it is either known-to-exist or known-not-to-exist. Whether an optional element is an actual point of uncertainty depends on property dfdl:occursCountKind as described above. (Property dfdl:occursCountKind is defined in Section 16.1 dfdl:occursCountKind property.)

For an array element, the point of uncertainty is resolved for each occurrence separately by parsing the occurrence until it is either known-to-exist or known-not-to-exist.  

9.3.3.1      Nested Points of Uncertainty

A point of uncertainty can be resolved because a schema component has been determined to be known-to-exist due to positive discrimination. In that case, if a subsequent processing error occurs when completing the parsing of that schema component this will cause the next enclosing schema component surrounding this point of uncertainty to be determined to be known-not-to exist.

For example, when parsing an element occurrence for an array with a variable number of occurrences, a positive discrimination tells the parser that the currently-being-parsed occurrence is known-to-exist. If a subsequent processing error occurs while completing the parsing of this occurrence, then the entire array is then known-not-to-exist.

Another example is a choice. If a discriminator resolves the choice point of uncertainty to the first of the choice's alternatives, a subsequent processing error causes the entire choice construct to be determined to be known-not-to-exist.

This will cause the next enclosing point of uncertainty to try the next possible alternative, or if there isn't one, will cause an unsuppressed processing error. 

The behavior of a DFDL processor on an unsuppressed processing error is not specified, but it is allowable for implementations to abort further parsing. Any other behavior is implementation-defined.

9.4       Element Defaults

A DFDL processor can create element defaults in the Infoset for both simple and complex elements. This happens quite differently for parsing and unparsing as will be explained in this section.

9.4.1       Definition 'default value'

A simple element has a default value if any of these are true:

  1. The XSDL default property exists. The default value is the property's value.
  2. The XSDL fixed property exists. The default value is the property's value.
  3. The element has XSDL nillable is 'true' and dfdl:useNilForDefault  is 'yes'. The corresponding Infoset item will have the [nilled] member true, and the [dataValue] member will have no value.

9.4.2       Element Defaults When Parsing

If empty representation is established when parsing, the possibility of applying an element default arises. Essentially, if a required occurrence of an element has empty representation, then an element default will be applied if present, though there are a couple of variations on this rule. Remember that in order to have established empty representation, the occurrence must be compliant with the dfdl:emptyValueDelimiterPolicy for the element, and for a complex element the parser must have descended into the type and returned with no unsuppressed processing error.

The rules for applying element defaults are not dependent on dfdl:occursCountKind. However, if a required occurrence does not produce an item in the Infoset after the rules have been applied, then whether it is a processing error or a validation error (if validation is enabled) does depend on dfdl:occursCountKind (see Section 16.1 dfdl:occursCountKind property).

There are three main cases to consider:

9.4.2.1      Simple element (not xs:string and not xs:hexBinary)

Required occurrence: If the element has a default value then an item is added to the Infoset using the default value, otherwise nothing is added to the Infoset.

Optional occurrence: Nothing is added to the Infoset.

9.4.2.2      Simple element (xs:string or xs:hexBinary)

Required occurrence: If the element has a default value then an item is added to the infoset using the default value, otherwise an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value.

Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none'[12] then an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value, otherwise nothing is added to the Infoset.

Note: To prevent unwanted empty strings or empty hexBinary values from being added to the Infoset, use XSD minLength > '0' and a dfdl:assert that uses the dfdl:checkConstraints() function, to raise a processing error.

9.4.2.3      Complex element

Required occurrence: An item is added to the Infoset.

Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none'[13] then an item is added to the Infoset, otherwise nothing is added to the Infoset.

For both required and optional occurrences, the Infoset item may also have a child item.

  1. If the first child element of the complex type is a required simple element, then an empty string (type xs:string), empty hexBinary (type xs:hexBinary), or default value will also be added to the Infoset.
  2. If the first child element of the complex type is a required complex element, then an item is added to the Infoset (which may itself have a child via (1))

As an example, consider a sequence S0 with a separator that contains among other content an optional non-nillable non-initiated element E1 of complex type. The content of the type is a sequence S1 with a different separator and the first child is a required non-initiated element E2 of type xs:string. The dfdl:lengthKind of both E1 and E2 is 'delimited'. The representation of E1 has zero length, that is, the data contains adjacent S0 separators. On processing E1, the parser will establish a point of uncertainty and descend into E1's complex type and process E2. It scans for in-scope delimiters and immediately encounters S0 separator. E2 has the empty representation, so E1 is added to the Infoset along with a value of empty string for E2. All other content of S1 is missing, so the parser returns from the descent. E1 is therefore known-to-exist. Because the position in the data has not changed, E1 therefore has the empty representation. Because E1 is empty and optional it is not added to the Infoset, and the Infoset items for E1 and E2 are discarded.

9.4.3       Element Defaults When Unparsing

If an element is missing from the Infoset when unparsing, the possibility of applying an element default arises.  Essentially if a required occurrence of an element is missing, then an element default will be applied if present, and the resulting item is added to the augmented Infoset.

The rules for applying element defaults are not dependent on dfdl:occursCountKind. However if a required occurrence does not produce an item in the augmented Infoset after the rules have been applied then whether it is a processing error or a validation error (if enabled) is  dependent on dfdl:occursCountKind (see Section 16.1 dfdl:occursCountKind property).

There are two main cases to consider.

9.4.3.1      Simple element

Required occurrence: If an element has a default value then an item is added to the augmented Infoset using the default value, otherwise nothing is added.

Optional occurrence: Nothing is added to the augmented Infoset.

9.4.3.2      Complex element

Required occurrence: An item is added to the augmented Infoset as specified below.

Optional occurrence: Nothing is added to the augmented Infoset.

For a required occurrence, the unparser descends into the complex type:

For a sequence, each child element is examined in schema order and the rules for simple and complex elements applied (recursively). The lack of a default may give rise to a processing error, as described above.

For a choice, each branch is examined in schema order and the above rules applied recursively to the branch. The lack of a default may give rise to a processing error, as described above, and if so the error is suppressed and the next branch is tried, otherwise that branch is selected. It is a processing error if no choice branch is ultimately selected.

9.5       Evaluation Order for Statement Annotations

Given a component of a DFDL schema, there is a resolved set of annotations for it.

Of these, some are statement annotations and the order of their evaluation relative to the actual processing of the schema component itself (parsing or unparsing via its format annotations) is as given in the ordered lists below.

For elements and element refs:

1.     dfdl:discriminator or dfdl:assert(s) with testKind 'pattern' (parsing only)

2.     dfdl:element following property scoping rules

3.     dfdl:setVariable(s) - in lexical order, innermost schema component first

4.     dfdl:discriminator or dfdl:assert(s) with testKind 'expression' (parsing only)

For sequences, choices and group refs:

  1. dfdl:discriminator or dfdl:assert(s) with testKind 'pattern' (parsing only)
  2. dfdl:newVariableInstance(s) - in lexical order, innermost schema component first
  3. dfdl:setVariable(s) - in lexical order, innermost schema component first
  4. dfdl:sequence or dfdl:choice or dfdl:group following property scoping rules
  5. dfdl:discriminator or dfdl:assert(s) with testKind 'expression' (parsing only)

The dfdl:setVariable annotations at any one annotation point of the schema are always executed in lexical order. However, dfdl:setVariable annotations can also be found in different annotation points that are combined into the resolved set of annotations for one schema component. In this case, the order of execution of the dfdl:setVariable statements from any one annotation point remains lexical. The order of execution of the dfdl:setVariable annotations different annotation points follows the principle of innermost first, meaning that a schema component that references another schema component has its dfdl:setVariable statements executed after those of the referenced schema component. For example, if an element reference and an element declaration both have dfdl:setVariable statements, then those on the element declaration will execute before those on the element reference. Similarly dfdl:setVariable statements on a base simple type execute before those of a simple type derived from it. The dfdl:setVariable statements on a simple type execute before those on an element having that simple type (whether by reference, or when the simple type is lexically nested within the element declaration). The dfdl:setVariable statements on the sequence or choice within a global group definition execute before those on a group reference.

The dfdl:newVariableInstance annotations at any one annotation point of the schema are always executed in lexical order. However, dfdl:newVariableInstance annotations can also be found in different annotation points that are combined into the resolved set of annotations for one schema component. In this case, the order of execution of the dfdl:newVariableInstance statements from any one annotation point remains lexical. The order of execution of the dfdl:newVariableInstance annotations different annotation points follows the principle of innermost first, meaning that a schema component that contains or references another schema component has its dfdl:newVariableInstance statements executed after those of the contained or referenced schema component. For example, if a group reference and the sequence or choice group of a group definition both have dfdl:newVariableInstance statements, then those on the global group definition will execute before those on the group reference.

9.5.1       Asserts and Discriminators with testKind 'expression'

Implementations are free to optimize by recognizing and executing discriminators or asserts with testKind 'expression' earlier so long as the resulting behavior is consistent with what results from the description above.

9.5.2       Discriminators with testKind 'expression'

When parsing, an attempt to evaluate a discriminator must be made even if preceding statements or the parse of the schema component ended in a processing error.

This is because a discriminator's expression could evaluate to true thereby resolving a point of uncertainty even if the complete parsing of the construct ultimately caused a processing error.

Such discriminator evaluation has access to the DFDL Infoset of the attempted parse as it existed immediately before detecting the parse failure. Attempts to reference parts of the DFDL Infoset that do not exist are processing errors.

9.5.3       Elements and setVariable

The resolved set of dfdl:setVariable statements for an element are executed after the parsing of the element. This is in contrast to the resolved set of dfdl:setVariable statements for a group which are executed before the parsing of the group.

For elements, this implies that these variables are set after the evaluation of expressions corresponding to any computed DFDL properties for that element, and so the variables may not be referenced from expressions that compute these DFDL properties.

That is, if an expression is used to provide the value of a property (such as dfdl:terminator or dfdl:byteOrder), the evaluation of that property expression occurs before any dfdl:setVariable annotation from the resolved set of annotations for that element are executed; hence, the expression providing the value of the property may not reference the variable. Schema authors can insert sequences to provide more precise control over when variables are set.

 

10.  Core Representation Properties and their Format Semantics

The next sections specify the core set of DFDL v1.0 properties that may be used in DFDL annotations in DFDL Schemas to describe data formats.

It is a schema definition error when a DFDL schema does not contain a definition for a representation property that is needed to interpret the data. For example, a DFDL schema containing any textual data must provide a definition of the character set encoding property (dfdl:encoding) for that textual data, and if it is not part of the format properties context for that data, then it is a schema definition error.

Furthermore, no default values are provided for representation properties as built-in definitions by any DFDL processor. This requires DFDL schemas to be explicit about the representation properties of the data they describe, and avoids any possibility of DFDL schemas that are meaningful for some DFDL processors but not others.

The properties are organized as follows:

Where properties are specific to a physical representation, the property name may choose to reflect this. Where properties are related to a specific logical type grouping (defined below), the property name may choose to reflect this.

A limited number of properties can take a DFDL expression which must return a value of the proper type for the property. Those properties that take an expression explicitly state in the description. Other properties do not take an expression.

The property description defines which schema component that the property may be specified on. In addition all the DFDL properties may be specified on a dfdl:format annotation.

11.  Properties Common to both Content and Framing

Property Name

Description

byteOrder

Enum or DFDL Expression

Valid values 'bigEndian', 'littleEndian'. 

This property can be computed by way of an expression which returns the string 'bigEndian' or 'littleEndian'. The expression must not contain forward references to elements which have not yet been processed.  

Note that there is, intentionally, no such thing as 'native' endian[14].

This property applies to all Number, Calendar, and Boolean types with representation binary. Specifically that is binary integers, binary booleans, all packed decimals, binary floats, binary seconds and binary milliseconds.

This property is never used to establish the byte order for text /strings with Unicode fixed-width encodings that do not specify the byte order (UTF-16 and UTF-32). See Section 11.1 Unicode Byte Order Mark (BOM) for details.

Annotation: dfdl:element, dfdl:simpleType

bitOrder

Enum

Valid values 'mostSignificantBitFirst', 'leastSignificantBitFirst'. 

The bits of a byte each have a place value or significance of 2n, for n from 0 to 7. Hence, the byte value 255 = 27 + 26 + 25 + 24 + 23 + 22 + 21 + 20. A bit can always be unambiguously identified as the 2n-bit.

The bit order is the correspondence of a bit's numeric significance to the bit position (1 to 8) within the byte.

Value 'mostSignificantBitFirst' means:

  • The 27 bit is first, i.e., has bit position 1.
  • In general the 2n bit has position 8 - n.
  • The least significant bits of byte N are considered to be adjacent to the most significant bits of byte N+1.

Value 'leastSignificantBitFirst' means:

  • The 20 bit is first, i.e., has bit position 1.
  • In general the 2n bit has position n + 1.
  • The most significant bits of byte N are considered to be adjacent to the least significant bits of byte N+1.

This property applies to all content and framing since it determines which bits of a byte occupy what bit positions. Content and framing are defined in terms of regions of the data stream, and these regions are defined in terms of the starting bit position and ending bit position; hence, dfdl:bitOrder is relevant to determining the specific bits of any grammar region (see Section 9.2) when the region's starting bit position or ending bit position are not on a byte boundary. 

The bit order can only change on byte boundaries, and alignment of up to 7 bits will be inserted to ensure byte-alignment whenever the bit order changes.

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group 

encoding

Enum or DFDL Expression

Values are one of:

·       IANA charset name

·       CCSID[15]

·       DFDL standard encoding name

·       Implementation-specific encoding name

This property can be computed by way of an expression which returns an appropriate string value. The expression must not contain forward references to elements which have not yet been processed. 

Note that there is, deliberately, no concept of 'native' encoding[16].

Conforming DFDL v1.0 processors must accept at least 'UTF-8'', 'UTF-16', 'UTF-16BE', 'UTF-16LE', 'ASCII', and 'ISO-8859-1' as encoding names.

Encoding names are case-insensitive, so 'utf-8' and 'UTF-8' are equivalent.

Unicode character set encodings that do not specify a byte order (such as UTF-16 or UTF-32) can have their byte-order controlled by a document-level byte-order-mark (BOM). See Section 11.1 Unicode Byte Order Mark (BOM) for details.

The encoding name 'UTF-8' is interpreted strictly and does not include variants such as CESU-8.

DFDL standard encoding names are defined in Section 34 Appendix D: DFDL Standard Encodings. When supported, a conforming DFDL implementation must implement them in a uniform manner so that they are portable across all DFDL implementations that implement them.

Additional implementation-defined encoding names may be provided only for character set encodings for which there is no IANA name standard nor CCSID standard nor DFDL standard encoding. These implementation-defined encodings must have "X-" as a prefix to their name, as they are subject to being superseded by IANA or DFDL standard encoding names.

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group

utf16Width

Enum

Valid values are 'fixed', 'variable'.

Applies only when encoding is 'UTF-16', 'UTF-16BE', UTF16-LE' or their CCSID equivalents.

Specifies whether the encoding 'UTF-16' should be treated as a fixed or variable width encoding. 'UTF-16' can contain characters which require two codepoints (called a surrogate pair) to represent. When utf16Width is 'fixed', these surrogate code points are treated as separate characters. When utf16Width is 'variable', then surrogate pairs are converted into a single character on parsing, and such a character is split into two characters on unparsing.

When utf16Width is 'variable', then on parsing an un-paired surrogate codepoint causes a decode error, which can be controlled via dfdl:encodingErrorPolicy described below.

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group

ignoreCase

Enum

Valid values are 'yes', 'no'.

Whether mixed case data is accepted when matching delimiters and data values on input.

This affects the behavior of matching for these properties: dfdl:initiator, dfdl:terminator, dfdl:separator, dfdl:nilValue, dfdl:textStandardExponentRep, dfdl:textStandardInfinityRep, dfdl:textStandardNaNRep, dfdl:textStandardZeroRep, dfdl:textBooleanTrueRep, and dfdl:textBooleanFalseRep.

Property ignoreCase plays no part when comparing an element value with an XSDL enum facet, matching an element value to an XSDL pattern facet, or comparing an element value with the XSDL fixed property. It is therefore not used by validation (when validation is enabled), nor by the dfdl:checkConstraints function.

 On unparsing always use the delimiters or value as specified.

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group

encodingErrorPolicy

Enum

Valid values are 'error' or 'replace'.

This property applies whenever dfdl:encoding is applicable.

This property provides control of how decoding and encoding errors are handled when converting the data to text, or text to data. This includes converting when scanning for delimiters, matching regular expression length or test patterns, matching textual data type representation patterns against the data, and of course isolating the text content that will become the value of an element (parsing) or constructing the content from the value (unparsing).

When parsing, an error can occur when decoding characters from their encoded form into the DFDL Infoset character set (ISO10646). This can occur due to invalid byte sequences, or not enough bytes found to make up the full encoding of a character.

If 'replace', then the Unicode replacement character (U+FFFD) is substituted for the offending errors, one replacement character for any incorrect fragment of an encoding. 

If 'error' then a processing error occurs.

When unparsing, the errors that can occur when encoding characters from Unicode/ISO 10646 into the specified encoding include when no mapping is provided by the encoding character set specification and when there is not enough space to output the entire encoding of the character (e.g., need 2 bytes for a 2-byte character codepoint, but only 1 byte remains in the available length.)

If 'replace' then encoding-specific replacement/substitution character is output. It is a processing error if no such character is defined, and it is a processing error if there is any error when attempting to output the replacement (such as not enough room for the representation of the entire encoding of the replacement character).

If error' then a processing error occurs.

See Section 11.2 Character Encoding and Decoding Errors for further details.

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group

Table 12 Properties Common to both Content and Framing

11.1     Unicode Byte Order Mark (BOM)

DFDL provides automatic detection and generation of a Unicode BOM at the document level and saves (for parsing), or retrieves (for unparsing) the BOM information from the DFDL Infoset [unicodeByteOrderMark] member.

Parsing behaviour: When the dfdl:encoding property of the root element is specified, and is exactly one of UTF-8, UTF-16, or UTF-32 (or CCSID equivalents), then a DFDL parser will look for the appropriate BOM as the very first bytes in the data stream. 

UTF-8.  If a BOM is found[17] then this is used to set the document information item [unicodeByteOrderMark] member. If no BOM is found the parser takes no action. There is no need to model the BOM explicitly.

UTF-16.  If a BOM is found then this is used to set the document information item [unicodeByteOrderMark] member, and all data with dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have the implied byte order. If no BOM is found then all data with dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have big-endian byte order. There is no need to model the BOM explicitly.

UTF-32.  If a BOM is found then this is used to set the document information item [unicodeByteOrderMark] member, and all data with dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have the implied byte order . If no BOM is found then all data with dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have big-endian byte order. There is no need to model the BOM explicitly.

When the dfdl:encoding property of the root element is specified, and is exactly one of UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE (or CCSID equivalents), then a DFDL parser will not look for the appropriate BOM. The byte order to use is implicit in the encoding. If a BOM does appear at the start of the data stream, then it will simply be treated as a Unicode Zero-Width Non-Breaking Space (ZWNBS) character, because this shares the same codepoint as a BOM.

The dfdl:byteOrder property is never used to establish the byte order for Unicode encodings.

The parser never looks for a BOM at any other point in the data stream, so if a BOM appears elsewhere it will be treated as a Unicode ZWNBS character as described above[18].

Unparsing behaviour: When the dfdl:encoding property of the root element is specified, and is exactly one of UTF-8, UTF-16 or UTF-32 (or CCSID equivalents), then a DFDL unparser will look in the infoset document information item for a BOM. 

UTF-8.  If the document information item [unicodeByteOrderMark] member is 'UTF-8', the UTF-8 BOM is output as the very first bytes in the data stream. If the property is empty then no BOM is output.  If the property has any other value, it is a processing error. There is no need to model the BOM explicitly.

UTF-16.  If the document information item [unicodeByteOrderMark] member is 'UTF-16LE' or 'UTF-16BE', the corresponding UTF-16 BOM is output as the very first bytes in the data stream, and all data with dfdl:encoding UTF-16 throughout the rest of the document will be output with the implied byte order. If the property is empty then no BOM is output, and all data with dfdl:encoding UTF-16 throughout the rest of the document are assumed to have big-endian byte order. If the property has any other value, it is a processing error. There is no need to model the BOM explicitly.

UTF-32.  If the document information item [unicodeByteOrderMark] member is 'UTF-32LE' or 'UTF-32BE', the corresponding UTF-32 BOM is output as the very first bytes in the data stream, and all data with dfdl:encoding UTF-32 throughout the rest of the document will be output with the implied byte order . If the property is empty then no BOM is output, and all data with dfdl:encoding UTF-32 throughout the rest of the document are assumed to have big-endian byte order. If the property has any other value, it is a processing error. There is no need to model the BOM explicitly.

When the dfdl:encoding property of the root element is specified, and is exactly one of UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE (or CCSID equivalents), then a DFDL unparser will not look at the document information item [unicodeByteOrderMark] member and will not output a BOM. The byte order to use is implicit in the encoding. If a BOM does need to be output at the start of the data stream, then it must be explicitly modelled as such.

The dfdl:byteOrder property is never used to establish the byte order for Unicode encodings.

The unparser never outputs a BOM at any other point in the data stream. If a BOM needs to appear, then it must be explicitly modelled as such.

11.2     Character Encoding and Decoding Errors

When parsing, these are the errors that can occur when decoding characters into Unicode/ISO 10646.

  1. The data is broken - invalid bit/byte sequences are found which do not match the definition of a character for the encoding.
  2. Not enough data is found to make up the entire encoding of a character. That is, a fragment of a valid encoding is found.

When unparsing, these are the errors that can occur when encoding characters from Unicode/ISO 10646 into the specified encoding.

  1. No mapping provided by the encoding specification.
  2. Not enough room to output the entire encoding of the character (e.g., need 3 bytes for a character encoding that uses 3-bytes for that character, but only 1 byte remains in the available length.

The subsections below describe how these errors are handled.

11.2.1    Property dfdl:encodingErrorPolicy

The property dfdl:encodingErrorPolicy has two possible values: 'error' and 'replace'.

11.2.1.1    dfdl:encodingErrorPolicy 'error'

If 'error', then any error when decoding characters while parsing causes a processing error. For unparsing, any error when encoding characters causes a processing error.

When parsing, it does not matter if this happens when scanning for delimiters, matching a regular expression, matching a literal nil value, or constructing the value of a textual element.

There is one exception. When dfdl:lengthUnits is 'bytes', the 'not enough data' decoding error is ignored, and the data making up the fragment character is skipped over. Symmetrically, when unparsing the 'not enough room' encoding error is ignored and the left-over bytes are filled with the dfdl:fillByte.

11.2.1.2    dfdl:encodingErrorPolicy 'replace' for parsing

If 'replace' then any error when decoding characters results in the insertion of the Unicode Replacement Character (U+FFFD) as the replacement for that error.

It does not matter if this error and replacement happens when scanning for delimiters, matching a regular expression, matching a literal nil value, or constructing the value of a textual element.

There is one exception. When dfdl:lengthUnits is 'bytes', the 'not enough data' decoding error is ignored, no replacement character is created. The data making up the fragment character is skipped over. (It will be filled with the dfdl:fillByte when unparsing.)

Note that the "." wildcard in regular expressions will match the Unicode Replacement Character, so ".*" and ".+" regular expressions can potentially cause very large matches (up to the entire data stream) to occur when data contains errors and dfdl:encodingErrorPolicy 'replace'. DFDL Schema authors are advised that bounded length negated regular expressions can help in this case. E.g., "[^\uFFFD]{0,50}" says to match any character (excluding the Unicode Replacement Character), but only up to length 50.

It is also worth noting that the Unicode Replacement Character can appear in data as an ordinary character, and this cannot be distinguished from the insertion of the Unicode Replacement Character due to a decoding error. This is likely to happen for data that is (a) initially parsed by a DFDL parser with dfdl:encodingErrorPolicy 'replace', and (b) which contains some decoding errors, but (c) is nevertheless successfully parsed, (d) is written back out to a file or other data repository, and (e) is parsed again. The written data will have replaced data errors with the Unicode Replacement Character, and so if the data is parsed again, it will no longer have errors, but will have the Unicode Replacement Character as a regular character in the data.

If dfdl:lengthUnits is 'characters', then a Unicode Replacement Character counts as contributing a single character to the length.

If the data contains more than one adjacent decode error, then the specific number of Unicode Replacement Characters that are inserted as the replacement of these errors is implementation- dependent. That is, some implementations may view, for example, three consecutive erroneous bytes as three separate decode errors, others may view them as a single or two decode errors. All implementations MUST, however, insert some number of Unicode Replacement Characters, and then continue to decode characters following the erroneous data.

The trimming of pad characters always happens after Unicode Replacement Characters have been inserted into the data.

11.2.1.3    dfdl:encodingErrorPolicy 'replace' for unparsing

For unparsing, each encoding has a replacement/substitution character specified by the ICU. This character is substituted for the unmapped character or the character that has too large an encoding to fit in the available space. 

There is one exception. When dfdl:lengthUnits is 'bytes', the 'not enough room' encoding error is ignored. The left-over bytes are filled with the dfdl:fillByte (they are skipped when parsing.)

The definitions of these substitution characters can be conveniently found for many encodings in the ICU Converter Explorer (http://demo.icu-project.org/icu-bin/convexp). 

An encoding error is a processing error if the encoding does not provide a substitution/replacement character definition. (This would be rare, but could occur if a DFDL implementation allows many encodings beyond the minimum set.)

11.2.2    Unicode UTF-16 Decoding/Encoding Non-Errors

The following specific situations involving encodings UTF-16, UTF-16LE, and UTF-16BE when dfdl:utf16Width "fixed", and they do not cause a decoding or encoding error.

In all these cases the code-point(s) becomes a character code in the DFDL Information Item for the string.

11.2.3    Preserving Data Containing Decoding Errors

There can be situations where data wants to be preserved exactly even if it contains errors.

It is suggested that if a DFDL schema author wants to preserve information containing data where the encodings have these kinds of errors, that they model such data as xs:hexBinary, or as xs:string but using an encoding such as iso-8859-1 which preserves all bytes.

11.3     Byte Order and Bit Order

Byte order and bit order are separate concepts. However, of the possible combinations, only the following are allowed:

  1. ‘bigEndian’ with ‘mostSignificantBitFirst’
  2. ‘littleEndian’ with ‘mostSignificantBitFirst’
  3. ‘littleEndian’ with ‘leastSignificantBitFirst’ [19]

Other combinations must produce schema definition errors.

11.4     dfdl:bitOrder Example

Consider a structure of 4 logical elements. The total length is 16 bits. Assume dfdl:lengthUnits is 'bits', dfdl:representation is 'binary', dfdl:binaryNumberRep is 'binary':

<element name="A" type="xs:int" dfdl:length="3"/> <!-- having value 3 -->

<element name="B" type="xs:int" dfdl:length="7"/> <!-- having value 9 -->

<element name="C" type="xs:int" dfdl:length="4"/> <!-- having value 5 -->

<element name="D" type="xs:int" dfdl:length="2"/> <!-- having value 1 -->

The above are colorized so as to highlight the corresponding bits in the data below.

In a format where dfdl:bitOrder is 'mostSignificantBitFirst':

              01100010 01010101

              AAABBBBB BBCCCCDD

Significance  M      L M      L

Bit Position  12345678 12345678

Byte Position ----1--- ----2---

As presented here, the bits corresponding to each element appear left to right, and all bits for an individual element are adjacent. Within the bits of an individual element the most significant bit is on the left, least significant on the right, consistent with the way the bytes themselves are presented.

In contrast, in a format where dfdl:bitOrder is 'leastSignificantBitFirst':

              01001011 01010100

              BBBBBAAA DDCCCCBB

Significance  M      L M      L

Bit Position  87654321 87654321

Byte Position ----1--- ----2---

In the above presentation note how the bits of the element 'B' do not appear adjacent to each other. The most significant bits of byte N are adjacent to the least significant bits of byte N+1.

11.4.1    Example Using Right-to-Left Display for 'leastSignificantBitFirst'

When working exclusively with data having dfdl:bitOrder 'leastSignificantBitFirst', it is useful to present data with bytes Right to Left. That is, with the bytes starting at byte 1 on the right, and increasing to the left.

              01010100 01001011

              DDCCCCBB BBBBBAAA

Significance  M      L M      L

Bit Position  87654321 87654321

Byte Position ----2--- ----1---

With this reorientation, the bits of the element 'B' are once again displayed adjacently. Within the bits of an individual element the most significant bit is on the left, least significant on the right, consistent with the way the bytes themselves are presented.

Often the specification documents for data formats that with least-significant-bit-first bit order will describe data using this Right-to-Left presentation style.

 

11.4.2    dfdl:bitOrder and Grammar Regions

When any grammar region appears before (to the left of) or after (to the right of) another grammar region in the grammar rules of Section 9.2, and the boundary between the two falls within a byte rather than on a byte boundary, then the dfdl:bitOrder determines which bits are occupied by the regions.

In general, the notion of before means occupying lower-numbered bit positions, and the bit positions are numbered according to dfdl:bitOrder. Hence, when dfdl:bitOrder is 'mostSignificantBitFirst', grammar regions that are before, will occupy more-significant bits, and when dfdl:bitOrder is 'leastSignificantBitFirst', grammar regions that are before will occupy less-significant bits.

 

12.  Framing

Several properties are common across the various framing styles or are used to distinguish them. Generally these have to do with position and length for text, bit fields, or opaque data.

12.1     Aligned Data

Alignment properties control the leading alignment and trailing alignment regions.

When the alignment properties are applied to an array element, the properties are applied to each occurrence of the element; that is, not only to the first occurrence.

The following properties are used to define alignment rules.

 

Property Name

Description

alignment

Non-negative Integer or 'implicit'

A non-negative number that gives the alignment required for the beginning of the item. If alignment is needed then the size of the AlignmentFill grammar region will be non-zero if the item must be aligned to a boundary.

'implicit' specifies that the natural alignment for the representation type is used. See the table of implicit alignments Table 14 Implicit Alignment in bits for simple elements. The 'implicit' alignment of a complex element is the alignment of its model group. The 'implicit' alignment of a model group is always 1. If alignment is 'implicit' then dfdl:alignmentUnits is ignored.

For textual data, minimum alignment is mandated by the character-set encoding, and this property must be 'implicit' or set to a multiple of the character-set's mandatory alignment. See Section 12.1.2 .

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group

alignmentUnits

Enum

Valid values are 'bits' or 'bytes'

Scales the alignment so alignment can be specified in either units of bits or units of bytes.

Only used when dfdl:alignment not 'implicit'

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group

fillByte

DFDL String Literal

A single byte specified as a DFDL byte value entity or a single character. If a character is specified, it must be a single-byte character in the applicable encoding.

Used on unparsing to fill empty space such as between two aligned elements.

Used to fill these regions specified in the grammar: RightFill, ElementUnused, ChoiceUnused, LeadingSkip, AlignmentFill, and TrailingSkip.

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group  

leadingSkip

Non-negative Integer

A non-negative number of bytes or bits, depending on dfdl:alignmentUnits, to skip before alignment is applied. Gives the size of the grammar region having the same name.

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group

trailingSkip

Non-negative Integer

A non-negative number of bytes or bits, depending on dfdl:alignmentUnits, to skip after the element, but before considering the alignment of the next element. Gives the size of the grammar region having the same name.

If dfdl:trailingSkip is specified when dfdl:lengthKind is 'delimited' then a dfdl:terminator must be specified.

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group

Table 13 Aligned Data Properties

There are two properties which control the data alignment by controlling the length of the AlignmentFill region

An element's representation is aligned to N units if P is the first position in the representation and P mod N = 1.  When parsing, the position of the first unit of the data stream is 1. 

For example, if dfdl:alignment is 4, and dfdl:alignmentUnits is 'bytes', then the element's representation must begin at 1 or 1 plus a multiple of 4 bytes.  That is, 1, 5, 9, 13, 17 and so on.

The length of the AlignmentFill region is measured in bits. If alignmentUnits is 'bytes' then we multiply the alignment value by 8 to get the bit alignment,  If the position in the data stream of the start of the AlignmentFill region is bit position N, then the length of the AlignmentFill region is the smallest non-negative integer L such that (L + N) mod B = 1.  The position of the first bit of the aligned component is P = L + N.

The LeadingSkip and TrailingSkip regions length are controlled by two properties of corresponding names and the dfdl:alignmentUnits property.

12.1.1    Implicit Alignment

When dfdl:alignment is 'implicit' the following alignment values are applied for each logical type.

Type

Alignment

text

binary

String

Encoding Specific (usually 8 bits, with exceptions: See Section 12.1.2)

Not applicable

Float

32

Double

64

Decimal, Integer, nonNegativeInteger

Packed decimals: 8

binary: 8

Long, UnsignedLong

binary: 64

Int, UnsignedInt

binary: 32

Short, UnsignedShort

binary: 16

Byte, UnsignedByte

binary: 8

DateTime

binarySeconds: 32, binaryMilliseconds:64

Date

binarySeconds: 32, binaryMilliseconds:64

Time

binarySeconds: 32, binaryMilliseconds:64

Boolean

32

HexBinary

Not applicable

8

Table 14 Implicit Alignment in bits

Note: The above table specifies the implicit alignment in bits, but this does not imply that dfdl:alignmentUnits 'bits' can be specified for all simple types. Rather, dfdl:alignmentUnits and dfdl:lengthUnits are independent and have their own rules for when they are applicable.

12.1.2    Mandatory Alignment for Textual Data

The term textual data is used to describe data of type xs:string, data with dfdl:representation "text", as well as data being matched to delimiters (parsing) or output as delimiters (unparsing), and data being matched to regular expressions (parsing only - as in a dfdl:assert with testKind 'pattern', or an element with dfdl:lengthKind 'pattern').

Textual data has mandatory alignment that is character-set-encoding dependent. That is, these mandates come from the character set encoding specified by the dfdl:encoding property.

When processing textual data, it is a schema definition error if the dfdl:alignment and dfdl:alignmentUnits properties are used to specify alignment that is not a multiple of the encoding-specified mandatory alignment.

If the data is not aligned to the proper boundary for the encoding when textual data is processed, then bits are skipped (parsing) or filled from dfdl:fillByte (unparsing) to achieve the mandatory alignment.

All required character set encodings in DFDL have 8-bit/1-byte alignment.

DFDL standard encodings specify their alignment. See Section 34 Appendix D: DFDL Standard Encodings.

Some implementations may include additional implementation-defined encodings which have other alignments.

Note the 16-bit and 32-bit Unicode character set encodings UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, all have 8-bit/1-byte alignment.

12.1.3    Mandatory Alignment for Packed Decimal Data

Packed decimal data must have a multiple of 4-bit alignment.  It is a schema definition error otherwise.

12.1.4    Example: AlignmentFill

When dfdl:alignmentUnits is 'bits', and the dfdl:alignment is not a multiple of 8, then the dfdl:bitOrder property affects the alignment by controlling which bits are skipped as part of the grammar AlignmentFill region.

In general, the AlignmentFill region is before the regions it is aligning, and within a byte, the meaning of 'before' is interpreted with respect to the dfdl:bitOrder.

When dfdl:bitOrder is 'mostSignificantBitFirst', then bits with more significance are before bits with less significance, so the AlignmentFill region occupies the most significant bits of the byte.

When dfdl:bitOrder is 'leastSignificantBitFirst', then bits with less significance are before bits with more significance, so the AlignmentFill region occupies the least significant bits of the byte.

Consider a structure of 2 logical elements. Assume dfdl:lengthUnits='bits', dfdl:representation='binary', dfdl:binaryNumberRep='binary' dfdl:alignmentUnits='bits', and assume the data is at the begining of the data stream.

<element name="A" type="xs:int" dfdl:length="2" dfdl:alignment='8'/>

<!-- having value 1 -->

<element name="B" type="xs:int" dfdl:length="4" dfdl:alignment='4'/>

<!-- having value 5 -->

The above are colorized so as to highlight the corresponding bits in the data below. The total length due to the alignment region appearing before element 'B' will be 8 bits.

In a format where dfdl:bitOrder is 'mostSignificantBitFirst' the data can be visualized as:

              01000101

              AAxxBBBB

Significance  M      L

Bit Position  12345678

In the above, the AlignmentFill region is marked with 'x' characters, and contains all 0 bit values.

In a format where dfdl:bitOrder is 'leastSignificantBitFirst' the presentation is different:

              01010001

              BBBBxxAA

Significance  M      L

Bit Position  87654321

In the above the AlignmentFill region still appears before element 'B', and in this case that is in less significant bits of the byte than the bits of content of element 'B', and these bits are displayed to the right of the bits of element 'B'.

12.2     Properties for Specifying Delimiters

The following properties apply to all objects that use text delimiters to delimit, that is, to initiate and/or terminate data. Delimiters can apply to binary data; however they are most often called 'text' delimiters because the concept is much more commonly used for textual data formats.

 

Property Name

Description

initiator

List of DFDL String Literals or DFDL Expression

Specifies a whitespace separated list of alternative literal strings one of which marks the beginning of the element or group of elements.

This property can be computed by way of an expression which returns a string containing a whitespace separated list of DFDL String Literals.  The expression must not contain forward references to elements which have not yet been processed.

Each string literal in the list, whether apparent in the schema, or returned as the value of an expression, is restricted to allow only certain kinds of syntax:

·         DFDL character entities are allowed.

·         DFDL Byte Value entities ( %#r ) are allowed.

·         DFDL Character Classes NL, WSP, WSP+, WSP*, and ES are allowed.

·         ES must not appear as the only DFDL string literal in the property. It can only appear as a member of a list.

·         If the ES entity or the WSP* entity appear alone as one of the string literals in the list, then dfdl:initiatedContent must be "no" .

 If the above rules are not followed it is a schema definition error.

The Initiator region contains one of the initiator strings defined by dfdl:initiator.

When parsing, the list of values is processed in a greedy manner, meaning it takes all the initiators, that is, each of the string literals in the whitespace separated list, and matches them each against the data. The initiator with the longest match is the one that is selected as having been 'found'. Once a matching initiator is found, no other matches will be subsequently attempted (ie, there is no backtracking).

When an initiator is specified, it is a processing error if the component is required and one of the values is not found.

If dfdl:initiator is "" (the empty string), then the Initiator region is of length zero, and no initiator is expected.  It is not permitted for an expression to return an empty string. That is a schema definition error.

On unparsing the first initiator in the list is automatically inserted into the Initiator region.

If dfdl:ignoreCase is 'yes' then the case of the string is ignored by the parser.

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group

terminator

List of DFDL String Literals or DFDL Expression

Specifies a whitespace separated list of alternative text strings that one of which marks the end of an element or group of elements. The strings MUST be searched for in the longest first order.

This property can be computed by way of an expression which returns a string of whitespace separated list of values.  The expression must not contain forward references to elements which have not yet been processed.

This property can be used to determine the length of an element as described in Section 12.3.2 dfdl:lengthKind 'delimited'.

Each string literal in the list, whether apparent in the schema, or returned as the value of an expression, is restricted to allow only certain kinds of syntax:

·         DFDL character entities are allowed.

·         DFDL Byte Value entities ( %#r ) are allowed.

·         DFDL Character Classes NL, WSP, WSP+, WSP*, and ES are allowed.

·         ES must not appear as the only DFDL string literal in the property. It can only appear as a member of a list.

·         Neither the ES entity nor the WSP* entity may appear on their own as one of the string literals in the list when the parser is determining the length of a component by scanning for delimiters.

If the above rules are not followed it is a schema definition error.

The Terminator region contains the terminator string.

If dfdl:terminator is "" (the empty string), then the terminator region is of length zero, and no terminator is expected. It is not permitted for an expression to return an empty string, that is a schema definition error.

When parsing, the list of values is processed in a greedy manner, meaning it takes all the terminators, that is, each of the string literals in the whitespace separated list, and matches them each against the data. The terminator with the longest match is the one that is selected as having been 'found'. Once a matching terminator is found, no other matches will be subsequently attempted (ie, there is no backtracking).

When a terminator is expected it is a processing error if no matching terminator is found. However, if dfdl:documentFinalTerminatorCanBeMissing is specified then it is not an error if the last terminator in the data stream is not found.

On unparsing the first terminator in the list is automatically inserted in the Terminator region.

If dfdl:ignoreCase is 'yes' then the case of the string is ignored by the parser.

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group

emptyValueDelimiterPolicy

Enum

Valid values are 'none', 'initiator', 'terminator' or 'both'

Indicates that when an element in the data stream is empty, an initiator (if one is defined), a terminator (if one is defined), both an initiator and a terminator (if defined) or neither must be present.

Ignored if both dfdl:initiator and dfdl:terminator are "" (empty string).

'initiator' indicates that, on parsing, if the content region (which can be either the SimpleContent region or the ComplexContent region defined in Section 9.2)  is empty then the dfdl:initiator must be present. It also indicates that on unparsing when the content region is empty that the dfdl:initiator will be output.

'terminator' indicates that, on parsing, if the content region is empty then the dfdl:terminator must be present. It also indicates that on unparsing when the content region is empty the dfdl:terminator will be output.

'both' indicates  that, on parsing, if the content region is empty both the dfdl:initiator and dfdl:terminator must be present. On unparsing when the content region is empty the dfdl:initiator followed by the dfdl:terminator will be output.

'none' indicates that if the content region is empty neither the dfdl:initiator or dfdl:terminator must be present. On unparsing when the content region is empty nothing will be output.

It is a schema definition error if dfdl:emptyValueDelimiterPolicy set to 'none' or 'terminator' when the parent xs:sequence has dfdl:initiatedContent 'yes'.

This property plays an important role in establishing empty representation. See 9.2.2 Empty Representation for details.

Annotation: dfdl:element, dfdl:simpleType

documentFinalTerminatorCanBeMissing

Enum

Valid values are 'yes', 'no'

When the dfdl:documentFinalTerminatorCanBeMissing property is true, then when an element is the last element in the data stream, then on parsing, it is not an error if the terminator is not found.

For example, if the data are in a file, and the format specifies lines terminated by the newline character (typically LF or CRLF), then if the last line is missing its newline, then this would normally be an error, but if dfdl:documentFinalTerminatorCanBeMissing is true, then this is not a processing error.

On unparsing the terminator is always written out regardless of the state of this property.

Annotation: dfdl:format (but applies to elements only)

outputNewLine

DFDL String Literal or DFDL Expression

Specifies the character or characters that will be used to replace the %NL; character class entity during unparse

It is a schema definition error if any of the characters are not in the set of characters allowed by the DFDL entity %NL; Only individual characters or the %CR;%LF; combination are allowed.

It is a schema definition error if the DFDL entity %NL; is specified

This property can be computed by way of an expression which returns a DFDL string literal. The expression must not contain forward references to elements which have not yet been processed.

Annotation: dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group

Table 15 Properties for Specifying Delimiters

12.3     Properties for Specifying Lengths

These properties are used to determine the content length of an element and apply to elements of all types (simple and complex).

Property Name

Description

lengthKind

Enum

Controls how the content length of the component is determined.

Valid values are: 'explicit', 'delimited', 'prefixed', 'implicit', 'pattern', 'endOfParent'

A full description of each enumeration is given in the later sections.

'explicit' means the length of the element is given by the dfdl:length property.

'delimited' means the element length is determined by scanning for a terminator or separator.

'prefixed' means the length of the element is given by an immediately preceding PrefixLength data region the format of which is specified using dfdl:prefixLengthType.

'implicit means the length is to be determined in terms of the type of the element and its schema-specified properties if any.

'pattern' means the length of the element is given by scanning for a regular expression specified using the dfdl:lengthPattern property.

'endOfParent' means that the length extends to the end of the containing (parent) construct.

Annotation: dfdl:element, dfdl:simpleType

lengthUnits

Enum

Valid values 'bytes', 'characters', ''bits'.

Specifies the units to be used whenever a length is being used to extract or write data. Applicable when dfdl:lengthKind is 'explicit', 'implicit' (for xs:string and xs:hexBinary) or 'prefixed'.

Usage is restricted as follows:

  • 'characters' may only be used for complex elements and simple elements with text representation.
  • 'bits' may only be used for xs:boolean, xs:byte, xs:short, xs:int, xs:long, xs:unsignedByte, xs:unsignedShort, xs:unsignedInt, and xs:unsignedLong simple types with binary representation.
  • 'bytes' must be used for type xs:hexBinary.
  • 'bytes' must be used for types xs:float and xs:double with binary representation.

 Annotation: dfdl:element, dfdl:simpleType

Table 16 Properties for Specifying Length

12.3.1    dfdl:lengthKind 'explicit'

When dfdl:lengthKind is 'explicit' the length of the item is given by the dfdl:length property.

When the value of the dfdl:length property is a constant, it is used both when parsing and unparsing.

When unparsing an element with dfdl:lengthKind 'explicit' and where dfdl:length is an expression, then the data in the Infoset is treated as variable length and not fixed length. The behaviour is the same as dfdl:lengthKind 'prefixed'. See Section 12.3.4.

When parsing and dfdl:lengthKind is 'explicit', delimiter scanning is turned off and in-scope delimiters are not looked for within or between elements.

 

Property Name

Description

length

Non-negative Integer or DFDL Expression. 

Only used when lengthKind is 'explicit'.

Specifies the length of this element in units that are specified by the dfdl:lengthUnits property.

This property can be computed by way of an expression which returns a non-negative integer. The expression must not contain forward references to elements which have not yet been processed.

Annotation: dfdl:element, dfdl:simpleType

Table 17 The dfdl:length Property

When dfdl:lengthKind 'explicit', the method of extracting data is described in section: 12.3.7 Elements of Specified Length

12.3.2    dfdl:lengthKind 'delimited'

On parsing, the length of an element with dfdl:lengthKind 'delimited' is determined by scanning the datastream for the delimiter.

The data stream is scanned for any of

·         the element's terminator (if specified)

·         an enclosing construct's separator or terminator

·         the end of an enclosing element designated by its known length

·         the end of the data stream

dfdl:lengthKind 'delimited' may be specified for

·         elements of simple type with text representation

·         elements of number or calendar simple type with dfdl:representation 'binary' that have a packed decimal representation

·         elements of type xs:hexBinary

·         elements of complex type.

The rules for resolving ambiguity between delimiters are:

  1. When two delimiters have a common prefix, the longest delimiter is tried first.
  2. When two delimiters have exactly the same length, but on different schema components, the innermost (most deeply nested) delimiter is tried first.
  3. When the separator and terminator on a group have the same value, then at a point in the data where either the separator or terminator could be found, the separator is tried first. (Speculative execution may try the terminator subsequently).
  4. If the length of the delimiters cannot be determined because character class entities (which are variable length) are being used then the delimiters must each be matched against the data, and the longest matching delimiter is taken as the match for the delimiter.
  5. Ties (same matched length) are broken by giving a separator priority over a terminator of a sequence, or by choosing the innermost, or first in schema order.

When unparsing a simple element with text representation, the length in the data stream is the length of the content region, padded to dfdl:textOutputMinLength or the XSD minLength facet if dfdl:textPadKind is 'padChar'.

When unparsing a simple element with binary representation, then for hexBinary the length is the number of bytes in the infoset value padded to the XSD minLength facet value using dfdl:fillByte, and for the other types the length is the minimum number of bytes to represent the value and any sign.

When unparsing a complex element, the length is that of the ComplexContent region.

12.3.2.1    Non-Delimited Elements within Delimited Constructs

When a simple or complex element has a specified length,dfdl:lengthKind 'pattern', or dfdl:lengthKind 'endOfParent' then delimiter scanning is suspended for the duration of the processing of that element.

This allows formats to be parsed which are delimited, but have nested elements which contain non-character data so long as that nested data can be isolated from the delimited data context surrounding it.

12.3.2.2    Delimited Binary Data

Formats involving binary data, most notably packed decimals, can use delimiter scanning but care must be taken that the delimiters cannot match data represented in these formats. In particular, the delimiters must be chosen with knowledge that BCD data can contain any byte both of whose nibbles are 0 to 9 (that is, excluding A to F). Packed data adds bytes with a sign indicator, that is, a nibble in the range A to F.

General binary data can contain any bit pattern whatsoever, so delimiter scanning for numbers and calendars with dfdl:representation 'binary' are disallowed, with the specific exception of packed decimals. Delimiter scanning is also allowed for type xs:hexBinary.

12.3.3    dfdl:lengthKind 'implicit'

When dfdl:lengthKind is 'implicit', the length is determined in terms of the type of the element and its schema-specified properties.

For complex elements, 'implicit' means the length is determined by the combined lengths of the contained children, that is the ComplexContent region. However, note that alignment regions inside the ComplexContent region may be of different lengths depending on the ComplexContent's starting position alignment.

For simple elements the length is fixed and is given in Table 18 Length in Bits for SimpleTypes when dfdl:lengthKind is 'implicit' .

Type

Length

text

binary

String

The XSD maxlength facet gives length in characters, but this is also the length in bytes. (See note below: character set encoding must be single-byte.) Multply by 8 to get number of bits.

Not applicable

Float

Not allowed

32 bits

Double

Not allowed

64 bits

Decimal, Integer, nonNegativeInteger

Not allowed

packed decimal: Not allowed

binary: Not allowed

Long, UnsignedLong

Not allowed

binary: 64 bits

Int, UnsignedInt

Not allowed

binary: 32 bits

Short, UnsignedShort

Not allowed

binary: 16 bits

Byte, UnsignedByte

Not allowed

binary: 8 bits

DateTime

Not allowed

binarySeconds: 32 bits, binaryMilliseconds: 64 bits.

Date

Not allowed

binarySeconds: 32 bits, binaryMilliseconds: 64 bits

Time

Not allowed

binarySeconds: 32 bits, binaryMilliseconds: 64 bits

Boolean

Length of  longest of dfdl:textBooleanTrueRep and dfdl:textBooleanFalseRep values

32 bits

HexBinary

Not applicable

The XSD maxLength facet gives the length in bytes. Multiply by 8 to convert to number of bits.

Table 18 Length in Bits for SimpleTypes when dfdl:lengthKind is 'implicit'

When dfdl:lengthKind is 'implicit', the method of extracting data is described in section: 12.3.7 Elements of Specified Length.

It is a schema definition error if type is xs:string and dfdl:lengthKind is 'implicit' and dfdl:lengthUnits is 'bytes' and encoding is not an SBCS (exactly 1 byte per character code) encoding. This prevents a scenario where validation against the XSD maxLength facet is in characters but parsing and unparsing using the XSD maxLength facet is in bytes.

12.3.4    dfdl:lengthKind 'prefixed'

When dfdl:lengthKind is 'prefixed' the length of the element is given by the integer value of the PrefixLength region specified using dfdl:prefixLengthType. The property dfdl:prefixIncludesPrefixLength also can be used to adjust the length appropriately.

When dfdl:lengthKind is 'prefixed' the method of extracting data is described in section: 12.3.7 Elements of Specified Length

When dfdl:lengthKind is 'prefixed', delimiter scanning is turned off and in-scope delimiters are not looked for within or between elements.

 

Property Name

Description

prefixIncludesPrefixLength

Enum

Valid values are 'yes', 'no'

Whether the length given by a prefix includes the length of the prefix as well as the length of the content region (which can be either the SimpleContent region or the ComplexContent region defined in Section 9.2 DFDL Data Syntax Grammar).)

Used only when dfdl:lengthKind 'prefixed'.

Annotation: dfdl:element, dfdl:simpleType

prefixLengthType

QName

Name of a simple type derived from xs:integer or any subtype of it.

This type, with its DFDL annotations specifies the representation of the length prefix, which is in the PrefixLength region.

It is a schema definition error if the xs:simpleType specifies any of:

  • dfdl:lengthKind 'delimited', 'endOfParent', or 'pattern'
  • dfdl:lengthKind 'explicit' where length is an expression
  • dfdl:outputValueCalc
  • dfdl:initiator or dfdl:terminator other than empty string
  • dfdl:alignment other than '1'
  • dfdl:leadingSkip or dfdl:trailingSkip other than '0'.

Annotation: dfdl:element, dfdl:simpleType

Table 19 Properties for dfdl:lengthKind 'prefixed'

The representation of the element is in two parts.

  1. The 'prefix length' is an integer which specifies the length of the element's content. The representation of the length prefix is described by a simple type which is identified using the dfdl:prefixLengthType property.
  2. The content of the element.

When parsing, the length of the element's content is obtained by parsing the simple type specified by dfdl:prefixLengthType to obtain an integer value. Note that all required properties must be present on the specified simple type or defaulted because there is no element declaration to supply any missing required properties.

If the dfdl:prefixIncludesPrefixLength property is 'yes' then the length of the element's content is the value of the prefix length minus the length of the content of the prefix length.

If the prefix type is dfdl:lengthKind 'implicit' or 'explicit' then the dfdl:lengthUnits properties of both the prefix type and the element must be the same.

The DFDL properties that specify the format of the prefix come from annotations directly on the dfdl:prefixLengthType's type definition, and from the default format annotation for the schema document containing the definition of that type. . If the using element resides in a separate schema, the simple type does not pick up values from the element's schema's default dfdl:format annotation.

When unparsing, the length of the element's content region must be determined first as described below. Then the value of the prefix length must be adjusted using dfdl:prefixIncludesPrefixLength.

Then the prefix length can be written to the data stream using the properties on the dfdl:prefixLengthType, and finally the element's content can be written to the data stream.

Consider this example:

<xs:element name="myString" type="xs:string"

                    dfdl:lengthKind="prefixed"

                    dfdl:prefixIncludesPrefixLength="false"

                    dfdl:prefixLengthType="packed3"/>

 

<xs:simpleType name="packed3"

            dfdl:representation="binary"

            dfdl:binaryNumberRep="packed"

            dfdl:lengthKind="explicit"

            dfdl:length="2" >

  <xs:restriction base="integer" />

</xs:simpleType>

In the above, the string has a prefix length of type 'packed3' containing 3 packed decimal digits.

The property dfdl:prefixIncludesPrefixLength is an enumeration which allows the length computation to be varied to include or exclude the length of the prefix element itself.

The prefix length's value contains the length measured in units given by dfdl:lengthUnits.

When parsing, if the dfdl:lengthUnits are bits, then any number of bits can be in the representation.However, the same is not true when unparsing. The DFDL Infoset does not store the number of bits in a number, so the number of bits will always be a multiple of 8 bits.

When unparsing, the value of the prefix is computed automatically by obtaining the length of the element's content.

For a simple element with text representation, the length is computed as for dfdl:lengthKind 'delimited'.

For a simple element with binary representation, the length is given in the table below.

For a complex element, the length is that of the ComplexContent region.

Type

                        Length

String

Not applicable

Float

32

Double

64

Decimal, Integer, NonNegativeInteger

Compute the minimum number of bytes to represent the value (per dfdl:binaryNumberRep) and sign (if applicable). Multiply by 8 for number of bits.

Long, UnsignedLong

 

 

 

 

packed decimal: as Decimal

 

 

 

 

 

binary: 64

Int, UnsignedInt

binary: 32

Short, UnsignedShort

binary: 16

Byte, UnsignedByte

binary: 8

DateTime

binarySeconds: 32, binaryMilliseconds:64

Date

binarySeconds: 32, binaryMilliseconds:64

Time

binarySeconds: 32, binaryMilliseconds:64

Boolean

32

HexBinary

 

 

Compute the number of bytes in the infoset value padded to the value of the XSD minLength facet (which gives minimum length in bytes) using dfdl:fillByte if necessary. This gives the unparse length in bytes. Multiply by 8 for the number of bits.

Table 20 Unparse Lengths (in Bits) for Binary Data with dfdl:lengthKind 'prefixed'

12.3.4.1    Nested Prefix Lengths[20]

It is possible for a prefix length, as specified by dfdl:prefixLengthType, to itself have a prefix length  

It is a schema definition error if this nesting exceeds 1 deep. That is, an element can have a prefix length, which defines a PrefixLength region (see Section 9.2 DFDL Data Syntax Grammar). The PrefixLength region can itself have a type which also specifies a prefix length, thereby defining a PrefixPrefixLength region. It is a schema definition error unless the type associated with the PrefixPrefixLength is different from the type associated with the PrefixLength.

12.3.5    dfdl:lengthKind  'pattern'

The dfdl:lengthKind 'pattern' means the length of the element is given by a regular expression specified using the dfdl:lengthPattern property. The DFDL processor scans the data stream to determine a string value that is the match to a regular expression. The pattern is only used on parsing.

When dfdl:lengthKind is 'pattern', delimiter scanning is turned off and in-scope delimiters are not looked for within or between elements.

Property Name

Description

lengthPattern

DFDL Regular Expression. 

Only used when lengthKind is 'pattern'.

Specifies a regular expression that, on parsing, is executed against the datastream to determine the length of the element.

The data stream beginning at the starting offset of the content region (which can be either the SimpleContent region or the ComplexContent region defined in Section 9.2 DFDL Data Syntax Grammar) of the element is interpreted as a stream of characters in the encoding of the element, and the regular expression contained in the dfdl:lengthPattern property is executed against that stream of characters. When the element is complex the encoding used is the dfdl:encoding of the complex element itself.

It is a schema definition error if there is no value for the dfdl:encoding property in scope.

DFDL Escape Schemes (per dfdl:escapeSchemeRef) are not used when executing the regular expression.

If the pattern matching of the regular expression reads data that cannot be decoded into characters of the current encoding, then the behavior is controlled by the dfdl:encodingErrorPolicy property. See dfdl:encodingErrorPolicy in Section 11 Properties Common to both Content and Framing.

Annotation: dfdl:element, dfdl:simpleType

Table 21 The dfdl:lengthPattern Property

On unparsing the behavior is the same as for dfdl:lengthKind 'prefixed'.

When the DFDL regular expression is matched against data:

·         The data is considered to be text in the character set encoding specified by the dfdl:encoding property, regardless of the actual representation of the element.

·         The data is decoded from the specified encoding into Unicode before the actual matching takes place.

·         If there is no match (ie, the length of the data found to match the pattern is zero) it is not a processing error but instead it means the length is zero.

 

12.3.6    dfdl:lengthKind 'endOfParent'

The dfdl:lengthKind 'endOfParent' means that the element is terminated either by the end of the data stream, or the end of an enclosing complex element with dfdl:lengthKind ‘explicit’, ‘pattern’, ‘prefixed’ or ‘endOfParent’, or the end of an enclosing choice with dfdl:choiceLengthKind ‘explicit’. The ‘parent’ element or choice does not have to be the immediate enclosing component of the element, but there must be no other components defined between the element specifying dfdl:lengthKind 'endOfParent' and the end of the parent.

A convenient way of describing the parent is as a 'box', being defined as a portion of the data stream that has an established content length prior to the parsing of its children. If the parent is such a ‘box’ then the element specifying dfdl:lengthKind ‘endOfParent’ is the last element in the ‘box’ and its content extends to the end of the ‘box’.

A dfdl:lengthKind of  'endOfParent' can only be used on simple and complex elements in the following locations:

It is a schema definition error if:

The effective length units of the parent are:

·        dfdl:lengthUnits if parent is an element with dfdl:lengthKind ‘explicit’ or ‘prefixed’;

·        ‘characters’ if parent is an element with dfdl:lengthKind ‘pattern’;

·        ‘bytes’ if parent is a choice with dfdl:choiceKind ‘explicit’;  

·        ‘characters’ if the element is the document root;

·        the effective length units of the parent’s parent if parent is an element with dfdl:lengthKind ‘endOfParent’

If the element is in a sequence then it is a schema definition error if:

If the element is in a choice where dfdl:choiceLengthKind is 'implicit' then it is a schema definition error if:

A simple element must have one of:

type xs:string

dfdl:representation 'text'

type xs:hexBinary

dfdl:representation 'binary' and a packed decimal representation

A complex element can have dfdl:lengthKind 'endOfParent'. If so then its last child element can be any dfdl:lengthKind including 'endOfParent'.

The dfdl:lengthKind 'endOfParent' can also be used on the document root to allow the last element to consume the data up to the end of the data stream.

The use of dfdl:lengthKind ‘endOfParent’ is distinct from the situation where the length of the last element in the parent is known but is not sufficient to fill the parent. In the latter case the remaining data are ignored on parsing and filled with dfdl:fillByte on unparsing.

When parsing an element with dfdl:lengthKind ‘endOfParent’, delimiter scanning is turned off and in-scope terminating delimiters are not looked for within the element.

When unparsing an element with dfdl:lengthKind ‘endOfParent’, if the parent is a complex element with dfdl:lengthKind 'explicit' where dfdl:length is not an expression, or a choice with dfdl:choiceLengthKind 'explicit', then the element with dfdl:lengthKind 'endOfParent' is padded or filled in the usual manner to the required length, by completing the LeftPadding, RightPadOrFill, ElementUnused, or ChoiceUnused regions of the data as appropriate. 

 

12.3.7    Elements of Specified Length

An element has a specified length when dfdl:lengthKind is 'explicit', 'implicit' (simple type only)  or 'prefixed'. The units that the length represents are specified by the dfdl:lengthUnits property except where noted in Section 12.3.3.

Using specified length, it is possible for an element to have content length longer than needed to represent just the data value. For example, a simple text element may be padded in the RightPadding region if the data is not long enough.

When an element has specified length, but appears inside a complex type element having delimited length kind, delimiter scanning is turned off and in-scope delimiters are not looked for within or between elements.

An element of specified length with dfdl:lengthKind 'implicit' or 'explicit' where dfdl:length is not an expression has a known length when unparsing.  However, an element of specified length with dfdl:lengthKind 'prefixed' or 'explicit' where dfdl:length is an expression is considered to have a variable length when unparsing.Specifically:

When parsing, if the data stream ends without enough data to parse an element, that is, N bits are needed based on the dfdl:length, but only M < N bits are available, then it is a processing error. 

If dfdl:lengthUnits is 'characters' then the length (in bits) of the content region  (i.e., SimpleContent or ComplexContent defined in Section 9.2 DFDL Data Syntax Grammar) will depend on the encoding of the characters.

For a simple element, dfdl:lengthUnits 'characters' may only be used for textual elements, it is a schema definition error otherwise.

Some DFDL implementations may support character set encodings where the characters are not a multiple of 8-bits wide. Encodings which are 5, 6, 7, and 9 bits wide are rare, but do exist, so the overall length of the content region may not be a multiple of 8-bits wide.

12.3.7.1    Length of Simple Elements with Textual Representation

Textual data is defined to mean either data of type string or data where the dfdl:representation property is 'text'.

For a textual element, the dfdl:lengthUnits property can be either 'bytes' or 'characters'.

12.3.7.1.1   Text Length Specified in Bytes

If a textual element has dfdl:lengthUnits of 'bytes', and the dfdl:encoding is not SBCS, then it is possible for a partial character encoding to appear after the code units of the characters. In this case, the following rules apply:

It is a schema definition error if type is xs:string and dfdl:textPadKind is not 'none' and dfdl:lengthUnits is 'bytes' and dfdl:encoding is not an SBCS encoding and the XSD minLength facet is not zero. This prevents a scenario where validation against the XSD minLength facet is in characters but padding would be performed in bytes.

12.3.7.2    Length of Simple Elements with Binary Representation

This section discusses the dfdl:lengthKind 'explicit' and 'prefixed' specified lengths for the different binary representations. When dfdl:lengthKind is 'implicit', see Section 12.3.3 dfdl:lengthKind 'implicit'.

The dfdl:lengthUnits can be 'bytes' or 'bits' unless otherwise stated. It is schema definition error if dfdl:lengthUnits is 'characters'.

It is a schema definition error if the specified dfdl:length for an element of dfdl:lengthKind 'explicit' is a string literal integer such that the length of the data exceeds the capacity of the simple type.

It is a processing error if the specified length for an element of dfdl:lengthKInd 'prefixed' or 'explicit' (with dfdl:length an expression) is an integer such that the length of the data exceeds the capacity of the simple type.

12.3.7.2.1   Length of Base-2 Binary Number Elements

Non-floating point numbers with binary representation and dfdl:binaryNumberRep 'binary' are represented as a bit string which contains a base-2 representation.

The value of the specified length is constrained per the table below. The lengths are expressed in bits and are inclusive.

 

Type

Minimum value of length

Maximum value of length

xs:byte

2

8

xs:short

2

16

xs:int

2

32

xs:long

2

64

xs:unsignedByte

1

8

xs:unsignedShort

1

16

xs:unsignedInt

1

32

xs:unsignedLong

1

64

xs:nonNegativeInteger

1

Implementation-dependent (but not less than 64)

xs:integer

2

Implementation-dependent (but not less than 64)

xs:decimal

2

Implementation-dependent (but not less than 64)

Table 22: Allowable Specified Lengths in Bits for Base-2 Binary Number Elements

See Section 13.7.1.1 Converting Base-2 Binary Numbers for details of the conversion to/from numeric values.

12.3.7.2.2   Length of Floating Point Binary Number Elements

For binary elements of types xs:float or xs:double, a specified length must be either exactly 4 bytes or exactly 8 bytes respectively.

The dfdl:lengthUnits property must be 'bytes'. It is a schema definition error otherwise.

See Section 13.8 Properties Specific to Float/Double with Binary Representation.

12.3.7.2.3   Length of Packed Decimal Number Elements

Non-floating point numbers with binary representation and dfdl:binaryNumberRep 'packed', 'bcd', or 'ibm4690Packed', are represented as a bit string of 4 bit nibbles. The term packed decimal is used to describe such numbers.

It is a schema definition error if the specified length is not a multiple of 4 bits.

The maximum specified length of a packed decimal number is implementation-defined.

See Section 13.7 Properties Specific to Number with Binary Representation for details of the conversion of the packed decimal bit string to/from a numeric value.

12.3.7.2.4   Length of Binary Boolean Elements

The specified length of a binary element of type xs:boolean is as for type xs:unsignedInt described in 12.3.7.2.1 Length of Base-2 Binary Number Elements.

See also Section 13.10 Properties Specific to Boolean with Binary Representation for details of how the data is converted to/from a Boolean value.

12.3.7.2.5   Length of Base-2 Binary Calendar Elements

Calendars with binary representation and dfdl:binaryCalendarRep ‘binarySeconds’ or ‘binaryMilliseconds’ are represented as a bit string which contains a base-2 representation.The specified length must be either exactly 4 bytes or exactly 8 bytes respectively.

The dfdl:lengthUnits property must be 'bytes'. It is a schema definition error otherwise.

See Section 13.13 Properties Specific to Calendar with Binary Representation for details of how the data is converted to/from the calendar type.

12.3.7.2.6   Length of Packed Decimal Calendar Elements

Calendars with binary representation and dfdl:binaryCalendarRep 'packed', 'bcd', or 'ibm4690Packed', are represented as a bit string of 4 bit nibbles. The term packed decimal is used to describe such calendars.

It is a schema definition error if the specified length is not a multiple of 4 bits.

The maximum specified length of a packed decimal calendar is implementation-dependent (but not less than 9 bytes, which corresponds to calendar pattern 'yyyyMMddhhmmssSSS')[21].

See Section 13.13 Properties Specific to Calendar with Binary Representation for details of how the data is converted to/from the calendar type.

12.3.7.2.7   Length of Binary Opaque Elements

The dfdl:lengthUnits property must be 'bytes'. It is a schema definition error otherwise.

When unparsing a specified length element of type xs:hexBinary, and the simple content region is larger than the length of the element in the Infoset, then the remaining bytes are filled using the dfdl:fillByte property.

The dfdl:fillByte is not used to trim an element of type xs:hexBinary when parsing.

12.3.7.3    Length of Complex Elements

A complex element of specified length is defining a 'box' in which its child elements exist. An example of this would be a fixed length record element with a variable number of children elements. The dfdl:lengthUnits may be 'bytes' or 'characters' and it is a schema definition error otherwise.

It is possible that the children may not entirely fill the full length of the complex element. An example is a complex element with a specified length of 100 characters, which contains a sequence of child elements that use up less than 100 characters of data, perhaps because an optional element is not present. In this case the remaining unused data is called the ElementUnused region in the data syntax grammar of section 9.2. Another example is a complex element with a specified length of 100 bytes, which contains a sequence of child elements the last of which has dfdl:lengthKind 'endOfParent', dfdl:representation 'text' and a multi-byte dfdl:encoding such that the element does not use up all the bytes of data. In this case the remaining unused bytes comprise the child element's RightFill region in the data syntax grammar of section 9.2. In both examples, the unused area is skipped when parsing, and is filled with the dfdl:fillByte on unparsing.  

Note that a poorly chosen value for dfdl:fillByte may fill the region with data that cannot be decoded in the character set encoding, resulting in a decode error when this data is subsequently parsed again. When dfdl:lengthUnits is 'characters' the value for dfdl:fillByte should be chosen so as to avoid this error.

13.  Simple Types

The 'representation' property identifies the physical representation of the element. The DFDL logical types are grouped to illustrate which physical representations apply to each logical type.

These properties provide the correct interpretation of the data found in the SimpleContent grammar region.

The allowable physical representations for each logical type grouping are also shown, where the logical type groupings are defined as:

Logical Type Group

Types

Number

xs:double, xs:float, xs:decimal, xs:integer and its restrictions (xs:int, xs:unsignedLong, etc.)

String

xs:string

Calendar

xs:dateTime, xs:date, xs:time

Opaque

xs:hexBinary

Boolean

xs:boolean

Table 23 Logical type groups

13.1     Properties Common to All Simple Types

Property Name

Description

representation

Enum

Valid values are dependent on logical type.

Number: 'text, 'binary'

String: representation is assumed to be 'text' and the dfdl:representation property is not examined

Calendar: 'text, 'binary'

Boolean: 'text, 'binary'

Opaque:  representation is assumed to be 'binary' and the dfdl:representation property is not examined.

Annotation: dfdl:element, dfdl:simpleType

Table 24 Properties Common to All Simple Types

The permitted representation properties for each logical type are shown in Table 25: Logical Type to Representation properties

Logical  type

dfdl:representation

Additional representation property

String

Assumed to be text

 

Float, Double

text

dfdl:textNumberRep:
standard

binary

dfdl:binaryFloatRep:
ieee, ibm390Hex

Decimal, Integer, nonNegativeInteger

text

dfdl:textNumberRep:
standard, zoned

binary

dfdl:binaryNumberRep:
packed, bcd, ibm4690Packed, binary

Long, Int, Short, Byte, UnsignedLong, Unsignedint, Unsignedshort, UnsignedByte

text

dfdl:textNumberRep:
standard, zoned

binary

dfdl:binaryNumberRep:
packed, bcd, ibm4690Packed, binary

DateTime, Date, Time

text

 

 

binary

dfdl:binaryCalendarRep:
packed, bcd, ibm4690Packed, binarySeconds, binaryMilliseconds

Boolean

text

 

binary

 

HexBinary

Assumed to be binary

 

Table 25: Logical Type to Representation properties

13.2     Properties Common to All Simple Types with Text representation

Property Name

Description

textPadKind

Enum

Valid values 'none', 'padChar'.

Indicates whether to pad the data value on unparsing. This controls the contents of the LeftPadding and RightPadding regions of the data syntax grammar in section 9.2.

'none': No padding occurs. When dfdl:lengthKind is 'implicit' or  'explicit' (and dfdl:length is not an expression) the unparsed data value must match the expected length otherwise it is a processing error.

'padChar': The data value is padded using the dfdl:textStringPadCharacter, dfdl:textNumberPadCharacter, dfdl:textBooleanPadCharacter or dfdl:textCalendarPadCharacter  depending on the type of the element. The padding characters populate the LeftPadding and/or RightPadding regions depending on dfdl:textStringJustification, dfdl:textNumberJustification, or dfdl:textCalendarJustification, depending on the type of the element.

When dfdl:lengthKind is 'implicit' the data value is padded to the implicit length for the type.

When dfdl:lengthKind is 'explicit' (and dfdl:length is not an expression) the data value is padded to the length given by the dfdl:length property.

When dfdl:lengthKind is 'explicit' (and dfdl:length is an expression), 'delimited', 'prefixed', 'pattern' the data value is padded to the length given by the XSD minLength facet for type 'xs:string' or dfdl:textOutputMinLength  property for other types.

When dfdl:lengthKind is 'endOfParent' the data value is padded to the available length.

Annotation: dfdl:element, dfdl:simpleType

textTrimKind

Enum

Valid values 'none', 'padChar'

Indicates whether to trim data on parsing. This controls the expected contents of the LeftPadding and RightPadding regions of the data syntax grammar in section 9.2.

When 'none' no trimming takes place. 

When 'padChar' the element is trimmed of the dfdl:textStringPadCharacter, dfdl:textNumberPadCharacter, dfdl:textBooleanPadCharacter or dfdl:textCalendarPadCharacter  depending on the type of the element.  The padding characters populate the LeftPadding and/or RightPadding regions depending on dfdl:textStringJustification, dfdl:textNumberJustification, or dfdl:textCalendarJustification, depending on the type of the element.

Annotation: dfdl:element , dfdl:simpleType

textOutputMinLength

Non-negative Integer.  

Only used when dfdl:textPadKind is 'padChar' and dfdl:lengthKind is 'delimited', 'prefixed', 'pattern', 'explicit' (when dfdl:length is an expression) or 'endOfParent', and type is not xs:string

Specifies the minimum content length during unparsing for simple types that do not allow the XSD minLength facet to be specified.

For dfdl:lengthKind 'delimited', 'pattern' and 'endOfParent' the length units are always characters, for other dfdl:lengthKinds the length units are specified by the dfdl:lengthUnits property.

If dfdl:textOutputMinLength is zero or less than the length of the representation text then no padding occurs.

Annotation: dfdl:element, dfdl:simpleType

escapeSchemeRef

QName or empty String

The name of the dfdl:defineEscapeScheme annotation that provides the additional properties used to describe the escape scheme. If the value is the empty string then escaping is explicitly turned off.

See: Section 7.6 The dfdl:escapeScheme Annotation Element, and Section 7.5 The dfdl:defineEscapeScheme Defining Annotation Element.

Annotation: dfdl:element, dfdl:simpleType

Table 26 Properties Common to All Simple Types with Text Representation

13.2.1    The dfdl:escapeScheme Properties

The dfdl:escapeScheme annotation is used within a dfdl:defineEscapeScheme annotation to group the properties of an escape scheme and allows a common set of properties to be defined that can be reused.

An escape scheme is needed when the content of a text element contains sequences of characters that are the same as an in-scope separator or terminator. If the characters are not escaped, a parser scanning for a separator or terminator would erroneously find the character sequence in the content.

An escape scheme defines the properties that describe the text escaping rules. There are two variants on such schemes:

·         The use of a single escape character to cause the next character to be interpreted literally. The escape character itself is escaped by the escape escape character.

·         The use of a pair of escape strings to cause the enclosed group of characters to be interpreted literally. The ending escape string is escaped by the escape escape character.

On parsing, the escape scheme is applied after pad characters are trimmed and on unparsing before pad characters are added. A pad character is not escaped by an escape character. When parsing, pad characters are trimmed without reference to an escape scheme. When unparsing, pad characters are added without reference to an escape scheme.

On unparsing, the application of escape scheme processing takes place before the application of the dfdl:emptyValueDelimiterPolicy property.

 

Property Name

Description

escapeKind

Enum

Valid values 'escapeCharacter', 'escapeBlock'

The type of escape mechanism defined in the escape scheme

When 'escapeCharacter': On unparsing a single character of the data is escaped by adding an dfdl:escapeCharacter before it. The following are escaped if they are in the data

  • Any in-scope terminating delimiter by escaping its first character.
  • dfdl:escapeCharacter (escaped by dfdl:escapeEscapeCharacter)
  • any dfdl:extraEscapedCharacters

On parsing any in-scope terminating delimiter encountered in the data is not interpreted as such when it is immediately preceded by the dfdl:escapeCharacter (when not itself preceded by the dfdl:escapeEscapeCharacter). Occurrences of the dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed from the data, unless the dfdl:escapeCharacter is preceded by the dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter does not precede the dfdl:escapeCharacter.

When 'escapeBlock': On unparsing the entire data are escaped by adding dfdl:escapeBlockStart to the beginning and dfdl:escapeBlockEnd to the end of the data. The data is either always escaped or escaped when needed as specified by dfdl:generateEscapeBlock. If the data is escaped and contains the dfdl:escapeBlockEnd then first character of each appearance of the dfdl:escapeBlockEnd is escaped by the dfdl:escapeEscapeCharacter.

On parsing the dfdl:escapeBlockStart string must be the first characters in the (trimmed) data in order to activate the escape scheme. The dfdl:escapeBlockStart string is removed from the beginning of the data. Until a matching dfdl:escapeBlockEnd string (that is, one not preceded by the dfdl:escapeEscapeCharacter) is found in the data, any in-scope terminating delimiter encountered in the data is not interpreted as such, and any dfdl:escapeEscapeCharacters are removed when they precede an dfdl:escapeBlockEnd string. The matching dfdl:escapeBlockEnd string is removed from the data.. The matching dfdl:escapeBlockEnd does not have to be the last character(s) in the (trimmed) data in order to de-activate the escape scheme. A dfdl:escapeBlockStart occurring anywhere in the data other than the first characters has no significance.

Annotation: dfdl:escapeScheme

escapeCharacter

DFDL String Literal or DFDL Expression

Specifies one character that escapes the subsequent character.

Used when dfdl:escapeKind is 'escapeCharacter'

It is a schema definition error if dfdl:escapeCharacter is empty when dfdl:escapeKind is 'escapeCharacter'

This property can be computed by way of an expression which returns a character. The expression must not contain forward references to elements which have not yet been processed.

Escape and Quoting Character Restrictions: The string literal is restricted to allow only certain kinds of syntax:

  • DFDL character entities are allowed
  • The DFDL byte value entity ( %#r ) is not allowed
  • DFDL Character classes  NL, WSP, WSP+, WSP*, and ES are not allowed

It is a schema definition error if the string literal contains any of the disallowed constructs.

Escape characters contribute to the content length of the field

Annotation: dfdl:escapeScheme

escapeBlockStart

DFDL String Literal

The string of characters that denotes the beginning of a sequence of characters escaped by a pair of escape strings.

Used when dfdl:escapeKind is 'escapeBlock'

It is a schema definition error if dfdl:escapeBlockStart is empty when dfdl:escapeKind is 'escapeBlock'

The string literal value is restricted in the same way as described in "Escape and Quoting Character Restrictions" in the description of the dfdl:escapeCharacter property.

An dfdl:escapeBlockStart string contributes to the content length of the field

Annotation: dfdl:escapeScheme

escapeBlockEnd

DFDL String Literal

The string of characters that denotes the end of a sequence of characters escaped by a pair of escape strings.

Used when dfdl:escapeKind is 'escapeBlock' .

It is a schema definition error if dfdl:escapeBlockEnd is empty when dfdl:escapeKind is 'escapeBlock'

The string literal value is restricted in the same way as described in "Escape and Quoting Character Restrictions" in the description of the escapeCharacter property.

A dfdl:escapeBlockEnd string contributes to the content length of the field

Annotation: dfdl:escapeScheme

escapeEscapeCharacter

DFDL String Literal or DFDL Expression

Specifies one character that escapes an immediately following dfdl:escapeCharacter or first character of dfdl:escapeBlockEnd.

Used when dfdl:escapeKind is 'escapeCharacter' or 'escapeBlock'.

This property can be computed by way of an expression which returns a character. The expression must not contain forward references to elements which have not yet been processed.

The string literal value is restricted in the same way as described in "Escape and Quoting Character Restrictions" in the description of the escapeCharacter property.

If the empty string is specified then no escaping of escape characters occurs.

It is explicitly allowed for both the dfdl:escapeCharacter and the dfdl:escapeEscapeCharacter to be the same character. In that case processing functions as if the dfdl:escapeCharacter escapes itself.

Annotation: dfdl:escapeScheme

extraEscapedCharacters

List of DFDL String Literals

A whitespace separated list of single characters that must be escaped in addition to the in-scope delimiters. If there are no extra characters to escape the property should be set to "".

The string literal values are restricted in the same way as described in "Escape and Quoting Character Restrictions" in the description of the dfdl:escapeCharacter property.

This property only applies on unparsing.

Annotation: dfdl:escapeScheme

generateEscapeBlock

Enum

Valid values 'always',  'whenNeeded'

Controls when escaping is used on unparsing when dfdl:escapeKind is 'escapeBlock'.

If 'always' then escaping is always occurs as described in dfdl:escapeKind. 

If 'whenNeeded' then escaping occurs as described in dfdl:escapeKind when the data contains any of the following:

  • any in-scope terminating delimiter
  • dfdl:escapeBlockStart at the start of the data
  • any dfdl:extraEscapedCharacters

Annotation: dfdl:escapeScheme

Table 27 Escape Scheme Properties


13.3     Properties for Bidirectional support for All Simple Types with Text representation

Bidirectional text consists of mainly right-to-left text with some left-to-right nested segments (such as an Arabic text with some information in English), or vice versa (such as an English letter with a Hebrew address nested within it.)

Note: the bidirectional properties apply to the content of the element and not to the initiator, terminator or separator if defined.

Property name

Description

textBidi

Enum

Valid values are 'yes', 'no'

Indicates the text content of the element is bidirectional.

Annotation: dfdl:element, dfdl:simpleType (representation text)

textBidiOrdering

Enum

Valid values 'implicit', 'visual'.

Defines how bidirectional text is stored in memory.

'Implicit' means that the characters are stored in the order they are read or typed. That is with the first character in the first position in the data. (This is also called logical). 'Visual means that the characters are stored in the order they would be printed or displayed. That is, the last character of a right to left sequence is in the first position in the data and the first character of a left to right sequence is in the first position in the data.

Annotation: dfdl:element , dfdl:simpleType (representation text) ,

textBidiOrientation

Enum

Valid values 'LTR', 'RTL', 'contextual_LTR', 'contextual_RTL'.

Indicates how the text should be displayed.

'LTR' means left-to-right

'RTL' mean right to left.

'contextual_LTR' and 'contextual_RTL' means that the orientation should be taken from the context of the data. The data may contain 'strong' characters that are either orientation left or orientation right. The term following contextual (LTR or RTL) specifies what should be the default orientation when the data are orientation-neutral (i.e. there are no strong characters).

Annotation: dfdl:element, dfdl:simpleType (representation text)

textBidiSymmetric

Enum

Valid values are 'yes', 'no'

Defines whether characters such as < ( [ { that have a symmetric character with an opposite directional meaning: > ) ] } should be swapped

Annotation: dfdl:element, dfdl:simpleType (representation text)

textBidiShaped

Enum

Valid values are 'yes', 'no'

Defines whether characters should be shaped on unparsing. Character shaping occurs when the shape of a character is dependent on its position in a word.

Annotation: dfdl:element, dfdl:simpleType (representation text)

textBidiNumeralShapes

Enum

Valid values 'nominal', 'national'.

Defines on unparsing whether logical numbers with text representation  should have Arabic shapes (0123456789) or Arabic-Indic ( ٠١٢٣٤٥٦٧٨٩ )

When 'nominal': All numbers are presented using Arabic shapes

When 'national': All numbers are presented using  Arabic-Indic shapes.

Annotation: dfdl:element, dfdl:simpleType (number with representation text)

Table 28 Properties for Bidirectional support for All Simple Types with Text representation

13.4     Properties Specific to String

 

Property Name

Description

textStringJustification

Enum

Valid values 'left', 'right',  'center'

Unparsing:

'left': Justifies to the left and adds padding chars to the string contents if the string is too short, to the length determined by the dfdl:textPadKind property.

'right': Justifies to the right and adds padding chars to the string contents if the string is too short, to the length determined by the dfdl:textPadKind property.

'center': Adds equal padding chars left and right of the string contents if the string is too short, to the length determined by the dfdl:textPadKind property. It adds one extra padding char on the left if needed.

Parsing:

'left': Trims any pad characters from the right of the string, according to dfdl:textTrimKind property.

'right': Trims any pad characters from the left of the string, according to dfdl:textTrimKind property.

'center' Trims any pad characters from the left and right of the string, according to dfdl:textTrimKind property.

Annotation: dfdl:element, dfdl:simpleType

textStringPadCharacter

DFDL String Literal

The value that is used when padding or trimming string elements.
The value can be a single character or a single byte.

If a character, then it can be specified using a literal character or using DFDL entities.

If a byte, then it must be specified using a single byte value entity otherwise it is a schema definition error

If a pad character is specified when dfdl:lengthUnits is 'bytes' then the pad character must be a single-byte character.

If a pad byte is specified when dfdl:lengthUnits is 'characters' then

  • the encoding must be a fixed-width encoding
  • padding and trimming must be applied using a sequence of N pad bytes, where N is the width of a character in the fixed-width encoding.

Padding Character Restrictions: The string literal is restricted to allow only certain kinds of syntax:

  • DFDL character entities are allowed
  • The DFDL byte value entity ( %#r ) is allowed.
  • DFDL Character classes NL, WSP, WSP+, WSP*, and ES are not allowed

It is a schema definition error if the string literal contains any of the disallowed syntax.

Annotation: dfdl:element, dfdl:simpleType

truncateSpecifiedLengthString

 

Enum

Valid values are 'yes', 'no'

Used on unparsing only

'yes' means if the logical type is xs:string and the value is longer than the specified length, the string is truncated to this length. (See section 12.3.7 Elements of Specified Length.) No processing error is raised.

The position from which data is truncated is determined by the value of the dfdl:textStringJustification property. If the value of the dfdl:textStringJustification property is 'left', data is truncated from the right; if the value of the dfdl:textStringJustification property is 'right', data is truncated from the left. However if the value of the dfdl:textStringJustification property is 'center', truncation does not occur and a processing error occurs if the value is too long.

When unparsing, validation errors cannot be prevented by truncation as validation takes place on the augmented infoset, before any truncation has occurred.

Annotation: dfdl:element, dfdl:simpleType

Table 29 13.4    Properties Specific to String


 

13.5     Properties Specific to Number with Text or Binary Representation

Property Name

Description

decimalSigned

Enum

Valid values are 'yes', 'no'

Indicates whether an xs:decimal element is signed. See 13.6.2 Converting logical numbers to/from text representation and 13.7.1 Converting Logical Numbers to/from Binary  to see how this affects the presence of the sign in the data stream.

'yes' means that the xs:decimal element is signed

'no' means that the xs:decimal element is not signed

Annotation: dfdl:element, dfdl:simpleType

Table 30 Properties Specific to Number with Text or Binary Representation

13.6     Properties Specific to Number with Text Representation

Property Name

Description

textNumberRep

Enum

Valid values are 'standard', 'zoned'

'standard' means represented as characters in the character set encoding specified by the dfdl:encoding property.

'zoned' means represented as a zoned decimal in the character set encoding specified by the dfdl:encoding property. Zoned is not supported for float and double numbers. Base 10 is assumed, and the encoding must be for an EBCDIC or ASCII compatible encoding. It is a schema definition error if any of these requirements are not met.

Annotation: dfdl:element, dfdl:simpleType

textNumberJustification

Enum

Valid values 'left', 'right', 'center'

Controls how the data is padded or trimmed on parsing and unparsing.

Behavior as for dfdl:textStringJustification.

Annotation: dfdl:element, dfdl:simpleType

textNumberPadCharacter

DFDL String Literal

The value that is used when padding or trimming number elements.

The value can be a single character or a single byte.

If a character, then it can be specified using a literal character or using DFDL entities.
If a byte, then it must be specified using a single byte value entity

If a pad character is specified when dfdl:lengthUnits is 'bytes' then the pad character must be a single-byte character.

If a pad byte is specified when dfdl:lengthUnits is 'characters' then

·         the encoding must be a fixed-width encoding

·         padding and trimming must be applied using a sequence of N pad bytes, where N is the width of a character in the fixed-width encoding.

When parsing, if the pad character is '0' and the SimpleContent region consists entirely of '0' characters, then the last remaining '0' is not trimmed and a single '0' is the result of the trimming.  This rule also applies when the pad character is a DFDL character entity equivalent to '0'. This rule does not apply when the pad character is any other character nor when a pad byte is specified.  

The string literal value is restricted in the same way as described in "Pad Character Restrictions" in the description of the dfdl:textStringPadCharacter property.

Annotation: dfdl:element, dfdl:simpleType

textNumberPattern

String

Defines the ICU-like pattern that describes the format of the text number. The pattern defines where grouping separators, decimal separators, implied decimal points, exponents, positive signs and negative signs appear. It permits definition by either digits/fractions or significant digits. Allows rounding.

When dfdl:textNumberRep is 'standard' this property only applies when  dfdl:textStandardBase is 10. When dfdl:textNumberRep is 'standard' and dfdl:textStandardBase is not 10 the number is represented as the  minimum number of characters to represent the digits. There is no sign or virtual decimal point.

The syntax of dfdl:textNumberPattern is described in section 13.6.1 The dfdl:textNumberPattern Property

Annotation: dfdl:element, dfdl:simpleType

textNumberRounding

Enum

Specifies how rounding is controlled during unparsing.

Valid values 'pattern', 'explicit'

When dfdl:textNumberRep is 'standard' this property only applies when dfdl:textStandardBase is 10.

If 'pattern' then rounding takes place according to the pattern. A rounding increment may be specified in the dfdl:textNumberPattern using digits '1' though '9', otherwise rounding is to the width of the pattern. The rounding mode is always 'roundHalfEven'.

If 'explicit' then the rounding increment is specified by the dfdl:textNumberRoundingIncrement property, and any digits '1' through '9' in the dfdl:textNumberPattern are treated as digit '0'. The rounding mode is specified by the dfdl:textRoundingMode property.

To disable rounding, use 'explicit' in conjunction with 'roundUnnecessary' for the dfdl:textNumberRoundingMode. If rounding is disabled then any excess precision is treated as a processing error.

Annotation: dfdl:element, dfdl:simpleType

textNumberRoundingMode

Enum

Specifies how rounding occurs during unparsing, when dfdl:textNumberRounding is 'explicit'.

When dfdl:textNumberRep is 'standard' this property only applies when  dfdl:textStandardBase is 10.

To switch off rounding, use 'roundUnnecessary'.

Valid values 'roundCeiling',  'roundFloor', 'roundDown', 'roundUp', 'roundHalfEven',  'roundHalfDown', 'roundHalfUp', 'roundUnnecessary'

Annotation: dfdl:element, dfdl:simpleType

textNumberRoundingIncrement

Double

Specifies the rounding increment to use during unparsing, when dfdl:textNumberRounding is 'explicit'.

When dfdl:textNumberRep is 'standard' this property only applies when  dfdl:textStandardBase is 10.

A negative value is a schema definition error.

Annotation: dfdl:element, dfdl:simpleType

textNumberCheckPolicy

Enum

Values are 'strict' and 'lax'.

Indicates how lenient to be when parsing against the pattern.

When dfdl:textNumberRep is 'standard' this property only applies when  dfdl:textStandardBase is 10.

If 'lax' and dfdl:textNumberRep is 'standard' then grouping separators are ignored, leading and trailing whitespace  is ignored, leading zeros are ignored and quoted characters may be omitted.

If 'lax' and dfdl:textNumberRep is 'zoned' then positive punched data is accepted when parsing an unsigned type, and unpunched data is accepted when parsing a signed type

If 'strict' and dfdl:textNumberRep is 'standard' then the data must follow the pattern with the exceptions that digits 0-9, decimal separator and exponent separator are always recognised and parsed.

If 'strict' and dfdl:textNumberRep is 'zoned' then the data must follow the pattern.

On unparsing the pattern is always followed and follow the rules in 13.6.2 Converting logical numbers to/from text representation.

Annotation: dfdl:element, dfdl:simpleType

textStandardDecimalSeparator

List of DFDL String Literals  or DFDL Expression

Defines the a whitespace separated list of single characters that will appear (individually) in the data as the decimal separator.

This property is applicable, when dfdl:textNumberRep is 'standard' and dfdl:textStandardBase is 10. It must be set if  dfdl:textNumberPattern contains a decimal separator symbol ("."), or the E or @ symbols. (it is a schema definition error otherwise.) Empty string is not an allowable value.

This property can be computed by way of an expression which returns a character. The expression must not contain forward references to elements which have not yet been processed.

Text Number Character Restrictions: The the string literal is restricted to allow only certain kinds of syntax:

·         DFDL character entities are allowed

·         The DFDL byte value entity ( %#r ) is not allowed.

·         DFDL Character classes NL, WSP, WSP+, WSP*, and ES are not allowed

It is a schema definition error if the string literal contains any of the disallowed syntax constructs.

Annotation: dfdl:element, dfdl:simpleType

textStandardGroupingSeparator

DFDL String Literal or DFDL Expression

Defines the single character that will appear in the data as the grouping separator.

This property is applicable when dfdl:textNumberRep is 'standard' and dfdl:textStandardBase is 10. It must be set if  dfdl:textNumberPattern contains a grouping separator symbol (it is a schema definition error otherwise.) Empty string is not an allowable value.

This property can be computed by way of an expression which returns a character. The expression must not contain forward references to elements which have not yet been processed.

The string literal value is restricted in the same way as described in "Text Number Character Restrictions" in the description of the dfdl:textStandardDecimalSeparator property.

Annotation: dfdl:element, dfdl:simpleType

textStandardExponentRep

DFDL String Literal or DFDL Expression

Defines the actual character(s) that will appear in the data as the exponent indicator. If the empty string is specified then no exponent character will be used.

This property is applicable when dfdl:textNumberRep is 'standard' and dfdl:textStandardBase is 10. Empty string is an allowable value, so that formats like NNN+M (meaning NNN x 10 with MM exponent) can be expressed.

This property must be set even if the dfdl:textNumberPattern does not contain an 'E' (exponent) character. It is a schema definition error if this property is not set or in scope for any number with dfdl:representation 'text'.

This property can be computed by way of an expression which returns a DFDL String Literal character. The expression must not contain forward references to elements which have not yet been processed.

The string literal value is restricted in the same way as described in "Text Number Character Restrictions" in the description of the dfdl:textStandardDecimalSeparator property.

If dfdl:ignoreCase is 'yes' then the case of the string is ignored by the parser.

Annotation: dfdl:element, dfdl:simpleType

textStandardInfinityRep

DFDL String Literal

The value used to represent infinity.

Infinity is represented as a string with the positive or negative prefixes and suffixes from the dfdl:textNumberPattern applied.

This property is applicable when dfdl:textNumberRep is 'standard', dfdl:textStandardBase is 10 and the simple type is float or double.

If dfdl:ignoreCase is 'yes' then the case of the string is ignored by the parser.

The string literal value is restricted in the same way as described in "Text Number Character Restrictions" in the description of the dfdl:textStandardDecimalSeparator property.

Annotation: dfdl:element, dfdl:simpleType

textStandardNaNRep

DFDL String Literal

The value used to represent NaN.

NaN is represented as a string and the positive or negative prefixes and suffixes from the dfdl:textNumberPattern are not used.

This property is applicable when dfdl:textNumberRep is 'standard', dfdl:textStandardBase is 10 and the simple type is float or double.

If dfdl:ignoreCase is 'yes' then the case of the string is ignored by the parser.

The string literal value is restricted in the same way as described in "Text Number Character Restrictions" in the description of the dfdl:textStandardDecimalSeparator property.

Annotation: dfdl:element, dfdl:simpleType

textStandardZeroRep

List of DFDL String Literals

Valid values: empty string, any character string

The whitespace separated list of alternative literal strings that are equivalent to zero, for example the characters 'zero'.

The representation is examined for a match to one of the values of this property after padding has been trimmed away.

On unparsing the first value is used.

If dfdl:ignoreCase is 'yes' then the case of the string is ignored by the parser.

The empty string means that there is no special literal string for zero. 

This property is applicable when dfdl:textNumberRep is 'standard' and dfdl:textStandardBase is 10.

Each string literal in the list is restricted to allow only certain kinds of syntax:

·         DFDL character entities are allowed.

·         DFDL Byte Value entities ( %#r ) are not allowed.

·         DFDL Character class entities NL and ES are not allowed.

·         DFDL Character class entities WSP, WSP+, and WSP* are allowed.

However, the WSP* entity cannot appear on its own as one of the string literals in the list. It must be used in combination with other text characters or entities so as to describe a representation that cannot ever be an empty string.

It is a schema definition error if the string literal contains any of the disallowed syntax constructs.

Annotation: dfdl:element, dfdl:simpleType

textStandardBase

Non-negative Integer

Valid Values 2, 8, 10, 16

Indicates the number base.

Only used when dfdl:textNumberRep is 'standard'.

When base is not 10, xs:decimal, xs:float and xs:double are not supported.

When dfdl:textNumberRep is 'zoned' dfdl:textNumberBase 10 is not used and base 10 is assumed.

Annotation: dfdl:element, dfdl:simpleType

textZonedSignStyle

Enum

Specifies the code points that are used to overpunch the sign nibble when the dfdl:encoding is an ASCII-derived character set encoding. The location of this sign nibble is indicated in the dfdl:textNumberPattern.

This property is applicable when dfdl:textNumberRep is 'zoned'.

Used only when dfdl:encoding is an ASCII-derived character set encoding. The encoding must provide the character to single byte code point mapping used by the specified value of dfdl:textZonedSignStyle, as stated below.

Valid values 'asciiStandard', 'asciiTranslatedEBCDIC', 'asciiCARealiaModified', and 'asciiTandemModified'

Which characters are used to represent 'overpunched' (included) positive and negative signs, varies by encoding, Cobol compiler and system. The code points are fixed for EBCDIC systems but not for ASCII.

In EBCDIC-based encodings, code points 0xC0 to 0xC9 or 0xF0 to 0xF9 represent a positive sign and digits 0 to 9 (typically characters '{ABCDEFGHI' or '0123456789'), and code points 0xD0 to 0xD9 or 0xB0 to 0xB9 represent a negative sign and digits 0 to 9 (typically characters '}JKLMNOPQR' or  '^£¥·©§¶¼½¾ ' ). On parsing both ranges will be accepted. On unparsing the range 0xC0 to 0xC9 will be produced for positive signs and the range 0xD0 to 0xD9 will be produced for negative signs.

asciiStandard: ASCII characters '0123456789' represent a positive sign and the corresponding digit. (Sign nibble for '+' is 0x3, which is the high nibble of these code points unmodified.) ASCII characters 'pqrstuvwxy' represent negative sign and digits 0 to 9. (Code points 0x70 to 0x79)

asciiTranslatedEBCDIC:  The overpunched character is the ASCII equivalent of the typical EBCDIC above. So the characters '{ABCDEFGHI'  still represent a positive sign and digits 0 to 9. (These are code points 0x7B, 0x41 through 0x49). The characters '}JKLMNOPQR' still represent negative sign and digits 0 to 9. (These are code points 0x7D, 0x4A through 0x52). This case comes up if EBCDIC zoned decimal data is translated to ASCII as if it were textual data.)

asciiCARealiaModified[22]:  In this style, the ASCII characters '0123456789' represent positive sign and digits 0 to 9 as in standard. However, ASCII characters from code points 0x20 to 0x29 are used for negative sign and the corresponding decimal digit. This doesn't translate well into printing characters. These characters include the space (' ') for zero, characters '!"#$%&' for 1 through 6, the single quote character "'" for 7, and the parenthesis '()' for 8 and 9.

asciiTandemModified: In this style the ASCII characters '0123456789' represent positive sign and digits 0 to 9, but code points 0x80 to 0x89 are used to represent negative sign and a digit. There are no corresponding code points in the standard ASCII encoding since these values are all above 128 (decimal). This means the resultant bytes are not code points in standard ASCII, so the modeller must specify an encoding like ISO-8859-1 in order for such zoned decimals to parse without an encoding error. (Note that neither ISO-8859-1 encoding nor Unicode have assigned glyphs for these code points. They are considered control characters.)

Annotation: dfdl:element, dfdl:simpleType

Table 31 Properties Specific to Number with Text Representation

The dfdl:textStandardDecimalSeparator, dfdl:textStandardGroupingSeparator, dfdl:textStandardExponentRep, dfdl:textStandardInfinityRep, dfdl:textStandardNaNRep, and dfdl:textStandardZeroRep must all be distinct, and it is a schema definition error otherwise. Note that if dfdl:textStandardDecimalSeparator, dfdl:textStandardGroupingSeparator, or dfdl:textStandardExponentRep are expressions, this checking can only be carried out during processing (parsing or unparsing.)

Implementation note: This rule is in the interests of clarity, and is an extra constraint compared to ICU.

 

13.6.1    The dfdl:textNumberPattern Property

The dfdl:textNumberPattern describes how to parse and unparse text representations of number logical types with base 10.

The length of the representation of the number is determined first, and the number pattern is used only for conversion of the content text to and from a numeric logical infoset value.

The pattern described below is derived from the ICU DecimalFormat class described here: [ICUDecimal]

The pattern is an ICU-like syntax that defines where grouping separators, decimal separators, implied decimal points, exponents, positive signs and negative signs appear. It permits definition by either digits/fractions or significant digits.

13.6.1.1    dfdl:textNumberPattern for dfdl:textNumberRep 'standard'

When dfdl:textNumberRep is 'standard' this property only applies when  dfdl:textStandardBase is 10.

The pattern comes in two parts separated by a semi-colon. The first is mandatory and applies to positive numbers, the second is optional and applies to negative numbers.

Examples: The first shows digits/fractions and positive/negative signs, the second shows exponent, the third shows virtual decimal point, the fourth shows scaling position.

+###,##0.00;(###,##0.00)

 

##0.0#E0

 

000V00

 

PPP0000

The 'V' symbol is used to indicate the location of an implied decimal point for fixed point number representations. (This is an extension to the ICU pattern language.)

The 'P' symbol is used to indicate that a decimal scaling factor needs to be applied. (This is an extension to the ICU pattern language.)

The actual grouping separator, decimal separator and exponent characters are defined independently of the pattern.

The actual positive sign and negative sign are defined within the pattern itself.

Many characters in a pattern are taken literally; they are matched during parsing and output unchanged during unparsing. Special characters, on the other hand, stand for other characters, strings, or classes of characters. For example, the '#' character is replaced by a digit.

To insert a special character in a pattern as a literal, that is, without any special meaning, the character must be quoted. There are some exceptions to this which are noted below.

 

Symbol

Location

Meaning

0

Number

Digit

1-9

Number

'1' through '9' indicates rounding.

#

Number

Digit, zero shows as absent

.

Number

Decimal separator or monetary decimal separator

-

Number

Minus sign

,

Number

Grouping separator

E

Number

Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.

+

Exponent

Prefix positive exponents with plus sign. Need not be quoted in prefix or suffix.

;

Subpattern boundary

Separates positive and negative subpatterns

'

Prefix or suffix

Used to quote special characters in a prefix or suffix, for example, "'#'#" formats 123 to "#123". To create a single quote itself, use two in a row: "# o''clock".

*

Prefix or suffix boundary

Pad escape, precedes pad character

V

Number

Virtual decimal point marker. Only used with decimal, float and double simple types.

P

Number

Decimal scaling position. Only used with decimal, float and double simple types.

@

Number

Significant digits specifier. Only used with decimal simple type. Controls number of significant digits when used alone or in conjunction with the # character.

Table 32 dfdl:textNumberPattern Special Characters

A pattern contains a positive and negative subpattern, for example, "#,##0.00;(#,##0.00)". Each subpattern has a prefix, a numeric part, and a suffix. If there is no explicit negative subpattern, the negative subpattern is the minus sign prefixed to the positive subpattern. That is, "0.00" alone is equivalent to "0.00;-0.00". If there is an explicit negative subpattern, it serves only to specify the negative prefix and suffix; the number of digits, minimal digits, and other characteristics are ignored in the negative subpattern. That means that "#,##0.0#;(#)" has precisely the same result as "#,##0.0#;(#,##0.0#)".

The prefixes, suffixes, and various symbols used for infinity, digits, grouping separators, decimal separators, etc. may be set to arbitrary values, and they will appear properly during unparsing. However, care must be taken that the symbols and strings do not conflict, or parsing will be unreliable. For example, either the positive and negative prefixes or the suffixes must be distinct for parse to be able to distinguish positive from negative values.

The grouping separator is a character that separates clusters of integer digits to make large numbers more legible. It commonly used for thousands, but in some locales it separates ten-thousands. The grouping size is the number of digits between the grouping separators, such as 3 for "100,000,000" or 4 for "1 0000 0000". There are actually two different grouping sizes: One used for the least significant integer digits, the primary grouping size, and one used for all others, the secondary grouping size. In most locales these are the same, but sometimes they are different. For example, if the primary grouping interval is 3, and the secondary is 2, then this corresponds to the pattern "#,##,##0", and the number 123456789 is formatted as "12,34,56,789". If a pattern contains multiple grouping separators, the interval between the last one and the end of the integer defines the primary grouping size, and the interval between the last two defines the secondary grouping size. All others are ignored, so "#,##,###,####" == "###,###,####" == "##,#,###,####".

The P symbol is used to derive the location of an assumed decimal point when the point is not within the number that appears in the data. It acts as a decimal scaling factor.

The symbol P can be specified only as a continuous string of Ps in the leftmost or rightmost digit positions in the vpinteger region of the pattern.

It is a schema definition error if any symbols other than "0", "1" through "9" or # are used in the vpinteger region of the pattern.

Examples

Data representation

Pattern

Value

123

PP000

0.00123

123

000PP

12300

Table 33 Examples of P Symbol in the dfdl:textNumberPattern Property

 pattern    := subpattern (';' subpattern)?

 subpattern := prefix? ((number exponent?)| vpinteger) suffix?

 number     := (integer ('.' fraction)?) | sigdigits

 

 vpinteger  := pinteger | (vinteger exponent?)

 pinteger   := ('P'* integer) | (integer 'P'* ) 

 vinteger   := ('V'? integer) |

               ('#'* 'V'? integer)|

               ('#'* '0'* 'V'? '0'* '0')|

               (integer 'V'?)

 

 prefix     := '\u0000'..'\uFFFD' - specialCharacters

 suffix     := '\u0000'..'\uFFFD' - specialCharacters

 integer    := '#'* '0'* '0'

 fraction   := '0'* '#'*

 sigDigits  := '#'* '@' '@'* '#'*

 exponent   := 'E'? '+'? '0'* '0'

 padSpec    := '*' padChar

 padChar    := '\u0000'..'\uFFFD' - quote

  

 Notation:

   X*       0 or more instances of X

   X?       0 or 1 instances of X

   X|Y      either X or Y

   C..D     any character from C up to D, inclusive

   S-T      characters in S, except those in T

 Figure 4 dfdl:textNumberPattern BNF syntax

The first subpattern is for positive numbers. The second (optional) subpattern is for negative numbers.

Not indicated in the BNF syntax above:

The grouping separator ',' can occur inside the integer region, between any two pattern characters of that region, as long as the number region is not followed by an exponent region.

Two grouping intervals are recognized: That between the decimal point and the first grouping symbol, and that between the first and second grouping symbols. These intervals are identical in most locales, but in some locales they differ. For example, the pattern "#,##,###" formats the number 123456789 as "12,34,56,789".

The pad specifier padSpec may appear before the prefix, after the prefix, before the suffix, after the suffix, or not at all.

In place of '0', the digits '1' through '9' in the number or vpinteger region may be used to indicate a rounding increment.

The term maximum fraction digits is the total number of '0' and '#' characters in the fraction sub-pattern above.

The term minimum fraction digits is the total number of '0' characters (only) in the fraction sub-pattern above.

The term maximum integer digits is a limit that is implementation-dependent, but must be at least 20 (which is the number of digits in a base 10 unsigned long).[23]

The term minimum integer digits is the total number of '0' characters (only) in the integer sub-pattern above.

 

Parsing

During parsing, grouping separators are removed from the data.

Unparsing

Unparsing is guided by several parameters all of which can be specified using a pattern. The following description applies to formats that do not use scientific notation.

If the number of actual integer digits exceeds the maximum integer digits, then only the least significant digits are shown. For example, 1997 is formatted as "97" if the maximum integer digits is 2.

If the number of actual integer digits is less than the minimum integer digits, then leading zeros are added. For example, 1997 is formatted as "01997" if the minimum integer digits is 5.

If the number of actual fraction digits exceeds the maximum fraction digits, then half-even rounding it performed to the maximum fraction digits. For example, 0.125 is formatted as "0.12" if the maximum fraction digits is 2. This behavior can be changed by specifying a rounding increment and a rounding mode.

If the number of actual fraction digits is less than the minimum fraction digits, then trailing zeros are added. For example, 0.125 is formatted as "0.1250" if the minimum fraction digits is 4.

Trailing fractional zeros are not displayed if they occur j positions after the decimal, where j is less than the maximum fraction digits. For example, 0.10004 is formatted as "0.1" if the maximum fraction digits is four or less.

Special Values

NaN is represented as a string determined by the dfdl:textStandardNaNRep property. This is the only value for which the prefixes and suffixes are not used.

Infinity is represented as a string with the positive or negative prefixes and suffixes applied. The infinity string is determined by the dfdl:textStandardInfinityRep property.

Scientific Notation

Numbers in scientific notation are expressed as the product of a mantissa and a power of ten, for example, 1234 can be expressed as 1.234 x 103. The mantissa is typically in the half-open interval [1.0, 10.0) or sometimes [0.0, 1.0), but it need not be. In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation. Example: "0.###E0" formats the number 1234 as "1.234E3".

The number of digit characters after the exponent character gives the minimum exponent digit count. There is no maximum. Negative exponents are formatted using the  minus sign, not the prefix and suffix from the pattern. This allows patterns such as "0.###E0 m/s". To prefix positive exponents with a  plus sign, specify '+' between the exponent and the digits: "0.###E+0" will produce formats "1E+1", "1E+0", "1E-1", etc.

The minimum number of integer digits is achieved by adjusting the exponent. Example: 0.00123 formatted with "00.###E0" yields "12.3E-4". This only happens if there is no maximum number of integer digits. If there is a maximum, then the minimum number of integer digits is fixed at one.

The maximum number of integer digits, if present, specifies the exponent grouping. The most common use of this is to generate engineering notation, in which the exponent is a multiple of three, e.g., "##0.###E0". The number 12345 is formatted using "##0.####E0" as "12.345E3".

When using scientific notation, the formatter controls the digit counts using significant digits logic. The maximum number of significant digits limits the total number of integer and fraction digits that will be shown in the mantissa; it does not affect parsing. For example, 12345 formatted with "##0.##E0" is "12.3E3". .

Exponential patterns may not contain grouping separators.

Significant Digits

The '@' pattern character can be used with the '#' to control how many integer and fraction digits are needed to display the specified number of significant digits. The '@' only affects unparsing behavior. Examples:

Pattern

Minimum significant digits

Maximum significant digits

Number

Formatted Output

@@@

3

3

12345

12300

@@@

3

3

0.12345

0.123

@@##

2

4

3.14159

3.142

@@##

2

4

1.23004

1.23

Table 34 Significant Digits '@' Symbol in the dfdl:textNumberPattern Property

Padding

Padding may be specified through the pattern syntax. In a pattern the pad escape character, followed by a single pad character, causes padding to be parsed and formatted. The pad escape character is '*'. For example, "*x#,##0.00" formats 123 to "xx123.00", and 1234 to "1,234.00".

When padding is in effect, the width of the positive subpattern, including prefix and suffix, determines the format width. For example, in the pattern "* #0 o''clock", the format width is 10.

The width is counted in 16-bit code units.

Some parameters which usually do not matter have meaning when padding is used, because the pattern width is significant with padding. In the pattern "* ##,##,#,##0.##", the format width is 14. The initial characters "##,##," do not affect the grouping size or maximum integer digits, but they do affect the format width.

Padding may be inserted at one of four locations: before the prefix, after the prefix, before the suffix, or after the suffix. If there is no prefix, before the prefix and after the prefix are equivalent, likewise for the suffix.

When specified in a pattern, the 32-bit codepoint immediately following the pad escape is the pad character. This may be any character, including a special pattern character. That is, the pad escape escapes the following character. If there is no character after the pad escape, then the pattern is illegal.

Note: Padding specified through the pattern syntax is distinct from, and in addition to, padding specified using dfdl:textPadKind.

Rounding

How rounding is controlled is given by dfdl:textNumberRounding.  The rounding increment may be specified in the dfdl:textNumberPattern itself using digits '1' through '9' or using an explicit increment in dfdl:textNumberRoundingIncrement. For example, 1230 rounded to the nearest 50 is 1250. 1.234 rounded to the nearest 0.65 is 1.3.

Using an explicit rounding increment, dfdl:textNumberRoundingMode determines how values are rounded.

13.6.1.2    dfdl:textNumberPattern for dfdl:textNumberRep 'zoned'

When dfdl:textNumberRep is 'zoned' a subset of the number pattern language described in Section 13.6.1.1 dfdl:textNumberPattern for dfdl:textNumberRep 'standard' is used.

Only the pattern for positive numbers is used. It is a schema definition error if the negative pattern is specified.

In addition, only the following pattern characters may be used:

Rounding occurs as described under Rounding in 13.6.1.1 dfdl:textNumberPattern for dfdl:textNumberRep 'standard'

13.6.2    Converting logical numbers to/from text representation

·         Signed numbers with dfdl:textNumberRep 'standard' and dfdl:textStandardBase 10 are mapped using the dfdl:textNumberPattern.

·         Signed numbers with dfdl:textNumberRep 'standard' and dfdl:textStandardBase not 10 are mapped to an unsigned representation. On unparsing the minimum number of characters to represent the digits is output and it is a processing error if the value is negative.


13.7     Properties Specific to Number with Binary Representation

These properties are applicable to simple type xs:decimal and its derived types which include all the signed and unsigned integer types. These properties are not applicable to types xs:float and xs: double. See section 13.8. Note that simple types derived from xs:decimal do not imply base-10 representations in the data stream.

 Property Name

Description

binaryNumberRep

Enum

Valid values are  'packed', 'bcd', 'binary', 'ibm4690Packed'

Allowable values for each number type are:

Logical Type

Permitted Value

Decimal, Integer, NonNegativeInteger

packed, bcd, binary, ibm4690Packed

Long, Int, Short, Byte,

packed, binary, ibm4690Packed (but not bcd)

UnsignedLong, Unsignedint, UnsignedShort, UnsignedByte

packed, bcd, binary, ibm4690Packed

'packed' means represented as an IBM 390 packed decimal. Each byte contains two decimal digits, except for the least significant byte, which contains a sign in the least significant nibble.

'bcd' means represented as a binary coded decimal with two digits per byte.

'binary' means represented as twos complement for signed types and unsigned base-2 binary for unsigned types.

Note that the maximum allowed value for twos-complement and unsigned base-2 binary integers is implementation-dependent, but must be at least that of a xs:long type, which is the equivalent of an 8 byte/64-bit signed integer.

'ibm4690Packed' is a variant of a packed decimal having the following characteristics:

  • Nibbles represent digits 0 - 9 in the usual BCD manner.
  • A positive value is simply indicated by digits.
  • A negative number is indicated by digits with the most significant nibble being xD.
  • If a positive or negative value packs to an odd number of nibbles, an extra xF nibble is added as the most significant nibble.

For all values, the dfdl:byteOrder property is used to determine the numeric significance of the bytes making up the representation, and the dfdl:bitOrder property is used to determine the numeric significance of the bits within a byte.

Annotation: dfdl:element, dfdl:simpleType 

binaryDecimalVirtualPoint

Integer.

Used when base simpleType is xs:decimal.

An integer that represents the position of an implied decimal point within a number, or specify 0.

If you specify 0 then there is no virtual decimal point

If you specify a positive integer, the position of the decimal point is moved from the least-signficant side of the number toward the most-significant side of the number.  For example, if 3 is specified then, the integer value 1234 represents 1.234. This is equivalent to dividing by 103.

If you specify a negative integer, the position of the decimal point is moved from the least significant side of the number further in the less-significant direction. For example, if you specify -3, the integer value 1234 represents 1 234 000.This is equivalent to multiplying by 103.

Annotation: dfdl:element, dfdl:simpleType

binaryPackedSignCodes

List of Characters

Used only when dfdl:binaryNumberRep or dfdl:binaryCalendarRep is 'packed'

A whitespace separated string giving the hex sign nibbles to use for a positive value, a negative value, an unsigned value, and zero.

Valid values for positive nibble: A, C, E, F

Valid values for negative nibble: B, D

Valid values for unsigned nibble: F

Valid values for zero sign: A C E F 0

Example: 'C D F C' – typical S/390 usage

Example: 'C D F 0' – handle special case for zero

On parsing, whether to accept all valid values for a positive, negative or unsigned number, and for zero, is governed by the dfdl:binaryNumberCheckPolicy property. On unparsing, the specified values are always used.

Annotation: dfdl:element, dfdl:simpleType

binaryNumberCheckPolicy

Enum

Values are 'strict' and 'lax'.

Indicates how lenient to be when parsing binary numbers.

If 'lax' then the parser tolerates all valid alternatives where such alternatives exist. Specifically, for dfdl:binaryNumberRep 'packed' the sign nibble for positive, negative, unsigned and zero is allowed to be any of the valid respective values.

On unparsing, the specified value is always used

Annotation: dfdl:element, dfdl:simpleType

Table 35 Properties Specific to Number with Binary Representation

13.7.1    Converting Logical Numbers to/from Binary Representation

When unparsing a binary number (packed decimal or twos-complement) and excess precision is supplied in the Infoset no rounding occurs. It is a processing error.

13.7.1.1    Converting Base-2 Binary Numbers

For both parsing and unparsing, the bit string that represents the content region for a base-2 binary number is converted to/from an Infoset value by a calculation that involves the length and the dfdl:byteOrder property.

For unparsing, the dfdl:fillByte property can also be involved.

When parsing, DFDL specifies how an unsigned integer of unbounded magnitude is computed from a bit string based on its length, and the dfdl:byteOrder property. For signed types, this  unbounded integer is converted into a signed value by way of the well-known twos-complement scheme, and for the xs:decimal type, the dfdl:binaryDecimalVirtualPoint property can be used to convert this integer into a decimal value with an integer and a fractional component, and for both xs:decimal and the integer types the dfdl:binaryVirtualDecimalPoint or to scale up the integer by some scale factor.

A DFDL implementation can use any conversion technique consistent with this description.

13.7.1.2    Bit strings, Alignment, and dfdl:fillByte

The dfdl:alignmentUnits of 'bits', and dfdl:alignment of '1' can be used to position a bit string anywhere in the data stream without regard for any other grouping of bits into bytes. 

The numeric value of the unsigned integer represented by a bit string is unaffected by alignment. 

When unparsing a bit string, alignment may cause the bits within the bit string to occupy only some of the bits within a byte of the data stream. The bits of data in the alignment fill region are unspecified by the elements of the DFDL schema, and when parsing, neither they, nor any data computed from them are put into the DFDL infoset. During unparsing, such unspecified bits are filled in using the value of the dfdl:fillByte property. Corresponding bits from the dfdl:fillByte value are used to fill in unspecified bits of the data stream. That is, if bit K (K will be 1 or greater, but less than or equal to 8) of a data stream byte is unspecified, its value will be taken from bit K of the dfdl:fillByte property value. 

Since the value of any bit string element is unaffected by alignment, the logical unsigned integer value for a bit-string is always computed as if the first bit were at position 1 of the bit stream. If the dfdl:length for the bit-string evaluates to M, then the bit-string conceptually occupies bits 1 to M of a data stream for purposes of computing its value.

13.7.1.3    Bits within Bit Strings of Length <= 8

Any time the length in bits is < 8, then when set, the bit at position Z supplies value 2^(M-Z), and the value of the bit string as an integer is the sum of these values for each of its bits. 

13.7.1.4    Bits within Bit Strings of Length > 8

Call M the length of the bit string element in bits. In general, when M > 8 the contribution of a bit in position i to the numeric value of a bit string is given by a formula specific to the dfdl:byteOrder.

For dfdl:byteOrder of 'bigEndian' the value of bit i is given by 2^(M - i).

For dfdl:byteOrder of 'littleEndian' the value of bit i is given by a more complex formula. The following pseudo code computes the value of a bit in a littleEndian bit string. It is just a very big expression, but is spread out over many local variables to illustrate the various sub-calculations clearly. DFDL implementations may use any way of converting bit strings to the corresponding integer values that is consistent with this:

In the pseudo code below:

·         '%' is modular division (division where remainder is returned)

·         '/' is regular division (quotient is returned)

·         the expression 'a ? b : c' means 'if a is true, then the value is b, otherwise the value is c'

    littleEndianBitValue(bitPosition, bitStringLength)

        assert bitPosition >= 1;

        assert bitStringLength >= 1;

        assert bitStringLength >= bitPosition;

        numBitsInFinalPartialByte = bitStringLength % 8;

        numBitsInWholeBytes = bitStringLength -

                              numBitsInFinalPartialByte;

        bitPosInByte = ((bitPosition - 1) % 8) + 1;

        widthOfActiveBitsInByte = (bitPosition <= numBitsInWholeBytes)

             ? 8 : numBitsInFinalPartialByte;

        placeValueExponentOfBitInByte = widthOfActiveBitsInByte –

                                        bitPosInByte;

        bitValueInByte = 2^placeValueExponentOfBitInByte;

        byteNumZeroBased = (bitPosition - 1)/8;

        scaleFactorForBytePosition = 2^(8 * byteNumZeroBased);

        bitValue = bitValueInByte * scaleFactorForBytePosition;

        return bitValue;

Figure 5  Little Endian bit position and value

13.7.1.4.1   Examples of Unsigned Integer Conversion

Consider the first three bytes of the data stream. Imagine their numeric values as 0x5A 0x92 0x00.

Positions:
00000000 01111111 11122222
12345678 90123456 78901234
Bits:
01011010 10010010 00000000

Hex values
   5   A    9   2    0   0

Beginning at bit position 1, (the very first bit) if we consider the first two bytes as a bigEndian short, the value will be 0x5A92. 

  <xs:element name="num" type="unsignedShort"

        dfdl:alignment="1"

        dfdl:alignmentUnits="bytes" 

        dfdl:byteOrder="bigEndian"

        dfdl:representation="binary"

        dfdl:binaryNumberRep="binary"/>

As a littleEndian short, the value will be 0x925A.

  <xs:element name="num" type="unsignedShort"

        dfdl:alignment="1"

        dfdl:alignmentUnits="bytes" 

        dfdl:byteOrder="littleEndian"

        dfdl:representation="binary"

        dfdl:binaryNumberRep="binary"/>

Now let us examine a bit string of length 13, beginning at position 2.

<xs:sequence>

  <xs:element name="ignored" type="unsignedByte"

        dfdl:alignment="1" 

        dfdl:alignmentUnits="bits" 

        dfdl:lengthUnits="bits" 

        dfdl:length="1" 

        dfdl:representation="binary"

        dfdl:binaryNumberRep="binary"/>

  <xs:element name="x" type="unsignedShort" 

        dfdl:alignment="1" 

        dfdl:alignmentUnits="bits" 

        dfdl:byteOrder="bigEndian"

        dfdl:lengthUnits="bits" 

        dfdl:length="13" 

        dfdl:representation="binary"

        dfdl:binaryNumberRep="binary"/>

   ...

</xs:sequence>

Let's examine the same data stream and consider the bit positions that make up element 'x', which are the bits at positions 2 through 14 inclusive.

Positions:
00000000 01111111 11122222
12345678 90123456 78901234
Bits:
 1011010 100100

Since alignment does not affect logical value, we will obtain the same logical value as if we realigned the bits. That is, the value is the same as if we began the bits of the element's representation with bit position 1.

Realigned Positions:
00000000 01111111 11122222
12345678 90123456 78901234
Bits:
10110101 00100

The DFDL schema fragment above gives element 'x' the dfdl:byteOrder 'bigEndian' property. In this case the place value of each position is given by 2^(M – i)

PlaceValue positions 2^(M - i)

...11110 00000000
...21098 76543210

Bit values

...10110 10100100
Hex values
   1   6    A   4

The value of element 'x' is 0x16A4. Notice how it is the most-significant byte -- which is the first byte when big endian -- that becomes the partial byte (having fewer than 8 bits) in the case where the length of the bit string is not a multiple of 8 bits. 

For dfdl:byteOrder of 'littleEndian'. The place values of the individual bits are not as easily visualized. However there is still a basic formula (given in the pseudo code in 13.7.1.4 Bits within Bit Strings of Length > 8) and value.

Looking again at our realigned positions:

Realigned Positions:
00000000 01111111 11122222
12345678 90123456 78901234
Bits:
10110101 00100

The place values of each of these bits, for little endian byte order can be seen to be:

PlaceValue positions

00000000 ...11100
76543210 ...21098

Bit values

10110101 ...00100
Hex values
   B   5    0   4   

We must reorder the bytes for little endian byte order. The value of element 'x' is 0x04B5. In little endian form, the first 8 bits make up the first byte, and that contains the least-significant byte of the logical numeric unsignedShort value. The additional bits of the partial byte are once again the most significant byte; however, for little endian form, this is the second byte. The second byte contains only 5 bits, those make up the least significant 5 bits of that byte, but that logical 5-bit value makes up the most-significant byte of the unsignedShort integer.

Now let us examine the 13 bits beginning at position 2, in the context where dfdl:byteOrder is 'littleEndian' and dfdl:bitOrder is 'leastSignificantBitFirst' and dfdl:binaryNumberRep is 'binary'.

In this case, the bit positions are assigned differently. Below the bytes are shown left-to-right:

Positions:
00000000 11111110 22222111
87654321 65432109 43210987
Bits:
01011010 10010010 00000000

Hex values
   5   A    9   2    0   0

The bits of interest are highlighted above. If we redisplay this same data, but reversing the order of the bytes to right-to-left, then we get:

Positions:
22222111 11111110 00000000
43210987 65432109 87654321
Bits:
00000000 10010010 01011010

Hex values
   0   0    9   2    5   A

The above shows more clearly that we are looking at a contiguous region of bits containing

0 1001 0010 1101

or the value 0x092D.

13.7.1.5    Converting Packed Decimal Numbers

Signed numbers with dfdl:binaryNumberRep 'packed' are parsed using a nibble to indicate the sign. The unsigned nibble is treated as positive. On unparsing the sign nibble is written according to dfdl:binaryPackedSignCodes. The unsigned nibble is never written.

Signed numbers with dfdl:binaryNumberRep 'bcd' are always positive. On unparsing it is a processing error if the Infoset data is negative.

Signed numbers with dfdl:binaryNumberRep 'ibm4690Packed' are parsed using the sign nibble to identify negative values. There is no sign nibble for positive values. On unparsing the nibble 0xD is written for negative values.

Unsigned numbers with dfdl:binaryNumberRep 'packed' are parsed if the nibble is positive or unsigned. It is a processing error if the data is negative. On unparsing the unsigned nibble is used.

Unsigned numbers with dfdl:binaryNumberRep 'bcd' are readily parsed as BCD data is always positive.

Unsigned numbers with dfdl:binaryNumberRep 'ibm4690Packed' are parsed if there is no sign nibble of 0xD to identify a negative value. It is a processing error if the data is negative. On unparsing no sign nibble is written.


13.8     Properties Specific to Float/Double with Binary Representation

Property Name

Description

binaryFloatRep

Enum or DFDL Expression

This specifies the encoding method for the float and double. 

Valid values are 'ieee', 'ibm390Hex',This property can be computed by way of an expression which returns a string of 'ieee' or ' ibm390Hex' . The expression must not contain forward references to elements which have not yet been processed.

The enumeration value 'ieee' refers to the IEEE 754-1985 specification.

For both 'ieee' and 'ibm390hex', an xs:float must have a physical length of 4 bytes. It is a schema definition error if there is a specified length not equivalent to 4 bytes.

Similarly, for both 'ieee' and 'ibm390hex', an xs:double must have a physical length of 8 bytes. It is a schema definition error if there is a specified length not equivalent to 8 bytes.

The dfdl:byteOrder property is used to construct a value from the bytes in the binary representation.

Note: The DFDL Infoset float and double data types match the precision of the IEEE specification. There may be precision/rounding issues when converting IBM float/double to/from the DFDL infoset float/double types.

Half-precision IEEE and quad-precision IEEE/IBM are not supported.[24]

Annotation: dfdl:element, dfdl:simpleType 

Table 36 Properties Specific to Float/Double with Binary Representation

13.9     Properties Specific to Boolean with Text Representation

Property Name

Description

textBooleanTrueRep

List of DFDL String Literals or DFDL Expression

A whitespace separated list of representations to be used for 'true'. These are compared after trimming when parsing, and before padding when unparsing.

If dfdl:lengthKind is 'explicit' or 'implicit' and either dfdl:textPadKind or dfdl:textTrimKind  is 'none' then both dfdl:textBooleanTrueRep and dfdl:textBooleanFalseRep must have the same length else it is a schema definition error.

This property can be computed by way of an expression which returns a string of whitespace separated list of values. The expression must not contain forward references to elements which have not yet been processed.

On unparsing the first value is used

If dfdl:ignoreCase is 'yes' then the case of the string is ignored by the parser.

Text Boolean Character Restrictions: The string literal is restricted to allow only certain kinds of syntax:

·         DFDL character entities are allowed

·         The DFDL byte value entity ( %#r ) is not allowed.

·         DFDL Character classes  NL, WSP, WSP+, WSP*, and ES are not allowed

It is a schema definition error if the string literal contains any of the disallowed constructs.

Annotation: dfdl:element, dfdl:simpleType

textBooleanFalseRep

List of DFDL String Literals or DFDL Expression

A whitespace separated list of representations to be used for 'false' These are compared after trimming when parsing, and before padding when unparsing. 

If dfdl:lengthKind is 'explicit' or 'implicit' and either dfdl:textPadKind or dfdl:textTrimKind  is 'none' then both dfdl:textBooleanTrueRep and dfdl:textBooleanFalseRep must have the same length else it is a schema definition error.

This property can be computed by way of an expression which returns a string of whitespace separated list of values. The expression must not contain forward references to elements which have not yet been processed.

On unparsing the first value is used

If dfdl:ignoreCase is 'yes' then the case of the string is ignored by the parser.

The string literal value is restricted in the same way as described in "Text Boolean Character Restrictions" in the description of the dfdl:textBooleanTrueRep property.

Annotation: dfdl:element, dfdl:simpleType

textBooleanJustification

Enum

Valid values 'left', 'right',  'center'

Controls how the data is padded or trimmed on parsing and unparsing.

Behavior as for dfdl:textStringJustification.

Annotation: dfdl:element, dfdl:simpleType

textBooleanPadCharacter

DFDL String Literal

The value that is used when padding or trimming boolean elements. The value can be a single character or a single byte.
If a character, then it can be specified using a literal character or using DFDL entities.

If a byte, then it must be specified using a single byte value entity.

If a pad character is specified when lengthUnits is 'bytes' then the pad character must be a single-byte character.

If a pad byte is specified when lengthUnits is 'characters' then

  • the dfdl:encoding must be a fixed-width encoding
  • padding and trimming must be applied using a sequence of N pad bytes, where N is the width of a character in the fixed-width encoding.

The string literal value is restricted in the same way as described in "Pad Character Restrictions" in the description of the dfdl:textStringPadCharacter property.

Annotation: dfdl:element, dfdl:simpleType

Table 37 Properties Specific to Boolean with Text Representation

13.10  Properties Specific to Boolean with Binary Representation

Property Name

Description

binaryBooleanTrueRep

Non-negative Integer

This value, treated as a binary xs:unsignedInt (See Section 13.7.1 Converting Logical Numbers to/from Binary Representation ), gives the representation to be used for 'true'

If this property value is the empty string, when parsing it means dfdl:binaryBooleanTrueRep is any value other than dfdl:binaryBooleanFalseRep; when unparsing, the one's complement of the dfdl:binaryBooleanFalseRep will be used.

The length of the data value of the element must be between 1 bit and 32 bits (4 bytes) as described in Section 12.3.7.2. It is a schema definition error if the value (when provided) of dfdl:binaryBooleanTrueRep cannot fit as an unsigned binary integer in the specified length.

Annotation: dfdl:element, dfdl:simpleType

binaryBooleanFalseRep

Non-negative Integer

This value, treated as a binary xs:unsignedInt (See Section 13.7.1 Converting Logical Numbers to/from Binary Representation ),  gives the representation to be used for 'false'

The length of the data value of the element must be between 1 bit and 32 bits (4 bytes) as described in Section 12.3.7.2. It is a schema definition error if the valuef dfdl:binaryBooleanFalseRep cannot fit as an unsigned binary integer in the specified length.

Annotation: dfdl:element, dfdl:simpleType

Table 38 Properties Specific to Boolean with Binary Representation

13.11  Properties specific to Calendar with Text or Binary Representation

The properties describe how a calendar is to be interpreted including a unparsing pattern property plus properties that qualify the pattern.

These properties can be used when a calendar has dfdl:representation 'text' or dfdl:representation 'binary' and a packed decimal representation.

Property Name

Description

calendarPattern

String

Defines the ICU pattern that describes the format of the calendar. The pattern defines where the year, month, day, hour, minute, second, fractional second and time zone components appear. See calendarPattern property section below.    

When the dfdl:representation is 'binary' and the representation is a packed decimal then the pattern can contain only characters and symbols that always result in the presentation of digits.

Annotation: dfdl:element, dfdl:simpleType

calendarPatternKind

Enum

Valid values 'explicit', 'implicit'

'explicit' means the pattern is given by dfdl:calendarPattern,

'implicit' means the pattern is derived from the XML schema date/time type.

Logical Type

Default Pattern

xs:date

yyyy-MM-dd

xs:dateTime

yyyy-MM-dd'T'HH:mm:ss

xs:time

HH:mm:ssZ

Annotation: dfdl:element, dfdl:simpleType

calendarCheckPolicy

Enum

Valid values are 'strict', 'lax'

Indicates how lenient to be when parsing against the pattern.

See Section 13.11.2 The dfdl:calendarCheckPolicy Property below for details of the specific behaviors for 'strict' and 'lax'.

Annotation: dfdl:element, dfdl:simpleType

calendarTimeZone

String

This property provides the time zone that will be assumed if no time zone explicitly occurs in the data.

Valid values specify a UTC time zone offset by matching the regular expression:

(UTC)([+\-]([01]\d|\d)((([:][0-5]\d){1,2})?))?)

In addition, empty string can be specified to indicate "no time zone", or the IANA time zone format (also known as the Olson time zone format) may be used. (e.g, America/New_York)) See [IANATimeZone].

Note that this property is used when parsing only.

Annotation: dfdl:element, dfdl:simpleType

calendarObserveDST

Enum

Valid values are 'yes', 'no'

Whether the time zone given in dfdl:calendarTimeZone observes daylight savings time.  

Ignored if dfdl:calendarTimeZone is specified in UTC format, or if dfdl:calendarTimeZone is empty string. That is, this property is used only if the dfdl:calendarTimeZone is in IANA (also known as Olson) format [IANATimeZone].

This property applies to parsing only.

Annotation: dfdl:element, dfdl:simpleType

calendarFirstDayOfWeek

Enum

Valid values 'Monday' … 'Sunday'

The day of the week upon which a new week is considered to start.

Annotation: dfdl:element, dfdl:simpleType

calendarDaysInFirstWeek

Non-negative Integer

Valid values 1 to 7

Specify the number of days of the new year that must fall within the first week.

The start of a year usually falls in the middle of a week. If the number of days in that week is less than the value specified here, the week is considered to be the last week of the previous year; hence week 1 starts some days into the new year. Otherwise it is considered to be the first week of the new year; hence week 1 starts some days before the new year.

Annotation: dfdl:element, dfdl:simpleType

calendarCenturyStart

Non-negative Integer

Valid values 0 to 99.

This property determines on parsing how two-digit years are interpreted. Specify the two digits that start a 100-year window that contains the current year. For example, if you specify 89, and the current year is 2006, all two-digit dates are interpreted as being in the range 1989 to 2088. A two-digit year less than 89 will be interpreted as 20nn and a two-digit year more than or equal to 89 will be treated as 19nn.

Annotation: dfdl:element, dfdl:simpleType

calendarLanguage

String or DFDL Expression

The language that is used when the pattern produces a presentation in text.

The value must match the regular expression:

([A-Za-z]{1,8}([\-_][A-Za-z0-9]{1,8})*)

It is a schema definition error otherwise.

All DFDL Implementations must support dfdl:calendarLanguage value "en".

DFDL implementations may support additional values, however, the value of the dfdl:calendarLanguage property is always interpreted as a Unicode Language Indentifier as defined by [LDML], and [CLDR].

Annotation: dfdl:element, dfdl:simpleType

Table 39 Properties specific to Calendar with Text or Binary Representation

13.11.1  The dfdl:calendarPattern property

The dfdl:calendarPattern describes how to parse and unparse text and binary representations of dateTime, date and time logical types. The pattern is primarily used on unparsing to define the format but is also used to aid parsing.

The pattern is derived from the ICU SimpleDatetimeFormat class described here: [ICUDateTime], which uses symbols defined by [LDML].

An extension is the formatting symbol I which means accept a subset of ISO 8601 [ISO8601] compliant calendars  

Symbol

Meaning

Presentation

Example

G       

era designator

Text             

G

AD

y       

year

Number           

y, yyyy

yy

1996

96

u

year (allows negative years)

Number          

u

1900, 0, -500

Y       

year (of the week of year)

Number           

Y

1997

M       

month in year

Text & Number  

M, MM

MMM

MMMM

MMMMM

 09

Sept

September

S

d       

day in month

Number           

d

dd

2

02

h       

hour in am/pm (1~12)

Number           

h

hh

7

07

H       

hour in day (0~23)

Number          

H

HH

0

00

m       

minute in hour

Number           

m

mm

4

04

s       

second in minute

Number          

s

ss

5

05

S       

fractional second (see note 1)

Number        

S

SS

SSS

2

24

235

E       

day of week

Text             

E, EE,EEE

EEEE

EEEEE

EEEEEE

Tues

Tuesday

T

Tu

e       

day of week (local)

Text & Number

e, ee

eee

eeee

eeeee

eeeeee

2

Tues

Tuesday

T

Tu

D       

day in year            

Number           

D

189

F       

day of week in month  

Number           

F

2 (2nd Wed in July)

w       

week in year

Number           

w, ww

27

W       

week in month          

Number           

W

2

a       

am/pm marker          

Text            

A

pm

k       

hour in day (0~24 )     

Number           

k

kk

2, 24

02, 24

K       

hour in am/pm (0~11)  

Number           

K

KK

0

00

z

time zone: specific non-location

Text

z, zz, zzz

zzzz

PDT

Pacific Daylight Time

Z

time zone: ISO8601 basic format

time zone: long localized GMT

Text

Z, ZZ, ZZZ

ZZZZ

-0800, +0000

GMT-08:00, GMT+00:00

O

time zone: localized GMT

Text

O

OOOO

GMT-8

GMT-08:00

v

time zone: generic non-location

Text

v

vvvv

PT

Pacific Time

V

time zone: short time zone ID

time zone: long time zone ID

time zone: exemplar city

time zone: generic location.

Text

V

VV

VVV

VVVV

uslax

America/Los_Angeles

Los Angeles

Los Angeles Time

x

time zone: ISO8601 basic or extended format

Text

x

xx

xxx

-08, +0530, +0000

-0800, +0000

-08:00, +00:00

X

Time Zone: ISO8601 basic or extended format .The UTC indicator "Z" is used when local time offset is 0.

Text

X

XX

XXX

-08, +0530, Z

-0800, Z

-08:00, Z

I

ISO8601 date/time

Text