Notes on an Extensible DFDL

The process of standardizing DFDL has been long and difficult because of the sheer complexity and number of features required to handle both modern and legacy data formats.

Early on, there was interest in defining a kernel or core DFDL that was extensible, so as to have upward extensibility. That is, so that features omitted from the standard could be built by the users themselves.

If these could be wed seamlessly to the language, then one would not need so many features in DFDL v1.0, and libraries would emerge, and perhaps subsequently be standardized in the future if they proved useful.

Examples of topics that some constituency wanted to have be extensions and not core:

Everything about IBM Mainframe Cobol data: zoned decimal, BCD, EBCDIC encoding.
Unordered sequences
lengthKind="pattern"
bi-directional text
lengthUnits='bits'
alignment

Suffice it to say, any feature that someone doesn't think they'll need, they'd rather have out of the core of the language because it makes the language easier for them to learn. Features that aren't relevant are certainly not of zero cost to someone trying to use DFDL. One must after all understand them well enough to know that one does not need them.

Another primary goal of extensibility is to overcome the one-level flat nature of DFDL, to put multiple-passes of transformation into one schema. This is sometimes called layering.

While the core/kernel and extensibility idea is very attractive, it has one very key flaw: we do not have examples of existing data format description systems that are extensible to derive the standard from. Standardization is not supposed to be a research project. It is supposed to derive a standard from existing practice. All known data format description systems are closed, in that extending them requires modifying the software of the system.

DFDL extensibility remains a valuable research goal.

Areas of extensibility

The following are suggested ways one could extend DFDL, ordered with what seems simplest to understand and perhaps implement first, and more complicated kinds of extension later:

Add a new representation for an existing primitive data type - this is like adding zoned decimal as a supported data format, or adding ones-complement binary integers. It involves adding:
- A new enum to an existing keyword (e.g., textNumberRep is currently either 'standard' or 'zoned'. One could add an additional value to this enum to denote a different text number representation.)
- Additional properties that control the representation of the new type.
Define a new construct by way of a macro-expansion of it into combinations of other pre-existing constructs. This has been suggested for implementing sequenceKind="unordered", by macro expansion into an array of choices.
Define a new separator suppression policy - control when unnecessary separators are tolerated and when they are not.
Define a new character encoding - hard because character encodings are embedded very deeply algorithmically (for obvious efficiency reasons)
Add a new primitive data type - quad-precision floating point is the typical example used here. This is challenging because it requires modifying XSD itself and the underlying data primitives (e.g., Scala has no quad float data type, so one would have to be built)

Having progressed the Daffodil implementation of DFDL to its current state, it is now possible to comment on what would be required to add extensibility to Daffodil.

Ability to add attributes to attribute groups of the DFDL XSD schemas - thereby enabling validation of these new attributes. These would not be in the dfdl namespace, but other extension namespaces.
- One would then re-run the code generator that reads these schemas to generate all the property mixins.
- The code generator would have to be robust, provide good diagnostics, etc. However, the user could test their new attributes in say, Eclipse or other XSD validation tool before expecting Daffodil to accept it.
Ability to add

Space shortcuts

Child pages

Areas of extensibility