Child pages
  • Notes on an Extensible DFDL

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

The process of standardizing DFDL has been long and difficult because of the sheer complexity and number of features required to handle both modern and legacy data formats.

Early on, there was interest in defining a kernel or core DFDL that was extensible, so as to have upward extensibility. That is, so that features omitted from the standard could be built by the users themselves.

If these could be wed seamlessly to the language, then one would not need so many features in DFDL v1.0, and libraries would emerge, and perhaps subsequently be standardized in the future if they proved useful.

Examples of topics that some constituency wanted to have be extensions and not core:

  • Everything about IBM Mainframe Cobol data: zoned decimal, BCD, EBCDIC encoding.
  • Unordered sequences
  • lengthKind="pattern"
  • bi-directional text
  • lengthUnits='bits'
  • alignment

Suffice it to say, any feature that someone doesn't think they'll need, they'd rather have out of the core of the language because it makes the language easier for them to learn. Features that aren't relevant are certainly not of zero cost to someone trying to use DFDL. One must after all understand them well enough to know that one does not need them.

Another primary goal of extensibility is to overcome the one-level flat nature of DFDL, to put multiple-passes of transformation into one schema. This is sometimes called layering.

While the core/kernel and extensibility idea is very attractive, it has one very key flaw: we do not have examples of existing data format description systems that are extensible to derive the standard from. Standardization is not supposed to be a research project. It is supposed to derive a standard from existing practice. All known data format description systems are closed, in that extending them requires modifying the software of the system.

DFDL extensibility remains a valuable research goal. Given an implementation of DFDL like Daffodil, one can examine how various kinds of extensibility might be achieved.

Areas of extensibility

The following are suggested ways one could extend DFDL, ordered with what seems simplest to understand and perhaps implement first, and more complicated kinds of extension later:

  • Add a new representation for an existing primitive data type - this is like adding zoned decimal as a supported data format, or adding ones-complement binary integers. It involves adding:
    • A new enum to an existing keyword (e.g., textNumberRep is currently either 'standard' or 'zoned'. One could add an additional value to this enum to denote a different text number representation.)
    • Additional properties that control the representation of the new type.
  • Define a new construct by way of a macro-expansion of it into combinations of other pre-existing constructs. This has been suggested for implementing sequenceKind="unordered", by macro expansion into an array of choices.
  • Embed a new representation of an existing data type that uses a complex type to describe the representation - that is, break down the flat notion of simple types, and allow a complex type to be used to define the parts that contribute to computation of a simple value.
    • An example of this would be a very small floating point format (as used in some sensors) such as a 3-bit twos complement exponent and a 5 bit mantissa. This can be converted into type float via simple computations, but one must be able to describe the pieces as a sequence of child elements.
    • Note that this is a limited kind of layering
  • Define a new occursCountKind - control when another array element is created and when the end of an array is detected.
  • Define a new separator suppression policy - control when unnecessary separators are tolerated and when they are not.
  • Define a new character encoding - hard because character encodings are embedded very deeply algorithmically (for obvious efficiency reasons)
  • Add a new primitive data type - quad-precision floating point is the typical example used here. This is challenging because it requires modifying XSD itself and the underlying data primitives (e.g., Scala has no quad float data type, so one would have to be built)

Adding Extensibility to Daffodil

Having progressed the Daffodil implementation of DFDL to its current state, it is now possible to comment on what would be required to add extensibility to Daffodil.

  • Ability to add attributes to attribute groups of the DFDL XSD schemas - thereby enabling validation of these new attributes. These would not be in the dfdl namespace, but other extension namespaces.

This is challenging. The current code-generation approach assumes the DFDL XSD schemas are complete. If they were modified, then one would need to re-run the code-generator and re-link the Daffodil library jar(s) via some sort of reflective automatic rebuild. Making such a thing very robust and transparent to a user is difficult.

An alternative is to add a attribute extension hook to the places that properties and their values are accessed. If a property is not found, then the extension list would be checked. If a property value is not found, then the property value extension list for that property would be checked. This creates some difficulty with the very strong-typing discipline used in Daffodil to make the code more robust. Example: The lengthKind property is a strongly typed Scala enumeration with specific values LengthKind.Delimited and LengthKind.Expression, and so forth. A lengthKind.Extension could be added which would trigger an extension mechanism, and then cause untyped code to run to deal with a user extension. Such code would need to be highly defensive, although if the DFDL XSD schemas were properly modified by the user, then DFDL schema validation would at least insure that the new property usage passed validation.

Other extension capabilities might include:

  • Ability to add new grammar productions that reuse existing grammar productions and terminals
  • Ability to inject new constructs into existing productions
    • Modify a production with a begin and/or end "wrapper" that are composed with the contents of the production by sequential composition or by alternative composition
    • Note: This is like aspect-oriented programming
  • Ability to add new terminals (aka primitives) which generate parsers/unparsers
  • Ability to add new parsers/unparsers

Most of the above could be achieved by various kinds of indirect linkage. That is, the old "one more indirection solves the problem" fix. Example: instead of primitives being code objects directly referenced as scala class members from productions, instead we allow them to be expressed as strings, and we lookup the strings in some registry of registered primitives and dynamically create an instance of one, initializing it with the current context. Many match-case statements in Daffodil would change from exhausting all options to having a case for all pre-defined options and then calling an extensibility hook of some kind instead of the error case. The error would only be signaled if no extension was detected.

However, that really understates the big issue, which is that editing the productions this way is very likely to break many many things in Daffodil. A user is very likely to create productions which are meaningful so long as the new properties that control them are used as intended, but which generate broken grammars that can't parse anything if any mistakes are made in the use of the properties. So providing very precise control over when new extensions are to be considered, and when they are not, is crucial to making debugging possible.

Very Difficult/Impossible Extensions

These things are really not in the cards in Daffodil

  • Defining new character encodings - we depend on the ICU libraries for everything about character decode/encode. Extensibility here would have to pass-thru to an extensibility hook in ICU.
  • Adding new error messages for existing constructs - International translations of existing messages will be possible certainly, and improving wording in the base English messages could be done by having an improved English 'translation'. But adding entirely new error detections, where certain pre-existing constructs would generate new error/warning messages that weren't previously anticipated is hard, because it is attempting to weave new code paths into existing code paths.
  • Changing lookahead/lookbehind buffer management - these can be tunable sizes, but if a user wanted say, a property that indicates a point in a schema where a lookback buffer can be discarded to save memory.... that's hard.
  • Changing a pre-existing static property into one computed at run-time via an expression.
  • Adding new primitive simple types which do not exist in XSD - this requires extending the set of types XSD provides. It runs across the notion that DFDL is an add-on annotation language on top of XSD.
  • Recursion - ability to define types/elements recursively (Recursion is not a feature of DFDL v1.0 as it stands today. Stripping DFDL v1.0 down to a core which one could then add recursion capability to as an extension.... unlikely.)

A Practical Core

There are some constructs in DFDL that seem very core to any implementation, are core to the Daffodil implementation, and form the basis for any kind of extensibility, rather than being candidates to be extensions.

Note on optional features of DFDL v1.0: Note that these suggested core features don't line up at all well with the 'optional' features of the DFDL specification. The optional features do not strip DFDL down to an extensible minimal core, because DFDL as currently specified is not extensible. Rather, the goal of the optional features is to allow creation of conformant subsets of DFDL v1.0 which are easily implemented by parties who are interested in being consistent with DFDL v1.0, but who are unable/unwilling/uninterested in undertaking creation of a full DFDL v1.0 implementation.

Core features might be:

  • ordered sequences
  • choices - with discriminators, with direct-dispatch (a feature added to DFDL v1.0 via an errata)
  • multiple occurrences (including optionals)
  • expressions
  • variables
  • primitive types - byte, string
  • hidden groups
  • calculated values - inputValueCalc, outputValueCalc
  • ignoreCase behavior
  • lengthUnits='bits'
  • lengthKind - delimited, endOfData, explicit, pattern, implicit
  • occursCountKind - expression, parsed
  • recursion (if needed in the future, it would have to be a core capability.)

Mechanisms would need to be added to DFDL to specify the extension mechanisms. These would need to include:

...

 Page content removed. See https://cwiki.apache.org/confluence/display/DAFFODIL/For+Contributors.