Child pages
  • Notes on an Extensible DFDL

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Created this note.

...

DFDL extensibility remains a valuable research goal. Given an implementation of DFDL like Daffodil, one can examine how various kinds of extensibility might be achieved.

Areas of extensibility

The following are suggested ways one could extend DFDL, ordered with what seems simplest to understand and perhaps implement first, and more complicated kinds of extension later:

  • Add a new representation for an existing primitive data type - this is like adding zoned decimal as a supported data format, or adding ones-complement binary integers. It involves adding:
    • A new enum to an existing keyword (e.g., textNumberRep is currently either 'standard' or 'zoned'. One could add an additional value to this enum to denote a different text number representation.)
    • Additional properties that control the representation of the new type.
  • Define a new construct by way of a macro-expansion of it into combinations of other pre-existing constructs. This has been suggested for implementing sequenceKind="unordered", by macro expansion into an array of choices.
  • Embed a new representation of an existing data type that uses a complex type to describe the representation - that is, break down the flat notion of simple types, and allow a complex type to be used to define the parts that contribute to computation of a simple value.
    • An example of this would be a very small floating point format (as used in some sensors) such as a 3-bit twos complement exponent and a 5 bit mantissa. This can be converted into type float via simple computations, but one must be able to describe the pieces as a sequence of child elements.
    • Note that this is a limited kind of layering
  • Define a new occursCountKind - control when another array element is created and when the end of an array is detected.
  • Define a new separator suppression policy - control when unnecessary separators are tolerated and when they are not.
  • Define a new character encoding - hard because character encodings are embedded very deeply algorithmically (for obvious efficiency reasons)
  • Add a new primitive data type - quad-precision floating point is the typical example used here. This is challenging because it requires modifying XSD itself and the underlying data primitives (e.g., Scala has no quad float data type, so one would have to be built)

Adding Extensibility to Daffodil

Having progressed the Daffodil implementation of DFDL to its current state, it is now possible to comment on what would be required to add extensibility to Daffodil.

  • Ability to add attributes to attribute groups of the DFDL XSD schemas - thereby enabling validation of these new attributes. These would not be in the dfdl namespace, but other extension namespaces.One would then

This is challenging. The current code-generation approach assumes the DFDL XSD schemas are complete. If they were modified, then one would need to re-run the code-generator

...

 and re-link the Daffodil library jar(s) via some sort of reflective automatic rebuild. Making such a thing very robust and transparent to a user is difficult.

An alternative is to add a attribute extension hook to the places that properties and their values are accessed. If a property is not found, then the extension list would be checked. If a property value is not found, then the property value extension list for that property would be checked. This creates some difficulty with the very strong-typing discipline used in Daffodil to make the code more robust. Example: The lengthKind property is a strongly typed Scala enumeration with specific values LengthKind.Delimited and LengthKind.Expression, and so forth. A lengthKind.Extension could be added which would trigger an extension mechanism, and then cause untyped code to run to deal with a user extension. Such code would need to be highly defensive, although if the DFDL XSD schemas were properly modified by the user, then DFDL schema validation would at least insure that the new property usage passed validation.

Other extension capabilities might include:

  • Ability to add new grammar productions that reuse existing grammar productions and terminals
  • Ability to inject new constructs into existing productions
    • Modify a production with a begin and/or end "wrapper" that are composed with the contents of the production by sequential composition or by alternative composition
    • Note: This is like aspect-oriented programming
  • Ability to add new terminals (aka primitives) which generate parsers/unparsers
  • Ability to add new parsers/unparsers

Most of the above could be achieved by various kinds of indirect linkage. That is, the old "one more indirection solves the problem" fix. Example: instead of primitives being code objects directly referenced as scala class members from productions, instead we allow them to be expressed as strings, and we lookup the strings in some registry of registered primitives and dynamically create an instance of one, initializing it with the current context. Many match-case statements in Daffodil would change from exhausting all options to having a case for all pre-defined options and then calling an extensibility hook of some kind instead of the error case. The error would only be signaled if no extension was detected.

However, that really understates the big issue, which is that editing the productions this way is very likely to break many many things in Daffodil. A user is very likely to create productions which are meaningful so long as the new properties that control them are used as intended, but which generate broken grammars that can't parse anything if any mistakes are made in the use of the properties. So providing very precise control over when new extensions are to be considered, and when they are not, is crucial to making debugging possible.

Very Difficult/Impossible Extensions

These things are really not in the cards in Daffodil

  • Defining new character encodings - we depend on the ICU libraries for everything about character decode/encode. Extensibility here would have to pass-thru to an extensibility hook in ICU.
  • Adding new error messages for existing constructs - International translations of existing messages will be possible certainly, and improving wording in the base English messages could be done by having an improved English 'translation'. But adding entirely new error detections, where certain pre-existing constructs would generate new error/warning messages that weren't previously anticipated is hard, because it is attempting to weave new code paths into existing code paths.
  • Changing lookahead/lookbehind buffer management - these can be tunable sizes, but if a user wanted say, a property that indicates a point in a schema where a lookback buffer can be discarded to save memory.... that's hard.
  • Changing a pre-existing static property into one computed at run-time via an expression.
  • Adding new primitive simple types which do not exist in XSD - this requires extending the set of types XSD provides. It runs across the notion that DFDL is an add-on annotation language on top of XSD.
  • Recursion - ability to define types/elements recursively (Recursion is not a feature of DFDL v1.0 as it stands today. Stripping DFDL v1.0 down to a core which one could then add recursion capability to as an extension.... unlikely.)

A Practical Core

There are some constructs in DFDL that seem very core to any implementation, are core to the Daffodil implementation, and form the basis for any kind of extensibility, rather than being candidates to be extensions.

Note on optional features of DFDL v1.0: Note that these suggested core features don't line up at all well with the 'optional' features of the DFDL specification. The optional features do not strip DFDL down to an extensible minimal core, because DFDL as currently specified is not extensible. Rather, the goal of the optional features is to allow creation of conformant subsets of DFDL v1.0 which are easily implemented by parties who are interested in being consistent with DFDL v1.0, but who are unable/unwilling/uninterested in undertaking creation of a full DFDL v1.0 implementation.

Core features might be:

  • ordered sequences
  • choices - with discriminators, with direct-dispatch (a feature added to DFDL v1.0 via an errata)
  • multiple occurrences (including optionals)
  • expressions
  • variables
  • primitive types - byte, string
  • hidden groups
  • calculated values - inputValueCalc, outputValueCalc
  • ignoreCase behavior
  • lengthUnits='bits'
  • lengthKind - delimited, endOfData, explicit, pattern, implicit
  • occursCountKind - expression, parsed
  • recursion (if needed in the future, it would have to be a core capability.)

Mechanisms would need to be added to DFDL to specify the extension mechanisms. These would need to include:

  • Ability to define new types
  • Ability to compute on arrays of values
  • Ability to use the value of a string or hex binary or array of bytes, as a source of data for parsing. This has been called 'Data Source Indirection', and it is a form of layering.