Error/Diagnostics and Tracing/Logging for Daffodil

Design notes and Requirements Analysis

Compilation

Sometimes we return

a value
a set of errors/diagnostics/warnings
both

This suggests that Scala's convention of using an Either object is no good for us, because its very name implies one or the other, not both. Both will certainly happen if there are only compile-time warnings.

A functional-programming approach to errors/diagnostics, that also is consistent with Daffodil's DSOM model and the OOLAG (object oriented lazy attribute grammar) approach works like this: You call a function or evaluate an expression to create a value. The returned object is always of the correct type, but the returned object represents either a value, a set of errors/diagnostics/warnings, or both. Attributes of the returned object will include the set of errors/diagnostics/warnings/information, a status of whether the value is OK, or there was an error severe enough to prevent creation of a value.

Example: an xpath expression is compiled into a CompiledExpression type object. This object type should contain a boolean member named hasValue or isRunnable (or perhaps wasCompilationSuccessful()?) which is true if compilation of the xpath succeeded. A member 'diagnostics' will be Nil if there are no errors/diagnostics/warnings/info, otherwise will contain a Set of error/diagnostic/warning/info objects. If hasValue is true, then any error/diagnostic/warning objects are not ones that prevent the CompiledExpression from being used (i.e., probably all warnings/info objects)

In general, this is the idiom for any sort of compilation step. So instead of a compilation action returning an Option[T] where None indicates failure, we get type T, and isRunnable will be true or false.

Runtime

The parser runtime has a return object that captures success/failure. Whether any given runtime failure is actually a failure, or just part of speculative parsing is not something we actually know in advance, which means even if we're just backtracking as part of parsing we'll be constructing diagnostic objects just in case those errors propagate up to top level.

For example: suppose we have a choice with 3 alternatives. Suppose the data doesn't match any of the 3. The diagnostic we issue doesn't want to say that the 3rd alternative didn't match, it wants to say that no alternative matched the data, and specifically may want to say that the first alternative didn't match because of reason(s) X, the second due to reason(s) Y, and the third due to reason(s) Z. That is, imagine we are parsing along attempting the first alternative of this choice. We encounter an error. Now this error might be suppressed because one of the subsequent alternatives might succeed, and in that case we can discard the failure reason for the first alternative. Or, this error might still be needed because if all 3 alternatives result in errors, it might be helpful as a diagnostic to see why each alternative failed.

We'll want to design this so that most of the work associated with diagnostics (constructing message strings for example, and substituting toString representations of various pieces of errant data into them) all happens at the time the diagnostic is actually issued at top level, and not at the time the diagnostic object is created. Aka,

we will want to do lazy message construction.

Conveniently, this goal lines up perfectly with internationalization requirements, where the software should not contain message strings at all, but should just construct objects of the right type. Message strings are created at the point where messages are displayed to a user, or printed out. While log files may want to contain English, they also want to contain all the components to enable internationalized presentation, so they don't want to contain the formatted English messages, but rather a message identifier, the string representations of the things to be substituted into the message, and possibly the English (in case someone is trying to make sense of the log, outside of an environment where they have the internationalization available, in which case they must understand English.)

In addition, since the JVM will throw some exceptions (like divide by zero), we will surround the runtime with a try-catch, and catch exceptions coming from the JVM to include in our returned list of errors/diagnostics/warnings.

Error Types

The DFDL spec has added, as an errata to draft v1.0.3, a behavior that is effectively a warning mechanism, called a 'recoverable error'.

So we have these error/warning types: We will generally use the term "error" to mean "error or warning or information item".

Schema Definition Error (a.k.a., SDE) - detected at compile time
Schema Definition Error - detected at run time
Schema Definition Warning - detected at compile time - not called out explicitly in the spec, though there are many places it says implementations may want to warn....
Schema Definition Warning - detected at run time
Processing Error (a.k.a., PE) - always a run-time thing - causes backtracking when the schema expresses alternatives
Recoverable Error - always a run-time thing - never causes backtracking, really this is just a run-time "warning". We may want to have a threshold that determines how many of these are allowed before it escalates it to either a Processing error (which can cause backtracking), or a fatal error.
Information - either at compile or run time, we may want to simply inform the user. Probably this is under control of some flags to control whether one wants these or not. Example: an information object might inform the user that the format is not efficiently streamable due to forward/backward reference issues, such as a header record that contains the length of the entire data object. One cannot stream this when unparsing, as one must hold-back the header until the full length is known. In some cases users may want to escalate these information items to warnings or errors, such as if it is their intention to stream the data, then they may want an error from schema compilation for non-streamable formats.

In all cases we need to capture information about the schema location(s) relevant to understanding the cause of the error, and in the case of errors/warnings at run-time, the data location(s) that are relevant. A schema location is a schema file name, and a schema component within that schema file. Any given issue may require more than one schema location to be provided as part of its diagnostic information.

SchemaComponent(s) will contain their schema location.
Every sub-structure within a schema component will contain its schema location.

At runtime, a data location is either

an offset (in bits, bytes, characters, or combination thereof) from the beginning of the data stream
a relative offset from another data location (recursively, this bottoms out at the beginning of the data stream)

Any data location can always be converted into an absolute location in the data stream, but relative offsets to other locations in the data (e.g., the beginning of the current record) are often more useful for diagnostics.

Continuing Execution After Fatal Error

SDEs are always fatal - processing stops, even if they are detected at run-time. This is consistent with the notion that recovering from an error involves backtracking to try another alternative the schema allows. If the schema itself is not meaningful, then this isn't a legitimate thing to do.

However, consider: at run time the data is supposed to contain an indicator byte that provides the byte order. Value of 1 means bigEndian, 0 means littleEndian. Suppose at run time this byte is found to contain 2. Unless the author of the DFDL schema has default logic in the expression that computes the byte order given even an errant value like this, well we're not going to be able to decide what the byte order is. This is a schema definition error which is fatal.

An application may; however, be parsing buffers of data over and over, and this application loop may simply view the failure as some broken data, and may want to proceed. So in this case, the application will want to behave as if this were any other kind of runtime error that propagated itself all the way up to top-level. It may want to save the message buffer along with the diagnostic information about what caused the failure, and then continue execution. This has implications for the API of Daffodil.

Recovery from a runtime error involves a point of uncertainty expressed in the schema. Consider: the root element for the parse could be placed as an element reference inside a choice as the first alternative. The second alternative could be one hexBinary type element extending with lengthKind='endOfParent', meaning to the end of the data stream in this case. This second alternative will always succeed to parse, so provides a natural way to move forward if parsing based on the schema and its root element ultimately fails.

However, this is not quite enough, as the user is likely to want to route this hexBinary data blob somewhere, and capture the diagnostic information from the parse failure to keep with it. The normal behavior of a choice where a second alternative succeeds would be to discard error/diagnostic information from any prior failing alternative. Hence, a top-level API is needed which provides access to the failure diagnostic information for the overall parse.

Gathering Multiple Compile-Time Errors

At compile time, we want to gather a set of SDEs to issue to the user. So compilation wants to continue to process more of the schema where possible, gather together the results, and then return that full set of diagnostics from all the compilation results.

Given a schema with many top-level elements, we can easily just compile each of the top-level elements regardless of any errors on the other top-level elements, and then present the complete set of errors to the user. This is not desirable, however, because a large schema with many top level element declarations may still have only one that is intended to be the document root.

The API for compilation should allow compilation of one or more global element declarations as the potential document root(s).

It is harder to gather a set of diagnostics from the compilation of a single root element, rather than stop on the first such issue. The thing to depend on is this:

Compilation of each of the children of a sequence or choice can be isolated, and the errors from those compilations concatenated (in schema order) to form the set for the whole compilation.

Tracing/Logging

Applications that embed Daffodil are very likely to be servers, so a target logger to which Daffodil writes out logging/tracing needs to be something that the application can provide to Daffodil via an API.

In the case of an interactive DFDL Schema authoring environment, trace information would normally be displayed to the user/author. A runtime server that embeds Daffodil would more likely want to log to a file-system-based logger, and possibly trigger alerts flowing to a monitoring system of some kind.

Tracing and logging overlap, in the sense that tracing may need to be activated on a pure-runtime embedded system for diagnostic purposes, in which case trace output becomes just a specialized kind of log output. An example of this would be when a DFDL schema author believes the schema is correct, but when deployed at runtime inside some server, data arrives that contains things unanticipated by the schema author. The resulting failure to parse may result in wanting to turn on debug/trace features within the server's Daffodil runtime.

Purposes of tracing include:

helping Daffodil developers find and isolate bugs in the Daffodil code base.
helping DFDL schema authors write correct schemas by tracing/logging compiler behavior. These traces/logs can be about identifying problems, or simply building confidence that a schema is correct. In the latter case, the trace/log is effectively useful redundant information.
helping DFDL schema authors write correct schemas by tracing/logging runtime behavior.
helping DFDL processor users (who are running applications that embed Daffodil) identify problems in either the data or schemas that purport to describe that data.
helping Daffodil developers find and isolate performance problems in the Daffodil code base.
helping DFDL schema authors understand the performance of Daffodil when processing data for their DFDL Schema. (When there is more than one way to model data in DFDL, sometimes DFDL Schemas can be tuned to improve performance by choosing alternative modeling techniques.)

Purposes of logging include all of the above, but also include:

monitoring (over extended time periods) performance of compilation
monitoring (over extended time periods) performance of runtime behavior
generating alerts that flow to an overarching systems monitoring environment

Purpose (1) here is much like Assertion checking. it wants to be something that is low/zero overhead if turned off, but can be turned on without recompilation of the application, and preferably without even restarting the application. Keep in mind however, that traces should not be used as a substitute for breakpoint debugging.

Purpose (2) is an end-user feature likely to be turned on/off as part of some tooling/environment used by a DFDL schema author. The Daffodil processor must provide APIs for controlling this from the tooling. In a functional program like Daffodil, these kinds of traces/logs are not really things that print to an error stream so much as they are additional attributes (lazy vals) computed for purposes of illustrating the decisions made by the compiler.

Purpose(3) is much like (4), except that we can assume something about the development environment perhaps, i.e., that the trace/logging information will be displayed to a schema author for purposes of informing them.

Purpose(4) cannot depend on any specific tooling. Daffodil is designed to be embedded, and purpose (3) here is one where the context is that of the application which just uses Daffodil code inside it somewhere. The application must be able to

turn on/off this tracing, without having to restart, and similarly control the verbosity of detail in the traces/log, and control any selectivity features of the tracing.
supply the streams to which the trace/logs are written. These may or may not be streams leading to a file system.
avoid full-disk situations by being notified about the volume of data written to the streams and being able to change the streams without loss of any traces/log records.
run forever with tracing/logging turned on, albeit with some performance degradation proportional to the amount of trace/log information being generated.

Coding Style Requirements

In order to encourage testing for error situations, good diagnostics, clear code with good "algorithmic density" (by this I mean the code is not so spread out you can't follow what's going on in a screenful) it is important that issuing a diagnostic/error message take exactly 1 line of code 80% of the time, and not more than a couple of lines the rest of the time. Complex decision making about how the error message should best be phrased should be deferrable. Navigating to other files to create message identifiers should be deferrable, or optional.

Space shortcuts

Child pages

Error/Diagnostics and Tracing/Logging for Daffodil

Compilation

Runtime

Error Types

Continuing Execution After Fatal Error

Gathering Multiple Compile-Time Errors

Tracing/Logging

Coding Style Requirements

Space shortcuts

Child pages

Error, Diagnostics, Tracing, Logging

Error/Diagnostics and Tracing/Logging for Daffodil

Compilation

Runtime

Error Types

Continuing Execution After Fatal Error

Gathering Multiple Compile-Time Errors

Tracing/Logging

Coding Style Requirements