You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

One goal for Daffodil is efficiency. This goal is second to correctness, because performance can also be obtained by just scaling up hardware. However, a major reason people use non-XML data formats is for data density in high-performance or low-power (aka mobile) applications. So good performance is quite essential to success.

This page is for discussion of various ideas associated with achieving good runtime performance.

Discussion of parallelism/concurrency of processing is also appropriate.

Performance of the daffodil compiler itself is a separate topic.

Minimizing Allocation

A major overhead in any JVM-based system is minimizing unnecessary allocation.

This is in tension with parallelism. Often, to operate in parallel requires more storage, not a central reusable data structure.

Abstractly, the runtime is a bunch of functions from PState to PState, and this means every parse operation involves allocating a new PState object. Effectively, the parser behaves like an inner-loop calling the allocator for PState objects. This has the advantage that all code is inherently thread-safe.

Note: except the implementation currently uses JDOM trees as the Infoset representation, and these are mutated. These would have to be replaced by an inverted child-points-to-parent-only pure-functional Infoset representation.

However, this also limits performance to that of the garbage collector, and parallel performance to the ability of the allocator and garbage collector to run in parallel.

The alternative to this is to use a mutable state, and a parser mutates that state. Parallelism then requires the ability to create or obtain exclusive access to a separate state for use by the concurrent parse activity.

The runtime generally uses lots of Lists, mostly as stacks of variable bindings, backtrack locations, etc. Instead of these, one can use mutable arrays with current indexes in the state to indicate what is in use. For parallelism, allocating a new state then involves also allocating these mutable arrays.

These state objects can be kept in a pool and reused. The pool will need to be protected for thread safety.

Alternation and Variables

All variables are declared in DFDL, so each gets an array within the execution state. The DFDL newVariableInstance statement just increments a currentLocation index for its variable.

Input Stream Position

Each backtrack point saves the pos (stack again). Successful advancing pops the stack, but then reassigns top to hold the new position value.

Infoset

A JDOM tree is inefficient as a representation no matter what we do. It holds even bit data as strings.

Introduce a variable corresponding to each xpath expression. Use DFDL setVariable at the place in the schema that the path references. The point where the xpath expression was used is replaced by a variable reference.

This eliminates any use of a xpath processor except to evaluate flat expressions like ($x + $y). This eliminates the need for JDOM, which is required due to the xpath implementation that Daffodil tries to reuse.

That enables the infoset to be built up in the inverted direction (that is, children point to parents, not the other way round, or doubly-linked as in DOM/JDOM-style trees.)

Or... children of the current infoset node can be added to an array. If the node is successfully parsed, then the array is handed off for inclusion in the infoset. Otherwise we recursively free the array and its contained children back to a infoset node array pool, which enables reuse of these node arrays.

Diagnostics

A great deal of runtime overhead is required if Daffodil is to provide good diagnostic messages when things go wrong.

The runtime could have a "no diagnostics" speed-oriented mode. In this mode, no Diagnostic objects would be allocated.

This suggests that such objects must be created by an inlined call-by-name factory, such that the factory can be turned off, and the arguments passed to it will not be evaluated.

  • No labels