Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

Compilers are among the more complex kinds of code we write. They translate something with rich complex structure, into something else very different, but also with its own rich and complex structure.

Compilers need to be designed so that the inherent complexity of the task can be controlled, decomposed, and solved in pieces, and this requires a number of patterns that are not used often in other kinds of software.

DFDL/Daffodil is a compiler that constructs a parser or unparser from a DFDL schema, which is the source language being compiled.

AST - Abstract Syntax Tree

All compilers start with a front end, which performs the parsing of the source language. In our case this is the parsing of the DFDL schema into the nodes of an AST.

An AST is generally doubly-linked. That is, children nodes have back-pointers up to their parents, and so children are not shared by multiple parents. This allows calculations centered on child nodes to reference their contextual setting (from their parents).

Our AST is a tree of DFDL Schema Components in the dsom (DFDL Schema Object Model or DSOM) package. The base class SchemaComponent is the common root.

Along with this is the core class DFDLAnnotation, which is the base for all DFDL annotations.

The DFDL DSOM model is not quite the same as the XML Schema Object Model. One important difference is that we require a component that corresponds to the schema document (i.e., the file) itself.

In the code, because DSOM is an AST, the front end proceeds by parsing the syntax of the DFDL Schema, and a parent node is responsible to construct its child nodes. Common to this pattern is that 'this', i.e., the parent, is always passed down to the constructor of the child node in order to give it the context in which it resides. That is, the back-pointer to the parent. So code like:

class ParentNode(context : ASTNode,...) {
...
lazy val c1 = new ChildNode(this, ...other args)
...
}

This is the style one would commonly see as the pattern for the way constructors are called.

Attribute Grammars

An old compiler technique is something called attribute grammars. Attribute here is used in a general sense. Nothing to do with XML attributes.

The idea is the AST is decorated with a number of 'attributes'. These represent computed characteristics of that AST node. These can be as simple as 'elementName' or as complex as 'isFollowedByNonOptionalElement' or more.

The attribute grammar literature (I'd really have to dig to find citations - I'll let you web search it) calls attributes 'inherited' if they are computed based on context (looking at parent/ancestor nodes of the AST), and 'synthesized' if they are more bottom-up based on the attributes of children of an AST node. Note that this use of 'inherited' has nothing to do with object-oriented programming use of the word inherited.

...

this

...

Lazy Evaluation versus Passes

Attribute calculations can be complicated. Knowing exactly when all the prerequisite calculations have been completed, such that one knows when one can compute any give attribute is hard. Attribute grammars carefully divide the world into top-down and bottom-up calculations to facilitate organizing the calculations into passes. Compilers generally have many passes, each of which computes a few more attributes, or constructs a new representation of the data (which is typically, another attributed-tree of slightly different shape/design).

In a (mostly) functional programming language like Scala, if we adopt a functional style, we can take advantage of lazy evaluation to avoid the complexity of passes altogether. Each attribute calculation is computed on demand when it is first needed. You get the ball rolling by asking for the ultimate answer. In our case, you ask for the ultimate parser/unparser to be produced. Everything else is an attribute, which is evaluated on demand. This makes the entire calculation self-organizing. No pass-structure is needed.

That's the theory anyway. There are a few caveats having to do with errors generally.

Lazy evaluation requires that one stick to the functional programming discipline. Ordinary imperative programming styles are crucially dependent on knowing when any given computation is going to take place. With lazy evaluation, you give all that up. That is, the calculations are pure-functions of other attributes of the AST nodes. This leads to a very declarative style of coding, which is good for helping control the complexity of the whole endeavor.

OOLAG - putting the techniques together

If the AST nodes are organized into a class & mixin taxonomy, and use lazy evaluation to compute their attribute-grammar attributes, you get OOLAG, which is an effective discipline for writing compilers.

...

page

...

Above we pointed out one must stick with a pure functional programming style to make this all hang together.  This means we need a technique for dealing with the fact that things can go wrong in a compiler. The input source may have errors in it, and good diagnostic messages must be presented back to the author of that DFDL schema.

To manage this, we generalize the notion of what the 'answer' can be to any attribute calculation.

Some attributes are simple. They compute an ordinary Scala built-in type like a boolean, and nothing is expected to go wrong when computing them. Those are ordinary Scala code.

But, what if in the course of computing an attribute one might encounter errors in the DFDL schema, or one might want to issue warnings about suspect constructs in the DFDL schema that might be errors?  The pure functional style we use requires that these things are returned as part of the value.

OOLAG Values

So we generalize the notion of an attribute value. A class OOLAGValue is a base used to implement attribute calculations where not only is an answer being computed, but one might get an answer, a set of error/diagnostic/warnings (which are instances of trait Diagnostic), or both.

At the API, an object that can have diagnostics like this exhibits the Daffodil API trait WithDiagnostics, which describes an API that includes two important attributes: 'isError' and 'getDiagnostics'. (These OOLAG-specific implementation classes aren't visible at the level of the API to Daffodil. Only Diagnostic, and WithDiagnostic are visible at the Daffodil API for users of Daffodil. The implementation trait for implementations of WithDiagnostics is called DiagnosticsProviding.)

The key functionality of OOLAGValue is to wrap all its calculations in a try/catch, so that if any part of the calculation throws an exception, that is captured, turned into a diagnostic object, and becomes part of the "answer".

An OOLAGValue has a critical attribute 'isError', which is true if the object is in error state, and so has no ordinary value, but has a value containing diagnostic object(s) about the error.

You can ask an OOLAGValue for its 'value', to get the ordinary value, but this will throw if there is no such value. Often one wants to call 'isError' first to decide if you want the value, but sometimes the throw is fine. E.g., if computing another attribute in terms of an attribute that has no value due to an error, the right behavior may be for this new attribute to also be in error state. If so, then no test for isError is required. The throw will cause the new attribute to also be in error. The OOLAG framework makes sure that there is no additional error/diagnostic recorded in this case.

OOLAG Hosts and Multiple Diagnostic Messages - RequiredEvaluations

When compiling, we very often want to get more than one error message out. We don't want our compiler to fail on the very first problem as this leads to long debug cycles, and sometimes one can only figure out what is really wrong by looking at several errors at once.

The OOLAG framework/pattern deals with this by a notion of OOLAG Hosts. An OOLAG Host (base class OOLAGHost), is an AST node which can have multiple children, where it is feasible to compile the children separately and combine their errors/diagnostics together to form the final set. Good examples: A SchemaSet is an OOLAGHost, as each Schema can be compiled for diagnostics. A SchemaDocument is an OOLAGHost as each top level element declaration can be compiled separately, as well as all top-level DFDL annotations can be scrutinized for correctness.

The framework tolerates that sometimes the children are internally coupled, so an error in the first can, for example, cause all the subsequent children to also error. However, sometimes there is independence, and we can benefit from accumulating errors across these children.

The way this is done is that an OOLAGHost object carries required evaluations. Each one of these method calls (which are lazy - the args aren't evaluated right away) indicates that for this OOLAGHost object to be considered to be without error, all these evaluations must take place, and not cause errors. The calculation is structured so that it does not stop on the first one that throws or causes an error, but evaluates all of them.

Example:

...

}

In the above you see that def of 'typeDef' which defines it using this LV function. The LV function takes a symbol, and an expression, which is not evaluated. The computation is done exactly once, and cached internally to the OOLAG system. Hence, calling typeDef repeatedly just accesses the cache.

Use of 'def' above saves a per-object slot to hold a lazy val. However, for debug reasons, you may want the value to be apparent in the object. In that case the right idiom is:

  lazy val typeDef = LV('typeDef){ .... }.value

This will allow more convenient inspection of the typeDef value when debugging code.

Scala Note: I have no idea how one can do this with less syntax baggage in Scala. In Lisp this would be (defAttribute typeDef ... computation...), one line, one statement. This is my primary argument for why Scala needs Lisp-like macros, as defining forms like this cannot be created with Scala's supposedly powerful Domain-Specific-Language (DSL) capabilities.

I'm open to suggestions though. If some very very clever Scala technique can be used to improve this coding idiom, I'm all for it.

A Few Details

About warnings. The test 'isError' can be false, and an ordinary value be produced, but warnings may still have been accumulated. Hence, one can have diagnostic objects which represent warnings even when there is no error that stops processing.

Summary

  • OOLAG is a compiler-writers pattern
  • It requires functional programming to avoid the need for organizing code into passes
  • It requires use of special value objects that allow an ordinary value, a set of error/diagnostic objects, or both to be the "answer" of a calculation.
    • Every such value object can answer the 'isError' test.
  • Code is structured into OOLAGValue calculations (using the LV idiom) and OOLAGHost objects

  • OOLAGHost objects carry required evaluations that must be evaluated to insure they are 'done'

     

 

 

 

 

...

has moved to https://cwiki.apache.org/confluence/display/DAFFODIL/OOLAG+-+Object-Oriented+Lazy+Attribute+Grammars