You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

 

Compilers are among the more complex kinds of code we write. They translate something with rich complex structure, into something else very different, but also with its own rich and complex structure.

Compilers need to be structured so that the inherent complexity of the task can be controlled, decomposed, and solved in pieces, and this requires a number of patterns that are not used often in other kinds of software.

DFDL/Daffodil is a compiler that constructs a parser or unparser from a DFDL schema, which is the source language being compiled.

AST - Abstract Syntax Tree

All compilers start with a front end, which performs the parsing of the source language. In our case this is the parsing of the DFDL schema into the nodes of an AST.

An AST is generally doubly-linked. That is, children nodes have back-pointers up to their parents, and so children are not shared by multiple parents. This allows calculations centered on child nodes to reference their contextual setting (from their parents).

Our AST is a tree of DFDL Schema Components in the dsom (DFDL Schema Object Model or DSOM) package. The base class SchemaComponent is the common root.

Along with this is the core class DFDLAnnotation, which is the base for all DFDL annotations.

The DFDL DSOM model is not quite the same as the XML Schema Object Model. One important difference is that we require a component that corresponds to the schema document (i.e., the file) itself.

In the code, because DSOM is an AST, the front end proceeds by parsing the syntax of the DFDL Schema, and a parent node is responsible to construct its child nodes. Common to this pattern is that 'this', i.e., the parent, is always passed down to the constructor of the child node in order to give it the context in which it resides. That is, the back-pointer to the parent. So code like:

class ParentNode(context : ASTNode,...) {

...

lazy val c1 = new ChildNode(this, ...other args)

...

}

This is the style one would commonly see as the pattern for the way constructors are called.

Attribute Grammars

An old compiler technique is something called attribute grammars. Attribute here is used in a general sense. Nothing to do with XML attributes.

The idea is the AST is decorated with a number of 'attributes'. These represent computed characteristics of that AST node. These can be as simple as 'elementName' or as complex as 'isFollowedByNonOptionalElement' or more.

The attribute grammar literature (I'd really have to dig to find citations - I'll let you web search it) calls attributes 'inherited' if they are computed based on context (looking at parent/ancestor nodes of the AST), and 'synthesized' if they are more bottom-up based on the attributes of children of an AST node. Note that this use of 'inherited' has nothing to do with object-oriented programming use of the word inherited.

We don't really need to care much about this inherited versus synthesized characterization. It's just worth mentioning that the power of the technique comes from the fact that there are these top-down/ancestor as well as bottom-up attribute calculations.

Lazy Evaluation versus Passes

Attribute calculations can be complicated. Knowing exactly when all the prerequisite calculations have been completed, such that one knows when one can compute any give attribute is hard. Attribute grammars carefully divide the world into top-down and bottom-up calculations to facilitate organizing the calculations into passes. Compilers generally have many passes, each of which computes a few more attributes, or constructs a new representation of the data (which is typically, another attributed-tree of slightly different shape/design).

In a (mostly) functional programming language like Scala, if we adopt a functional style, we can take advantage of lazy evaluation to avoid the complexity of passes altogether. Each attribute calculation is computed on demand when it is first needed. You get the ball rolling by asking for the ultimate answer. In our case, you ask for the ultimate parser/unparser to be produced. Everything else is an attribute, which is evaluated on demand. This makes the entire calculation self-organizing. No pass-structure is needed.

That's the theory anyway. There are a few caveats having to do with errors generally.

Lazy evaluation requires that one stick to the functional programming discipline. Ordinary imperative programming styles are crucially dependent on knowing when any given computation is going to take place. With lazy evaluation, you give all that up. That is, the calculations are pure-functions of other attributes of the AST nodes. This leads to a very declarative style of coding, which is good for helping control the complexity of the whole endeavor.

OOLAG - putting the techniques together

If the AST nodes are organized into a class & mixin taxonomy, and use lazy evaluation to compute their attribute-grammar attributes, you get OOLAG, which is an effective discipline for writing compilers.

There are a few more details required to deal with errors though. There's a separate wiki page on Error, Diagnostics, Tracing, Logging, which discusses some of this also.

Above we pointed out one must stick with a pure functional programming style to make this all hang together.  This means we need a technique for dealing with the fact that things can go wrong in a compiler. The input source may have errors in it, and good diagnostic messages must be presented back to the author of that DFDL schema.

To manage this, we generalize the notion of what the 'answer' can be to any attribute calculation.

Some attributes are simple. They compute an ordinary Scala built-in type like a boolean, and nothing is expected to go wrong when computing them. Those are ordinary Scala code.

But, what if in the course of computing an attribute one might encounter errors in the DFDL schema, or one might want to issue warnings about suspect constructs in the DFDL schema that might be errors?  The pure functional style we use requires that these things are returned as part of the value.

OOLAG Values

So we generalize the notion of an attribute value. A class OOLAGValue is a base used to implement attribute calculations where not only is an answer being computed, but one might get an answer, a set of error/diagnostic/warnings (which are instances of trait Diagnostic), or both.

At the API, an object that can have diagnostics like this exhibits the Daffodil API trait WithDiagnostics, which describes an API that includes two important attributes: 'isError' and 'getDiagnostics'. (These OOLAG-specific implementation classes aren't visible at the level of the API to Daffodil. Only Diagnostic, and WithDiagnostic are visible at the Daffodil API for users of Daffodil.)

The key functionality of OOLAGValue is to wrap all its calculations in a try/catch, so that if any part of the calculation throws an exception, that is captured, turned into a diagnostic object, and becomes part of the "answer".

An OOLAGValue has a critical attribute 'isError', which is true if the object is in error state, and so has no ordinary value, but has a value containing diagnostic object(s) about the error.

You can ask an OOLAGValue for its 'value', to get the ordinary value, but this will throw if there is no such value. Often one wants to call 'isError' first to decide if you want the value, but sometimes the throw is fine. E.g., if computing another attribute in terms of an attribute that has no value due to an error, the right behavior may be for this new attribute to also be in error state. If so, then no test for isError is required. The throw will cause the new attribute to also be in error. The OOLAG framework makes sure that there is no additional error/diagnostic recorded in this case.

OOLAG Hosts and Multiple Diagnostic Messages - Diagnostic Children

When compiling, we very often want to get more than one error message out. We don't want our compiler to fail on the very first problem as this leads to long debug cycles, and sometimes one can only figure out what is really wrong by looking at several errors at once.

The OOLAG framework/pattern deals with this by a notion of OOLAG Hosts. An OOLAG Host (base class OOLAGHost), is an AST node which can have multiple children, where it is feasible to compile the children separately and combine their errors/diagnostics together to form the final set. Good examples: A SchemaSet is an OOLAGHost, as each Schema can be compiled for diagnostics. A SchemaDocument is an OOLAGHost as each top level element declaration can be compiled separately, as well as all top-level DFDL annotations can be scrutinized for correctness.

The framework tolerates that sometimes the children are internally coupled, so an error in the first can, for example, cause all the subsequent children to also error. However, sometimes there is independence, and we can benefit from accumulating errors across these children.

The way this is done is that an OOLAGHost object is required to create a very critical member named diagnosticChildren. An OOLAGHost object answers isError with true if any of its diagnosticChildren answer isError with true. The calculation is structured so that it does not stop on the first one that answers isError with true, but evaluates isError for all of them. The diagnostic objects for an OOLAGHost are the union of all the diagnostic objects for all its diagnosticChildren.

One further detail. An OOLAGHost can also have local calculations. These also must be without error if isError is to return true, and any diagnostic objects created must be captured. The idiom here is that this is done using OOLAGValue calculations to compute all local attributes, and adding these OOLAGValue objects to the list of diagnosticChildren.

Example:

abstract class ElementBase(...) extends OOLAGHost { 

lazy val typeDef = typeDef_.value
private lazy val typeDef_ = LV { // object LV constructs an OOLAGValue
.... compute the type definition ....
}
  lazy val diagnosticChildren = List(child1, child2, myAttribute1_ )

}

In the above, you see the two lines associated with the typeDef attribute of ElementBase. The first obtains the value, if possible from the LV (lazy value) created by the associated private typeDef_. This is the common idiom, to have two lines together like this.

Scala Note: I have no idea how one can do this with less syntax baggage in Scala. In Lisp this would be (defAttribute typeDef ... computation...), one line, one statement. This is my primary argument for why Scala needs Lisp-like macros, as defining forms like this cannot be created with Scala's supposedly powerful Domain-Specific-Language (DSL) capabilities.

I'm open to suggestions though. If some very very clever Scala technique can be used to improve this coding idiom, I'm all for it.


In addition, you see that the private typeDef_ is the last child in the diagnosticChildren member. An object of type LV can answer the question 'isError', which would tell us if an error occurred during the calculation of the typeDef's value. This is reasonable because every Element must have a type, which must have a definition of some sort. If something goes wrong while computing the typeDef for the element, then the element 'isError' is true, and any diagnostics captured during the attempt to compute the typeDef become diagnostics of the element itself. Recursively, the child1, child2 are either OOLAGValue objects, or OOLAGHost objects. They are examined separately for whether they are 'isError', and their diagnostics are contributed to those of the ElementBase object.

A Few Details

About warnings. The test 'isError' can be false, and an ordinary value be produced, but warnings may still have been accumulated. Hence, one can have diagnostic objects which represent warnings even when there is no error that stops processing.

Summary

  • OOLAG is a compiler-writers pattern
  • It requires functional programming to avoid the need for organizing code into passes
  • It requires use of special value objects that allow an ordinary value, a set of error/diagnostic objects, or both to be the "answer" of a calculation.
    • Every such value object can answer the 'isError' test.
  • Code is structured into OOLAGValue calculations (using the LV idiom) and OOLAGHost objects

  • The list diagnosticChildren is central to the idiom

     

 

 

 

 

 

 

  • No labels