View Source

This page discusses coding style guidelines for the Daffodil code base.

Much of the code does not follow these guidelines. As it evolves the goal is to make new code follow these guidelines, and to evolve existing code toward them.

64-bit vs. 32-bit

Our goal is all-64-bit capabilities. Unfortunately, many Java and Scala libraries do not allow offsets or positions larger than a signed 32-bit Int can hold.

Someday, maybe those libraries will be updated so that, for example, a byte array can hold an entire video, which is bigger than 2GBytes. For now we're stuck.

Our code should be written to use type Long for all offsets and positions, and only when we must deal with an underlying library that has only an Int-based API do we then cast to Int. This should be done with an explicit check, as x.toInt doesn't cause overflow errors. E.g., Long.MaxValue.toInt produces -1. No error is thrown.

So, when you must have an Int to cope with a library, code should do this:

def myFunc(param: Long) = {
   Assert.usage(param <= Int.MaxValue, "Maximum is 32-bit limited currently. %s".format(param))
   val intParam = param.toInt
   ...
   callLameAPI(intParam)
   ...
}

Scala, not Java

We are committed to using Scala for Daffodil long term. Do not add Java code to this code base except in a few special circumstances.

we use many java-based libraries of course
code snippets from online that are being used largely unmodified can be pasted wholesale into Java files.

If you find online examples of how to use an API from Java, then generally these should be rewritten into Scala. Often there are nicer Scala idioms. Be sure to Web-search for the same API with the keyword "Scala" added to your search. Often you will find idiomatic scala to accomplish the same thing.

Use Scala's built in XML capabilities to reduce the quoting hell that otherwise results when you try to type XML as string content.

We are committed to tracking Scala as it evolves. It is too early to try to freeze the Scala language. There are improvements, particularly in the XML support, which are needed, and which we will want to take advantage of. So expect some disruption when major releases of Scala emerge.

Similarly, we expect to track new versions of the libraries we depend on. Please use a robust naming discipline of naming libraries to make versions clear.

* Except perhaps Saxon which is still the no-longer-progressing Saxon-B, which is fine for now. (Note: No longer using Saxon as of late 2014.)

Use Smaller Files

Scala's compiler is quite slow, and an this must be taken into account to insure a reasonable edit-compile-debug cycle for developers. A compilation unit is an entire file. Incremental compilation is improved in efficiency if the files are smaller. So avoid huge files that blend multiple concepts together. Do not, however, go so far as to break things apart that really are best understood if kept in the same file.

Test-Driven Development & Design-for-Test (DFT)

Our code is organized under src/main and src/test directories, with test-only source code going in the latter directory. The package structure under these is identical, the separation is just so that we can package distributions of daffodil that do not contain test code, should we so desire.

Unit Testing

Everything should have unit tests, though there is always debate of what a "unit" really means. For our purposes, what we mean by unit tests is test that are easily run, by the developer, in the IDE and outside the IDE, which very quickly tell you the status of the code - what's still working, what is broken, and have some intention of helping isolate the problem to smaller units of code.

Unit tests must run quickly, i.e., in just a second or so, though the whole suite of them, if run en-masse, can take 15 to 30 seconds to run.

Larger test suites can also be written using JUnit, so not everything using unit testing tools is strictly speaking a "unit" test.

A couple of specifics:

JUnit predicates, not Scalatest - That is, use assertEquals(expected, actual), not "actual should be equal to expected" (from scalatest's ShouldMatchers classes) because the IDE supports JUnit well, and doesn't support scalatest.
- We do use Scalatest, but mostly for the bridge to JUnit, and the convenient intercept construct for catching expected exceptions.
- Someone needs to make an argument in favor of Scalatest's ShouldMatchers stuff because it seems its biggest attraction is nice English-language readable sentences of test output, and this is not very compelling as an advantage.
JUnit4, because that is what TypeSafe (a Scala company) seems to be supporting.

Test Suites and TDML (Test Definition Markup Language)

DFDL is a large specification. There's no way to be successful implementing it without a very extensive emphasis on test.

IBM has contributed a set of tests they use for their commercial DFDL implementation, which are expressed in a Test-Definition-Markup-Language (TDML).

We have adopted TDML as our standard for expressing tests as well.

TDML enables creation and interchange of very self-contained tests.

IDE vs. Command-line and REPL

Many Scala fans really like the Read-Eval-Print loop paradigm. Many languages starting with LISP, had R-E-P loops as a core development tool. However, the REPL style can be a big disadvantage. REPL-style encourages ad-hoc testing where the tests are run once by the developer in the REPL, and are not captured for repeated use as unit tests. REPL discourages giving real thought to design-for-test and regression testing. The REPL is great for learning how to call something, reminding yourself how a function works, etc. I.e., for trying things out. It is *not* a good way to do testing of your own code.

An IDE with explicit support for building up a library of unit tests beside the code is really greatly superior.

An important theme is converting the code base so that it is easy to work on and can get the benefits of an IDE.

Coding Style

Unless specified below, all code should following the Scala Style Guide.

Bits, Bytes, 1-based, and zero-based indexing

DFDL and XML use 1-based indexing. Java, Scala, and all their libraries (except XML libraries?) are zero-based.

Also, DFDL allows expression of lengths and alignments in bits, bytes, or characters.

Some naming conventions make code maintenance much easier even if they make identifiers bigger.

Using these identifiers can eliminate the need for many code comments to clarify this stuff.

The convention is to begin the identifier with either bit, byte, or char, (or other like 'child'), and suffix it with 1b or 0b.

Examples:

bitPosition0b : ULong - means position, measured in bits, first bit is at position 0, type unsigned long.
mCharWidthInBits: MaybeInt - measured in bits, but note that sizes, lengths, widths, don't have 0 or 1 base stuff. Note also use of MaybeInt type.
childIndex1b - child index, first child is at index 1.

Exercise for Reader!

Create a scala ZeroBased and OneBased AnyVal wrapper type with explicit (or some implicit) conversions.

The point is to let the scala compiler give you an error when you mix zero and one-based things, or pass a zero-based thing to an argument that wants a 1 based thing.

So the type of bitPosition0b (which is ULong currently) would be

var bitPosition0b = ZeroBased[ULong]

var bitPosInByte1b = OneBased[UInt]

For examples on how to do number types along these lines, look at UInt which is an AnyVal type.

Identifier Naming Conventions

Choose identifiers for positions, lengths, and limits wisely. Here are some conventions to follow:

position/pos = A one or zero based position from the start. Java Buffer uses position in this sense (Java are always 0 based).

limit = limit position = the position one past the last valid position. Java Buffer uses limit in this sense.

offset = a relative position, commonly within a byte or word. E.g., val bitOffset0b = bitPos0b % 8.

length/len = limit - position

width = a length that is usually smaller. We commonly use width for characters, e.g., 7 bit width, 8-bit width,, 16 bit width, etc.

length limit = a length that bounds the maximum length

size = same as length.

Line Endings

All files should use Unix line endings (i.e. \n).

Avoid Functionals - Do Not Over-Use the apply() method.

FP advocates like to make objects which take action when applied to another object. Sometimes this is a useful style, but more often when an object is going to take some action, the method should be named using the verb.

The apply() idiom breaks down when there are more than a couple of arguments, as the code gets pretty hard to read.

In addition, the IDE provides much better support for a named method with named arguments.

So, eliminate/avoid uses of class derivations from FunctionN (e.g., Function6, Function5, Function4, which have generic 6, 5, and 4-argument apply function signatures) because they have generic argument names. Instead these classes should either

have their own explicit apply functions which have descriptive argument names. These argument types and names are then visible to the IDE.
have verb-named methods

Careful with the Catches

You should always limit the scope of try/catch blocks to the smallest region of code that needs to be in the scope of the try.

You should always catch the most specific type of throwable thing possible.

It is almost always wrong to catch Exception, RuntimeException, Error and especially wrong to catch Throwable.

We have a specific class UnsuppressableException, which you should never catch. To be sure you are not, you should write:

catch {
   case u : UnsuppressibleException => throw u
   ...
    your other catch cases here
   ...
}

This insures you are not accidentaly suppressing things like Assert.invariantFailed() or Assert.notYetImplemented().

IDE Support and Coding for Debug

Use a coding style supported by the IDE. E.g., notationally, Scala supports both these styles as equivalent:

object method argument // non-punctuated style
object.method(argument) // punctuated style

Without the IDE, one might be indifferent, or in some cases prefer the less-punctuated style. With the Eclipse IDE, the latter style is clearly preferable, as when you type that ".", a menu pops up of available methods and members to choose from. This greatly accelerates ones work, and helps immensely when trying to learn a large code base. As I have been editing and debugging the code, I've found myself rewriting in the punctuated style to gain this advantage.

In Scala, the non-punctuated style becomes important if one has constructed a domain-specific language (DSL) and the various program objects are verbs and nouns of that language. But when you are dealing with object and method, the punctuated style is clearer.

(Update: Emacs Ensime mode is a very good Scala IDE, and Ensime is also usable with other text editors. Anyway it does not have this "." notation restriction. It will happily give you suggested completions regardless of your notational preference. However, until this comes to the Scala Eclipse IDE, I still suggest use of "." notation.)

Coding for Debug - Spread Out the Code

Use a coding style motivated by the availability of breakpoint debugging in the IDE. A coding style called "coding for debug" is important here. Breakpoint debuggers are line-oriented, and so it is much easier to navigate code that is spread out so that there is one function/procedure/method call per line. Hence, expressions like:

f(g(a), h(b))

get rewritten as

val x = g(a)
val y = h(b)
val res = f(x, y)
res // good place for a breakpoint

If you really want that code to stay spread out like that, use a comment like the one above about wanting a place for a breakpoint. Otherwise someone might reorganize the code for clarity.

Another example that comes up a lot in Daffodil is

processor(a, b, c, d, e) match {
 case A(x) => f(x)
 case B(y) => g(y)
}

Which gets rewritten as

val p = processor(a, b, c, d, e)
p match {
  case A(x) => {
    val r = f(x)
    r
  }
  case B(y) => {
    val r = g(y)
    r
  }
}

This has many good places to put line-oriented breakpoints where you can observe at a glance what the value of the variables is.

All this reduces code density somewhat, but if the variable names and function names are well chosen this can counter-balance by improving the self-documenting nature of the code thereby reducing the number of lines of comments required to make the code clear. This helps especially when dealing with highly polymorphic code, as in Daffodil.

The discipline this coding style supports is very much Test-Driven Development, that is, writing unit tests, and walking through them when they fail by just using the IDE "Debug As JUnit Test" feature, and watching the variables change, because the variables give observability to what is going on.

Uniform Return Type Principle

Suppose you want to write:

def myFunction(arg : Seq[T]) : Seq[T]

This is some function which takes a sequence, and returns a sequence. Since a sequence is a generic type that is a supertype of lists, nodeSeq, etc. you can pass this many things. What you will get back is of type Seq\[T] to the caller.

So if you pass a Vector\[Node] you get back a Seq\[Node]..... the return type doesn't match the argument type. You can perhaps down-cast it to some concrete type if you want. Turns out a better way to write this is:

def myBetterFunc[S <: Seq[T]](arg : S) : S

That notation {{S <: Seq\[T]}} can be read as S is a subtype of {{Seq\[T]}}

This function signature says myBetterFunction takes an arg of type S, returns that same type S, oh, and S must be a subtype of type {{Seq\[T]}}.

So, when you call myBetterFunc, passing it a Vector\[Node] you will get back a Vector\[Node]. This is a general principle of Scala library design called the "uniform return type principle" that makes libraries easier to use, and avoids many error-prone downcasts.

Use 'def' for abstract members

When you create an interface in a base class for a derived class to implement, you always want to use 'def'.

That is, you always want to put a 'def' on a base class that defines an abstract member.

Then each derived class provides an implementation using 'def', 'val', or 'lazy val'.

So 'def' both means "define function" and "deferred". That is:

def f = a + b

evaluates a + b every time that f is called/used, exactly as if you wrote:

def f() = a + b

In contrast to that,

val v = a + b

evaluates a + b exactly once, when the object containing val v is constructed.

lazy val lv = a + b

evaluates a + b when lv is first called/used, and saves the value, so that it is only computed once.

Use 'lazy val' and 'def' to Avoid Object Initialization Headaches

In many situations when an object is being created and initialized, if anything goes wrong the error is hard to figure out because, well, the object isn't an object yet.

In general it is bad style to do anything at object initialization (in 'val' members) that might throw an exception or otherwise fail.

Furthermore, members that are computed at initialization time can't depend on other methods of the class (especially members of parent classes and traits).

If your 'val' members are changed to lazy val, then they're not computed until the object is fully constructed, so they can do things like throw exceptions, etc.

So instead of

class myClass {
 val foo : FooType = ....complex calculation.....
}

Just use lazy val

class myClass {
 lazy val foo : FooType = ....complex calculation....not done until after object is created....
}

There is a small amount of overhead for lazy val, so in the most performance critical situations you may want to just use an explicit initialization. Here's the baggage that implies though....

class myClass {

  private val isInitialized =  false

  @inline private def checkInitialized {
    assert isInitialized
  }

  private var foo_ : FooType = null

  @inline def foo = checkInitialized ; foo_

  def init {
    foo_ = ....complex calculation...
    isInitialized = true
  }

  def initErr {
    throw new InvalidStateException("not initialized")
  }

Use Typed Equality

Subtle bugs can arise when comparing a == b, when a and b turn out to be different types. The == operator is "natural" equality, which simply returns false if a and b are different types.

There are arguments for why this is good and useful and such, see: http://www.artima.com/pins1ed/object-equality.html.

However, very often this masks a programmer error. If a and b are of types that it makes no sense to be comparing, then a == b should, ideally, be a scala compile-time type error.

To achieve this, we have two forms of strongly typed equality in the daffodil.equality package. To use them you must

import edu.illinois.ncsa.daffodil.equality._

There are two operators:

a =:= b provides "Type Equality" for use when a and b have a subtype relationship, including the most obvious case of a and b both being of the same type.
a =#= b provides "View Equality" for use when a and b are convertible, that is, a can be implicitly converted to b's type, or b can be implicitly converted to a's type. Numbers and number-like types are common cases for this so the "#" in the operator is suggestive of "number"

Of these, the a =:= b is the more important and more commonly used. The ":" in the name is supposed to suggest the concept of "typed".

(To be determined - what is the performance of these relative to ordinary a == b.)

Attach the Source Code

Our build system will obtain the source code for libraries when sbt is able to retrieve them. If a library is not sbt-managed the library itself goes in the lib sub-directory, and the source code and documentation go into libsrc.

Having the source code to walk into from the debugger helps immensely with debugging, and makes up for some of the deficiencies of the Scala IDE support versus the more mature Java IDE support. E.g., Scala mode today doesn't pop up Javadoc strings, but if you can quickly jump over to the corresponding piece of source code, you can read the javadoc/scaladoc there.

Specifics on Libraries

Some libraries we use (or don't use) deserve specific commentary.

Library Licenses

We are committed to the Univ. of Illinois/NCSA open-source licensing terms for the Daffodil code. This restricts the licenses of libraries we use to those compatible with this license.

Generally speaking, this means we cannot use libraries licensed under the GPL (v2 or v3), but there are variations of these licenses (e.g., "classpath exception", and LGPL) which may be acceptable. These need to be examined on a case-by-case basis.

Problematic Libraries

There are some "supposedly" standard libraries that we're not using, basically because we tried and they didn't work out. Details on these efforts are below. Some day in the future this may be worth revisiting, but only if either the libraries have improved or we have someone with maintenance-level experience with them join the Daffodil project, that is, someone who knows how to make them work properly.

Apache XML Schema

This library has been tried and is inadequate to our needs currently (2012-02-24). It lacks support for non-native attributes, the support for appinfo and annotations in general is difficult to use (if it works at all), and it has no escape-mechanism by which one can bypass, get back to the XML objects themselves, and overcome its limitations.

XSOM - XML Schema Object Model

This library has been tried and we may still use it to assemble lists of schema files for us, so that it will handle the namespace resolution and include/import. But we have tried and found it unusable as far as abstract access to the DFDL Schema objects. Specifically, it does not have a first-class notion of a Schema Document. DFDL depends heavily on the notion of a Schema Document in that these are the units where lexically-scoped annotations are used. XSOM provides no way to even ask for the annotations on a schema document, so one cannot implement DFDL's lexical scoping feature using XSOM.I