This page discusses coding style guidelines for the Daffodil code base.
Much of the code does not follow these guidelines. As it evolves the goal is to make new code follow these guidelines, and to evolve existing code toward them.
Our goal is all-64-bit capabilities. Unfortunately, many Java and Scala libraries do not allow offsets or positions larger than a signed 32-bit Int can hold.
Someday, those libraries will be updated so that, for example, a byte array can hold an entire video, which is bigger than 2GBytes. For now we're stuck.
Our code should be written to use type Long for all offsets and positions, and only when we must deal with an underlying library that has only an Int-based API do we then cast to Int. This should be done with an explicit check, as x.toInt doesn't cause overflow errors. E.g., Long.MaxValue.toInt produces -1. No error is thrown.
So, when you must have an Int to cope with a library, code should do this:
We are committed to using Scala for Daffodil long term. Do not add Java code to this code base except in a few special circumstances.
- we use many java-based libraries of course
- code snippets from online that are being used largely unmodified can be pasted wholesale into Java files.
If you find online examples of how to use an API from Java, then mostlikely these should be rewritten into Scala. Often there are nicer Scala idioms. Be sure to Web-search for the same API with the keyword "Scala" added to your search. Often you will find idiomatic scala to accomplish the same thing.
Use Scala's built in XML capabilities to reduce the quoting hell that otherwise results when you try to type XML as string content.
We are committed to tracking Scala as it evolves. It is too early to try to freeze the Scala language. There are improvements, particularly in the XML support, which are needed, and which we will want to take advantage of. So expect some disruption when major releases of Scala emerge.
Similarly, we expect to track new versions of the libraries we depend on. Please use a robust naming discipline of naming libraries to make versions clear.
- Except perhaps Saxon which is still the no-longer-progressing Saxon-B, which is fine for now.
Our code is organized under src/main and src/test directories, with test-only source code going in the latter directory. The package structure under these is identical, the separation is just so that we can package distributions of daffodil that do not contain test code, should we so desire.
Everything should have unit tests, though there is always debate of what a "unit" really means. For our purposes, what we mean by unit tests is test that are easily run, by the developer, in the IDE and outside the IDE, which very quickly tell you the status of the code - what's still working, what is broken, and have some intention of helping isolate the problem to smaller units of code.
Unit tests must run quickly, i.e., in just a second or so, though the whole suite of them, if run en-masse, can take 15 to 30 seconds to run.
Larger test suites can also be written using JUnit, so not everything using unit testing tools is strictly speaking a "unit" test.
A couple of specifics:
- JUnit predicates, not Scalatest - That is, use assertEquals(expected, actual), not "actual should be equal to expected" (from scalatest's ShouldMatchers classes) because the IDE supports JUnit well, and doesn't support scalatest.
- We do use Scalatest, but mostly for the bridge to JUnit, and the convenient intercept construct for catching expected exceptions.
- Someone needs to make an argument in favor of Scalatest's ShouldMatchers stuff because it seems its biggest attraction is nice English-language readable sentences of test output, and this is not very compelling as an advantage.
- JUnit4, because that is what TypeSafe (a Scala company) seems to be supporting.
DFDL is a large specification. There's no way to be successful implementing it without a very extensive emphasis on test.
IBM has contributed a set of tests they use for their commercial DFDL implementation, which are expressed in a Test-Definition-Markup-Language (TDML).
We have adopted TDML as our standard for expressing tests as well.
TDML enables creation and interchange of very self-contained tests.
Many Scala fans really like the Read-Eval-Print loop paradigm. Many languages starting with LISP, had R-E-P loops as a core development tool. However, the REPL style can be a big disadvantage. REPL-style encourages ad-hoc testing where the tests are run once by the developer in the REPL, and are not captured for repeated use as unit tests. REPL discourages giving real thought to design-for-test and regression testing. The REPL is great for learning how to call something, reminding yourself how a function works, etc. I.e., for trying things out. It is not a good way to do testing of your own code.
An IDE with explicit support for building up a library of unit tests beside the code is really greatly superior.
An important theme is converting the code base so that it is easy to work on and can get the benefits of an IDE.
FP advocates like to make objects which take action when applied to another object. Sometimes this is a useful style, but more often when an object is going to take some action, the method should be named using the verb.
The apply() idiom breaks down when there are more than a couple of arguments, as the code gets pretty hard to read.
In addition, the IDE provides much better support for a named method with named arguments.
So, eliminate/avoid uses of class derivations from FunctionN (e.g., Function6, Function5, Function4, which have generic 6, 5, and 4-argument apply function signatures) because they have generic argument names. Instead these classes should either
- have their own explicit apply functions which have descriptive argument names. These argument types and names are then visible to the IDE.
- have verb-named methods
You should always limit the scope of try/catch blocks to the smallest region of code that needs to be in the scope of the try.
You should always catch the most specific type of throwable thing possible.
It is almost always wrong to catch Exception.
We have a specific class UnsuppressableException, which you should never catch. To be sure you are not, you should write
This insures you are not accidently suppressing things like Assert.invariantFailed() or Assert.notYetImplemented().
Use a coding style supported by the IDE. E.g., notationally, Scala supports both these styles as equivalent:
Without the IDE, one might be indifferent, or in some cases prefer the less-punctuated style. With the Eclipse IDE, the latter style is clearly preferable, as when you type that ".", a menu pops up of available methods and members to choose from. This greatly accelerates ones work, and helps immensely when trying to learn a large code base. As I have been editing and debugging the code, I've found myself rewriting in the punctuated style to gain this advantage.
In Scala, the non-punctuated style becomes important if one has constructed a domain-specific language (DSL) and the various program objects are verbs and nouns of that language. But when you are dealing with object and method, the punctuated style is clearer.
(Update: Emacs Ensime mode is a very good Scala IDE, and Ensime is also usable with other text editors. Anyway it does not have this "." notation restriction. It will happily give you suggested completions regardless of your notational preference. However, until this comes to the Scala Eclipse IDE, I still suggest use of "." notation.)
Use a coding style motivated by the availability of breakpoint debugging in the IDE. A coding style called "coding for debug" is important here. Breakpoint debuggers are line-oriented, and so it is much easier to navigate code that is spread out so that there is one function/procedure/method call per line. Hence, expressions like:
get rewritten as
another example that comes up a lot in Daffodil is
Which gets rewritten as
This has many good places to put line-oriented breakpoints where you can observe at a glance what the value of the variables is.
All this reduces code density somewhat, but if the variable names and function anmes are well chosen this can counter-balance by improving the self-documenting nature of the code thereby reducing the number of lines of comments required to make the code clear. This helps especially when dealing with highly polymorphic code, as in Daffodil.
The discipline this coding style supports is very much Test-Driven Development, that is, writing unit tests, and walking through them when they fail by just using the IDE "Debug As JUnit Test" feature, and watching the variables change, because the variables give observability to what is going on.
Suppose you want to write:
This is some function which takes a sequence, and returns a sequence. Since a sequence is a generic type that is a supertype of lists, nodeSeq, etc. you can pass this many things. What you will get back is of type Seq[T] to the caller.
So if you pass a Vector[Node] you get back a Seq[Node]..... the return type doesn't match the argument type. You can perhaps down-cast it to some concrete type if you want. Turns out a better way to write this is:
S <: Seq[T] can be read as S is a subtype of
This function signature says myBetterFunction takes an arg of type S, returns that same type S, oh, and S must be a subtype of type
So, when you call myBetterFunc, passing it a Vector[Node] you will get back a Vector[Node]. This is a general principle of Scala library design called the "uniform return type principle" that makes libraries easier to use, and avoids many error-prone downcasts.
When you create an interface in a base class for a derived class to implement, you always want to use 'def'.
That is, you always want to put a 'def' on a base class that defines an abstract member.
Then each derived class provides an implementation using 'def', 'val', or 'lazy val'.
So 'def' both means "define function" and "deferred". That is:
evaluates a + b every time that f is called/used, exactly as if you wrote:
In contrast to that,
evaluates a + b exactly once, when the object containing val v is constructed.
evaluates a + b when lv is first called/used, and saves the value, so that it is only computed once.
Our build system will obtain the source code for libraries when sbt is able to retrieve them. If a library is not sbt-managed the library itself goes in the lib sub-directory, and the source code and documentation go into libsrc.
Having the source code to walk into from the debugger helps immensely with debugging, and makes up for some of the deficiencies of the Scala IDE support versus the more mature Java IDE support. E.g., Scala mode today doesn't pop up Javadoc strings, but if you can quickly jump over to the corresponding piece of source code, you can read the javadoc/scaladoc there.
Some libraries we use (or don't use) deserve specific commentary.
We are committed to the Univ. of Illinois/NCSA open-source licensing terms for the Daffodil code. This restricts the licenses of libraries we use to those compatible with this license.
Generally speaking, this means we cannot use libraries licensed under the GPL (v2 or v3), but there are variations of these licenses (e.g., "classpath exception", and LGPL) which may be acceptable. These need to be examined on a case-by-case basis.
There are some "supposedly" standard libraries that we're not using, basically because we tried and they didn't work out. Details on these efforts are below. Some day in the future this may be worth revisiting, but only if either the libraries have improved or we have someone with maintenance-level experience with them join the Daffodil project, that is, someone who knows how to make them work properly.
This library has been tried and is inadequate to our needs currently (2012-02-24). It lacks support for non-native attributes, the support for appinfo and annotations in general is difficult to use (if it works at all), and it has no escape-mechanism by which one can bypass, get back to the XML objects themselves, and overcome its limitations.
This library has been tried and we may still use it to assemble lists of schema files for us, so that it will handle the namespace resolution and include/import. But we have tried and found it unusable as far as abstract access to the DFDL Schema objects. Specifically, it does not have a first-class notion of a Schema Document. DFDL depends heavily on the notion of a Schema Document in that these are the units where lexically-scoped annotations are used. XSOM provides no way to even ask for the annotations on a schema document, so one cannot implement DFDL's lexical scoping feature using XSOM.