You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

This page about the requirements driving the continuing work on the Daffodil open-source project.

With out repeating the motivations for DFDL from the DFDL spec, this page is intended to capture goals and requirements from the various constituencies interested in Daffodil.

Right now, this is  an UNORDERED list. An important revision will be to prioritize this list, and then to self-organize the interested contributors.

  • Make the DFDL specification successful 
    • effectively this is an economic requirement: we all spend too much time and money on data format stuff. DFDL will help. Sooner the better.
  • Conformance with the standard
    • Build up compactly expressed test cases that are readily exchanged, and which insure Daffodil becomes and remains in conformance.
    • Worth mentioning: the standard has ambiguities and places where clarifications will be necessary, and tests to drive this are also crucial.
    • Interchange of test cases with other DFDL implementors (notably right now, IBM) will be an advantage to all parties.
  • Performance & Memory utilization:
    • enable use of DFDL for applications that require high-performance streaming access to data (both parsing and unparsing).
      • To make that a bit more concrete: 40,000 1Kbyte messages per second on a 12-core commodity computer
      • I presume this is a performance requirement for both reading/parsing them, and for serializing/unparsing the messages.
      • A key requirement here is that you must be able to avoid maintaining the whole data stream in memory, but this may require some restrictions on the generality of the specific format as well. (Some formats just don't stream well.)
    • enable use of DFDL for access to large data file-based structures in memory (DOM-tree style)
      • Daffodil today (2012 - January) is closest to this goal.
    • true random access - i.e., without retrieving/constructing the entire tree.
  • Features - we need to prioritize certain features so as to decide when to byte the bullets needed to implement.
    • Bits - Daffodil is byte-centric today. If dense bit-packed formats are critical (and I suspect they are), then some back-end rework to deal with bits cleanly is required that could otherwise be postponed.
    • Encodings - which are important?
  • Robustness & Code Quality - it's pretty critical that the implementation be robust (not too many bugs) given the diverse constituencies it is expected to serve.
    • maintainability of the code-base as it grows and comes into conformance with the spec is very important. 
    • this is an open source project - there's a coolness factor here - having the code base remain cleanly organized and crafted is key to attracting talented people over time to keep it moving forward. This is one of the reasons why creating Daffodil in Scala is a good idea - a very cool new language attracts talented individuals, etc. It's good to be at least somewhat cutting edge here.
  • Timeliness - the usual caveats apply here, i.e., feature coverage with what level of performance? The above performance/memory requirements need to be phased over time. Nevertheless, desirable time lines that have been mentioned by some but without discussion of coincident performance can be summarized as:
    • 60/70% feature coverage by end Sept 2012
    • 100% feature coverage for parsing by early spring 2013
    • Unparsing with 100% feature coverage by later 2013

 

These requirements suggest some tasks/issues which should go into Jira once they have been made a bit more concrete.

  • Design streaming system for DFDL - an API that makes sense, and subset of features that make sense for streamable formats.
    • keep in mind both parsing and unparsing streaming here. (e.g., there's a standard java API for pull-parsing of XML. But there's no corresponding streaming push-unparsing of XML API)
    • there are other kinds of APIs (cursor-oriented for example), which may be appropriate here as well. The XML-event style isn't the only way to go.
  • Daffodil/DFDL API - streaming is one API aspect. It begs the question of the overall APIs for DFDL.
  • Scala training, and code-quality/review provisions - how do we want to do this.
    • Everyone working on this project will be new to Scala for a period of time, so there's inevitable learning curve. How do we deal with this (other than by rewriting stuff a few times until believe we have it written the right way? Maybe that's all we can do - but we can learn from other organizations using Scala and the community of Scala developers.)
  • No labels