Blog from June, 2012

In examining Unable to locate Jira server for this macro. It may be due to Application Link configuration. I've found that under-allocation may be problematic, over-allocation of resources may also have issues.

In the original under-allocation definition of InStreamFromByteChannel, the bb buffer would only read a maximum of 4 times the sizeHint the class received as a parameter, which defaults to half a mebibyte.  Because there is no code to determine if more bytes exist in the buffer, this configuration runs the risk of underflow errors when parsing fields.

The following code corrects this by allocating a buffer of sufficient size to store everything available in the input stream in the bb buffer:

if (count == bb.capacity) {
// Buffer not big enough, allocate one 4 times larger and fill at offset
var tooSmall = scala.collection.mutable.ListBuffer.empty[ByteBuffer]
var lastWrite = 0
while (count == bb.capacity()) {
// Remember where we started
bb.flip()
bb.position(lastWrite)
    // Save old buffer and allocate anew
tooSmall += bb
bb = ByteBuffer.allocate(count * 4)
    // Leave space to copy the old buffers back to this one
bb.position(count)
lastWrite = count
    // Read in as much as possible
count += in.read(bb)
}
// bb now holds enough space for the entire buffer starting from a position at the end of the previous buffer's size
// so copy over the other buffers in tooSmall to fill in the gap
bb.flip()
tooSmall.foreach(b => { bb.put(b) } )
bb.position(0)
}
else {
// Buffer is sufficiently sized
bb.flip()
}

The problem with this solution is that it has the potential to grossly over-allocate a buffer just to store the entire contents of the stream.

What's ideally desired is something more akin to a sliding window that only loads enough data at any given time to satisfy the current field request.  In other words, it should only allocate more space in a lazy, just in time fashion.

Unfortunately, this change would negate the test case for TestBufferAllocations in sub-projects/core/srcTest/daffodil/dsom/TestBufferAllocation.scala and it's not yet clear how to inject a custom sizeHint into the InStreamFromByteChannel in order to properly test a lazy evaluation of a reasonably sized buffer.

The correct solution, with a test for if the buffer contains the end of the file and lazy allocations to get more data when the data possible will take some time as more analysis of the code is needed.

However, a similar problem exists in the UNICODE / ICU IBM internationalization libraries with their buffer allocations.  Also, the code in fillCharBufferUntilDelimiterOrEnd and fillCharBufferWithPatternMatch (sub-projects/core/src/daffodil/grammar/Parser.scala) are 90% similar and so code reuse and consolidation would be a goal of the allocation cleanups.

The current implementation of lengthKind="pattern" assumes the lengthPattern expression stores a value to be match.  That is to say if one wanted a hexadecimal string, one would set lengthPattern="[0-9A-Fa-f]+" in the current implementation.  This would still require a separator attribute though because it doesn't code for the delimiter that would normally separate fields.

However, the AO and AV tests from Tresys assume the opposite: that the lengthPattern is another form of separator/delimiter and doesn't say anything about the data format for the given field.

MikeB said: The correct definition is 'value to be matched', not a regex way to express delimiters.

In the current implementation, the separator might also become part of the pattern since it's possible to define a positive look-ahead which specifies the end of the field.  For example, if the separator is a comma (,) and the pattern is "[A-Z][a-z]*(?=,|$)", (meaning the next field must consist of a capital/majuscule letter followed by zero or more lower-case/minuscule letters which then terminates at the first comma or end of string.

The testLengthKindPattern test case in sub-projects/core-srcTest/daffodil/api/TestAPI1.scala has another example of how the current implementation might be used:

lengthPattern=".*?[^\\](?=,|$)"

The above expression says match any character, zero-or-more though not more than necessary, followed by any character that isn't a backslash (/).  It then asserts that whatever follows this (the first instance of this possible match thanks to the "though not more than necessary, non-greedy" clause) is either a comma or an end of string.

If the code were changed to mean that lengthPattern is just another word for a regex separator, this would have to be changed to:

lengthPattern="(?<!\\),"

Which uses a negative look-behind to ensure that the given comma delimiter is not immediately preceded by a backslash.  The looks behinds though have to be fixed length, unlike the look-aheads due to limitations in more regex parsers, so for instance (?<=a{1,3}) is not a valid positive look-behind.

So the opened question is, which is better: should lenghtPattern subsume separator or should it specify the pattern the field may match?

Note: changing the code to support pattern is delimiter would not be that much work so time to implement should not be a consideration.

I have made the necessary local changes to reverse the logic to make it a search by regex delimiter and still find the AO000, AO001, AO002, AO003, AO004, AV000, AV001 and AV002 tests are still failing due to an inability of the parser to find the initiator property.

And I reverted back to the original implementation and test thanks to Mike's comment below; we will treat lengthKind="pattern" as a regex match against the field.  What has yet to be decided is how to handle field delimiters and separators.

MikeB said: Best to go to the current draft of the DFDL spec. The AA-BG tests were part of the original daffodil code base, and that was developed based on a very early draft of the DFDL standard, which has since changed a great deal as it has converged toward an agreed standard. In particular, way back somebody suggested "why not let delimiters be regular expressions?" For a while this was entertained, and that's when the original daffodil code base was created along with tests AA-BG. But this was later viewed as too much generality (hence complexity, especially when debugging "why doesn't my DFDL schema work???"), for not enough benefit.

The lengthKind="pattern" was added to handle the cases which truly need a complex match.