Uploaded image for project: 'Daffodil'
  1. Daffodil
  2. DFDL-1597

Too many ways that encoding, byteOrder, etc. are being setup

XMLWordPrintableJSON

    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Normal Normal
    • deferred
    • 2.0.0
    • Back End, clean ups
    • None

      This issue affects both parser and unparser.

      It is a code-cleanup/maintainability, and perhaps a bit of a performance issue.

      I'll use encoding as the property for discussion here, but the point applies to byteOrder and possibly bitOrder and maybe a few other things like fillByte also.

      We have this idea that since encoding often doesn't change at all in a data format, that we insert an encoding-change processor (parser or unparser) which just sets the new setting, which is then stored in the DataInputStream/DataOutputStream object, and no per-element overhead is encountered for setting up this property.

      For some formats, encoding changes. This becomes quite tricky if the change occurs inside of a repeating element, as in that case, each repeat might need to begin by setting the encoding to the right one for the start of the repeating element, but then inside that element another set might change it, and that might last until the end of the repeating element, hence, at the start of the next repeat, we must set it back to the proper encoding for the start of the element.

      So, unless the DFDL compiler can optimize it out, any repeating element containing any text must begin with an encodingChange processor. Right now the optimization looks for whether the entire schema has uniform encoding (which is common), but any format that has a runtime-valued expression for encoding will be deemed "unknowable", and the encoding will be assumed to be variable. This is very pessimistic, but a better analysis depends on determining that a runtime-valued expression for encoding is still defined at a high-enough scope (such as top-level) that encoding is not subject-to-change during the element in question. The analysis being done is not that sophisticated currently.

      The above might not matter much for encoding, but for byteOrder, where we know there are formats that begin with a byte-order indication (such as PCAP), this matters.

      Now, there are also some parsers (and unparser) primitive processors that expliclity set encoding, or byteOrder. before they carry out their processing.

      It is possible that encoding change processors are not being inserted everywhere they are needed; hence, if the various setEncoding calls are removed, it may break things.

      Now, setting the encoding can check if the encoding is the same one it can do little work other than an equality check. So rather than having lots of encoding-change processors that are being inserted due to unsophisticated compile-time optimization, it may actually be better performance to either simply call setEncoding before every textual primitive operation, or it may be better to use Evaluatables, and have the data stream layer access encoding information via the Evaluatables mechanism. Then at least we would have only one code path to focus on for performance improvement.

      For unparsing:

      slightly more complicated - any suspended operation needs to "freeze" the value of encoding that is used to unparse, so that subsequent non-suspended unparser operations that change the encoding will not result in the suspended operation using the wrong one.

      However, since the encoding change unparser will have been run before the suspension is created, and since the data output stream of a suspension is cloned for the suspension, the encoding should be correctly set for unparsers that are suspended. Should being the key operative word here. This needs to be verified. (Also, the cloning of the data output stream state is itself a big performance worry, so the work done there needs to be minimized.)

      Whatever mechanism is chosen to resolve the parser issue, the unparser should work the same way.

              Unassigned Unassigned
              mbeckerle.dfdl Mike Beckerle
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: