Uploaded image for project: 'Daffodil'
  1. Daffodil
  2. DFDL-1386

single utf-8 4-byte character becomes surrogate character pairs in scala/java string

    XMLWordPrintableJSON

    Details

    • Type: Wish
    • Status: Open
    • Priority: Normal
    • Resolution: Unresolved
    • Affects Version/s: 2.0.0
    • Fix Version/s: never
    • Component/s: Back End
    • Labels:
      None

      Description

      Recent changes in 1.2.0 to the data input layers removed a feature which is the ability to treat surrogate pair characters as single characters.

      See test_encodingNoError.

      This test has a TDML representation where a single character in utf-8 that has a 4-byte encoding has to become a surrogate-pair (two codepoints) in a java/scala string, but the data input stream's char iterator on a call to next() returns only 1 codepoint. There is no accomodation in the data input stream layers for the possibility of a single character needing 2 codepoints.

        Gliffy Diagrams

          Attachments

            Activity

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              mbeckerle.dfdl Mike Beckerle
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:

                  Tasks