Uploaded image for project: 'Daffodil'
  1. Daffodil
  2. DFDL-1386

single utf-8 4-byte character becomes surrogate character pairs in scala/java string

XMLWordPrintableJSON

    • Icon: Wish Wish
    • Resolution: Unresolved
    • Icon: Normal Normal
    • never
    • 2.0.0
    • Back End
    • None

      Recent changes in 1.2.0 to the data input layers removed a feature which is the ability to treat surrogate pair characters as single characters.

      See test_encodingNoError.

      This test has a TDML representation where a single character in utf-8 that has a 4-byte encoding has to become a surrogate-pair (two codepoints) in a java/scala string, but the data input stream's char iterator on a call to next() returns only 1 codepoint. There is no accomodation in the data input stream layers for the possibility of a single character needing 2 codepoints.

              Unassigned Unassigned
              mbeckerle.dfdl Mike Beckerle
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: