Recent changes in 1.2.0 to the data input layers removed a feature which is the ability to treat surrogate pair characters as single characters.
See test_encodingNoError.
This test has a TDML representation where a single character in utf-8 that has a 4-byte encoding has to become a surrogate-pair (two codepoints) in a java/scala string, but the data input stream's char iterator on a call to next() returns only 1 codepoint. There is no accomodation in the data input stream layers for the possibility of a single character needing 2 codepoints.