Profiling has shown that on unparse, remapping from PUA to XML is actually fairly intensive. Removing this remap (for a schema that doesn't need to remap anything) improves performance by about 30%.
And fortunately, there are probably a lot of cases where we know we don't need to remap. For example, xs:hexBinary, integer types, and date/time types should never require mapping to/from PUA since they are always representing in the infoset with ASCII chars. Really, the only thing that might need it is xs:strings.
So potentially a few ideas for performance improvements:
- Never remap types that we know will never have XML illegal characters
- For types that could potentially have XML illegal characters, first check if there are any illegal characters before remapping the string. In most cases, we won't need to remap, so this will save us the costs associated with string builders. This does mean things might be a little slower for strings that contain illegal XML characters, but that's not the common case.
- When we do find a string containing XML illegal characters, let's put an attribute in the infoset that indicates that the data was mapped to PUA. This way, when we unparse, we only ever have to remap strings to XML if that attribute is set. This may also be helpful for users, since this could be a notice that they need to remap the string before using it. Note, however, that we might want a tunable that says to always remap xs:strings when unparsing, even if the attribute doesn't exist, since the infoset may not have come from a Daffodil parse or may have been sanitized and had attributes removed.