Example: XML Cyrillic ISO8859-5 to UTF-8 translation in the system Unicode

Consider Example: XML Cyrillic ISO8859-5 to UTF-8 translation in the system Unicode. The data is in Unicode before and after the translation so that the encoding conversion makes no sense and does not happen. Everything else is the same. The data is bulk copied and the encoding declaration is changed to UTF-8.

If you convert a site using this translation to the system Unicode, then both threads default to binary encoding. This does not change the data, so that the output message contains ISO8859-5 encoded data with UTF-8 in the XML declaration.

There is another problem with using this site in the system Unicode. The default binary encoding does not change the inbound data. It translates byte values 00h to FFh into the Unicode zero page (U+0000 to U+00FF).

Therefore, the inbound message contains ISO8859-5 values. The XML parser interprets this data as Unicode code points and essentially sees garbage. This Unicode-ized byte data could result in characters that are invalid for an XML file. The translation would then send the message to the error database.

An example of this type of failure would be an input file in UTF-16BE. If the XML file starts with an XML declaration <?xml …, then the inbound byte data (in UTF-16BE) begins with 00 3C 00 3F 00 78 6D 6C. The default binary encoding converts these eight bytes into eight Unicode characters. The XML parser rejects this because 00 3C is not a valid beginning for an XML message.

Both of these problems can be corrected by selecting XML Encoding for the inbound and outbound threads on the Network Configurator. This was the primary motivation for adding the XML encoding mode to the protocol threads.

When XML encoding is used for this example, the inbound ISO8859-5 encoding is extracted from the XML declaration. The message is then properly converted to Unicode. The XML parser treats the data correctly and the translation changes the encoding declaration to UTF-8. The outbound thread extracts the UTF-8 encoding from the message data and correctly converts it. This gives the same result as previous versions.

The system cannot automatically detect which threads should or should not have XML encoding enabled, so you must consider this and make the changes. A utility is available that checks for XML-to-XML translations. Threads can also be identified that use these translations or are their destination.

For example, they can be identified in the prologue:

prologue
    xlt_infile: xml text test
    who:       
    date:       
    xlt_outfile:  xml text test
    type:  xlt
    version:  6.0
end_prologue