Cloverleaf multi-byte encoding
Multi-byte encoding permits the system to effectively use unicode character encoding. With multi-byte, the data streams are treated as Unicode code points. Because Unicode is a superset of all other character sets, the system can handle any external character set. It provides for conversion between the external character representation and Unicode in the protocol threads.
Because both Java and Tcl have Unicode support, the approach is uniform throughout the system.
In the engine, the unicode representation is the UTF-8 encoding due to its intrinsic compatibility of representing ASCII as unmodified single bytes. This provides a high level of backward compatibility with existing versions. The message data arriving at or sent from the engine can be any codepage, including UTF-16. (The system engine provides user-selected translation of the encoding to UTF-8 for internal engine processing.)
Multi-byte is not intended to provide a full multilingual user interface because the command-line utilities are based on the locale codepage. The codepage usually does not support mixing languages beyond ASCII and the locale. Multi-byte transports and transforms data messages in different languages such as English, French, Arabic, and Chinese in their respective OS environments. The GUI for development, administration, and monitoring of the system can be in English. All data strings in the GUI are displayed properly in the different languages.
For a particular language, different messages can be encoded differently and support a variety of codepages for both incoming and outgoing data messages.
You can specify an external encoding for each thread. The specified encoding applies to both inbound and outbound messages for that thread.
You can choose to use a byte order mark. If a byte order mark is found in the inbound data stream, then the encoding is determined automatically.
You can choose the XML encoding mode. The XML mode determines the encoding dynamically from an XML declaration. XML encoding works for inbound and outbound messages. XML encoding and
can both be specified for inbound messages. In this case, a byte order mark takes precedence if found. If no byte order mark or XML declaration is found, then the XML mode assumes UTF-8 encoding.You can choose to generate a byte order mark. This applies to outbound messages. Byte order marks are generated for Unicode UTF encodings, that is, UTF-8, UTF-16, and so on. If no encoding options are specified, then binary encoding is assumed. Binary encoding is an identity mapping; that is, byte values 00-FF hex translate into unicode code points 00-FF. Pre-existing sites that are upgraded use the default binary encoding to maintain existing behavior.
Backward compatibility includes existing configuration files, engine behavior, and system resource consumption.
The operation of existing sites is unchanged. Any incompatibilities with existing files are documented with conversion procedures for migration of existing sites.
Some performance degradation is inevitable because there is more data to process for each character. Sometimes increased internal efficiency provides a performance increase.
Unicode is supported in the user interface, permitting display of foreign characters and entry of foreign characters on the keyboard. Foreign characters are permitted in object names where practical. This includes file names, directory names, and so on. Foreign characters are permitted in file names used for data transfer
Process names must be ASCII.
Encoding conversions are bypassed for inbound and outbound messages as a performance enhancement. The engine stores data internally as UTF-8. Therefore, threads that use UTF-8 or ASCII data can use bypass encoding to eliminate the performance penalty of encoding conversion. A negative effect of using bypass encoding is that the input data is not validated, so invalid byte streams can get into the engine. This does not cause any serious problems, but invalid messages may error out inside the engine.
The use of bypass encoding is basically your guarantee that the data is UTF-8 or ASCII and encoding conversions can be skipped for increased performance. The use of UTF-8 or ASCII as the encoding performs a conversion. Even though a UTF-8 or ASCII encoding conversion does not change any data, it invokes the associated validation processes.
Inbound encoding conversion errors can make it impossible to convert the inbound data to Unicode. It is therefore necessary for inbound SMAT and recovery state 1 to be raw data. Then messages with encoding errors can be examined, corrected, and resent in a consistent fashion.
An on-screen keyboard is available with most operating systems to assist with entering file names and text that use different characters.