Legacy TSS versus UTF-T
The old TSS is called legacy TSS. The embedding of Unicode in TSS is called UTF-T, analogously to the names UTF-8, UTF-16, and UTF-32 for the three standard Unicode Encoding Forms.
TSS consists of two types of characters: single byte and multibyte. The single byte characters use hexadecimal values 0 … FF, except 9B. Multi byte TSS characters use a sequence of 4 bytes, the first of which has hexadecimal value 9B.
This table describes how all supported native character sets are mapped into the available TSS space.
TSS range (hexadecimal) | Meaning |
---|---|
00 - 7F | ASCII character. |
80 - 8A | Line drawing character. |
8B - 9A | Code feature. |
9B | Lead byte for 4-byte TSS characters |
9C-9E | Reserved for future use. |
9F | Used to represent the Euro Symbol in a Cyrillic context. In ISO8859-5 there is no room available in the normally used range A0 - FF. |
A0 - FF | Is ambiguous. It corresponds to a 'high ASCII' character in one of the ISO8859-n character sets. Can be converted (without actual conversion) to the correct ISO8859-n character set, or (often with some offset) to the corresponding Windows Code Page. |
9B 21 pp qq | Japanese (Kanji). Can be converted to Kanji EUC, Shift JIS, or Windows Code Page 932 |
9B 23 21 pp | Single width Japanese. Can be converted to Kanji EUC, Shift JIS, or Windows Code Page 932 |
9B 25 pp qq | Simplified Chinese. Can be converted to GB2312-80 or Windows Code Page 936 |
9B 27 pp qq | Traditional Chinese. Can be converted to Big 5 or Windows Code Page 950 |
9B 31 pp qq | Korean (Wansung). Can be converted to Wansung or Code Page 949. |
9B 32 pp qq | Korean (Johab). Can be converted to Johab or Code Page 1361. |
9B 9C 9D nn with: 40 < nn < BF |
Is ambiguous. It corresponds to Microsoft extension character nn + 0x40 in one of the Windows Code Pages corresponding to an ISO8859-n character set. Can be converted to the correct Windows Code Page. The value nn + 0x40 is in the high ASCII range 0x80 - 0xFF, so nn is in the range 0x40 - 0xBF. Only 55 positions are really used, and only 3 of them are really ambiguous. |
The ambiguity that is shown in the table for TSS characters in the ranges A0 – FF and 9B 9C 9D 40 – 9B 9C 9D BF is resolved by interpreting these characters in the context of the character set of the current locale.
The embedding of Unicode in TSS uses the single byte ASCII range and a major part of the remaining TSS space, as shown in the following table. Notice that this embedding does not increase the existing ambiguity described earlier.
TSS range (hexadecimal) | Meaning |
---|---|
00 - 7F |
ASCII character. Corresponding to the first 128 Unicode characters U+0000 - U+007F. Can be converted (without actual conversion) to single byte UTF-8 or to single word UTF-16. |
9B pp qq rr with: BC < pp < BF, 80 < qq < FF, 80 < rr < FF |
UTF-T corresponding to the first 216 Unicode characters U+0000 - U+FFFF, the so called Basic Multilingual Plane (BMP), except for the first 128 Unicode characters U+0000 - U+007F (corresponding to the ASCII character set, and mapped to single byte TSS). Can be converted algorithmically (bit shuffling) to 2-byte or 3-byte UTF-8 or single word UTF-16. |
9B pp qq rr with: C0 < pp < FF, 80 < qq < FF, 80 < rr < FF |
UTF-T corresponding to the 220 so called Supplementary Unicode characters U+010000 - U+10FFFF. Can be converted algorithmically (bit shuffling) to 4-byte UTF-8 or double word UTF-16. |
When converting from TSS to some other character set, the Enterprise Server porting set can interpret both legacy TSS and UTF-T. The other way around, when converting from some other character set to TSS, it depends on the so called TssMode whether legacy TSS of UTF-T is produced. The TssMode is determined by the content of the $BSE/lib/tss_mbstore6.2 file. If the first line of this file consists of exactly the text “UTF-T”, then the mode is called ‘UTF-T mode’ and the conversion produces UTF-T. Otherwise the mode is called ‘legacy mode’ and the conversion produces legacy TSS.
In the following table this and further differences between UTF-T mode and legacy mode are indicated. These differences are so essential that it is not allowed to switch the mode at arbitrary moments. Switching from legacy mode to UTF-T mode is allowed at the price of a complete database conversion. Switching back from UTF-T mode to legacy mode is not allowed.
Legacy mode | UTF-T mode |
---|---|
Conversion from any character set to TSS produces legacy TSS | Conversion from any character set to TSS produces UTF-T |
Conversion from any Unicode encoding (UTF-8 or UTF-16) to TSS uses conversion tables and fails for characters which do not exist in the character set of the current locale. | Conversion from any Unicode encoding (UTF-8 or UTF-16) to TSS does not need conversion tables and will not fail. |
Multi Byte Data in the database is stored in the native character set of the database | Multi Byte Data in the database is stored in Unicode (probably UTF-16, possibly UTF-8) for Multi Byte, in a native locale for Single Byte |
Each user must use a language and corresponding locale of which the character set corresponds to the character set used in the database | Each user can choose a language and locale, independent of the character set used in the database. |
Single Byte Data in the database is sorted using binary sort, e.g. A Z a z À Ý à ý | Data in the database is sorted according to the Unicode Collation Algorithm, e.g. a A à À ý Ý z Z |
For single byte character sets, the non-ASCII characters are mapped to single byte TSS (with some exceptions which are mapped to 9B 9C 9D nn) | For single byte character sets, the non-ASCII characters are mapped to 4-byte UTF-T |