Character sets
A character set is a set of alphabetic or other characters used to construct the words and other elementary units of (a) native language(s).
During the installation of the LN application you must choose a character set. So only one character set applies for the whole LN environment. Therefore, only those languages can be stored which are supported by the character set that was chosen.
You can choose these character set types:
- single byte character sets
- multibyte character sets
- Unicode character set
Single byte character sets
Single byte character sets require one byte to store the character information. Therefore, max 256 characters are available. The ISO 8859 standard defines several characters sets, also called locales, to cover the characters of mainly the European languages.
Examples of single byte character sets are:
- ISO 8859-1: mainly west European languages like French, German, Italian etc
- ISO 8859-5: supporting Cyrillic languages (Russian).
The lower range, character 000 -127, is the same for all ISO 8859 character sets. The upper range character 128 – 255, is specific per locale.
The alphabet is encoded in the lower range and therefore the English language is supported with each ISO 8859 locale. English does not require any additional characters.
The sorting is binary based. The sorting is based on the order the characters are defined in the encoding. All uppercase alphabetical characters, for example, are sorted before the lowercase alphabetical characters. For example, the ‘Z’ is sorted before the ‘a’.
multibyte character sets
multibyte characters sets are typically required for languages that have more than 256 characters. A typical example is Chinese. In the context of LN the multibyte character sets require 4 bytes per character.
Examples of multibyte character sets are:
- BIG5: Chinese
- Wansung: Korean
Sorting is binary based.
Unicode character set
The Unicode character set is a standardized character set supporting all (modern) languages. This takes away the limitation of supporting a small set of languages within one LN environment. When you choose Unicode as character set, you can have for example Chinese, English and French in one LN environment.
Another advantage of the Unicode character set is that it comes with linguistic sorting rules. When the data must be visualized in a sorted form, the data is shown based on the sorting rules as defined by the ICU standard.
As a consequence, the database size of a Unicode based LN environment is bigger. The CPU and memory load on the system are higher than for a multibyte or single byte based character set. The choice for Unicode is typically made when multiple languages must be supported or when linguistic sorting is a preference.
High Ascii Tolerance
You must set the high_ascii_tolerance resource to 0 in these situations:
- If your LN environment is a production environment and you plan to convert the environment to Unicode.
- If your LN environment is a development environment and you plan to deliver software components or translations. If high_ascii_tolerance is not set to 0, problems occur when the delivered components are imported in Unicode environments.
To set high_ascii_tolerance to 0, add this line in the $BSE/lib/defaults/all file:
high_ascii_tolerance:0
The role of the user locale
This section describes the role of the user locale in these types of installations:
- single-byte
- multibyte
- unicode
The role of the user locale in a single-byte installation
In a single-byte installation the user locale defines the character set that can be used throughout the application.
- Use a user locale that matches the character set used in the database. In this way “3rd party” database tools can access the data.
- Use a locale that defines a binary sorting order, also known as collation.
- Choose a binary sorting order in the database as well. For example, in West-European countries you can use the ISO_BIN1 locale. The character set of this locale is the same as the ISO-8859-1 locale, but the sorting order is binary.
- Ensure all users have the same user locale. Note that this is not enforced by the tools or porting set.
The user locale has affect on these processes:
- The way data is stored in the database. The user locale determines the code points that are used to store single-byte characters.
- Some bshell (3GL) functions:
- mb.locale.info(), including the TSS_GET_IFACTOR and TSS_GET_EFACTOR aspects
- set.min()
- set.max().
- set.fmin()
- set.fmax()
- The non-Unicode version of BWPrint, which must convert to the correct Windows code page. Note: to convert data from the ISO locales to Windows code pages, BWPrint uses the _WIN32 versions of the user locale.
The role of the user locale in a multibyte installation
In a multibyte installation the user locale defines the character set that can be used throughout the application.
- Use a user locale that matches the character set used in the database. This character set must use a binary sorting.
- Data is stored in the database using the character set of the user locale. Ensure the correct character set is specified for the database, so the database treats the characters in the correct way. In this way “3rd party” database tools can access the data. Otherwise the data could show up garbled.
-
Ensure all users have the same user locale. Note that this is not enforced by the tools or porting set. If users have different user locales, conversion errors occur when a user processes data of another user, who has a different user locale. This mainly impacts processes that write to and read from the database. Consequently the user locale impacts any integration that interacts with the database.
The user locale has affect on these processes:
- The conversion of data from “native” format (MBCS) to TSS and vice versa. The user locale determines the meaning of the term “native”.
- The way data is stored in the database. The user locale determines the code points that are used to store multibyte characters.
- Some bshell (3GL) functions:
- mb.import$(). This function converts data from “native” format to TSS.
- mb.export$(). This function converts data from TSS to “native”.
- utf8.export$(). The Baan IV porting set uses the “native” format as intermediate format to convert from TSS to UTF-8. If the user locale does not match with the data, this can result in conversion errors.
- mb.width(). This function returns the width of a string,
where width is defined in “number of display positions”. For example, in the
ISO8859n character sets, the
LATIN SMALL LETTER E WITH ACUTE
character takes one display position. In theGB2312
character set, it takes two display positions. - mb.locale.info(), including the
TSS_GET_IFACTOR
andTSS_GET_EFACTOR
aspects. - set.min()
- set.max()
- set.fmin()
- set.fmax()
- The appearance of log messages. The text in log messages that is converted from the TSS character set to the “native” format.
- The non-Unicode version of BWPrint, which must convert TSS data to “native” format.
The role of the user locale in a Unicode installation
Because the introduction of the Unicode character set, the role of the user locale has become less important. In a pure Unicode environment all characters are represented by unique code points. All code points have a unique interpretation. There are still some areas where conversions from and to Unicode occur.
Example
You work in a Unicode environment. But your personal user locale is ISO8859. You want to exchange data between the Unicode environment and another environment. When you perform an export from the Unicode environment, for example through LN Data Director or EDI, the export files are in ISO8859 format.
The user locale has no affect on these processes:
- The way data is stored in the database. All data in multibyte
columns is stored in Unicode. The data in single-byte columns is stored “as is” in
Unicode. Effectively it is interpreted in the ISO-8859-1 locale. For example, the
LATIN LETTER A
is stored as Unicode code point 0x41. The code point 0x9e is stored as the Unicode code point 0x9e, which represents theLATIN SMALL LETTER E WITH ACUTE
character (é). - The normal operation of the bshell. This excludes these functions:
- Conversion functions, such as mb.export$() and mb.import$().
- Functions to acquire information about the current user locale, such as mb.locale.info().
These functions are not affected by the user locale:
- set.min()
- set.max()
- set.fmin()
- set.fmax()
- The appearance of log messages. The text that is in log messages is converted from UTF-T to UTF-8 format.
- The Unicode version of BWprint.
The user locale has a small affect on the dump files as created by the
bdbpre utility. Data in the bdbpre-dump
files is
in the UTF-8 character set. If the database contains “high ascii” characters, these
characters are converted in the context of the current user locale. Note that the
high_ascii_tolerance resource has no effect on this process. See the comment on the
conversion of “high ascii” characters.
The user locale has affect on these processes:
- The conversion of data from “native” format (MBCS) to Unicode (UTF-T) and vice versa. The user locale determines the meaning of the term “native”.
- The conversion of so-called “high ascii” characters. See below.
- Some bshell (3GL) functions:
- mb.import$(). This function converts data from “native” format to UTF-T.
- mb.export$(). This function converts data from UTF-T to “native”.
- mb.width() This function returns the width of a string, where
width is defined in “number of display positions”. For example, in the ISO8859n
character sets, the
LATIN SMALL LETTER E WITH ACUTE
character takes one display position. In the GB2312 character set, it takes two display positions. - mb.locale.info(), excluding the
TSS_GET_IFACTOR
andTSS_GET_EFACTOR
aspects.
- The non-Unicode version of BWPrint, which must convert UTF-T data to “native” format. This can result in conversion errors because the “native” character set supports only a limited subset of UTF-T. Therefore, we recommend that you use the Unicode version of BWPrint in a Unicode installation.
Conversion of “high ascii” characters
The occurrence of “high ascii” characters poses a problem, because one code point can have different meanings in different character sets.
Example
In the ISO-8859-1 locale, the code point 0xe9, decimal 233, is
interpreted as the LATIN SMALL LETTER E WITH ACUTE
character (é).
In the ISO-8859-7 locale, this code point is interpreted as the
GREEK SMALL LETTER IOTA
character (ι).
To determine the meaning of a “high ascii” character, LN uses the current user locale. If the user locale is an ISO8859n variant, then this character set is used to determine the correct meaning. Otherwise the ISO85591 character set is used.
Example
The user locale is ISO88597. A string, which contains the 0xe9 code
point, must be converted to UTF-T. The code point is interpreted as the GREEK
SMALL LETTER IOTA
character. The resulting UTF-T code point is
0x9bbc87b9.
The user locale is ISO88591 and the same string must be converted. The
code point is interpreted as the LATIN SMALL LETTER E WITH ACUTE
character. The resulting UTF-T code point is 0x9bbc81e9.
We recommend that you keep the installation clean from “high ascii” characters. To achieve this, set the high_ascii_tolerance resource to 0.