Character sets

A character set is a set of alphabetic or other characters used to construct the words and other elementary units of (a) native language(s).

During the installation of the LN application you must choose a character set. So only one character set applies for the whole LN environment. Therefore only those languages can be stored which are supported by the character set that was chosen.

You can choose the following character set types:

  • single byte character sets
  • multi-byte character sets
  • Unicode character set
Single byte character sets

Single byte character sets only need one byte to store the character information. As a consequence max 256 characters are available. The ISO 8859 standard defines several characters sets, also called locales, to cover the characters of mainly the European languages.

Examples of single byte character sets are:

  • ISO 8859-1: mainly west European languages like French, German, Italian etc
  • ISO 8859-5: supporting Cyrillic languages (Russian).

The lower range, character 000 -127, is the same for all ISO 8859 character sets, the upper range character 128 – 255, is specific per locale.

The alphabet is encoded in the lower range and therefore the English language is supported with each ISO 8859 locale (English does not need any additional characters).

The sorting is binary based: The sorting is based on the order the characters are defined in the encoding. All upper case alphabetical characters, for instance, are sorted before the lower case alphabetical characters. For example, the ‘Z’ is sorted before the ‘a’.

Multi-byte character sets

Multi-byte characters sets are typically required for languages that have more than 256 characters. A typical example is Chinese. In the context of LN the multi-byte character sets require 4 bytes per character.

Examples of multi-byte character sets are:

  • BIG5: Chinese
  • Wansung: Korean

Sorting is binary based.

Unicode character set

The Unicode character set is a standardized character set supporting all (modern) languages. This takes away the limitation of supporting a small set of languages within one LN environment. When you choose Unicode as character set, you can have for example Chinese, English and French in one LN environment.

Another advantage of the Unicode character set is that it comes with linguistic sorting rules. When the data must be visualized in a sorted form, the data will be shown based on the sorting rules as defined by the ICU standard.

Hinweis

The ICU standard also defines ‘tailoring’, that is fine-tuning the sorting rules to a specific language, which is not supported by the LN tools.

As a consequence the database size of a Unicode based LN environment is bigger, and the CPU and memory load on the system are higher than for a multi-byte or single byte based character set. The choice for Unicode is typically made when multiple languages must be supported or when linguistic sorting is a preference.

High Ascii Tolerance
Achtung!

The following only applies to LN environments that do not run in Unicode mode.

You must set the high_ascii_tolerance resource to 0 in the following situations:

  • If your LN environment is a production environment and you plan to convert the environment to Unicode.
  • If your LN environment is a development environment and you plan to deliver software components or translations. If high_ascii_tolerance is not set to 0, problems will occur when the delivered components are imported in Unicode environments.

To set high_ascii_tolerance to 0, add the following line in the $BSE/lib/defaults/all file:

high_ascii_tolerance:0

The role of the user locale

This section describes the role of the user locale in the following types of installations:

  • single-byte
  • multi-byte
  • unicode
Achtung!

It is technically possible to define a different locale for each user ( User Data Template (ttams1110m000) session). However, this can cause problems. Therefore Infor does NOT support the usage of multiple user locales. Consequently all users in an LN environment must have the same user locale.

The role of the user locale in a single-byte installation

In a single-byte installation the user locale defines the character set that can be used throughout the application.

Achtung!

Infor strongly advises the following:

  • Use a user locale that matches the character set used in the database. In this way “3rd party” database tools can access the data.
  • Use a locale that defines a binary sorting order (also known as collation).
  • Choose a binary sorting order in the database as well. For example, in West-European countries you can use the ISO_BIN1 locale. The character set of this locale is the same as the ISO-8859-1 locale, but the sorting order is binary.
  • Ensure all users have the same user locale. Note that this is not enforced by the tools or porting set.

The user locale has impact on the following:

  • The way data is stored in the database. The user locale determines the code points that are used to store single-byte characters.
  • Some bshell (3GL) functions:
    • mb.locale.info(), including the TSS_GET_IFACTOR and TSS_GET_EFACTOR aspects
    • set.min()
    • set.max().
    • set.fmin()
    • set.fmax()
  • The non-Unicode version of BWPrint, which must convert to the proper Windows code page. Note: to convert data from the ISO locales to Windows code pages, BWPrint uses the _WIN32 versions of the user locale.
The role of the user locale in a multi-byte installation

In a multi-byte installation the user locale defines the character set that can be used throughout the application.

Achtung!

Infor strongly advises the following:

  • Use a user locale that matches the character set used in the database. This character set must use a binary sorting.
  • Data is stored in the database using the character set of the user locale. Ensure the correct character set is specified for the database, so the database treats the characters in the correct way. In this way “3rd party” database tools can access the data. Otherwise the data could show up garbled.
  • Ensure all users have the same user locale. Note that this is not enforced by the tools or porting set. If users have different user locales, conversion errors occur when a user processes data of another user, who has a different user locale. This mainly impacts processes that write to and read from the database. Consequently the user locale impacts any integration that interacts with the database.

The user locale has impact on the following:

  • The conversion of data from “native” format (MBCS) to TSS and vice versa. The user locale determines the meaning of the term “native”.
  • The way data is stored in the database. The user locale determines the code points that are used to store multi-byte characters.
  • Some bshell (3GL) functions:
    • mb.import$(). This function converts data from “native” format to TSS.
    • mb.export$(). This function converts data from TSS to “native”.
    • utf8.export$(). The Baan IVc porting set uses the “native” format as intermediate format to convert from TSS to UTF-8. If the user locale does not match with the data, this can result in conversion errors.
    • mb.width(). This function returns the width of a string, where width is defined in “number of display positions”. For example, in the ISO8859n character sets, the “LATIN SMALL LETTER E WITH ACUTE” character takes 1 display position. However, in the GB2312 character set, it takes 2 display positions.
    • mb.locale.info(), including the TSS_GET_IFACTOR and TSS_GET_EFACTOR aspects.
    • set.min()
    • set.max()
    • set.fmin()
    • set.fmax()
  • The appearance of log messages. The text in log messages is converted from the TSS character set to “native” format.
  • The non-Unicode version of BWPrint, which must convert TSS data to “native” format.
The role of the user locale in a Unicode installation

Since the introduction of the Unicode character set, the role of the user locale has become less important. In a pure Unicode environment all characters are represented by unique code points. All code points have a unique interpretation. However, there are still some areas where conversions from and to Unicode occur.

Beispiel

You work in a Unicode environment. But your personal user locale is ISO8859. You want to exchange data between the Unicode environment and another environment. When you perform an export from the Unicode environment, for example through LN Data Director or EDI, the export files are in ISO8859 format.

The user locale has no impact on:

  • The way data is stored in the database. All data in multi-byte columns is stored in Unicode. The data in single-byte columns is stored “as is” in Unicode; effectively it is interpreted in the ISO-8859-1 locale. For example, the “LATIN LETTER A” is stored as Unicode code point 0x41. The code point 0x9e is stored as the Unicode code point 0x9e, which represents the “LATIN SMALL LETTER E WITH ACUTE” character (é).
  • The normal operation of the bshell. This excludes these functions:
    • Conversion functions, such as mb.export$() and mb.import$().
    • Functions to acquire information about the current user locale, such as mb.locale.info().
    Note: These functions are not impacted by the user locale:
    • set.min()
    • set.max()
    • set.fmin()
    • set.fmax()
  • The appearance of log messages. The text in log messages is converted from UTF-T to UTF-8 format.
  • The Unicode version of BWprint.

The user locale has a small impact on the dump files as created by the bdbpre utility. Data in the bdbpre-dump files is in the UTF-8 character set. If the database contains “high ascii” characters, these characters are converted in the context of the current user locale. Note that the high_ascii_tolerance resource has no effect on this process. For details, refer to the comment on the conversion of “high ascii” characters below.

The user locale has impact on:

  • The conversion of data from “native” format (MBCS) to Unicode (UTF-T) and vice versa. The user locale determines the meaning of the term “native”.
  • The conversion of so-called “high ascii” characters. See below.
  • Some bshell (3GL) functions:
    • mb.import$(). This function converts data from “native” format to UTF-T.
    • mb.export$(). This function converts data from UTF-T to “native”.
    • mb.width() This function returns the width of a string, where width is defined in “number of display positions”. For example, in the ISO8859n character sets, the “LATIN SMALL LETTER E WITH ACUTE” character takes 1 display position. However, in the GB2312 character set, it takes 2 display positions.
    • mb.locale.info(), excluding the TSS_GET_IFACTOR and TSS_GET_EFACTOR aspects.
  • The non-Unicode version of BWPrint, which must convert UTF-T data to “native” format. This can result in conversion errors because the “native” character set supports only a limited subset of UTF-T. Therefore, Infor strongly advises to use the Unicode version of BWPrint in a Unicode installation.
Conversion of “high ascii” characters

The occurrence of “high ascii” characters poses a problem, because one code point can have different meanings in different character sets.

Beispiel

In the ISO-8859-1 locale, the code point 0xe9 (decimal 233) is interpreted as the “LATIN SMALL LETTER E WITH ACUTE” character (é).

In the ISO-8859-7 locale, this code point is interpreted as the “GREEK SMALL LETTER IOTA” character (ι).

To determine the meaning of a “high ascii” character, LN uses the current user locale. If the user locale is an ISO8859n variant, then this character set is used to determine the correct meaning; otherwise the ISO85591 character set is used.

Beispiel

The user locale is ISO88597. A string, which contains the 0xe9 code point, must be converted to UTF-T. The code point is interpreted as the “GREEK SMALL LETTER IOTA” character. The resulting UTF-T code point is 0x9bbc87b9.

The user locale is ISO88591 and the same string must be converted. The code point is interpreted as the “LATIN SMALL LETTER E WITH ACUTE” character. The resulting UTF-T code point is 0x9bbc81e9.

We recommend that you keep the installation clean from “high ascii” characters. To achieve this, set the high_ascii_tolerance resource to 0.