Extensions for the support for the Unicode® standard

Extensions for the support for the Unicode® standard
Prev	Chapter 2. GT.M Language Extensions	Next

To represent and process strings that use international characters, GT.M processes can use the UTF-8 encoding defined by the Unicode® standard.

If the environment variable gtm_chset has a value of UTF-8 and either LC_ALL or LC_CTYPE is set to a locale with UTF-8 support (for example, zh_CN.utf8), a GT.M process interprets strings as containing characters encoded in the UTF-8 representation. In the UTF-8 mode, GT.M no longer assumes that one character is one byte, or that the glyph display width of a character is one. Depending on how ICU is built on a computer system, in order to operate in UTF-8 mode, a GT.M process may well also need a third environment variable, gtm_icu_version set appropriately.

If the environment variable gtm_chset has no value, the string "M", or any value other than "UTF-8", GT.M treats each 8-bit byte as a character, which suffices for English, and many single-language applications.

All GT.M components related to the M mode reside in the top level directory in which a GT.M release is installed and the environment variable gtm_dist should point to that directory for M mode processes. All GT.M components related to the UTF-8 mode reside in the utf8 subdirectory and the environment variable gtm_dist should point to that subdirectory for UTF-8 mode processes. So, in addition to the values of the environment variables gtm_chset and LC_ALL/LC_CTYPE, gtm_dist for a UTF-8 process should also point to the utf8 subdirectory.

M mode and UTF-8 mode are set for the process, not for the database. As a subset of UTF-8 characters, ASCII characters ($CHAR() values 0 through 127) are interpreted identically by processes in M and UTF-8 modes. The indexes and values in the database are simply sequences of bytes and therefore it is possible for one process to interpret a global node as encoded in UTF-8 and for another to interpret the same node as bytecodes. Note that such an application configuration would be extremely unusual, except perhaps during a transition phase or in connection with data import/export.

In UTF-8 mode, string processing functions (such as $EXTRACT()) operate on strings of multi-byte characters, and can therefore produce different results in M and UTF-8 modes, depending on the actual data processed. Each function has a "Z" alter ego (for example, $ZEXTRACT()) that can be used to operate on sequences of bytes identically in M and UTF-8 modes (that is, in M mode, $EXTRACT() and $ZEXTRACT() behave identically).

In M mode, the concept of an illegal character does not exist. In UTF-8 mode, a sequence of bytes may not represent a valid character, and generates an error when encountered by functions that expect and process UTF-8 strings. During a migration of an application to add the support for UTF-8 mode, illegal character errors may be frequent and indicative of application code that is yet to be modified. VIEW "NOBADCHAR" suppresses these errors at times when their presence impedes development.

In UTF-8 mode, GT.M also supports IO encoded in UTF-16 variants as well as in the traditional one byte per character encoding from devices other than $PRINCIPAL.

The following table summarizes GT.M's support for the Unicode® standard.

EXTENSION	EXPLANATION
$ASCII()	IN UTF-8 mode, the $ASCII() function returns the integer UTF-8 code-point value of a character in the given string. Note that the name $ASCII() is somewhat anomalous for UTF-8 data but that name is the logical extension of the function from M mode to UTF-8 mode. For more information and usage examples, refer to “$ASCII()”.
$Char()	In UTF-8 mode, $CHAR() returns a string composed of characters represented by the integer equivalents of the UTF-8 code-points specified in its argument(s). For more information and usage examples, refer to “$Char()”.
$Extract()	The $EXTRACT() function returns a substring of a given string. For more information and usage examples, refer to “$Extract()”.
$Find()	The $FIND() function returns an integer character position that locates the occurrence of a substring within a string. For more information and usage examples, refer to “$Find()”.
$Justify()	The $JUSTIFY function returns a formatted string. For more information and usage examples, refer to “$Justify()”.
$Length()	The $LENGTH() function returns the length of a string measured in characters, or in "pieces" separated by a delimiter specified by its optional second argument. For more information and usage examples, refer to “$Length()”.
$Piece()	The $PIECE() function returns a substring delimited by a specified string delimiter made up of one or more characters. For more information and usage examples, refer to “$Piece()”.
$TRanslate()	The $TRANSLATE() function returns a string that results from replacing or dropping characters in the first of its arguments as specified by the patterns of its other arguments. For more information and usage examples, refer to “$TRanslate()”.
$X	For UTF-8 mode and TRM and SD output, $X increases by the display-columns (width in glyphs) of a given string that is written to the current device. For more information and usage examples, refer to “$X”.
$ZASCII()	The $ZASCII() function returns the numeric byte value (0 through 255) of a given sequence of octets (8-bit bytes). For more information and usage examples, refer to “$ZAscii()”.
$ZCHset	The read-only intrinsic special variable $ZCHSET takes its value from the environment variable gtm_chset. An application can obtain the character set used by a GT.M process by the value of $ZCHSET. $ZCHSET can have only two values –"M", or "UTF-8" and it cannot appear on the left of an equal sign in the SET command. For more information and usage examples, refer to “$ZCHset”.
$ZCHar()	The $ZCHAR() function returns a byte sequence of one or more bytes corresponding to numeric byte value (0 through 255) specified in its argument(s). For more information and usage examples, refer to “$ZCHar()”.
$ZCOnvert()	The $ZCONVERT() function returns its first argument as a string converted to a different encoding. The two argument form changes the encoding for case within a character set. The three argument form changes the encoding scheme. For more information and usage examples, refer to “$ZCOnvert()”.
$ZExtract()	The $ZEXTRACT() function returns a byte sequence of a given sequence of octets (8-bit bytes). For more information and usage examples, refer to “$ZExtract()”.
$ZFind()	The $ZFIND() function returns an integer byte position that locates the occurrence of a byte sequence within a sequence of octets(8-bit bytes). For more information and usage examples, refer to “$ZFind()”.
$ZJustify()	The $JUSTIFY() function returns a formatted and fixed length byte sequence. For more information and usage examples, refer to “$ZJustify()”.
$ZLength()	The $ZLENGTH() function returns the length of a sequence of octets measured in bytes, or in "pieces" separated by a delimiter specified by its optional second argument. For more information and usage examples, refer to “$ZLength()”.
$ZPATNumeric	ZPATN[UMERIC] is a read-only intrinsic special variable that determines how GT.M interprets the patcode N used in the pattern match operator. With $ZPATNUMERIC="UTF-8", the patcode N matches any numeric character as defined by the Unicode standard. By default patcode N only matches the ASCII digits, which are the only digits which M actually treats as numerics. For more information and usage examples, refer to “$ZPATNumeric”.
$ZPIece()	The $ZPIECE() function returns a sequence of bytes delimited by a specified byte sequence made up of one or more bytes. In M, $ZPIECE() typically returns a logical field from a logical record. For more information and usage examples, refer to “$ZPIece()”.
$ZPROMpt	$ZPROM[PT] contains a string value specifying the current Direct Mode prompt. By default, GTM> is the Direct Mode prompt. M routines can modify $ZPROMPT by means of a SET command. $ZPROMPT cannot exceed 31 bytes. If an attempt is made to assign $ZPROMPT to a longer string, GT.M takes only the first 31 bytes and truncates the rest. With character set UTF-8 specified, if the 31st byte is not the end of a valid UTF-8 character, GT.M truncates the $ZPROMPT value at the end of last character that completely fits within the 31 byte limit. For more information and usage examples, refer to “$ZPROMpt”.
$ZSUBstr()	The $ZSUBSTR() function returns a properly encoded string from a sequence of bytes. For more information and usage examples, refer to “$ZSUBstr()”.
$ZTRanslate()	The $ZTRANSLATE() function returns a byte sequence that results from replacing or dropping bytes in the first of its arguments as specified by the patterns of its other arguments. $ZTRANSLATE() provides a tool for tasks such as encryption.For more information and usage examples, refer to “$ZTRanslate()”.
$ZWidth()	The $ZWIDTH() function returns the numbers of columns required to display a given string on the screen or printer. For more information and usage examples, refer to “$ZWidth()”.
%HEX2UTF	The GT.M %HEX2UTF utility returns the GT.M encoded character string from the given bytestream in hexadecimal notation. This routine has entry points for both interactive and non-interactive use. For more information and usage examples, refer to “%HEX2UTF”.
%UTF2HEX	The GT.M %UTF2HEX utility returns the hexadecimal notation of the internal byte encoding of a UTF-8 encoded GT.M character string. This routine has entry points for both interactive and non-interactive use. For more information and usage examples, refer to “%UTF2HEX”.
[NO]WRAP (USE)	Enables or disables automatic record termination. When the current record size ($X) reaches the maximum WIDTH and the device has WRAP enabled, GT.M starts a new record, as if the routine had issued a WRITE ! command. For more information and usage examples, refer to “WRAP”.
DSE and LKE	In UTF-8 mode, DSE and LKE accept Unicode characters in all their command qualifiers that require file names, keys, or data (such as DSE -KEY, DSE -DATA and LKE -LOCK qualifiers). For more information, refer to the LKE and DSE chapter For more information and usage examples, refer to GT.M Administration and Operations Guide.
GDE Objects	GDE allows the name of a file to include UTF-8 characters In UTF-8 mode, GDE considers a text file to be encoded in UTF-8 when it is executed via the "@" command. For more information, refer to the GDE chapter in GT.M Administration and Operations Guide.
FILTER[=expr]	Specifies character filtering for specified cursor movement sequences on devices where FILTER applies. In UTF-8 mode, the usual Unicode line terminators (U+000A (LF), U+0000D (CR), U+000D followed by U+000A (CRLF), U+0085 (NEL), U+000C (FF), U+2028 (LS) and U+2029 (PS)) are recognized. If FILTER=CHARACTER is enabled, all of the terminators are recognized to maintain the values of $X and $Y. For more information, refer to “FILTER”.
Job	The Job command spawns a background process with the same environment as the M process doing the spawning. Therefore, if the parent process is operating in UTF-8 mode, the Job'd process also operates in UTF-8 mode. In the event that a background process must have a different mode from the parent, create a shell script to alter the environment as needed, and spawn it with a ZSYstem command, for example, ZSYstem "/path/to/shell/script &", or start it as a PIPE device. For more information and UTF-8 mode examples, refer “Job”.
MUPIP	MUPIP EXTRACT In UTF-8 mode, MUPIP EXTRACT, MUPIP JOURNAL -EXTRACT and MUPIP JOURNAL -LOSTTRANS write sequential output files in the UTF-8 character encoding form. For example, in UTF-8 mode if ^A has the value of 主要雨在西班牙停留在平原, the sequential output file of the MUPIP EXTRACT command is: 09-OCT-2006 04:27:53 ZWR GT.M MUPIP EXTRACT UTF-8 ^A="主要雨在西班牙停留在平原" MUPIP LOAD MUPIP LOAD command considers a sequential file as encoded in UTF-8 if the environment variable gtm_chset is set to UTF-8. Ensure that MUPIP EXTRACT commands and corresponding MUPIP LOAD commands execute with the same setting for the environment variable gtm_chset. The M utility programs %GO and %GI have the same requirement for mode matching. For more information on MUPIP EXTRACT and MUPIP LOAD, refer to the General Database Management chapter in GT.M Administration and Operations Guide.
Open	In UTF-8 mode, the OPEN command recognizes ICHSET, OCHSET, and CHSET as three additional deviceparameters to determine the encoding of the input / output devices. For more information and usage examples, refer to “Open”.
Pattern Match Operator (?)	GT.M allows the pattern string literals to contain UTF-8 characters. Additionally, GT.M extends the M standard pattern codes (patcodes) A, C, N, U, L, P and E to the UTF-8 character set. For more information, refer to “Pattern Match Operator” and “$ZPATNumeric”.
Read	In UTF-8 mode, the READ command uses the character set value specified on the device OPEN as the character encoding of the input device. If character set "M" or "UTF-8" is specified, the data is read with no transformation. If character set is "UTF-16", "UTF-16LE", or "UTF-16BE", the data is read with the specified encoding and transformed to UTF-8. If the READ command encounters an illegal character or a character outside the selected representation, it triggers a run-time error. The READ command recognizes all Unicode line terminators for non-FIXED devices. For more information and usage examples, refer to “Read”.
Read #	When a number sign (#) and a non-zero integer expression immediately follow the variable name, the integer expression determines the maximum number of characters accepted as the input to the READ command. In UTF-8 or UTF-16 modes, this can occur in the middle of a sequence of combining code-points (some of which are typically non-spacing). When this happens, any display on the input device, may not represent the characters returned by the fixed-length READ (READ #). For more information and usage examples, refer to “Read”.
Read *	In UTF-8 or UTF-16 modes, the READ * command accepts one Unicode character of input and puts the numeric UTF-8 code-point value for that character into the variable. For more information and usage examples, refer to “Read”.
View "[NO]BADCHAR"	As an aid to migrating applications to using the Unicode standard, this UTF-8 mode VIEW command determines whether UTF-8 enabled functions trigger errors when they encounter illegal strings. For more information and usage examples, refer to “View”.
User-defined Collation	For some languages (such as Chinese), the ordering of character strings encoded with UTF-8 may not be the linguistically or culturally correct ordering. Supporting applications in such languages requires development of collation modules - GT.M natively supports M collation, but does not include pre-built collation modules for any specific natural language. Therefore, applications that use UTF-8 characters may need to implement their own collation functions. For more information on developing a collation module for the Unicode® standard, refer to “Implementing an Alternative Collation Sequence for Unicode® characters”.
Unicode® Byte Order Marker (BOM)	When ICHSET is UTF-16, GT.M uses BOM (U+FEFF) to automatically determine the endianess. For this to happen, the BOM must appear at the beginning of the file or data stream. If BOM is not present, GT.M assumes big endianess. SEEK or APPEND operations require specifying the endianess (UTF-16LE or UTF-16BE) because they do not go to the beginning of the file or data stream to automatically determine the endianess. When endianess is not specified, SEEK or APPEND assume big endianess. If the character set of a device is UTF-8, GT.M checks for and ignores a BOM on input. If the BOM does not match the character set specified at device OPEN, GT.M produces an error. READ does not return BOM to the application and the BOM is not counted as part of the first record. If the output character set for a device is UTF-16 (but not UTF-16BE or UTF-16LE,) GT.M writes a BOM before the initial output. The application code does not need to explicitly write the BOM.
WIDTH=intexpr (USE)	In UTF-8 mode and TRM and SD output, the WIDTH deviceparameter specifies the display-columns and is used with $X to control truncation and WRAPing of the visual representation of the stream. For more information and usage examples, refer to “WIDTH”.
Write	In UTF-8 mode, the WRITE command uses the character set specified on the device OPEN as the character encoding of the output device. If character set specifies "M" or "UTF-8", GT.M WRITEs the data with no transformation. If character set specifies "UTF-16", "UTF-16LE" or "UTF-16BE", the data is assumed to be encoded in UTF-8 and WRITE transforms it to the character encoding specified by character set device parameter. For more information and usage examples, refer to “Write”.
Write *	When the argument of a WRITE command consists of a leading asterisk (*) followed by an integer expression, the WRITE command outputs the character represented by the code-point value of that integer expression. For more information and usage examples, refer to “Write”.
ZSHow	In UTF-8 mode, the ZSHOW command exhibits byte-oriented and display-oriented behavior as follows: ZSHOW targeted to a device (ZSHOW "") aligns the output according to the numbers of display columns specified by the WIDTH deviceparameter. ZSHOW targeted to a local (ZSHOW "":lcl) truncates data exceeding 2048KB at the last character that fully fits within the 2048KB limit. ZSHOW targeted to a global (ZSHOW "*":^CC) truncates data exceeding the maximum record size for the target global at the last character that fully fits within that record size. For more information and usage examples, refer to “ZSHOW Destination Variables”.

Philosophy of GT.M's support for the Unicode® standard

With the support of the Unicode® standard, there is no change to the GT.M database engine or to the way that data is stored and manipulated. GT.M has always allowed indexes and values of M global and local variables to be either canonical numbers or any arbitrary sequence of bytes. There is also no change to the character set used for M source programs. M source programs have always been in ASCII (standard ASCII - $C(0) through $C(127) - is a proper subset of the UTF-8 encoding specified by the Unicode standard). GT.M accepts some non-ASCII characters in comments and string literals.

The changes in GT.M to support the Unicode standard are principally enhancements to M language features. Although conceptually simple, these changes fundamentally alter certain previously ingrained assumptions. For example:

The length of a string in characters is not the same as the length of a string in bytes. The length of a UTF-8 string in characters is always less than or equal to its length in bytes.
The display width of a string on a terminal is different from the length of a string in characters - for example, with UTF-8, a complex glyph may actually be composed of a series of glyphs or component symbols, each in turn a UTF-8 encoded character in a Unicode string.
As a glyph may be composed of multiple characters, a UTF-8 string can have canonical and non-canonical forms. The forms may be conceptually equivalent, but they are different strings of encoded characters.

	Important
	GT.M treats canonical and non-canonical versions of the same string as different and unequal. FIS recommends that applications be written to use canonical forms. Where conformance to a canonical representation of input strings cannot be assured, application logic linguistically and culturally correct for each language should convert non-canonical strings to canonical strings.

Applications may operate on a combination of character and binary data - for example, some strings in the database may be digitized images of signatures and others may include escape sequences for laboratory instruments. Furthermore, since M applications have traditionally overloaded strings by storing different data items as pieces of the same string, the same string may contain both UTF-8 data and binary data. GT.M has functionality to allow a process to manipulate UTF-8 strings as well as binary data.

The GT.M design philosophy is to keep things simple, but no simpler than they need to be. These typically arise where interpretations of lengths and interpretations of characters interact. For example:

A sequence of bytes is never illegal when considered as binary data, but can be illegal when treated as a UTF-8 string. The detection and handling of illegal UTF-8 strings adds complexity, especially when binary and UTF-8 data reside in different pieces of the same string.
Since binary data may not map to graphic UTF-8 characters, the ZWRite format must represent such characters differently. A sequence of bytes that is output by a process interpreting it as UTF-8 data may require processing to form correctly input to a process that is interpreting that sequence as binary, and vice versa. Therefore, when performing IO operations, including MUPIP EXTRACT and MUPIP LOAD operations in ZWR format, ensure that processes have the compatible environment variables and /or logic to generate the desired output and correctly read and process the input.
Application logic managing input / output that interacts with human beings or non-GT.M applications requires even closer scrutiny. For example, fixed length records in files are always defined in terms of bytes. In Unicode support related operations, an application may output data such that a character would cross a record boundary (for example, a record may have two bytes of space left, and the next UTF-8 character may be three bytes long), in which case GT.M fills the record with one or more pad bytes. When a padded record is read as UTF-8, trailing pad bytes are stripped by GT.M and not provided to the application code.

For some languages (such as Chinese), the ordering of strings according to UTF-8 code-points (character values) may not be the linguistically or culturally correct ordering. Supporting applications in such languages requires development of collation modules - GT.M natively supports M collation, but does not include pre-built collation modules for any specific natural language.

Glyphs and Unicode® characters

Glyphs are the visual representation of text elements in writing systems and UTF-8 code-points are the underlying data. Internally, GT.M stores UTF-8 encoded strings as sequences of bytes. A Unicode® compatible output device - terminal, printer or application - renders the characters as sequences of glyphs that depict the sequence of code-points, but there may not be a one-to-one correspondence between characters and glyphs.

For example, consider the following word from the Devanagari writing system.

अच्छी

On a screen or a printer, it is displayed in 4 columns. Internally GT.M stores it as a sequence of 5 UTF-8 code-points:

#	Character	UTF-8 code-point	Name
1	अ	U+0905	DEVANAGARI LETTER A
2	च	U+091A	DEVANAGARI LETTER CA
3	्	U+094D	DEVANAGARI SIGN VIRAMA
4	छ	U+091B	DEVANAGARI LETTER CHA
5	ी	U+0940	DEVANAGARI VOWEL SIGN II

The Devanagari writing system (U+0900 to U+097F) is based on the representation of syllables as contrasted with the use of an alphabet in English. Therefore, it uses the half-form of a consonant to represent certain syllables. The above example uses the half-form of the consonant (U+091A).

Although the half-form form consonant is a valid text element in the context of the Devanagari writing system, it does not map directly to a character in the Unicode® standard. It is obtained by combining the DEVANAGARI LETTER CA, with DEVANAGARI SIGN VIRAMA, and DEVANAGARI LETTER CHA.

च

्

छ

च्छ

On a screen or a printer, the terminal font detects the glyph image of the half-consonant and displays it at the next display position. Internally GT.M uses ICU's glyph-related conventions for the Devanagari writing system to calculate the number of columns needed to display it. As a result, GT.M advances $X by 1 when it encounters the combination of the 3 UTF-8 code-points that represent the half-form consonant.

To view this example at GT.M prompt, type the following command sequence:

GTM>write $ZCHSET
UTF-8
GTM>set DS=$char($$FUNC^%HD("0905"))_$char($$FUNC^%HD("091A"))_$char($$FUNC^%HD("094D"))
GTM>set DS=DS_$char($$FUNC^%HD("091B"))_$char($$FUNC^%HD("0940"))
GTM>write $zwidth(DS); 4 columns are required to display local variable DS on the screen.
4
GTM>write $length(DS); DS contains 5 characters or UTF-8 code-points.
5
GTM>

For all writing systems supported by the Unicode standard, a character is a code-point for string processing, network transmission, storage, and retrieval of Unicode data whereas a character is a glyph for displaying on the screen or printer. This holds true for many other popular programming languages. Keep this distinction in mind throughout the application development life-cycle.

ICU

ICU is a widely used, defacto standard package (see http://icu-project.org for more information) that GT.M relies on for most operations that require knowledge of the Unicode® character sets, such as text boundary detection, character string conversion between UTF-8 and UTF-16, and calculating glyph display widths.

	Important
	Unless the support for the Unicode standard is sought for a process (that is, unless the environment variable gtm_chset is UTF-8"), GT.M processes do not need ICU. In other words, existing applications that are not based on the Unicode standard continue to work on supported platforms without ICU.

An ICU version number is of the form major.minor.milli.micro where major, minor, milli and micro are integers. Two versions that have different major and/or minor version numbers can differ in functionality and API compatibility is not guaranteed. Differences in milli or micro versions are maintenance releases that preserve functionality and API compatibility. ICU reference releases are defined by major and minor version numbers. Note that display widths for some characters changed in ICU 4.0 and may change again in the future, as both languages and ICU evolve.

An operating system's distribution generally includes an ICU library tailored to the OS and hardware, therefore FIS does not provide any ICU library. In order to support UTF-8 functionality, GT.M requires an appropriate version of ICU to be installed on the system - check the release notes for your GT.M release for supported ICU versions.

GT.M expects ICU to be compiled with symbol renaming disabled and will issue an error at startup if the available version of ICU is built with symbol renaming enabled. To use a version of ICU built with symbol renaming enabled, the gtm_icu_version environment variable indicates the MAJOR VERSION and MINOR VERSION numbers of the desired ICU formatted as MajorVersion.MinorVersion (for example "3.6" to denote ICU-3.6). When $gtm_icu_version is so defined, GT.M attempts to open the specific version of ICU. In this case, GT.M works regardless of whether or not symbols in this ICU have been renamed. A missing or ill-formed value for this environment variable causes GT.M to only look for non-renamed ICU symbols. The release notes for each GT.M release identify the required reference release version number as well as the milli and micro version numbers that were used to test GT.M prior to release. In general, it should be safe to use any version of ICU with the specific ICU reference version number required and milli and micro version numbers greater than those identified in the release notes for that GT.M version.

ICU supports multiple threads within a process, and an ICU binary library can be compiled from source code to either support or not support multiple threads. In contrast, GT.M does not support multiple threads within a GT.M process. On some platforms, the stock ICU library, which is usually compiled to support multiple threads, may work unaltered with GT.M. On other platforms, it may be required to rebuild ICU from its source files with support for multiple threads turned off. Refer to the release notes for each GT.M release for details about the specific configuration tested and supported. In general, the GT.M team's preference for ICU binaries used for each GT.M version are, in decreasing order of preference:

The stock ICU binary provided with the operating system distribution.
A binary distribution of ICU from the download section of the ICU project page.
A version of ICU locally compiled from source code provided by the operating system distribution with a configuration disabling multi-threading.
A version of ICU locally compiled from the source code from the ICU project page with a configuration disabling multi-threading.

GT.M uses the POSIX function dlopen() to dynamically link to ICU. In the event you have other applications that require ICU compiled with threads, place the different builds of ICU in different locations, and use the dlopen() search path feature (for example, the LD_LIBRARY_PATH environment variable on Linux) to enable each application to link with its appropriate ICU.

Discussion and Best Practices

Data interchange

GT.M's support for the Unicode® standard only affects the interpretation of data in databases, and not databases themselves. A simple way to convert from a ZWR format extract in one mode to an extract in the other is to load it in the database using a process in the mode in which it was generated, and to once more extract it from the database using a process in the other mode.

If a sequence of 8-bit octets contains bytes other than those in the ASCII range (0 through 127), an extract in ZWR format for the same sequence of bytes is different in M and UTF-8 modes. In M mode, the $[Z]CHAR() values in a ZWR format extract are always equal to or less than 255. In UTF-8 mode, they can have larger values - the code-points of UTF-8 characters can be far greater than 255.

Note that the characters written to the output device are subject to the OCHSET transformation of the controlling output device. If OCHSET is "M", the multi-byte characters are written in raw bytes without any transformation.

Each multi-byte graphic character (as classified by $ZCHSET) is written directly to the device converted to the encoding form specified by the OCHSET of the output device.
Each multi-byte non-graphic character (as classified by $ZCHSET) is written in $CHAR(nnnn) notation, where nnnn is the decimal character code (that is, code-point up to 1114111 if $ZCHSET="UTF-8" or up to 255 if $ZCHSET="M").
If $ZCHSET="UTF-8" and a subscript or data contains a malformed UTF-8 byte sequence, ZWRITE treats each byte in the sequence as a separate malformed character. Each such byte is written in $ZCHAR(nn[,...]) notation, where each nn is the corresponding byte in the illegal UTF-8 byte sequence.

Note that attempts to use ZWRITE output from a system as input to another system using a different character set may result in errors or not yield the same state as existed on the source system. Application developers can deal with this by defining and using one or more pattern tables that declare all non-ASCII characters (or any useful subset thereof) to be non-graphic. For more details on defining pattern tables, please refer to "Pattern Code Definition" section of Chapter 12: “Internationalization”.

Limitations

User-defined pattern codes are not supported

Although the M standard patcodes (A,C,L,U,N,P,E) are extended to work in UTF-8 mode, application developers can neither change their default classification nor define the non-standard patcodes ((B,D,F-K,M,O,Q-T,V-X) beyond the ASCII subset. This means that the pattern tables cannot contain characters with codes greater than the maximum ASCII code 127.

String Normalization

In GT.M, strings are not implicitly normalized. Unicode® normalization is a method of computing canonical representation of the character strings. Normalization is required if the strings contain combination characters (such as accented characters consisting of a base character followed by an accent character) as well as precomposed characters. The Unicode standard assigned code-points to such precomposed characters for backward compatibility with legacy code sets. For the applications containing both versions of the same character (or combining characters), the Unicode standard recommends one of the normal forms. Because GT.M does not normalize strings, the application developers must develop the functionality of normalizing the strings, as needed, in order for string matching and string collation to behave in a conventional and wholesome fashion. In such a case, edit checks can be used that only accept a single representation when multiple representations are possible.

UTF-16 is not supported for $PRINCIPAL device

In GT.M does not support UTF-16, UTF-16LE and UTF-16BE encodings for $PRINCIPAL I/O devices (including Terminal, Sequential and Socket devices). In order to perform Unicode®-related I/O with the $PRINCIPAL device, application developers must use "UTF-8" for the ICHSET or OCHSET deviceparameters.

UTF-16 is not supported for Terminal Devices

Due to the uncommon usage and lack of support for UTF-16 by UNIX terminals and terminal emulators, GT.M does not support UTF-16, UTF-16LE and UTF-16BE encodings for Terminal I/O devices. Note that UNIX platforms use UTF-8 as the defacto character encoding for the Unicode® standard. The terminal connections from remote hosts (such as Windows) must communicate with GT.M in UTF-8 encoding.

Error messages are in [American] English

GT.M has no facility for a translation of product error messages or on-line help into languages other than [American] English. All error message text (except the messages arguments that could include Unicode® data) is in the [American] English language.

Performance and Capacity

With the use of UTF-8 as GT.M's internal character encoding, the additional requirements for CPU cycles, excluding collation algorithms, should not increase significantly compared with the identical application using the M character set. Additional memory requirements for UTF-8 vary depending on the application as well as the actual character set used. For example, applications based on Latin-1 (2-byte encoded) characters may require up to twice the memory and those based on Chinese/Japanese (3-byte encoded) characters may require up to three times the memory compared to an identical application using "M" (ASCII) characters. The additional disk-space and I/O performance trade-offs for UTF-8 also vary based on the application and the characters used.

Characters in arguments exchanged with external routines must be validated by the external routines

GT.M does not check for illegal characters in a string before passing it to an external routine or in a returned value before assigning it to a GT.M variable. This is because such checks add parameter-processing overhead. The application must ensure that the strings are in the encoding form expected by the respective routines. More robustly, external routines must interpret passed strings based on the value of the intrinsic variable $ZCHSET or the environment variable gtm_chset. The external routines can perform validation if needed.

Maximums

In older versions of GT.M which did not support the Unicode® standard, the restrictions on certain objects were put in place with the assumption that a character is represented by a single byte. With support for the Unicode standard in GT.M, the following restrictions are in terms of bytes- not characters.

M Name Length

The maximum length of an M identifier is restricted to 31 bytes. Since identifier names are restricted to be in ASCII, programmers can define M names up to 31 characters long.

M String Length

The maximum length of an M string is restricted to 1,048,576 bytes (1MiB). Therefore, depending on the characters used, the maximum number of characters could be reduced from 1,048,576 characters to as few as 262,144 (256K) characters.

M Source Line Length

The maximum length of a program or indirect source line is restricted to 2,048 bytes. Application developers must be aware of this byte limit if they consider using multi-byte source comments or string literals in a source line.

Database Key and Record Sizes

The maximum allowed size for database keys (both global and nref keys) is 1019 bytes and for database records is 1MiB. Application developers must be aware that the keys or data containing multi-byte UTF-8 characters are limited at a smaller number of characters than the number of available bytes.

Ten Golden Rules

Adhere to the following rules of thumb to design and develop applications based on the Unicode® standard for deployment on GT.M.

GT.M functionality related to the Unicode standard and characters becomes available only in UTF-8 mode.
[At least] in UTF-8 mode, byte manipulation must use Z* equivalent functions.
In M mode, standard functions are always identical to their Z equivalents.
Use the same character set for all globals names and subscripts in an instance.
Define a collation system according to the linguistic and cultural tenets of the language used.
Create the application logic to ensure strings used as keys are canonical.
Specify CHSET="M" or otherwise handle illegal characters during the I/O operations.
Communicate with any external routines using a compatible character encoding form.
Compile and run programs in the same setting of $ZCHSET and "BADCHAR".

Prev	Up	Next
Alias Variables Extensions	Home	Chapter 3. Development Cycle