
Teradata Parallel Transporter Unicode Usage
This article describes usage tips on how to load/unload Unicode data with the UTF8 and UTF16 Teradata client session character sets using Teradata Parallel Transporter (TPT).
As of this writing, Teradata Parallel Transporter supports Unicode only on network-attached platforms.
What is Unicode?
Unicode is an industry standard designed to allow text and symbols from all languages to be consistently represented. Unicode characters, each identified by an unambiguous name and an integer number called its code point, can be encoded using any of several schemes termed Unicode Transformation Formats (UTF).
Unicode encodings include:
• UTF-8 – an 8-bit, variable character-width encoding, compatible with 7-bit ASCII
• UCS-2 – a 16-bit, fixed character-width encoding
• UTF-16 – a 16-bit (or 32-bit surrogate pairs) variable character-width encoding
• UTF-32 – a 32-bit fixed character-width encoding
With the exception of UCS-2, all Unicode encoding forms contain the same character set repertoire; only the encodings differ between the Unicode Transformation Formats.
Character Set Encodings
ASCII
7-bit ASCII characters are one-byte characters using only 7 bits per character.
1-byte character: 0xxxxxxx
8-bit ASCII, also called “extended ASCII” or “high ASCII” describes eight-bit character encodings that include the standard 7-bit ASCII as well as others.
1-byte character: xxxxxxxx
ANSI
ANSI is a general definition for code pages. These can be one byte per character (example: Windows 1252) or multiple bytes per character (example: Shift JIS).
1-byte character: xxxxxxxx 2-byte character: xxxxxxxx xxxxxxxx
UTF-8
UTF-8 is a variable-length encoding for Unicode. It is able to represent any universal character in the Unicode standard, yet is also backwards compatible with 7-bit ASCII. In other words, UTF-8 is a superset of 7-bit ASCII. A plain 7-bit ASCII string is also a valid UTF-8 string. This backwards-compatibility means that no conversion needs to be done for 7-bit ASCII text and existing software based on 7-bit ASCII and its extensions can handle UTF-8.
The default Unicode character encoding form on UNIX platforms is UTF-8. UTF-8 works on ANSI single byte character systems without any need of modifications. 7-bit ASCII characters use one byte and all other characters use two or more bytes.
This encoding is also widely used on the Internet for transmitting Unicode text.
UTF-8 uses one to four bytes per character, depending on the Unicode symbol. The first byte of a multi-byte character contains a 1-bit for each byte used by the character followed by a 0-bit, and each of the following bytes of that character start with one 1-bit and one 0-bit.
1-byte character: 0xxxxxxx 2-byte character: 110xxxxx 10xxxxxx 3-byte character: 1110xxxx 10xxxxxx 10xxxxxx 4-byte character: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UCS-2
UCS-2 is the standard Unicode encoding format used in Win32 environments up to Windows NT. Characters are stored as fixed-length 2-byte characters, where the high-order byte contains zeros if the character is an ASCII (7-bit) or ANSI (8-bit) character.
2-byte character: 00000000 xxxxxxxx 2-byte character: xxxxxxxx xxxxxxxx
UTF-16
UTF-16 is the extended UCS-2 encoding format. This is the default encoding format on Microsoft Windows 2000 and XP. UTF-16 includes the UCS-2 character repertoire, but has been extended to handle two 16-bit values (called surrogate pairs) forming one character.
Surrogate pairs is a mechanism for encoding more than the 2^16 characters available in UCS-2 and UTF-16 before Unicode 3.1. This extension mechanism allows for more than one million additional characters.
This is accomplished by using two 16-bit values (surrogates) to represent one character. Each of the two surrogates can have one of 1024 different values, which gives 1024^2 new character values.
The first 16-bit has a value in the D800-DBFF range, called high surrogate. The second 16-bit has a value in the DC00-DFFF range, called low surrogate. In UCS-2, these 2*1024 character positions are reserved as the private usage character space.
2-byte character: 00000000 xxxxxxxx
2-byte character: xxxxxxxx xxxxxxxx
4-byte character: 110110xx xxxxxxxx 110111xx xxxxxxxx
High Surrogate Low Surrogate
Specifying Character Sets
Prior to Unicode support for TPT, the architecture for specifying character sets is that all of the following must be in the same character set:
• TPT job script • Client session character set • Data
For example, to load KANJISJIS_0S data:
• Job script must be encoded in KANJISJIS_0S • Job script must specify the USING CHARACTER SET KANJISJIS_0S client session character set clause • The data must be in KANJISJIS_0S
It turns out that if the job script does not contain extended characters, then the job script could also be encoded in ASCII – which makes sense since the lower 7-bits are the same.
With support for UTF-16, however, there may be situations where users want their job script encoded in UTF-8 and the data in UTF-16 (along with the client session character set UTF16); or vice-versa.
To accommodate this, TPT will adopt the following architecture for specifying job script encoding and for specifying the client session character set when using UTF-16.
Client Session Character Set, SQL Request Text, & Data
TPT will maintain the Teradata DBS requirement that the SQL request text and all character data must be in the same client session character set.
Job Script Encoding
A job script encoded in UTF-16 must be specified via a command line argument. This is necessary because TPT will (by default) expect job scripts that are encoded in a 7-bit ASCII-compatible character set.
Job Variables / INCLUDE Directive
TPT allows job variables and INCLUDE directives to be located in an external file. These job variables and directives get substituted into the TPT script at compile time by the TPT Preprocessor. TPT will maintain the requirement that these external files must be in a character set that is compatible with the character set in which the job script is encoded.
Unicode Job Scenarios in TPT
There are four scenarios for UTF-8 & UTF-16 job script & data encoding. They are outlined below.
Scenario 1: UTF-8 Job Script w/ UTF-8 Data
The following must be specified:
• Job script must be encoded in UTF-8 • Job script must specify the USING CHARACTER SET UTF8 client session character set clause • Data must be in UTF-8
Scenario 2: UTF-8 Job Script w/ UTF-16 Data
The following must be specified:
• Job script must be encoded in UTF-8 • Job script must specify the USING CHARACTER SET UTF16 client session character set clause • Data must be in UTF-16
The endianness of the UTF-16 data must be the native endianness for the hardware platform on which TPT is running.
Scenario 3: UTF-16 Job Script w/ UTF-8 Data
The following must be specified:
• Job script must be encoded in UTF-16 • Command line argument must specify –e UTF16 • Job script must specify the USING CHARACTER SET UTF8 client session character set clause • The data must be in UTF-8
The endianness of the UTF-16 job script must be the native endianness for the hardware platform on which TPT is running if –e UTF16 is specified.
Scenario 4: UTF-16 Job Script w/ UTF-16 Data
The following must be specified:
• Job script must be encoded in UTF-16 • Command line argument must specify –e UTF16 • Job script must specify the USING CHARACTER SET UTF16 client session character set clause • The data must be in UTF-16
The endianness of the UTF-16 job script must be the native endianness for the hardware platform on which TPT is running if –e UTF16 is specified.
The endianness of the UTF-16 data must be the native endianness for the hardware platform on which TPT is running.
