Character Sets and Encoding Schemes | Oracle 9i Fundamentals I Exam Cram 2

Oracle supports not only 8-bit but also 7-bit schemes for single-byte character sets. It also supports fixed-width multibyte character sets as well as varying-width multibyte character sets.

An encoding scheme for a character set specifies the numeric codes that correspond to the characters that the computer or terminal can display and receive. These encoding schemes are used to interpret the data into symbols that are meaningful to the users reading the data from the terminal or the machine from which they are reading the characters.

To this end, Oracle offers different classes for encoding schemes, including single-byte schemes, varying-width multibyte schemes, and fixed-width multibyte schemes and Unicode schemes.

Single-Byte Character Sets

In single-byte character sets, each character occupies exactly one byte of storage.

Single-byte, 7-bit encoding schemes can be used to define up to 128 characters (2 to the 7th power characters). The US7ASCII 7-bit character set corresponds to the ASCII 7-bit American character set.

Single-byte, 8-bit encoding schemes can be used to define up to 256 characters (2 to the 8th power characters). Several 8-bit character sets can be used in this representation. ISO 8859-1 Western European is represented by the WE8ISO8859P1 character set in Oracle. EBCIDIC Code Page 500 8-bit Western European is represented by the WE8EBCIDIC500 character set. DEC 8-bit Western European is represented by the WE3DEC character set.

Varying-Width Multibyte Character Sets

Varying-width multibyte character sets are represented by one or more than one byte per character. These multibyte character sets are typically used to store Asian languages and to support operations based on those languages. Many of these multibyte character sets use the most significant bit of each byte to store the information on if the byte represents an entire character, or if it is part of a multibyte represented character; others differentiate the single byte from multibyte characters in other ways.

Examples of varying-width multiple byte schemas include the following:

Japanese Extended Unix Code
Chinese GB2312-80
AL32UTF8

AL32UTF8 is a varying-width multibyte character set.

Fixed-Width Multibyte Character Sets

Fixed-width multibyte character sets provide database support similar to the varying-width multibyte character sets in that they allow for the storing of characters in more than one byte. The difference is that every character, regardless of how many bytes the character really takes up, takes up 2 bytes of storage space. This allows for the benefit of having a uniform byte size and ease in calculating sizes more exactly. Only one fixed-width multibyte character set is supported, and it is the AL16UTF16.

AL16UTF16 is a fixed-width multibyte character set.

Unicode Character Set

Unicode is a global character encoding standard that represents all characters for computer usage, including technical symbols and characters used in publishing. The current standard, version 3.0 of the Unicode standard, contains 49,149 characters while providing the capability for extending to more than a million different characters.

Characters in the Unicode character sets can have their characters represented in several different encoding formats.

Unicode characters can be represented in a number of different ways, using several different encoding schemes. UTF-16, or Universal Character Set Transformation Format is a 2-byte, fixed-width format. UTF-8 is a multibyte, varying-width format. To take advantage of these two standard formats, Oracle provides AL32UTF8, UTF8, and UTFE as database character sets and AL16UTF16 and UTF8 as national character sets. The advantage of using UTF-8 based character sets is that they include ASCII standard characters, implemented using the same single-byte encoding. Because the UTF8 character set is a superset of ASCII, the migration of database character sets is easier if you are upgrading ASCII-based character sets to Unicode character sets.

Be careful with the difference between UTF-8 (with a hyphen) and UTF8 (without a hyphen). UTF-8 refers to the Unicode standard, and UTF8 refers to the Oracle character set based on the Unicode standard.

The Unicode standards provide an absolutely unique code value for every character, regardless of the platform on which the system is running, the program that is running, or the language that the data is stored in or accessed with. Many software vendors and hardware vendors have adopted the Unicode standard. Many operating systems and browsers now support Unicode, which is required by many other current standards (such as XML, Java, JavaScript, and LDAP) and is also compatible with the ISO/IEC 10646 standard. Oracle has supported Unicode character sets as far back as Oracle 7 and continues to support it today. Oracle 9i's support of the Unicode standard has expanded greatly.

UTF-8 is the 8-bit encoding of Unicode. It is a variable-width multibyte encoding scheme wherein many of the character codes have the same meaning as those same codes in ASCII.

UTF-16 encoding is the 16-bit encoding of Unicode. It is a 2-byte, fixed-width encoding scheme that allows many of the character codes to have the same meaning that the same codes have in ASCII.

Two different types of character sets are of concern to DBAs. The first is the database character set, and the second is the national character set. The following sections discuss these character sets and their differences.