E.2. Unicode Transformation Formats

Although Unicode incorporates the limited ASCII character set (i.e., a collection of characters), it encompasses a more comprehensive character set. In ASCII each character is represented by a byte containing 0s and 1s. One byte is capable of storing the binary numbers from 0 to 255. Each character is assigned a number between 0 and 255; thus, ASCII-based systems can support only 256 characters, a tiny fraction of world's characters. Unicode extends the ASCII character set by encoding the vast majority of the world's characters. The Unicode Standard encodes all of those characters in a uniform numerical space from 0 to 10FFFF hexadecimal. An implementation will express these numbers in one of several transformation formats, choosing the one that best fits the particular application at hand.

Three such formats are in use, called UTF-8, UTF-16 and UTF-32, depending on the size of the unitsin bitsbeing used. UTF-8, a variable-width encoding form, requires one to four bytes to express each Unicode character. UTF-8 data consists of 8-bit bytes (sequences of one, two, three or four bytes depending on the character being encoded) and is well suited for ASCII-based systems, where there is a predominance of one-byte characters (ASCII represents characters as one byte). Currently, UTF-8 is widely implemented in UNIX systems and in databases.

The variable-width UTF-16 encoding form expresses Unicode characters in units of 16 bits (i.e., as two adjacent bytes, or a short integer in many machines). Most Unicode characters are expressed in a single 16-bit unit. However, characters with values above FFFF hexadecimal are expressed with an ordered pair of 16-bit units called surrogates. Surrogates are 16-bit integers in the range D800 through DFFF, which are used solely for the purpose of "escaping" into higher numbered characters. Approximately one million characters can be expressed in this manner. Although a surrogate pair requires 32 bits to represent characters, it is space-efficient to use these 16-bit units. Surrogates are rare characters in current implementations. Many string-handling implementations are written in terms of UTF-16. [Note: Details and sample code for UTF-16 handling are available on the Unicode Consortium Web site at www.unicode.org.]

Implementations that require significant use of rare characters or entire scripts encoded above FFFF hexadecimal should use UTF-32, a 32-bit, fixed-width encoding form that usually requires twice as much memory as UTF-16 encoded characters. The major advantage of the fixed-width UTF-32 encoding form is that it expresses all characters uniformly, so it is easy to handle in arrays.

Figure E.1 shows the different ways in which the three encoding forms handle character encoding. There are few guidelines that state when to use a particular encoding form. The best encoding form to use depends on computer systems and business protocols, not on the data itself. Typically, the UTF-8 encoding form should be used where computer systems and business protocols require data to be handled in 8-bit units, particularly in legacy systems being upgraded, because it often simplifies changes to existing programs. For this reason, UTF-8 has become the encoding form of choice on the Internet. Likewise, UTF-16 is the encoding form of choice on Microsoft Windows applications. UTF-32 is likely to become more widely used in the future as more characters are encoded with values above FFFF hexadecimal. Also, UTF-32 requires less sophisticated handling than UTF-16 in the presence of surrogate pairs.

Figure E.1. Correlation between the three encoding forms.

Character

UTF-8

UTF-16

UTF-32

Latin capital letter A

0x41

0x0041

0x00000041

Greek capital letter ALPHA

0xCD 0x91

0x0391

0x00000391

CJK unified ideograph-4E95

0xE4 0xBA 0x95

0x4E95

0x00004E95

Old italic letter A

0xF0 0x80 0x83 0x80

0xDC00 0xDF00

0x00010300


Preface

Index

    Introduction to Computers, the Internet and Visual C#

    Introduction to the Visual C# 2005 Express Edition IDE

    Introduction to C# Applications

    Introduction to Classes and Objects

    Control Statements: Part 1

    Control Statements: Part 2

    Methods: A Deeper Look

    Arrays

    Classes and Objects: A Deeper Look

    Object-Oriented Programming: Inheritance

    Polymorphism, Interfaces & Operator Overloading

    Exception Handling

    Graphical User Interface Concepts: Part 1

    Graphical User Interface Concepts: Part 2

    Multithreading

    Strings, Characters and Regular Expressions

    Graphics and Multimedia

    Files and Streams

    Extensible Markup Language (XML)

    Database, SQL and ADO.NET

    ASP.NET 2.0, Web Forms and Web Controls

    Web Services

    Networking: Streams-Based Sockets and Datagrams

    Searching and Sorting

    Data Structures

    Generics

    Collections

    Appendix A. Operator Precedence Chart

    Appendix B. Number Systems

    Appendix C. Using the Visual Studio 2005 Debugger

    Appendix D. ASCII Character Set

    Appendix E. Unicode®

    Appendix F. Introduction to XHTML: Part 1

    Appendix G. Introduction to XHTML: Part 2

    Appendix H. HTML/XHTML Special Characters

    Appendix I. HTML/XHTML Colors

    Appendix J. ATM Case Study Code

    Appendix K. UML 2: Additional Diagram Types

    Appendix L. Simple Types

    Index



    Visual C# How to Program
    Visual C# 2005 How to Program (2nd Edition)
    ISBN: 0131525239
    EAN: 2147483647
    Year: 2004
    Pages: 600

    Flylib.com © 2008-2020.
    If you may any questions please contact us: flylib@qtcs.net