Section 4.3. Coding Space


4.3. Coding Space

Coding space, or "codespace" to use the Unicode standard terminology, is the range of integers that can be used as numbers for characters. In an 8-bit encoding, the coding space is the range from 0 to 255. In Unicode, the coding space ranges from 0 to 10FFFF in hexadecimal, 1,114,111 in decimal. Some numbers in the range correspond to characters, some have been excluded from such usage, and some are currently unassigned.

Acode point, also called code position, is simply a value in the coding space. It may or may not have a character assigned to it.

The way Unicode uses the coding space is, strictly speaking, a technicality that does not affect the identity or properties of any character. In that sense, the allocation is independent of other design decisions. It is surely important to people who develop the Unicode standard, since the amount of characters makes some logical planning and allocation principles necessary. But does it interest others?

Understanding the principles of using the coding space helps in locating characters. Many tables and utilities present Unicode characters as organized according to the coding space structure and usage. Typically, you see blocks of characters, so you need to know what a block is. It also helps to know how blocks are organized internally, though we can list only rather general principles.

4.3.1. Planes

For practical reasons, the coding space has been divided into parts called planes . You can visualize a plane as a huge sheet of paper with 65,536 (256 times 256) squares, each of which might contain a character. Then imagine a pile of 17 such sheets. There you have the Unicode coding space.

Originally, Unicode was designed to be a 16-bit code and ISO 10646 a 32-bit code, divided into 16-bit planes. When they were harmonized, it was decided to use the ISO 10646 approach as the basis. However, an agreement between ISO and the Unicode Consortium guarantees that only the first 17 planes will ever be used. This effectively means that the coding space consists of the numbers that can be expressed in 21 bits, with the first 5 bits specifying the plane and the rest the position inside a plane.

Until recently, the use of Unicode has mostly been limited to BMP consisting of the range 0..FFFF, corresponding to the original design of Unicode. The other planes are 10000..1FFFF, 20000..2FFFF, etc., up to 100000..10FFFF.

Nowadays, there are many characters allocated on other planes as well, and rarely used characters (such as characters used in extinct writing systems, appearing in historical documents only) are being added to Unicode that way. Thus, Unicode was first theoretically, and then practically extended beyond a limitation to 16 bits (i.e., to code numbers that can be expressed as 16-bit integers).

Currently, and in the foreseeable future, only the first three planes are used for assigning characters in the standard. The big picture is the following, using hexadecimal numbers for the planes (with decimal numbers in parentheses):

  • Plane 0, Basic Multilingual Plane (BMP), contains most characters used in modern writing systems (and many from historical systems).

  • Plane 1, Supplementary Multilingual Plane (SMP), contains characters used in archaic writing systems as well as various collections of special symbols, including many mathematical symbols.

  • Plane 2, Supplementary Ideographic Plane (SIP), contains less-common Chinese-Japanese-Korean (CJK) characters that do not fit into BMP for practical reasons.

  • Planes 3 through D (= 13) are currently unassignedi.e., reserved for eventual future assignments.

  • Plane E (= 14) is called Supplementary Special-Purpose Plane (SSPP) and reserved for purposes such as code points for control functions.

  • Planes F and 10 (= 15 and 16) are designated for use as Private Use Planes. This means that the standard does not and will not define their use, any more than by saying that they can be used upon private agreements.

4.3.2. Allocation Areas

Between the plane level and the next formally defined levels of allocationrows and blocks there is an auxiliary and informal structuring level, allocation areas . The areas are mainly an organizational device for Unicode development, but they may also help you to get an overview of the use of coding space. An area may contain a set of writing systems of similar type or some other large set. The current allocation areas are:

  • On plane 0 (BMP):

    • General Scripts (Latin, Greek, Cyrillic, Armenian, and many others)

    • CJK Miscellaneous (different characters used in East Asian scripts)

    • Asian Scripts (Yi script and Korean Hangul)

    • Asian Scripts (Yi script and Korean Hangul)

    • Surrogates (reserved)

    • Private Use (for use by agreements outside the standard)

    • Compatibility and Specials (presentation forms etc., and a few formatting characters and special code points)

  • On plane 1 (SMP):

    • General Scripts (various small archaic scripts)

    • Notational Systems (musical, mathematical, and divination symbols)

  • On plane 2 (SIP):

    • CJK Unified Ideographs Extension B

    • CJK Compatibility Ideographs Supplement

4.3.3. Rows and Blocks

Each plane contains 65,536 (216) code points, which can be divided into 256 (28) parts called rows. The term can be misleading, since such a row is often presented visually as an array (matrix) with 16 rows and 16 columns.

The division of a plane to rows corresponds to splitting the last four hexadecimal digits in a code number into two parts consisting of two hexadecimal digits. For example, U+1234 belongs to row 12 (hexadecimal), where it occupies the relative position 34 (hexadecimal). We can say that for characters in the BMP, the first two of the four hexadecimal digits select the row, and the last two select the position within a row.

We will not use such a row concept much, and it is not very common in the Unicode context. There is a more important concept of a block. A block is a contiguous range of code points, which have similar characteristics in some sense and which has a name assigned to it in the Unicode standard. A block may contain code points that are unassigned or designated as noncharacters.

Rows and blocks are two different ways of dividing a plane into parts: a technical (or mathematical) way and a logical way. A block may be just part of a row, and vice versa.

The first block is called "Basic Latin " and it occupies the range U+0000 to U+007F. It has been formed simply because it contains the ASCII characters, with code numbers equaling those in ASCII. The block "Arrows," U+2190 to U+21FF, is much more homogenous: it contains different arrow characters (, , etc.) and nothing else. The block "Mathematical Operators," U+2200 to U+22FF, contains a mixed collection of operator symbols used in mathematics, but it does not contain Thus, the names of blocks should be understood by implying the word "Some" rather than "The" at the startfor example, the block Currency Symbols is not the block for the currency symbols but a block for some currency symbols. Many currency symbols appear in other blocks, including $ in Basic Latin.

Although a block may consist of a collection of characters of the same kind, blocks cannot be meaningfully used for classification of characters. Instead, use the General Category property and other formally defined properties (see Chapter 5).


In many cases, a block corresponds to a row in the sense described in the previous paragraphs. For example, the block "Cyrillic" is U+0400 to U+04FFi.e., row 4 (of plane 0). As the other examples show, however, a block may correspond to a part of a row only. On the other hand, a block may extend over several rows. For example, the block "Mathematical Alphanumeric Symbols," which is a relatively recent addition to Unicode, occupies the range U+1D400 to 1D7FF, therefore spanning rows D4 to D7 of plane 1.

4.3.4. Unicode as Extension of ISO-8859-1

Unicode can be regarded as an extension of practically any character code, in the sense that the Unicode character repertoire contains all characters that appear in at least one character code. However, the code numbers are generally different, of course.

Unicode is an extension of ISO-8859-1 (ISO Latin 1 ), and thereby an extension of ASCII, in a different, much stronger sense. The code numbers of ISO-8859-1 characters are exactly the same in Unicode as in ISO-8859-1. The range U+0000 to U+00FF has thus been directly copied from ISO-8859-1, although it has been divided into blocks: Basic Latin (U+0000 to U+007F) and Latin-1 Supplement (U+0080 to U+00FF).

Beware that Unicode is not an extension of Windows Latin 1 (windows-1252, often misleadingly called "ANSI") in the same sense. Unicode contains all Windows Latin 1 characters, of course, but characters with numbers 80 to 9F (hexadecimal) in Windows Latin 1 have quite different numbers in Unicode. They have in fact been scattered around in different blocks, although many of them appear in the General Punctuation block. The reason for this that in Unicode, range U+0080 to U+009F is reserved for control characters, as in ISO-8859-1.

The special role of ISO-8859-1 of course makes many things technically simpler to people and applications for which ISO-8859-1 has been suitable. If they need some additional characters, they can switch to Unicode smoothly, to some extent.

Conversion from ISO-8895-1 to Unicode requires a change in data representation, though. A file of ISO-8859-1 characters consists of 8-bit units, octets, in a manner that is different from Unicode encoding forms. If the data contains ASCII characters only, no change in representation is needed: a file of ASCII characters can be treated as a file of Unicode characters (in the Basic Latin block) in the UTF-8 encoding.

Since ISO-8859-1 is a mixture of rather different characters, the decision to use it as the model for the first two blocks in Unicode has implications for other blocks. The ISO-8859-1 characters do not appear as duplicates in other blocks, even though they would semantically belong there. For example, the plus sign +, once included in ASCII, does not appear in the Mathematical Operators block.

4.3.5. Internal Structure of Blocks

The internal structure of a block is not something that you need to know to use Unicode. Just as numbers of characters are in principle just labels permanently attached to characters, the mutual order and position of characters (by their code numbers) in a block is "arbitrary" in a sense. For example, letters might or might not appear in alphabetic order. Although the standard guarantees that assigned code numbers will never change, it is usually not a good idea to base processing of characters on the mutual relationships of their code numbers.

Unicode blocks are usually shown as arrays with 8 or 16 columns. The code charts in the Unicode standard organize the arrays so that they need to be read by column, if you wish to follow the code number order. For example, Figure 4-1 shows the start of the code chart for the Cyrillic block. Characters U+0400, U+0401, etc., appear in the first column, under the column heading "040." The order looks rather random, if you read the array by row, but if you read by column, and hence by code number order, it has parts where the order corresponds to the alphabetic order in Russian: , , ,....

Each block is meant to contain a collection of characters that belong together in an essential way. Often the collection and its internal order have been taken from an older, 8-bit character code designed for some language or purpose, though with modifications. In this context, official international (ISO) standards have been preferred to vendor-specific codes, even when the latter have been more common in actual use.

For example, the Cyrillic block is based on the ISO 8859-5 code, which we discussed in Chapter 3. Characters in the block have the same relative positions as in ISO 8859-5. However, ISO 8859-5 characters, such as Latin letters, that already exist in Unicode in

Figure 4-1. Excerpt from the code chart for the Cyrillic block


other blocks were not included in the Cyrillic block. (Many characters in Figure 4-1 look like Latin letters, but they are Cyrillic letters.) This might be described so that the characters in ISO 8859-5 with code numbers A1 to FF were directly copied to Unicode range U+0401 to U+045F, but characters that exist in other blocks (such as Basic Latin and Latin-1 Supplement) were omitted. The rest of the range U+0400 to U+04FF (U+0400 and columns 046 through 04F in the code chart illustrated in Figure 4-1) was used for Cyrillic characters not present in ISO 8859-5.

The omission of already coded characters follows the principle of not coding the same character twice, even though this prevents simple correspondence between other character codes and Unicode. If the Cyrillic block were just a copy of the ISO 8859-5 code table, shifted to a different range, transcoding between ISO 8859-5 and Unicode would be trivial. However, many other things would have become more complex, if such an approach had been taken. For example, all ASCII characters would appear in many copies in different blocks. This would waste coding space and make even simple tests like "is this character 'X'?" more complicated: the data being tested would need to be tested against all the appearances of "X" in different blocks.

This explanation was meant to emphasize that Unicode blocks are not similar or comparable to 8-bit code, even in the relatively common case where a block consists of one "row" of 256 code points and has been defined with some 8-bit code in mind. Using Unicode, you don't switch between blocks by selecting (in some special way) first some block, then another; you just use characters from different blocks.

Unicode blocks are not "code pages."


Some blocks contain characters from one script (writing systems, see Chapter 7) only, and might be named according to the script, such as "Devanagari." However, in general there is no one-to-one mapping between blocks and scripts. Blocks may contain characters from several scripts, and many scripts have been divided into several blocks.

The block concept and the principle of not coding the same character twice can be illustrated by looking at the block Superscripts and Subscripts. It contains the following code points :

  • U+2070 superscript zero

  • U+2071 superscript Latin small letter "i"

  • U+2072 (reserved, with a cross reference note to U+00B2 superscript two)

  • U+2073 (reserved, with a cross reference note to U+00B3 superscript three)

  • U+2074 superscript four

  • etc., up to U+2079 superscript nine

This looks odd, since we would expect that the superscript digits appear in consecutive code positions. There is a "hole" where we would expect superscript two and superscript three to appear, but the code points are reserved. The reason is that those characters, 2 and 3, already exist in the Latin-1 Supplement block. The positions where they would otherwise appear were intentionally left unassigned. This made it explicit that those superscripts do not appear in the block where one might expect to find them, but elsewhere.

So why isn't U+2071 reserved analogously, with reference to U+00B9 superscript one? You don't want to hear the full story, but originally it was reserved, an then allocated to superscript "i" in Unicode Version 3.2 after a long debate. In Unicode terminology, "reserved" means "unassigned (for now)," instead of guaranteeing that the code point will remain unassigned.

4.3.6. Noncharacter Code Points

The last two code points of the BMP, namely U+FFFE and U+FFFF, as well as the corresponding points on other planes, have been explicitly defined as forbidden in Unicode data. By definition, they do not denote any character or control function, and their occurrence in character data may be treated as an error. However, they may appear in a data stream that contains character data; they would then indicate noncharacter data.

The reason for disallowing U+FFFEin any Unicode data is that such a convention helps to detect common errors caused by different byte orders. If a Unicode text file begins with a byte order mark (BOM, U+FEFF), then an attempt to read the file on a system or application that implies the opposite byte order will result in an immediate error. The byte order mark will be read with octets swapped, U+FFFE, and some error recovery can be applied. Byte order is a matter of encoding, to be discussed in Chapter 6. Briefly, byte order specifies the order of octets within a four-octet unit of data.

In a sense, this might be seen as assigning U+FFFE a meaning: it could be interpreted as a "reversed byte order mark," so that an application can simply reverse the order when reading the data. Such things happen when error processing is defined exactly or is obvious from context. An error becomes a feature then.

The code point U+FFFF corresponds to the number -1 when interpreted as a signed integer in two's complement notation. Making it a noncharacter continues an old tradition. Even in the ASCII world, the corresponding code point FF is often treated the same way. Programs that were written to process ASCII data only, but using at least 8-bit storage units, were often made to treat an octet with the first bit set as indicating the absence of character datae.g., the end of an array of characters or the end of input stream. It was most natural to use an octet with all bits setFF, for this purpose. In particular, an input routine that returns a character can use U+FFFF as its return value, to indicate that no character was received.

Moreover, code points U+FDD0..U+FDEF have been defined as noncharacters, and applications may use them for different sentinel or indicator purposes. Similarly to U+FFFE and U+FFFF, they should not appear in character data. However, in a program, a function that normally returns a character may return one of these values to signal "no character" and some additional information. These code points can also appear in a data file as long they are not interpreted as characters.

When a program encounters a noncharacter code point in character data, the Unicode standard allows several options:

  • It may be treated internally as an indicator or sentinel.

  • An error may be signaled.

  • It may be ignored.

  • It may be removed from the data stream (that the program passes forward).

  • It may be treated as an unassigned code pointe.g., so that if a function for getting the value of a property for a character is called with a noncharacter argument, the function would return the same value as for an unassigned code point.

4.3.7. Classification of Code Points

Not all code points in the Unicode coding space correspond to characters. There are the following possibilities:


Assigned

The code point is assigned to a character. Such an allocation will never be removed or changed, though the properties of the character may be changed. The character might be declared as deprecated, but it will remain a Unicode character. The word "character" is to be interpreted in a broad, Unicode sense: it covers normal characters with graphic appearance, combining diacritic marks (which are normally shown as small marks on a base character), different spaces, formatting characters such as line break indicators, and control characters, to be defined in other standards.


Private use

The code point is reserved for "private" usei.e., for use by a specific agreement between interested parties. This, too, is a permanent allocation. Applications may use the code point for their own purposes, such as representing a character that has not been included in Unicode.


Noncharacter

The code point is designated as not corresponding to any character ever. This is permanent: the code point will never be assigned to a character. For historical reasons, some such code points are called "surrogate" code points.


Unassigned

The code point is currently unassigned. It may be allocated in the future. Use of the code point for any purpose is unwise: if you use it for private purposes now, it may later become assigned to a character in the Unicode standard. To emphasize this, unassigned code points are called "reserved."

For example, code point U+0021 is assigned to a character, the exclamation mark. Code point U+E000 is reserved for private use; it is the first code point in a large range of private use characters. Code point U+D800 does not correspond to any character; it corresponds to a "high-surrogate" value but does not represent any character. Code point U+0380 is unallocated in Unicode 4.1; it might be assigned to a character laterprobably to a Greek character, since it belongs to a block of Greek characters.

Previously, the situation was more complicated due to so-called surrogates. Terminology and concepts around them were confusing, but the surrogate concept has now been moved from the code point level to the encoding level, to be discussed in Chapter 6. The old approach is still reflected in the names "high-surrogate code point" and "low-surrogate code point."

This probably sounds rather confusing. Table 4-3 is meant to illustrate the classification. The column "Category" lists the short symbols of General Category values (to be explained in Chapter 5) for code points that belong to the type.

Table 4-3. Classification of code points

Type

Description

Example

Category

Graphic

A visible character

"A" U+0041

L, M, N, P, S, Zs

Format

Invisible, formatting

Line feed U+000A

Cf, Zl, Zp

Control

Control code, defined elsewhere

Backspace U+0008

Cc

Private use

Use by "private" agreement

U+E000

Co

Surrogate

Reserved, should not appear

U+D800

Cs

Noncharacter

Reserved for noncharacter use

U+FFFF

Cn

Reserved

Unassigned (for now)

U+05FF

Cn


Depending on your viewpoint, you might say that only code points of type "graphic" correspond to characters proper. You might take a broader view and call also "format," "control," and "private use" code points as representing characters. Other code points do not correspond to characters, although reserved code points may do so in future versions.

To illustrate the use of the coding space, Table 4-4 shows the number of code points as defined in the Unicode 4.1 standard and as planned for the Unicode 5.0 standard. The counts are given separately for the Basic Multilingual Plane and other planes.

Table 4-4. Number of different code points in Unicode

Ver. 4.1

5.0 (plan)

Type of code points

51,640

51,980

Assigned graphic characters (BMP)

35

35

Assigned format characters (BMP)

65

65

Assigned control characters (BMP)

6,400

6,400

Private use code points (BMP)

2,048

2,048

Surrogate code points (BMP)

34

34

Noncharacters, other (BMP)

5,314

4,974

Unassigned (reserved) code points (BMP)

45,875

46,904

Assigned graphic characters (supplementary planes)

105

105

Assigned format characters (supplementary planes)

131,068

131,068

Private use code points (supplementary planes)

32

32

Noncharacters (supplementary planes)

871,496

870,467

Unassigned (reserved) code points (supplementary planes)

1,114,112

1,114,112

All code points together


4.3.8. Surrogates

Unicode uses the word "surrogate" in a particular technical meaning. To avoid confusion, it is best to avoid this word in its loose everyday meaning; use words like "replacement" instead if you just want to write about using a character in the role of another character.

Originally, surrogates were invented as a method of overcoming the limitations of the 16-bit coding space. To represent characters outside that space, you would reserve some ranges of 16-bit values, called high and low surrogates, and represent a character as a pair of such values. Naturally, the number of characters that you can represent in the 16-bit coding space itself was decreased, since the high and low surrogates must not be used to represent characters, except in pairs as defined.

In Unicode as defined now, surrogates are not to be used as code points. The ranges allocated for high and low surrogates exist in the coding space, as U+D800..U+DB7F and U+DC00..U+DFFF, but code points in those ranges are not supposed to appear in Unicode data as such. Instead, one particular encoding, UTF-16, uses code units (16-bit quantities) with values in the surrogate ranges as a method of encoding characters outside the Basic Multilingual Plane.

Thus, when a program reads data in UTF-16 encoding, it needs to interpret any pair of surrogate code units as a single Unicode character. After this interpretation, the data contains the character with its designated code number (> FFFF hexadecimal), with no trace of any surrogates.

If a code point in a surrogate range is encountered in processing Unicode data (assuming it has been decoded from an eventual encoding such as UTF-16), the situation should be handled as an error. If it's not a high surrogate immediately followed by a low surrogate, there might be no way to handle the situation meaningfully, since we cannot know what happened. But if there is a surrogate pair, odds are that the data was in fact UTF-16 encoded and it was not decoded properly, so you might interpret the data according to UTF-16.

When using UTF-8 (8-bit code units) or UTF-32 (32-bit code units), there is no use for surrogates in any sense.

4.3.9. Unassigned Code Points and Private Use

Unassigned code points are simply points that have neither been allocated for any use nor declared as noncharacters or private use points. You might visualize them as white areas on a map, or as unoccupied rooms in the coding space. Programmers often use such "free" positions for their own purposes, but that would be wrong here; the unassigned code points are not free at all. They are reserved for eventual future extensions.

By using unassigned code points, you would violate the Unicode standard. On the practical side, you would take an unnecessary risk. It is quite possible that a future version of Unicode will assign a specific meaning to the code point. This would involve properties that you cannot anticipate.

Even if the characters you need have some planned or proposed area where there might be placed in a future version of Unicode, it would be a serious mistake to use code points in such an area. The Roadmaps to Unicode at http://www.unicode.org/roadmaps/ show some possible allocations of areas, but they exist for the purposes of planning future versions of Unicode. If you need to use hieroglyphs, for example, you might naively look at the roadmaps and see that the code range U+14000 to U+16BFF has been tentatively allocated for Egyptian and Mayan hieroglyphs, with some more detailed ideas on its internal structure. Using any code point there would be even worse than picking up an unassigned code point at random, since it is probable that some hieroglyphs will be allocated there, and this would almost certainly conflict with the way you would assign characters to code points.

The Unicode standard reserves 6,400 code points in the BMP for so-called private use, for "user-defined characters." This should be more than enough in most cases, but there are 131,068 additional private use code points in other planes. More exactly, the Private Use Area (PUA) consists of the following code points:

  • U+E000 to U+F8FF (on plane 0i.e., BMP)

  • U+F0000 to U+FFFFD (on plane F hexadecimal)

  • U+100000 to U+10FFFD (on plane 10 hexadecimali.e., the last plane)

Here the word "private" has a wider meaning than in common language. For example, two large public institutions could agree on the use of some private use code points for their information interchange. You could use private use code points even in data that you distribute in public, as long as you make it clear that the interpretation and processing of the data requires knowledge about special definitions you have made.

You should not use unassigned code points even for internal purposes like bookkeeping or "sentinels" such as indicators of end of character data or separators between blocks of character data. For such purposes, you can often use code points assigned to control characters or declared as noncharacters.

Do not use unassigned code points for anything. If you need a code point for a character that cannot be expressed in Unicode (yet), use private use code points.




Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net