Section 1.2. What s in a Character?


1.2. What's in a Character?

We use characters daily: we type them, and we read them on screen or on paper. We use text-processing programs routinely, much like people used to use typewriters, pens, or other writing tools. How could characters create problems?

1.2.1. Why Do We Need to Know About Characters?

If English is your native language, you are accustomed to using a small set of characters, consisting of the letters AZ and az, digits 09, and a few punctuation characters. Most novels, newspaper articles, and memos contain no other characters. Since you seem to be able to type these characters directly on a keyboard, why should you learn more about characters and get confused? To be honest, character issues are confusing.

Suppose you use a computer only to write and edit texts in English, perhaps as a secretary or a technical editor. You still have reasons to know about characters:

  • Computer technology has caused a decline in typography, and you can make a positive impression by using correct punctuation instead of typewriter-style punctuation. If you use a text-processing program, it probably takes care of using "smart" quotation marks instead of "straight" quotes, but you need to learn how to produce dasheslike thisand how to prevent bad line breaks.

  • Normal English texts may contain special characters occasionally. Someone may spell Caesar as Cæsar, or use a word like fiancé, rôle, or garçon the French way, or use the per mille sign or the euro sign €. Michael Everson writes: "Despite unfounded but widespread belief to the contrary (based doubtless on the prevalence of ASCII), diacritics (usually French ones) are often found in naturalized English words. Examples are: à la carte, abbé, Ægean, archæology, belovèd, café, décor, détente, éclair, façade, fête, naïve, naïvety (but cf. non-naturalized naïveté), noël, œsophagus, résumé, vicuña (http://www.evertype.com/alphabets/english.pdf). You may regard some of these spellings as foreign or obsolete, but people may still use them in English. There are often good reasons to change the spelling to something simpler, but not knowing how to produce the characters is not a good reason.

  • Your text may contain foreign names with some strange characters. Although it is common to simplify the spelling, you can stand out positively by doing things correctly. Suppose that someone's surname is Hämäläinen and she works in an important international position. She is probably accustomed to seeing her name written as Hamalainen or Haemaelaeinen. But wouldn't she be delighted if someone were polite enough and competent enough to spell her name right, just for a change? However, she might not like it if someone tried to do so and failed, producing Hmlinen or H{m{l{inen.

  • You might even be asked to include quotations in a foreign language. You might even need to work with a document in a foreign language, because someone has to do that and this is your day for being that someone. In that case, you may need to use foreign punctuation as well and to find a way to enter foreign characters efficiently, in addition to just knowing a universal clumsy way of entering any character.

  • Texts increasingly contain technical and scientific special notations . Even casual memos and messages may need to mention µm (micrometer) or to use the almost equals sign or the male sign . In scientific or technical texts, mathematical formulas are often quite crucial and need to be exactly right, down to the choice of each special symbol. The world is getting more technical and symbolic. Even nontechnical texts like bridge columns contain special symbols, such as .

In multilingual applications, characters and their codes are a major issue. Even a web site with two or more languages or a bilingual dictionary can be regarded as multilingual applications, and they create the problem of representing the characters of both or all languages. For example, people using French and people using Russian on computers probably work with their own tools, settings, and conventions, but if you need to create a document that is bilingual in French and Russian, you need to make sure you can work with both Latin letters with diacritic marks and Cyrillic letters. In effect, you would need to use Unicode, one way or another.

If you are a computer professional,you need to be prepared to handle data-processing problems that may involve characters of any kind. Someday someone will ask you to work with a system for processing data in a strange language or with strange symbols in it, perhaps even in a writing system where text runs right to left. It will be very difficult if you have no background in working with such issues. Most people need quite some time to digest character problems and techniques. You may find that, with something you thought you knew for years, you have completely misunderstood some basics.

Even if you process only "normal" text, character code standards and specifications are more important than they used to be. Modularity of software requires that you isolate character-level processing from other levels. You should not test for a character variable's value being equal to 32 to test whether it is a space character. Often, even a more sensible test, against the character constant ' ', is suboptimal, and using a built-in function like isspace is better, since it takes care of other space-like characters as well. Tools developed for such operations are increasingly based on general specification in character standards, especially the Unicode standard. They are supposed to define, in a systematic and all-compassing way, the fundamental properties of characters, like being space-like, or being a letter, or allowing a line break before or after a character. To use such definitions and software modules that implement them, you don't need to know every detail, but you need to know the principles and the ways to get at the details when needed.

In addition, if you design or develop programs, databases, or systems, you will find that it is extremely difficult to adapt them to processing different character sets, if they were not designed to work that way. If the software is full of code that relies on using 1 byte (octet, 8-bit entity) for one character, it may need an almost complete rewrite if it needs to be modified to process Chinese text as well.

1.2.2. Characters as Units of Text

A character is a basic (or "atomic") unit of written text. A piece of text is a sequence of characters, also called a string. This does not necessarily mean that text is always displayed so that its characters appear linearly one after another, although this is what happens for English text, if we ignore the issue of division into lines. In other writing systems, consecutive characters may be combined into one glyph in complex ways. However, the text is still logically a sequence of characters.

1.2.2.1. Characters as abstractions

To store, process, and transfer data in digital form, we need an abstract concept of a character. It would not be feasible to store the specific appearance of each written character. Instead, we store information that tells which character it is, independent of the specific visual shape it has. If we wish to affect the way in which our characters are displayed and printed, we use special formatting commands or other tools.

The abstract concept of character is essential in Unicode, in all digital processing of character data, and even in writing itself. The meaning of a piece of text does not change if you change its font, the specific design of its characters. To put it a bit differently, the style and tasteand even the effectof text might change, but we have an intuitive understanding of something invariant behind such variation. For example, "A," "A," "A," and "A" are instances of the same character. Since you know the Latin alphabet, you should have no difficulty with this. You might find it more difficult to know whether א and א are instances of the same Hebrew character, but people who speak Hebrew are able to recognize that.

Different attempts have been made to describe what characters are. They have even been compared to Platonic forms. The point is that there is so much negative in the concept: it is largely defined by saying what a character is not. In a sense, we extract properties and concrete features, until there's very little leftsomething that could be called the idea of a particular character. Dan Connolly has written in his classical treatise "'Character Set' Considered Harmful": "Note that by the term character, we do not mean a glyph, a name, a phoneme, nor a bit combination. A character is simply an atomic unit of communication. It is typically a symbol whose various representations are understood to mean the same thing by a community of people."

This raises the question of what to do if different people recognize things differently. In some languages, "v" and "w" have been treated as typographic variants of a single character; other languages treat them as completely distinct letters. In such situations, Unicode normally defines separate characters.

To clarify the abstract nature of characters, a Unicode character, or a character defined by some other standard:

  • Normally has no particular stylistic appearance but may vary between broad limits, as long as the designs can be recognized as the same character

  • Is essentially black and white, though a character as a whole could be colored with any other two colors (making, for example, the character appear in red), using methods external to character standards

  • Has no fixed pronunciation, except for some specifically phonetic characters; however, there are of course correspondences between letters and sounds, even across languages that use the same basic writing system

  • May have very specific usage as a special symbol (e.g., © is just a copyright symbol) or a broad range of different uses (e.g., / can be a separator of a kind, a mathematical operator, or something else)

1.2.2.2. Variation of appearance or different characters?

Problems arise when the concept of an abstract character has to be applied to concrete situations. We know what the letter "A" is, but is it the same as the lowercase letter "a"? That is, is the difference between them just variation in appearance, the same way as the letter "A" in the Times font differs from the letter "A" in the Arial font? In fact, the lowercase letters are a medieval invention, created by people who wrote text by hand and needed forms that are more convenient for that.

We could have defined "A" and "a" as just visual variants of the same abstract character, but we didn't. Quite early in the history of computers, this decision was made. It has far-reaching implications. If you wish to process input data so that upper- and lowercase letters are equivalent, to make things easier to people who type the data, you need to do something special to take care of that.

To take things a bit further, consider the Latin letter "A" and its relationship to the corresponding Cyrillic letter and the corresponding Greek letter, capital alpha. All three letters look the same in most fonts, and they share a common origin. Yet they belong to different alphabets: the Latin alphabet A, B, C, D..., which we use in English and many other languages, the Cyrillic alphabet , , , ..., which is used in Russian and many Eastern European languages, and the Greek alphabet Α, Β, Γ, Δ... (alpha, beta, gamma, delta...).

It would have been possible to identify the Latin "A" and its Cyrillic and Greek counterparts. However, it was decided to keep them separate. Generally, Unicode (and character standards in general) do not unify characters across writing system boundaries. We might take this just as a fact of life and live with it. But we might also look at its reasonableness. Consider the operation of converting text from upper- to lowercase. The Latin letter "A" should become "a," whereas the Greek letter alpha "Α" should become α. It would be impossible to do this automatically if it were impossible to tell, from the internal digital representation, whether the original data contains the Latin "A" or the Greek "Α."

Writing systems were invented by people, and characters are creations of mankind, not nature. Thus, the identity of abstract characters is in a sense just a decision made by some people. However, it is usually an informed decision.

1.2.2.3. Variation in shape turned into a character difference

In many cases, stylistic variation in drawing or printing a character has been "frozen" so that a variant obtains a specific shape and meaning. The ancient Romans used the letter "V" both as a consonant and as vowel. Later, it appeared in different variants, such as a rounded one, like our "U." People started using the original version and different curved variants in different contexts. As such usage became systematic, consistent, and common, the letter "U" was born.

Therefore, we now have the independent characters "V" and "U." They are, in turn, written with stylistic variation, though now the general idea is that the variation should not obscure the difference between these two characters. Yet, you might still see "V" used for "U" for stylistic reasons, especially to imitate ancient inscriptions (SENATVS POPVLVSQVE ROMANVS).

The letters "U" and "V" have later given birth to new characters that have originally been formed as their typographic variants, as well as the letter "W," originally a digraph (VV). Special forms of this letter have been recognized as separate characters, such as the modifier letter small w, ʷ. The story goes on. In different areas that need new symbols, characters are created as variants or modifications of old characters. This seems to suit the human mind better than the invention of new character shapes from scratch.

1.2.2.4. Characters and "abstract characters"

The Unicode standard defines different meanings for the term character. The first one is: "The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader's understanding." The second meaning is that "character" is a synonym for "abstract character," which is defined as "a unit of information used for the organization, control, or representation of textual data."

Thus, the difference seems to be that an abstract character may have a control purpose only. Control purposes include line breaks, for example. In more common terminology, "character" in Unicode often means a printable (graphic) character, whereas "abstract character" means what is commonly called just "character," which includes printable and control characters.

On the other hand, the Unicode standard also uses the expression "abstract character" to refer to a symbol that may be perceived by users as a character ("user character"), although it cannot be represented as a single Unicode character (also known as encoded character or coded character). In particular, a symbol with special marks (diacritic marks) on it, such as ó, cannot always be represented as one character in Unicode but may be a sequence of two or more characters.

The expression "semantic value" is somewhat misleading in this context. A character such as a letter can hardly be described as having a meaning (semantic value) in itself. It would be better to say that a character has a recognized identity and it may be sometimes used as meaningful in itself (as a symbol or as a one-letter word) but more often as a component of a string that has a meaning. Moreover, the "smallest component" part is somewhat vague. A character such as ú (letter u with an acute accent), which belongs to Unicode, can often be regarded as consisting of smaller components: a letter and a diacritic (acute accent). In fact, in Unicode, the character ú may be regarded either as a character on its own or as a combination: as two successive characters, letter "u" and a combining acute accent.

The intuitive concept of character varies by language and cultural background . If you know the letter ä mainly from J. R. R. Tolkien's books, you might regard it just as letter "a" with a special mark that indicates that it is to be pronounced separately. You might even regard the two dots just as optional decoration, as in "naïve" if spelled in the French way. If your native language were Finnish, you would certainly treat ä as a completely separate character, and you would have learned at school that it has its own position in alphabetic order (a, b, c,...x, y, z, å, ä, ö). Similarly, in Swedish, the words "här" ("here"), "har" ("has"), and "hår" ("hair") must be kept clearly separate. To a German, ä is different from "a," but it is treated as primarily equivalent to "a" in alphabetic order and is in a sense a variant of "a" ("a Umlaut").

Unicode, aiming at universality, generally recognizes written forms as separate characters, if at least one language or commonly used notation system makes a difference. Thus, "a" and ä are treated as distinct. If you wish to handle them as equivalent, you need to program code that treats them that way.

1.2.2.5. Characters and other units of text

Although a character is a natural "atom" of text in data processing, it does not always correspond to people's intuitive idea of the basic constituents of text. Looking at text in English, we might occasionally ask ourselves whether the ligature is two characters or one. In other writing systems, similar questions arise more often. Unicode takes a liberal approach to identifying a complex character in many cases. You can represent as one character or (more often) as two characters, "f and "i." As mentioned above, similar principles apply to letters with diacritic marks.

People who speak languages with many diacritic marks or ligatures may regard a symbol like or ú as a single character, even though they are often coded as sequences of characters. In some cases, it would not even be possible to code the symbol as a single character in Unicode, since Unicode does not contain all the combinations and ligatures that can be formed.

Moreover, although characters might be written separately, as in "ch," their combination might be understood as a single entity by some people. In English, "ch" denotes a particular sound and has thus some identity of its own. Some other languages treat the combination as an inseparable unit even in alphabetic order: in a dictionary, words would appear in an order like car, czar, char. Such treatment has become less common, though, since it is somewhat more difficult to implement in automated processing. Unicode treats "ch" as two characters but recognizes that it might constitute a unit in ordering.

Partly for such reasons, the ordering of characters is rather complex. Unicode does not prescribe a single ordering of characters and strings. Rather, it defines a basic (default) ordering that can be used as basis for defining language-dependent and even application-specific orderings.

1.2.3. Characters Versus Images

Characters are normally represented in graphic form, as something that can be called an image. However, there is a fundamental difference between an image and a character. An image can be a particular rendering of a character, much like a spoken word is a particular presentation of an element of a language. Moreover, most images are not renderings of characters at all.

Character code standards mostly identify a symbol as a character only if it is actually used in textse.g., in books, magazines, newspapers, and electronic documents. Characters that are normally used only in product labels and other specialized contexts are often borderline cases. However, they are often identified as characters if they are used in conjunction with symbols that are undeniably characters.

A typical example is the estimated symbol ℮, a stylized variant of the letter "e." It is not used in normal texts, but only in European packaging to claim conformance to certain standards in specifying a quantity. However, it is identified as a character, partly because it is used in packages in relation to text characterse.g., in "℮ 200 g" (indicating that the mass of the product is 200 grams, within tolerances defined in specific regulation).

On the other hand, logos and identifying symbols are not treated as characters, even though they might be accompanied by texts. By its nature, a logo consists of a name or abbreviation in a particular graphic style. Hence, it would be unnatural to encode it as a character or sequence of characters, although we might use a string of characters as a replacement for a logo (e.g., when a document containing a logo needs to be converted to plain text and the logo conveys essential information).

Similarly, most of the various political, ideological, or religious symbols are treated as graphic symbols that are not characters. They are not normally used in texts. Their shape may vary, but not as part of font variation. However, for various reasons, some graphic symbols have been defined as characters in some character codes, contrary to these principles. Unicode therefore contains them as characters, so that existing texts using such characters can be encoded.

Generally, a graphic symbol is encoded as a character in Unicode, if there is need for exchanging it in digital form in plain text. Decisions on this are sometimes difficult and may be affected by tradition.


The distinction between a character and an image is often a practical decision to be made by the author or editor of a document. In many cases, you have a choice between a character and an image. For example, suppose that you are designing a user interface for a document, program, or web page and you need graphic symbols for "Next" and "Previous." It may often be best to use words, but let us assume that you want to use arrows pointing to the left and to the right. Beware that even at this fairly abstract level, the decision is not culturally neutral: it implies left-to-right writing direction.

In Unicode, there is a largish block of arrow characters. Among them, a few like and are widely available in commonly used fonts. However, they are not very prominent graphically, even if shown in bold, in large font, and in color. Their graphic design is character-like, not iconic. Some other characters in the Arrows block of Unicode look more solid, but they are not as common in fonts. For buttons or links, specially designed images may thus work better. On the other hand, in running texts, the arrow characters often work well. If you wish to make references to other entries in an encyclopedia by using arrows, then "foobar works better than a word preceded by a distinctive graphic.

Generally, when deciding between the use of characters and the use of an image for presenting a graphic symbol, the following items should be considered:

  • Are there some Unicode characters that could be used, and are they suitable both by their defined semantics and by their typical graphic appearance?

  • Is it possible that the document will be rendered so that images are not displayed? If yes, is it possible to specify a textual alternative to the image (such as the alt attribute in HTML markup)?

  • How safely would the character work, given all the possible problems with encodings, fonts, etc.?

  • Is it acceptable, and perhaps desirable, that the symbol changes size, shape, or color when text font size, face, or color is changed?

  • Is it possible that the data will be processed as a character stringe.g., stored in a database or used in a search string?

For example, suppose we write about music and wish to refer to F-sharp and B-flat using the conventional musical symbols: F, B. The Unicode approach would use the special characters: music sharp sign and music flat sign . However, these characters, although part of Unicode since Version 1.1, are poorly supported in fonts. Even though you could find them in some fonts at your disposal, their appearance might not fit into your typographic design. You might end up using the number sign # and the letter "b as replacements. In web authoring for example, you might decide that although B&#x266d; would be technically quite correct (using a so-called character reference to include the flat sign), it is safer to create a small image, say flat.gif, and embed it with markup like B<img src="/books/1/536/1/html/2/flat.gif" alt="-flat">. This means that the flat symbol remains in constant size if the text size is changed, but this is usually tolerable.

Sometimes character-looking symbols are not characters. Microsoft Word by default changes the three-character sequence "-->" into a kind of arrow symbol (à). However, this arrow is different from any Unicode character: it is just a glyph in the Wingdings font. It is therefore something between a character and an image; as so many compromises, it combines the drawbacks of the alternatives.

1.2.4. Processing of Characters

The previous discussion mentioned that characters can be processed and used in many ways that are not possible (or practical), if information is represented as images, sounds, or in another nontext format. This includes:

  • Searching for occurrences of a word or other fragment of text, using either a simple search string or a text pattern

  • Performing automatic replacements, such as substituting a string for another in all occurrences

  • Indexing the data for efficiency of searching and for creating an alphabetic index or concordance (list of occurrences of words)

  • Sorting text datae.g., for presentation in alphabetic order

  • Copying text from an application or data format to another, often via a clipboard

  • Modifying text as in a text editor or text-processing application, by deleting, inserting, and replacing characters

  • Selecting parts of text by user actions, such as painting or keyboard commands

  • Recognizing constructs like words, syllables, morphemes (components of a word with a meaning), and sentences

  • Computing statistics on the use of characters, words, phrases, etc.

  • Spelling and grammar checks

  • Automatic or computer-aided translation

  • Presenting texts in audible form, via speech synthesis, which is more natural these days than you might expect from many science fiction films

Even the display of characters on screen or paper involves processing:

  • Choice of font, which can be a complex process

  • Application of bolding, italics, and other features, if requested

  • Selection of contextual forms for characters

  • Recognition of character sequences that should or could be rendered using ligatures or other special methods

  • Formation of characters with diacritic marks, often requiring complex algorithms

  • Adjusting spacing between characters and words, perhaps for justification of lines

  • Breaking text into lines, perhaps using hyphenation

In particular, suppose that some document exists on paper only, or as a scanned image only. The above lists of possibilities can be consulted when estimating whether the text should be converted into text format. The conversion may require quite a bit of work, including the identification of special characters occurring in the documents.

Sometimes the benefits of text format turn into drawbacks, or they are regarded as problems. If you send a contract by email and ask the recipient to print, sign, and send it, can you be sure that he does not edit the text before printing, without your noticing? Ease of copying text can be a problem, if it is used to violate your copyright. For such reasons, plain text and even other text forms are sometimes avoided. Perhaps even a printing possibility is undesirable. Some data formats, such as PDF, can be locked, or protected against copying and modification and printingthough in a relative sense only.

1.2.5. Giving Identity to Characters

To represent characters in digital form, we need to encode them using bits, but first we need something to encode. We need a collection of characters that are distinguishable from each other. We do not define characters individually but as parts of a collection. The Latin letter "A" is defined, among other things, by designating it as distinct from lowercase "a" or from any Greek or Cyrillic letter.

A character is also described by its meaning, or semantics. However, we must be careful about this. A character is usually just an atom of text and normally lacks a meaning in the sense that words or some parts of words have meanings. In the word "singing," the stem "sing" and the suffix "-ing" have meanings, but it would not be natural to say that the letter "g" has a meaning, in any comparable sense.

The meaning of letter "g" is basically that it is one of the (lowercase) Latin letters, used to write words in some writing systems. Its pronunciation may vary (even within one languagecompare "get" with "gem"), although it might be possible to indicate some typical phonetic values. Generally, definitions of letters in character standards are independent of pronunciation issues, except for some characters specifically designed for such usage (e.g., characters in the International Phonetic Alphabet, IPA).

As we get to more technical characters, such as the plus sign + or the copyright symbol © or the smiling face ☺, we find characters that can be described as having a meaning of their own. They might even correspond to words, such as "plus" and "copyright."

1.2.5.1. Definitions of characters in standards

The definition of a character in a standard needs to be unambiguous and definitive, not just loose prose. Old character standards tried avoiding the problem of definition by simply showing the character, assigning a number to it, and possibly naming it. This has turned out to be insufficient for many purposes. How could you tell from just seeing an "A" whether it is meant to be the Latin letter only, or also the Greek or Cyrillic letter?

The most important character standard in the modern world is Unicode, so let us take a look at its way of defining characters. Unicode identifies a character by:

  • Showing a representative glyph for the characteri.e., one specific but typical visual form that the character may have

  • Assigning a unique number to it; this number will never be changed

  • Assigning a unique Unicode name for it; this will never be changed either, even if it is found misleading or originally mistyped, and it is best to regard it as a mnemonic identifier rather than a name in a normal sense

  • Specifying a set of properties for it in a rigorous, formalized manner; they describe, for example, the general class (letter, digit, punctuation, etc.) of the character, its uppercase equivalent when applicable, etc.

  • Making annotations i.e., prose descriptions that clarify the meaning, often comparing the character with other characters, presenting alternate names for it, and sometimes even describing possible variation in the visual appearance

For example, the plus sign is defined in Unicode as follows:

  • The representative glyph looks much like +.

  • The number is 2B, often written as 002B for uniformity, in hexadecimal (base 16) notation, which means 43 in decimal (base 10).

  • The name is PLUS SIGN.

  • The general category is "Sm," which is short for "Symbol, Math." Line breaking is permitted after the character. There are several other formalized properties as well; we will discuss the various properties in detail in Chapter 5.

  • There are no annotations for this character.

1.2.5.2. Annotations used to emphasize differences

The plus sign is not easily confused with any other character, and it has no widely used alternate names in English. Therefore, no annotations were deemed necessary. For the comma character "," character number 002C, for example, there is an annotation that says that the character has the alternative name "decimal separator." This does not mean that the decimal separator should be a comma (although most languages in fact use a comma for that). It just means that in some contexts some people call the comma "decimal separator." This effectively identifies a comma used as a decimal separator with the character number 002C, as opposed to treating it as a separate though similar character. On the other hand, the annotations related to the comma character also contain notes that refer to "Arabic comma," "single low-9 quotation mark," and "ideographic comma" as separate characters. This can be read as a warning against confusing the comma with those visually similar characters. For example, some languages use a single low-9 quotation mark as an opening quote in some contexts (e.g., in German: 'gut'); without a warning, you might be inclined to think that it's just a special use for the comma.

1.2.5.3. The representative glyphs

The definitions of characters in Unicode are logical and do not imply any particular presentation of a character, either internally (in digital form, as bits) or visibly on paper or screen. However, a representative glyph is given to clarify the identity of a character.

The Unicode standard explicitly says that the representative glyph is not a prescriptive form of the character, but it lets a "knowledgeable user" recognize the character.

The glyphs used in Unicode code charts tend to be neutral and generic rather than typographically well-designed. They typically lack artistic ambitions, and they have been designed so that differences with other characters have been emphasized. That is, glyphs for characters that are often rather similar in practice, especially if we consider variation across fonts, have usually been designed to be sufficiently different from each other.

1.2.5.4. The number and the Unicode name as identifiers

The number assigned can be regarded as identification only, although in practice, it is used as a basis for the digital representation. The Unicode name is an alternative, more mnemonic identifier. As a mental exercise, consider the possibility of sending information by telephone so that you utter the names of Unicode characters, in order to express something complicated like a foreign word or a formula. If both participants have access to information about Unicode characters, the communication can be completely successful even though no visible characters are sent and no digital encoding is used.

Thus, when characters are represented in digital form, each character is internally a number, an integer. Numbers in turn are represented as sequences of bits, but this is a different level. When a file contains the string "Hello" (without the quotation marks), it really contains five numbers corresponding to the characters. In most character codes, this is the sequence 72, 101, 108, 108, 111.

A character code can assign numbers to characters arbitrarily, but once assigned in a specification, they should not be changed. In practice, the assignments have been made in a partly systematic way, so that related characters often have consecutive numbers.

Many modern standards, specifications, and instructions identify characters by their Unicode numbers to achieve unambiguity. Previously, documents on matters like mathematical or technical notations or transliteration of texts used to specify the symbols to be used just by showing them as visual forms, as ink on paper. This turned out to be particularly problematic in the computer era, when different people interpreted such signs differently, resulting in incompatible encoding of data.

Suppose that you specify, for example, that in some notation, the double prime character ("), with Unicode number 2033 in hexadecimal, be used (say, to denote seconds as a subdivision of a degree when expressing angles). Actually, the Unicode number alone would suffice, but mentioning the name makes the specification more readable. In principle, you do not even need to write the character itself, though usually it helps. By identifying the Unicode number, you have achieved several things:

  • You have unambiguously identified the character you mean. People may still decide to use some similar character instead, if they have difficulty typing the right character. Yet, it is clear which is the right character; others are various replacements.

  • You have given a number that can be used as an index to large collections of information about the character, such as varying visual shapes for it, its defined properties, fonts containing it, definitions of meaning, and comments on scope of usage.

  • The number can be used for typing the character by anyone who knows a general input method for Unicode characters in a particular environment. Typical word processors have at least one mechanism that produces a specific character, if you just specify its Unicode number.

Thus, anyone who participates in creating or clarifying notational specifications should know the principles of Unicode and should promote the use of Unicode numbers for characters. You should probably expect resistance, since it is not quite easy to see the benefits.

1.2.5.5. Unicode is more explicit

Older character standards, such as ASCII and the ISO 8859 family of standards, contain substantially less information about characters. They rely on the names of characters and the representative glyphsand intuitive understanding related to the traditions of using characters. The same applies to the ISO 10646 standard, which is the official international standard that corresponds to Unicode. This means that we have two standards that are fully in accordance, ISO 10646 and the Unicode standard, but the latter contains a lot of additional information. Moreover, the Unicode standard is freely available on the World Wide Web, which is why people speak about Unicode and not ISO 10646, except in official standards and related documents.

The collection of all Unicode (or ISO 10646) characters is sometimes called the Universal Character Set (UCS). This expression is used especially in formal contexts, when one needs to refer to ISO 10646 and does not want to mention Unicode. In normal prose, we usually refer just to Unicode characters.

1.2.5.6. Spelling of names and the U+nnnn convention

The Unicode names of characters are written in all uppercase in the Unicode standard, but this is just a convention. In fact, the standard itself spells the names in all lowercase in some contexts. Uppercasing is often used to indicate (or hint) that a character is referred to by its Unicode name. However, in this book, we use normal (mixed) case for the names, except in some quotations.

We will use the conventional style of mentioning a Unicode character by its code number in hexadecimal (base 16) and prefixed with U+e.g., U+002B. We could use just the number, but then you might not always know whether we use a number for such identification or just as a number.

This notation is used with at least four hexadecimal digits, so there are often leading zeros. All characters in the so-called Basic Multilingual Plane (BMP) can be expressed in four digits, but some newer characters need more.

We will normally mention first the Unicode name, then the code, often with a glyph between them. Thus, while you might see a Unicode character mentioned as U+002B PLUS SIGN in many sources, we will mostly say: the plus sign + U+002B.

1.2.6. Unicode Definitions of Characters

The definition of a character in Unicode is given partly in code charts, partly in the Unicode Database, which contains large tables of data on characters, by property, to be discussed in Chapter 5. Here we concentrate on the information in the code charts, which are available via http://www.Unicode.org/charts/. Each code chart begins with a table of glyphs, followed by notes on each character. The notes vary greatly in length and nature, but they should always be consulted when in doubt about the identity of character. Note that the code charts have been divided into two major groups, "Scripts" (which contains letters, ideographs, and other characters to write different human languages) and "Symbols and Punctuation." There is some overlap, since some blocks of characters belong to both groups.

The description of a character in a code chart consists of the following, where the first three items are given for every character (on one line), and others may or may not be present:

Figure 1-2. Sample description of a character in a Unicode code chart


  • Unicode number

  • Representative glyph (in normal text size)

  • Unicode name, in uppercase; this name is fixed

  • Old (Unicode 1.0) name, in uppercase on a line of its own

  • Other name(s), preceded by an equals sign = and written in lowercase; these names may be changed

  • Comment(s) on usage, preceded by a bullet •

  • Cross reference(s) to other characters, preceded by an arrow ; these references often warn against confusing a character with another, similar-looking character

  • (indicating so-called canonical equivalence) or with the symbol (indicating weaker correspondence)

Figure 1-1 shows the description of the full stop (period) character in a code chart.

1.2.7. Definitions of Characters Elsewhere

Characters were defined and used long before Unicode. Even in our times, characters are often used without identifying them with a reference to any character code standards. This creates ambiguity and potential diversity when text data is represented in computer-readable form.

For example, the standards that define the SI, the International System of Units (an extension of the metric system), use several special characters such as µ, x, and . The authoritative formats of the standards are printed documents, and since they do not specify code numbers or Unicode names for the characters, we are left in some uncertainty. Some characters can be identified rather unambiguously, but it is unclear what the "raised dot character is, for example. This character, used in notations like N·m (for newton meter), is usually interpreted as the middle dot U+00B7, but it can be argued that a more appropriate interpretation is the dot operator U+22C5.

Similarly, the International Phonetic Alphabet (IPA) was originally defined about a century ago. When it later became relevant to use it on computers, the characters had to be identified as Unicode characters. This was far from trivial, since many IPA characters can be regarded as normal Latin letters, or treated as separate symbols.

Even relatively new standards on transliteration or transcriptioni.e., on conversions between writing systemsfail to identify all characters unambiguously. For example, many standards and tables for writing Russian words in Latin letters specify that the so-called hard sign, , is to be translated using a special character, but this character is just shown as a glyph on paper. This is subject to different interpretations including the ASCII quotation mark ", the right double quotation mark ", and the double prime " (U+2033). The Unicode standard makes, in a code chart, the following note about the modifier letter double prime ʺ (U+02BA): "transliteration of tverdyj znak (Cyrillic hard sign: no palatalization)." This might seem to resolve the issue in principle, but in practice, that character is not present in most fonts, and we can also ask whether the Unicode standard is authoritative in transliteration issues. Problems similar to this also exist for some apostrophe-like characters in transliteration systems for Arabic, for example.

1.2.8. What's in a Name?

The names of characters in character standards are assigned identifiers rather than definitions. This is particularly true for Unicode, which now has an absolute principle of name stability. A Unicode name will not be changed even if proved wrong.

Typically, the names are selected so that they contain only letters AZ, spaces, and hyphens; often the uppercase variant is the reference spelling of a character name.

The same character may have different names in different definitions of character repertoires. Generally, the name is intended to suggest a generic meaning and scope of use. However, the Unicode standard warns (mentioning full stop "." as an example of a character with varying usage):

A character may have a broader range of use than the most literal interpretation of its name might indicate; the coded representation, name, and representative glyph need to be taken in context when establishing the semantics of a character.

Although the Unicode names can be misleadinga price that we pay for their absolute stabilitymost of them aren't. The great majority of Unicode names describe the character, and the name is often the only description that the Unicode standard gives about a character individually. Thus, the name should be taken as describing the character, unless there is an annotation that says otherwise.

The Unicode name is in English, in a sense. In many cases, it is normal English, but often the name contains elements from other languages, such as the name in another language but as (somehow) adapted to English spelling.

For many purposes, it would be desirable to refer to characters by some widely understood names, in different languages. There will probably be a registry of such names, though mostly only for those characters that are widely used in each language. It will naturally contain English names as well, partly different for U.S. English and British English. They will of course have much similarity to the Unicode names. The naming is expected to take place in the context of Common Locale Data Repository (CLDR), discussed in Chapter 11.

Names of characters vary a lot, even within a language. This applies particularly to characters that are widely used in modern notations, but without much tradition, such as the tilde ~ or the commercial at @. Do not assume that people know from the name alone what you mean, even if you speak the same language.


The Unicode standard mentions some colloquial names for characters, even in languages other than English. For the @ character, it mentions that the "common, humorous German slang name" is "Klammeraffe," which means "clinging monkey." Undoubtedly, in some environments, the character might be better known under that name than under any official name. However, you need to be careful in using the alternate names mentioned in the standard. It is better to look for information on actual usage in a language and a subculture. Slang, by its nature, varies by time and people.

When you need to refer to a character and cannot just show it, try to mention commonly known synonyms for it. It is not constructive to say just "use the reverse solidus." Instead, you can say "use the forward slash (that is, solidus), not the backslash (reverse solidus)." Unicode names alone are often rather useless in difficult situations for identifying characters to people who are not familiar with Unicode. The same applies even more to Unicode numbers.

Thus, you are not supposed to use the Unicode names for all characters in all contexts. If you are used to calling the "." character "period," you need not start calling it "full stop." You need not spell out "capital Latin letter A" every time you mention capital (uppercase) "A." However, the Unicode names appear in many contexts, like in character selection menus in editors, so you need to know the idea.

You may wonder why Unicode assigns two immutable identifiers for a character: a number and a name. If each of them is unique and guaranteed to remain unchanged, what do you need the other one for? The short answer is that numbers are the basic identifiers but names are needed too, since they have been used in programs and data to uniquely identify characters. Although it might not be wise to write code that operates on character names that way, it would be unwise to intentionally break all such code now.

Originally, names of characters were meant to act as identifiers across character codes. Different code may assign different numbers to the character ±, but they can be expected to assign the same name, "plus-minus sign," to it, or at least use names that can be recognized as essentially the same. However, this idea never worked well, since the names were in practice not always the same, or even essentially the same. Moreover, Unicode has made the original idea unnecessary, since nowadays the Unicode numbers are widely used to refer to characters across character codes, even when Unicode is not otherwise used for representing characters.

1.2.9. Should We Be Strict About the Meanings of Characters?

People tend to use characters on the basis of their visual appearance. You see a character like ß in some repertoire, and you start using it for the Greek letter beta, if you need it. You see the character ø and you take it as the diameter sign, so you use it in a technical context like "ø = 0.12 m" (saying that the diameter of something is 0.12 meters).

Unicode has strengthened such tendencies. People browse tables or menus of Unicode characters and pick up the first one that looks right for the purpose they have in their mind. Since Unicode has so many more characters than most old standards, there are far more opportunities for getting lost: it is easy to find a Unicode character that more or less looks like the one you need.

Then comes a purist and says that ß is a letter (sharp s) used in German, not any Greek letter, and that ø is a vowel used in some Nordic languages, not a mathematical symbol. Should we care?

Although you might realize the importance of using the right character, not just a right-looking character, you may need to explain the issue to others. Moreover, we often need to make compromises, and then it becomes essential to consider their impact. Reasons for using the right character translate into risks that you need to prepare for, when you cannot use the right character. So here are some basic reasons for being strict:


Some people see the difference

Although the character looks right to you, a specialist may well see a difference between ß and β (sharp s versus small beta) or between ø and ࣺ (letter "o" with stroke versus diameter sign). When you write a foreign word, anyone who speaks that language as her native language is a specialist compared to you.


Font changes make differences noticeable

When the font is changed, the difference can become clearly visible. A typical example is that the difference between degree sign ° (as in "50 °F" or "10 °C") and masculine ordinal indicator º (superscript letter "o," used in Spanish) is very small or nonexistent in many fonts, but very clear in many other fonts (e.g., ° versus º). Your text might be rendered in different fonts even though you have carefully selected a particular font. This is particularly true in web authoring and in cooperative authoring.


Conversions operate on characters, not appearance

Automated editing of text is based on defined properties of characters, not on their appearance. For example, text-editing commands that operate on words will (or at least should) treat ø as a letter, not as a technical symbol. Converting text to uppercase would turn "ß-carotene" into "SS-CAROTENE," since "SS" is the defined uppercase version of ß.


Searching looks for characters, not appearance

A search function in a program, as well as a database search, works on characters. When asked to find the string "β-carotene" (with beta), they will not find "ß-carotene" (with sharp s). The same applies to pattern matching and replace functions. Search routines may use some heuristics in their attempt to help users with common errors in using wrong characters, just as they may help with misspellingsas Google might say "did you mean pseudonym?" when you have typed "psuedonym." But don't rely on such features.


Automated processing generally ignores appearance

For example, automatic speech synthesis and automatic translation, works on characters as abstract entities, not on their visual appearance. If your text contains "1º", meant to mean "one degree" but incorrectly uses a masculine ordinal indicator, it might be spelled out as "primero" (Spanish word for "first" in masculine gender). Similarly, it might be translated incorrectly.

Sometimes these considerations do not matter, ormore oftenthey need to be suppressed in favor of other needs. If you only aim at producing a document to be distributed on paper and you have full control up to and including the print operation, then the appearance is all that matters. But more often than not, documents are stored and sent in digital form. Then you may need to take precautions against wrong processing, perhaps document what you have done, and check things after various conversions and other operations.

Characters differ in the definiteness of their meaning. Some well-known characters like the hyphen - (known formally as hyphen-minus in Unicode) have a wide range of uses, and you may need to use them liberally. Computer programs need to be prepared for handling them accordingly. But other characters have specific semantics. The letter ø and the technical symbol ࣺ have limited uses. They should not be confused with each other or used for other purposes without careful consideration.

1.2.10. Ambiguity Among Characters

The identity of characters is defined by the definition of a character repertoire. Thus, it is not an absolute concept but relative to the repertoire; some repertoire might contain a character with mixed usage while another defines distinct characters for the different uses. For instance, the ASCII repertoire has a character called "hyphen." It is also used as a minus sign, as well as a substitute for a dash, since ASCII contains no dashes. Thus, that ASCII character is a generic, multipurpose character, and one can say that in ASCII, hyphen and minus are identical. But in Unicode, there are distinct characters named "hyphen" and "minus sign" (as well as different dash characters). For compatibility, the old ASCII character is preserved in Unicode, too (in the old code position, with the name hyphen-minus).

Similarly, as a matter of definition, Unicode defines characters for micro sign, n-ary product, etc., as distinct from the Greek letters (small mu, capital pi, etc.) from which they originate. This is a logical distinction and does not necessarily imply that different glyphs are used. The distinction is important, for example, when textual data in digital form is processed by a program (which "sees" the code values, through some encoding, and not the glyphs at all). Note that Unicode does not make any distinction, for example, between the Greek small letter pi (π), and the mathematical symbol pi denoting the well-known constant 3.14159... (i.e., there is no separate symbol for the latter). For the ohm sign (), there is a specific character (in the Symbols Area), but it is defined as being canonical equivalent to Greek capital letter omega (Ω)i.e., there are two separate characters but they are equivalent. On the other hand, Unicode makes a distinction between Greek capital letter pi (Π) and the mathematical symbol n-ary product (), so that they are not equivalent.

If you think this doesn't sound quite logical, you are not the only one to think so. The point is that for symbols resembling Greek letters and used in various contexts, there are three possibilities in Unicode:

  • The symbol is regarded as identical to the Greek letter (just as its particular usage).

  • The symbol is included as a separate character, but it is defined as equivalent to the Greek letter. There are two kinds of equivalence: canonical and compatibility.

  • The symbol is regarded as a completely separate character.

You need to check the Unicode references for information about each individual symbol. As a rough rule of thumb about symbols looking like Greek letters, mathematical operators (like summation) exist as independent characters whereas symbols of quantities and units (like pi and ohm) are identical to Greek letters or equivalent to them.

1.2.11. How Do I Find My Character?

Suppose you have been requested to convert some printed or handwritten text into a digital format. (At the end of this chapter, we have such an exercise.) For English text with no special characters, you might be able to use a scanner. But what would you do with characters that the scanner does not recognize reliably?

Such problems are fairly common. For example, you might need to check the spelling of a foreign name from a printed reference book, or you might need to quote some printed material. Even standards on various notations often fail to specify the characters unambiguously: the authoritative format of a standard is usually a printed publication, and all you have got there is ink on paper, glyphs.

The recognition of a character from its glyph can be quite difficult, and it may require both factual and cultural knowledge about the subject area and the text. You also need technical information on character standards, since you ultimately need to identify glyphs as appearances of characters defined in the standards.

Looking for characters through lists or code charts is a rather hopeless task. The amount of characters is huge, and many characters look very similar to each other. For example, how can you know whether a glyph on paper is letter "a" with a caron (ǎ) or letter "a" with a breve ()? Thus, you first need some information or guess on the nature of a character. If you know or suspect that the character appears in a Romanian name, you have a good starting point, since the character repertoire used in Romanian can be found in a suitable reference. Similarly, if you know that a glyph like ₣ is a currency symbol, you have almost identified it.

The following list suggests some general online resources for identifying characters:


"Where is my Character?" (http://www.Unicode.org/standard/where/)

An explanatory document by the Unicode Consortium. It explains some problems caused by the variation of shapes of characters.


Unicode Code Charts (http://www.Unicode.org/charts/)

This is official information and covers all Unicode characters. It is organized first by division into "Scripts" (writing systems for human languages, containing letters, syllables, and word signs) and "Symbols and Punctuation." These parts are further divided into large categories such as "European Alphabets." Figure 1-3 illustrates the appearance of the main page of the Code Charts.


Fileformat.info, section Unicode (http://www.fileformat.info/info/Unicode/)

This contains data taken from the Unicode site and organized for viewing in different ways. It also contains information on Unicode support in different fonts. As you get down to information on individual characters, their properties are displayed in a compact format, which is great when you are ready to use it.


Database of characters at the EKI (http://www.eki.ee/letter/)

Although not as exhaustive in character repertoire as the above, this database lets you search for characters in a few ways and shows some essential extra information on usage: it lists languages that use a character and character encodings (charsets) that contain it. Although these lists are not complete, they are often helpful. For example, they tell that letter "a" with a caron (ǎ, U+01CE) is used in Yoruba and in Romanization of Bulgarian and Chinese, whereas the letter "a" with a breve (, U+0103) is used in Romanian and Vietnamese and Romanization of Khmer, as shown in Figure 1-4. However, the information is not always completely reliable; in particular, the character used when writing Bulgarian as Romanizedi.e., in Latin lettersis not "a with a caron but "a" with a breve, according to standards.

1.2.12. Which Characters Does Each Language Use?

For details on the use of characters in different languages, you need to consult grammar guides and textbooks on the languages themselves. However, there is an extensive compilation of basic information in The World's Writing Systems by Peter T. Daniels and William Bright (Oxford University Press). There is brief description of character usage in a few languages in The Chicago Manual of Style, 15th Edition (The University of Chicago Press). Online, you can find "The Alphabets of Europe," by Michael Everson, at http://www.evertype.com/alphabets/. It is extensive and based on detailed research, although it partly applies different criteria to different languages: for some languages, it includes only the basic modern alphabet; for others, it lists historical characters and other

Figure 1-3. Part of the interface to online Unicode code charts


characters that are not used in normal writing. The CLDR database, discussed in Chapter 11, contains information on the use of letters in different languages.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net