Section 1.4. Glyphs and Fonts

1.4. Glyphs and Fonts

It is important to distinguish the character concept from the glyph concept. A glyph is a presentation of a particular shape a character may have when rendered or displayed. It has even been said that any character is an abstract idea, whereas glyphs for the character are its different visible manifestations.

Each character we use in English normally has the same basic shape, and glyphs for it differ in typographic design only. It is obvious that "T" in the Times font represents the same character as "T" in the Arial font, for example. However, the letter "a" has two rather different shapes (compare "a" in normal Times font and "a" in Times italic). When you write literally by hand, you may draw characters differently in different positions of a word. For example, a word-final "s" may be quite different than a word-initial "s." In typewritten or typeset text, or in text displayed or printed on computers, such distinctions are not made, even in so-called handwriting-style fonts .

In Greek writing, a word-final sigma () is rather different from a normal small sigma (σ), although they are logically the same character. The first and last letter of the word σοφ (sophos, "wise) are the same but are written differently. However, since this is a special case, character codes usually solve this by encoding them as two separate characters, and Unicode follows suit, even without defining any equivalence between them.

In other writing systems, the variation can be much bigger, especially if the writing systems imitate handwriting. In Arabic, letters have two or four contextual forms, which can be quite different from each other. Figure 1-5 shows the four forms of an Arabic letter, usually called "ba" or more exactly bʾ, though the Unicode name is Arabic letter beh (U+02BE). The forms are (from right to left!) for use as isolated, at the start of a word, in the middle of a word, and at the end of a word. As you can see, for example, the word-final form (on the left) has a part that helps in joining the character with the previous character. Each of these forms, in turn, can appear differently in different fonts.

In the ISO-8859-6 character code (Latin/Arabic), for example, each Arabic letter has one code position only. This leaves it to rendering engines to determine the context (position within a word) and to use the correct contextual form. Unicode, on the other hand, contains both such characters (effectively, taken from ISO-8859-6) and each of the contextual forms as a separately coded character. This lets you write Arabic so that the rendering process can be very simple, at the cost of extra work in writing. However, even using Unicode, you are normally supposed to use the more abstract Arabic letters.

It is ultimately a matter of definition whether two graphic presentations are glyphs for the same character or distinct characters. However, it is normally not an individual's decision but a collective agreement. The definition of a character repertoire specifies the "identity" of characters, among other things. One could define a repertoire where uppercase "Z" and lowercase "z" are just two glyphs for the same character. On the other hand, one could define that italic "Z" is a character different from normal "Z," not just a different glyph for it.

In fact, in Unicode for example there are several characters that could be regarded as typographic variants of letters only, but for various reasons, Unicode defines them as separate characters. For example, mathematicians use a variant of letter "N" to denote the set of natural numbers (0, 1, 2,...), and this variant is defined as being a separate character (double-struck capital N, ℕ, U+2115) in Unicode.

The design of glyphs has several aspects, both practical and esthetic. For a review of a major company's description of its principles and practices, see Microsoft's "Character design standards" on its typography pages at http://www.microsoft.com/typography/.

Some discussions, such as ISO 9541-1 and ISO/EC TR 15285, make a further distinction between "glyph image," which is an actual appearance of a glyph, and "glyph," which is a more abstract notion. In such an approach, "glyph" is close to the concept of "character," except that a glyph may present a combination of several characters. Thus, in that approach, the characters "f" and "i" might be represented using an abstract glyph that combines the two characters into a ligature , which itself might have different physical manifestations. Such approaches need to be treated as different from the issue of treating ligatures as (compatibility) characters.

1.4.1. Allowed Variation of Glyphs

When a character repertoire is defined (e.g., in a standard), some particular glyph is often used to describe the appearance of each character, but this should be taken as an example only. The Unicode standard specifically says the glyphs used for a character can be quite different from the "representative glyph," but within cultural conventions:

Consistency with the representative glyph does not require that the images be identical or even graphically similar; rather, it means that both images are generally recognized to be representations of the same character. Representing the character U+0061 Latin small letter a by the glyph "X" would violate its character identity.

Thus, the definition of a character repertoire is not a matter of just listing glyphs. In fact, it's the exception rather than the rule that a character repertoire definition explicitly says something about the meaning and use of a character. For example, the description of the dollar sign $ says that the character may have one or two vertical bars, to make it clear that such variation does not change the character's identity. On the other hand, the pound sign £ has one crossbar, in contrast with the lira sign ₤, which is identified as a separate character.

1.4.2. Fonts and Their Properties

A font contains a repertoire of glyphs. In a more technical sense, as the implementation of a font, a font is an organized set of glyphs. The glyphs may have names that identify them; this is the way used in PostScript fonts . More often, glyphs are identified by their numbers, which typically correspond to code positions of the characters (presented by the glyphs). Thus, a font in that sense is character-code dependent. An expression like Unicode font refers to such issues of basic structure and does not imply that the font contains glyphs for all Unicode characters. In fact, such comprehensive fonts are very rare at present.

A font may contain the same glyph for distinct characters. For example, although characters such as Latin uppercase "A," Cyrillic uppercase "A," and Greek uppercase alpha are regarded as distinct characters (with distinct code values) in Unicode, a font might contain just one "A" that is used to present all of them. In fact, this applies to most fonts. On the other hand, a font may contain alternative glyphs for a character, for use in different contexts.

Fonts have names, which are often trademarks. The name of a font can be a single word like "Times" or it may consist of two or more words, such as "Times New Roman." It is not uncommon to see fonts that are very similar to each other but have completely different names such as "Helvetica" and "Arial."

Fonts can be classified in many ways, and this belongs to typography rather than our topic. However, some basic classifications as indicated in Table 1-1 are relevant for our purposes, since they appear in program settings for selecting fonts for displaying characters. For example, a program may have one choice of a font for serif font, another choice for sans serif font. These font classes are distinguished by the presence or absence (in French, "sans" means "without") of short strokes that terminate the lines of many letters. Usually there is also the difference that in a sans serif font, the lines of letters have (almost) equal thickness, whereas in a serif font, the thickness varies (e.g., the vertical line of "T" is thicker than the horizontal line).

Table 1-1. Some basic classes of fonts
Class of fonts	Characteristics	Sample font(s)
Serif	Widely used for copy text in books	Times, Georgia
Sans serif	Often used on screen and for small print	Arial, Verdana
Monospace	Equal-width characters, often used for code	Courier New
Cursive	Letters join to each other as in handwriting	Cooper BlkIt BT
Fantasy	Exotic, artistic (font)	Comic Sans MS

The attribute "proportional" refers to any font where the width of character varies, as opposed to monospace fonts. Monospace fonts are often used for computer code, and sometimes to imitate old typewriter text. To be exact, there can be variation in width even in a monospace font: some Unicode characters are defined to be invisible, so they need to have a width of zero, and some characters such as fixed-width spaces have a specific width by definition.

There are many online services for viewing samples of fonts and for identifying the font of some text you have seen. They often tell how to download or buy the fonts, too. See, for example, http://www.identifont.com and http://www.linotype.com.

Typographers often use the term typeface to denote the basic design of glyphs, reserving the word "font" for particular implementations and variants. For example, the Times typeface is available as normal (regular), as bold, as italic, and bold italic, as well as in different sizes. Variants of a typeface in different sizes may differ in their detailsi.e., they are not just formed from a basic size by simple scaling.

1.4.3. Font Variation Versus Characters

As mentioned above, variants such as normal, bold, and italic do not normally constitute a character difference. That is, a normal "A" is the same character as a bold "A" or an italic "A." Neither does changing the typeface change the identity of a character, as a rule. However, some Unicode characters have been defined essentially as variants of other characters, although this difference could have been made at the font level only. Such characters are defined in the Unicode standard as having compatibility decompositions, using notations as in Figure 1-6. The symbol stands for compatibility equivalence, and <font> indicates font variationsimilar to what you could achieve using the font element in HTML, but here <font> is just a general notation, not markup. The <font> notation in the Unicode standard does not specify what kind of a font is to be used, and as you can see from the descriptions of U+210C and U+210D, <font> can mean quite different things for different characters. For example, U+210E is essentially "h in italics, but in the Unicode standard, this is just implicit in its representative glyph.

1.4.4. Fonts in Implementations

The implementation of fonts is relevant to our topic, since it affects the practical availability of characters. If a character is only available in a font that is poorly implemented, we may need to look for other approaches. For example, high-quality printing may require the use of certain font technologies.

The most important font technologies at present are:

Figure 1-6. Some descriptions of characters in the Letterlike Symbols block in the Unicode standard

Bitmap fonts: Also known as raster fonts, system fonts, or screen fonts, these fonts essentially present a character as a matrix or raster of pixels, or bits indicating the presence or absence of a pixel. Bitmap fonts are more or less obsolete, though they are still used as "system fonts," often in window titles and dialog boxes.
PostScript Type 1: This technology, developed by Adobe, is widely used in the print industry and in desktop publishing. On your PC, you may find Type 1 fonts, too.
TrueType: This technology was developed by Apple, and then licensed to Microsoft. Probably most fonts on your PC are TrueType fonts (with filenames ending in .ttf).
OpenType: This is a new technology developed jointly by Microsoft and Adobe. It is Unicode oriented and more platform-independent than older technologies.

Fonts other than bitmap fonts are effectively computer programs of a kind, controlling the drawing of lines that constitute a glyph. Fonts are generally protected by copyright laws, although the scope and terms of protection vary by country.

If you use Windows, you will probably benefit from downloading and installing the software from http://www.microsoft.com/typography/TrueTypeProperty21.mspx, "Font properties extension." It enhances the functionality of Windows so that when you open the Fonts folder (via Start Control Panel), you can right-click on the icon of a font file and select Properties to get rather detailed information on the font. However, the amount of information depends on the technology of the font. Figure 1-7 shows some properties of a TrueType font. The properties include the ranges of Unicode characters that the font supports. Beware, however, that such support is not always exhaustive; it may lack some characters of the range, especially if the Unicode standard has been extended since the creation of the font. (The figure contains some Finnish words, too. Such things may happen if you install a program that uses English on an operating system that uses a different language in its interface.)

Figure 1-7. Properties of a font (Garamond), as viewed with the Font properties extension

1.4.5. Failures to Display a Character

In addition to the fact that the appearance of a character may vary, it is quite possible that some program fails to display a character at all. Perhaps the program cannot interpret the character encoding of the data, either because it was not properly informed about the encoding or because it has not been programmed to handle the particular encoding.

Even if a program recognizes some data as denoting a character, it may well be unable to display it since it lacks a glyph for it. Often it will help if the user manually checks the font settings, perhaps trying to find a rich enough font. Advanced programs could be expected to do this automatically and even to pick up glyphs from different fonts, but such expectations are often unrealistic at present. However, it is quite possible that no such font can be found. As an important detail, the possibility of seeing, for example, Greek characters on some Windows systems depends on whether "multilingual support" has been installed.

A well-designed program will in some appropriate way indicate its inability to display a character. For example, a small rectangular box, the size of a character, could be used to indicate that there is a character that was recognized but cannot be displayed. Some programs use a question mark, but this is riskyhow is the reader expected to distinguish such usage from the real "?" character? Advanced browsers may display a symbol that indicates the general class (e.g., Latin letter or mathematical symbol) of the character.

1.4.6. Font Embedding

To overcome a situation in which a recipient of a document might not have a font needed for the characters in it, techniques have been developed for embedding fonts into documents themselves. This is quite different from what word processors normally do with fonts: they include information about fonts (by font name), not fonts themselves.

Font embedding does not normally mean the inclusion of an entire font but only an extract from the font data, as needed for a particular document. The technique may prevent the recipient from using the embedded font for anything but viewing the particular document. This makes font designers more willing to allow embedding.

Another reason for font embedding is the desire to have a document presented exactly as designed. If you create a document using fonts that you like and send it, the recipient's program may well be capable of displaying all the characters but by using different fonts, in part. Usually if you specify a font that is not present in the recipient's system, the program used for viewing the document will use its default font instead. This might be regarded as a serious problem especially by visual designers.

The Font properties extension that was illustrated in Figure 1-7 gives access to information about font embedding possibilities, in the Embedding pane. If embedding is allowed for a TrueType font, you can, for example, set Microsoft Word to embed the font. For this, you would select Tools Settings Save, and then check the box about font embedding. Remember to reset this setting after saving the document, since otherwise Word will keep embedding all TrueType fonts, which is generally unnecessary.

For the Web, Microsoft has developed the Web Embedding Fonts Tool (WEFT) for use with HTML and CSS. However, it has not gained much popularity, partly due to its relative complexity. Instead, the usual approach is to use the PDF format, since common PDF creation tools allow easy font embedding. In addition to commercial products such as Adobe Acrobat, there are free tools like PDFCreator, which adds a "virtual printer" to your system. You can then use the Print command in various programs to generate a PDF version of a document, and in this context, you can check settings that make the tool embed the fonts you have used.

Font embedding has its drawbacks, too. Often it would be desirable for the user to change the font for legibility, but font embedding has more or less been designed to prevent this. A special character may look odd to a user, who might well recognize it if he could view it using some font he knows well. The PDF format does not allow easy font resizing, which would be crucial to many people. Therefore, it is best to distribute your material in alternative formats in accordance with recipients' choices, such as Microsoft Word, RTF, HTML, or PDF.