Section 8.6. Diacritic Marks


8.6. Diacritic Marks

Diacritic marks are small signs added to letters or other characters, such an acute accent added to letter "e" to produce é or a tilde added to "a" to produce ã. Usually the mark is placed above the letter, but it could also appear below the letter, as in ç, or in another position. If your native language does not use diacritic marks, you might regard them as decorations only. However, they may fundamentally affect the meanings of words.

8.6.1. Why Diacritic Marks?

Diacritic marks are used to create variants of letters, often because a language that uses Latin letters has more sounds that can be expressed using the basic letters.

Diacritic marks often originate from letters that were written above another letter. For example, the tilde was originally a small "n," so that, for example, "an" was first written with a small "n" above the "a," and then the "n" was simplified, producing ã. When, for example, the sound combination "an" had changed to a nasalized "a" (i.e., the vowel "a" pronounced through the nose, with no consonant "n"), it was natural to denote this sound with a single letter, ã.

People who have designed writing systems for previously unwritten languages often find the basic Latin alphabet insufficient. If there are more essentially different sounds (phonemes ) in the language than there are basic letters, you could invent new letters or take them from other alphabets. However, the most common solution is to add diacritic marks on letters, often imitating the orthographies of other languages.

The meanings of diacritic marks vary greatly by language. For example, in French, the acute on é affects the quality of the vowel in pronunciation; in Hungarian, the acute indicates that the vowel is long; in Spanish, that the vowel has stress. It is not a matter of small nuances only; the differences can be crucial to making distinctions in meaning. The French verbs "pêcher" (to fish) and "pècher" (to sin) are quite different.

Sometimes diacritic marks are used just to make a distinction between words that are pronounced the same way and otherwise written the same way, but have different meanings. The Italian words "e" (and) and è (is) are pronounced similarly, but the diacritic marks help readers to see the difference in meaning from the word itself, without context analysis.

In many languages, diacritic marks have an essential role. Speakers of such languages often regard characters created with diacritic marks as completely independent letters. For example, in Swedish, ö is a separate letter, placed at the end of the alphabet. From the Unicode perspective, however, it can also be regarded as letter "o" with a diacritic, the dieresis.

Diacritic marks can also be combined. Letters with two diacritic marks are rare in European languages but common in Vietnamese, for example.

In special notations, such as phonetic writing (e.g., IPA notation) and mathematical formalisms, diacritic marks are often deployed extensively. Their use could not be covered with a reasonable number of combinations of a base letter and a diacritic mark. For example, the Uralic Phonetic Alphabet (UPA) rather routinely uses three or four diacritic marks on a letter to describe various nuances of pronunciation.

Diacritic marks are often omitted, though, especially by people who are not familiar with the rules of a language that uses diacritic marks. People might not know how to write the diacritic marks in a particular program, or they might fearnot without reasonthat diacritic marks cause problems in data transfer.

Publishers' policies differ on the use of diacritic marks. The most logical and polite approach is to preserve all diacritic marks in foreign words, excluding those that have been specifically adapted to another language. Thus, in English you should reserve the diacritic in "Rhône" (name of a river in France) but may drop it in a loanword like "rôle," for which the spelling "role" is more common in English. Some names have been adapted in a form without diacriticse.g., "Aland" (Swedish "Åland"). Similarly, the unit name "angstrom" is often written without diacritics, but the scientist's name must have them: "Ångström."

8.6.2. Early Approaches

In the early days of character data processing on computers, diacritic marks were not used. Later, attempts were made to produce them in a coarse manner similar to those used on typewriters. To produce ô, for example, you typed "o" followed by a control character that moves the writing position backward (to the left), then the character ^ (i.e., the circumflex as a separate character). The control character used was normally the ASCII backspace, BS.

The results were of course esthetically poor, since the same diacritic was used for all letters, lowercase and uppercase. Moreover, for economic reasons (like saving keyboard keys and coding space), the characters used as diacritic marks were often not designed for the purpose. Instead, existing characters were overloaded with new meanings and uses. For example, ASCII does not contain an acute accent, but the ASCII apostrophe was meant to be used as an accent too. Since the ASCII apostrophe had to serve so many different purposes, its appearance had to be neutral, hence not really suitable for any of the uses.

Once some characters had been introduced for use as overprinting diacritic marks, new uses were invented for them. After all, there was a very limited character repertoire available. Thus, for example, since the circumflex ^ looks like an upward-pointing arrow head, it was taken into special usage such as exponentiation: x^y is often used to denote x to the power y. This in turn implied that the glyph for the character had to be clearly visible, even in low-quality rendering that was common at that time. That way, the circumflex became rather big in shape. It then became rather unsuitable as a diacritic mark, but it mostly wasn't used for that purpose anyway.

8.6.3. Coded Combinations

In Latin-1 and other 8-bit character sets, some character positions were assigned to letters with diacritic marks as needed for writing particular languages. For example, Latin-1 contains characters such as é and ü for the needs of Western European languages. Due to the limitations of the coding space and the practical nature of the character sets, the assignments do not follow very regular patterns. For example, Latin-1 contains the letter ÿ,

Figure 8-2. Sample glyphs for combining diacritic marks


but not the corresponding uppercase letterthe letter is rare in itself and its uppercase variant is very rare.

Although Unicode contains "precomposed" characters as well, it turned out to be unsatisfactory to define all the possible combinations as separate characters. The concept of "combining diacritic marks" was introduced to allow, in principle, free combinations. You can use almost any character as a base character and attach any diacritic marks to it. Some of the combinations result in characters that already exist in Unicode as precomposed, and this raises the problem of dual presentations that are addressed in the so-called normalization.

The general idea is that new precomposed characters, consisting of a Unicode character and a Unicode diacritic, will normally not be added to Unicode anymore. This has caused some controversy for obvious reasons, since precomposed characters, with their own code positions, are often regarded as "more real" than the combinations. Partly for such reasons, the concept Unicode Sequence Identifier (USI) was introduced, which is described in Chapter 4.

8.6.4. Combining Diacritic Marks

A combining diacritic mark is a character that is meant to be presented in conjunction with a base character, not as such. For example, when the combining acute accent U+0301 appears after the letter "a," this character pair is to be rendered as á. Should you wish to render the combining acute accent itself, you could put it after the space (or no-break space) characteri.e., combine it with a graphically empty character. This would normally create the same rendering as the acute accent ´ (U+00B4), which is treated as the "spacing clone" of the combining acute accent. In code charts, combining diacritic marks are often shown using a dotted circle to symbolize a generic base character, as in Figure 8-2.

You might think of a combining diacritic mark as corresponding to backspace followed by the corresponding spacing (noncombining) character. That is, you might regard U+0301 as resembling backspace U+0008 followed by acute accent U+00B4. Although such thinking paints a picture that is useful up to a point, it easily becomes misleading after that.

Programs that support combining diacritic marks in rendering are really supposed to do much more elaborated operations than backspacing and overprinting. A program is supposed to analyze the base character and the combining diacritic and pick up a suitable glyph (designed, as an element of a font, by a typographer), such as á, if possible. As a second option, a program should construct a visual rendering that places the diacritic on the base character intelligently. For example, to produce á and Á that way, the program should at least pay attention to the different heights of "a" and "A."

Existing software is often deficient in supporting combining diacritic marks. It might get simple cases right, but it might also use simplistic methods that correspond to overprinting. This might result in a rendering where the diacritic is barely visible, or not visible at all. It is currently much safer to use precombined characters when possible. The Unicode Normalization Form C (see Chapter 5) is suitable for such purposes.

There is a particular danger when a program has been instructed to use one font as the primary font and another font, or other fonts, as fallback for characters that do not have glyphs in the primary font. The data might contain combining diacritic marks that do not appear in the primary font. Consider what would happen if a program, when presenting the data U+0061 U+0301 (small letter a, combining acute accent), used the Times font for the first character and Arial Unicode MS for the latter. Since the proportions of glyphs are different, the diacritic will not be placed well on the letter. This would result in á, which is typographically inferior; compare it with the precomposed character in the two fonts: á and á. A program can avoid this particular case by using the precomposed character, but in the general case, such a character may not exist, or the basic font used might lack it.

If you use combining diacritic marks, be aware that not many fonts contain them. Select a suitable font, and make sure it is used for the base characters, too.


The combining marks used for Latin letters, as well as many other marks, are in the block "Combining Diacritical Marks" ranging from U+0300 to U+036F. The grouping of these characters is shown in Table 8-6. The attribute "combining" has been omitted from the character names here for brevity. The "Ordinary diacritics" group is by far the most common. Note that there are combining marks outside this block, too.

Table 8-6. Classification of combining diacritic marks

Range

Name of group

Sample diacritic name

U+0300..U+0333

Ordinary diacritics

Grave accent

U+0334..U+0338

Overstruck diacritics

Tilde overlay

U+0339..U+033F

Additions

Right half below

U+0340..U+0341

Vietnamese tone marks (deprecated)

Grave tone mark

U+0342..U+0345

Additions for Greek

Greek perispomeni

U+0346..U+034A

Additions for IPA

Bridge above

U+034B..U+34E

IPA diacritics for disordered speech

Homothetic above

U+034F

Grapheme joiner

Grapheme joiner

U+0350..U+0357

Additions for UPA

Right arrowhead above

U+035D..U+0362

Double diacritics

Double breve

U+0363..U+036F

Medieval superscript letter diacritics

Latin small letter a


The "ordinary" diacritic marks in the block are listed in Table 8-7, in alphabetic order by name, omitting the attribute "combining." The first column shows the character as combined with the letter "a."

Table 8-7. Ordinary combining diacritic marks

Character

Code

Diacritic mark

U+0301

Acute accent

U+0317

Acute accent below

U+0306

Breve

U+032E

Breve below

U+032A

Bridge below

U+0310

Candrabindu

U+030C

Caron

U+032C

Caron below

U+0327

Cedilla

U+0302

Circumflex accent

U+032D

Circumflex accent below

U+0313

Comma above

U+0315

Comma above right

U+0326

Comma below

U+0308

Dieresis

U+0324

Dieresis below

U+0307

Dot above

U+0323

Dot below

U+030B

Double acute accent

U+030F

Double grave accent

U+0333

Double low line

U+030E

Double vertical line above

U+031E

Down tack below

U+0300

Grave accent

U+0316

Grave accent below

U+0309

Hook above

U+031B

Horn

U+0311

Inverted breve

U+032F

Inverted breve below

U+032B

Inverted double arch below

U+031A

Left angle above

U+031C

Left half ring below

U+0318

Left tack below

U+0332

Low line

U+0304

Macron

U+0331

Macron below

U+0320

Minus sign below

U+0328

Ogonek

U+0305

Overline

U+0321

Palatalized hook below

U+031F

Plus sign below

U+0322

Retroflex hook below

U+0314

Reversed comma above

U+0319

Right tack below

U+030A

Ring above

U+0325

Ring below

U+0303

Tilde

U+0330

Tilde below

U+0312

Turned comma above

U+031D

Up tack below

U+030D

Vertical line above

U+0329

Vertical line below


Some diacritic marks are often confused with each other. In particular, the caron (hacek) is often confused with the breve, which typically indicates that a vowel is short. The marks may look rather similar, but the caron is angular, v-like in shape, often characterized as inverted circumflex, whereas the breve is at least mildly curved, a little bit u-like. Although the visual differences can be very small, there is a fundamental difference in the coded representations of the characters. Nobody knows where the name "caron" (used mostly in character standards only) comes from, and the common name for this diacritic is "hacek" (from the Czech word "háek).

Combining macron below (U+0331) and combining low line (U+0332) both indicate underlining of a kind, but the latter is supposed to join on both sides. That is, for two consecutive characters with combining low line, you would expect the underlining to be continuous. These combining marks should only be used when underlining is part of a writing systeme.g., when the orthography of a language uses an underlined letter to indicate something different from the base letter. For underlining used, for example, for emphasis or decoration, it is much better to use markup, word processor commands, or other tools.

The double diacritics U+035D to U+0362 are special in the sense that such a diacritic applies to the two characters around it. This is an exception from the rule that in Unicode, a combining diacritic appears after its base character. For example, to write an underlined "ts" so that there is just one long underline that applies to both characters, you would use U+0074 U+035F U+0073 ("t," combining double macron below, "s"). The character U+035F is poorly supported, but you might have better success, for example, with combining double inverted breve U+0361: U+0074 U+0361 U+0073 might produce t͡s.

The double diacritics are meant to be used in special cases where they belong to a script or notation (such as IPA). Note that the word "double" occurs in names of diacritics somewhat confusingly. For example, combining double low line U+0333 is not a double diacritic as discussed here, just a doubled low line under one character (a̳).

There are additional diacritic marks in the block "Combining Diacritical Marks for Symbols," U+20D0..U+20FF. As the name suggests, they are mainly meant for use with mathematical and other symbols. They have rather limited support in software and fonts. For example, to write letter "x" with a rightward arrow above it, you could in theory use "x" followed by combining right arrow above U+20D7. However, few fonts contain it, and it might even be incorrectly marked as a normal graphic character ⃗, not a combining diacritic mark. Thus, for formulas and texts containing such symbols, it is probably better to use special software like formula editors, instead of trying to represent them as plain text.

8.6.5. Variation in Appearance

The visual appearance of a diacritic mark may vary greatly by font. In handwriting, there is even more variation; for example, a handwritten ä may look like ã or .

The Latin small letter "a" with breve, is often written using a tilde as the diacritic ã. Such appearances can often be regarded as substitutions of one character for another, and they may have technical reasons. For example, ã (being an ISO Latin 1 character) might be available where is not. Whether the variation causes problems depends on the character repertoire of the language and the context. Although people can use ã instead of in Romanian, since this has become common and will probably not cause confusion, it would be risky to use , since the reader could not immediately see whether it stands for or for â, which are both used in Romanian.

Some of the variation is language-dependent. For example, the acute accent in French typically looks different from the acute in Polish. This can be reflected in the dislike of fonts and even in labeling some fonts as "foreign."

Unicode has even unified the modern Greek stress mark, tonos, with the acute accent, although the name "tonos" appears in the names of characters. For example, the Greek letter small alpha with tonos (U+03AC) is defined as compatibility equivalent to normal alpha followed by combining acute accent, U+0301. Despite this, the shape of the diacritic in such characters differs from the acute in, for example, é. The difference is more striking in, for example, the Greek letter capital alpha with tonos U+03AB, since in Greek typography, the tonos is positioned to left of the base character in such a case: .

Some diacritic marks have a regular appearance that deviates from what you might expect from their name. The Latin capital letter "T" with caron, used in Czech and Slovak, looks as you'd expect: . However, its lowercase counterpart, Latin small letter "t with caron U+0165, has a comma-like diacritic in most fonts: . This means that the diacritic mark looks like a comma or an apostrophe but it is called caron and treated as caron in Unicode (e.g., in the canonical decomposition). Although this sounds unnatural, it would also be unnatural to have "T with caron mapped to, say, "t" with comma above right in an uppercase-to-lowercase mapping.

8.6.6. Spacing Diacritic Marks

When a combining diacritic mark is applied to a space character, we get the diacritic itself as a visible character. Alternatively, we might use a character that itself represents a spacing diacritic mark, often called "spacing clones" of diacritic marks. Such characters appear, for historical reasons, in different blocks, such as Latin-1 Supplement and Spacing Modifier Letters.

Starting from of Unicode 4.1, the recommendation is to apply a combining diacritic mark to a no-break space U+00A0 rather than space U+0020. The reason is "potential conflicts with the handling of sequences of U+0020 space characters in contexts like XML." However, the formal definitions still to define decompositions using the space. For example, the acute accent ´ (U+00B4) is by definition compatibility equivalent to a two-character sequence consisting of a space U+0020 and a combining acute accent U+0301.

Spacing diacritic marks do not have much use. Sometimes we might wish to mention a diacritic in text, such as "the acute ´ has varying shapes." More often, the spacing diacritic marks are used mistakenly (or questionably) as replacements for more appropriate characters (e.g., the acute as an apostrophe).

Some Basic Latin (ASCII) characters are historically derived from diacritic marks but are now treated as characters on their own. For example, the tilde ~ (U+007E) is not treated as a spacing clone of the combining tilde U+0303'that would in fact be odd, since the tilde has a rather different appearance. Instead, there is a separate character, small tilde (U+02DC), which is by definition compatibility equivalent to U+0020 U+0303.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net