Section 8.8. General Punctuation


8.8. General Punctuation

The General Punctuation block (U+2000..U+206F) is very important, since many characters in it are used frequently. It is however a mixed set, and only under a very liberal interpretation can we regard all characters there as punctuation. For example, the per mille sign (U+2030) is comparable to a unit symbol rather than the comma or the colon.

On the other hand, there are important punctuation characters elsewhere. The Basic Latin and Latin 1 Supplement blocks contain many very common punctuation characters like the comma. Moreover, characters that are used in only one script have usually been placed in the same block as the letters or other characters of the script.

8.8.1. Space Characters

In ASCII, there is only one space character, space. The Latin 1 supplement adds the no-break space, which is meant to be used instead of a space between words and expressions when line breaking should be disallowed there. There are several other space characters in Unicode, but they are of rather limited usefulness and use.

8.8.1.1. Space

The space character U+0020 normally creates horizontal empty space. Depending on the rendering software, the spacing could be of fixed width (for any particular font), or it could vary, especially in typesetting when the text is justified on both sides. The spacing might also be affected by commands of the typesetting program or other means, such as a stylesheet (e.g., using the word-spacing property in CSS) when authoring in HTML.

Often texts can be reformatted so that spaces are replaced by line breaks or vice versa. In technical terms, Unicode describes this so that a line break is normally permitted after a space character. The space that is left at the end of a line is then ignored in formatting.

It is common to omit spaces in situations where orthography rules would require a space but both the width adjustments and the breakability would cause undesired effects. For example, the rules of the SI, the International System of Units, require a space between a number and a unit, as in "5 m" (five meters), but people often write "5m." Of course we don't want a line break between "5" and "m" or even a wide gap as in "5   m," when text justification requires increased spacing between words. Usually, however, we can prevent such effects and still comply with orthography rules, by using a no-break space.

8.8.1.2. No-break space: use it!

The no-break space character U+00A0 is similar to a normal space but does not allow a line break after it. That is, if you have "foo bar" with a no-break space between the words, then the words are kept on the same line when the text is rendered or reformatted. Note that you use a no-break space instead of a normal space, not in addition to it. The no-break space is also called a "hard space" or "required space," though these unofficial names may also allude to other meanings, which are often coupled with the non-breaking behavior.

In addition to its basic meaning, the no-break space usually has the property of being of fixed width, for any given font. That is, it is neither expanded nor shrunk in text justification. This behavior is not defined in the Unicode standard, but it is very common. It is probably often caused by the way programs deal with the no-break space: they treat it as a printable character, just with an empty glyph (of a particular width), not as a character that controls spacing. It's like an alphabetic character, just empty.

Some programs, such as web browsers, by default collapse consecutive spaces. That is, any sequence of space characters might be treated as equivalent to a single space. The programs usually treat no-break space characters as non-collapsing. This is natural, since no-break space is usually treated as a fixed width character, as just explained.

The no-break space has some special uses. In the HTML source code if web pages, you might find table cells that contain nothing but a no-break space, usually written as an HTML entity,  . The reason is that web browsers commonly treat empty cells differently from nonempty cells (e.g., empty cells may lack borders), and they typically treat a cell with a normal space as empty, a cell with a no-break space as nonempty.

The no-break space belongs to all ISO-8859 encodings, so it is widely available. However, it is not used very widely yet, partly because people do not know about it or how to type it simply. When using MS Word, for example, you can type a no-break space almost as easily as a normal space: just keep the Ctrl and Shift keys pressed down when you hit the spacebar. You can make no-break spaces visible in MS Word by selecting the Show ¶ mode (often by clicking on the ¶ button); Word then shows a no-break space as a degree sign, °. In other programs, things can be different, but often you can define a keyboard shortcut you can use.

The difficult part is to adopt the habit of using no-break spaces. The following list suggests some common cases where you might routinely use a no-break space:

  • Between a number and a unit, as in "5 m"

  • Between a word and a closely associated number or symbol, as in "section 1" or "letter x"

  • Within a number or a code that contains spaces, as in "1 000 000" in languages that use a space as thousands separator, or in phone numbers like "+358 9 888 2675"

  • In short expressions like "U = V" or "a < 0"

  • Before the last word of a paragraph, if that word is very short

If you find this too difficult, you might decide to use no-break space only when you notice a particularly bad line break in your text. However, texts are very often edited and reformatted so that you cannot predict line breaks well.

On the other hand, when the formatting is important (e.g., in headings and headlines), you might use no-break spaces even more extensively. For example, you might wish to prevent a short word that starts or ends a sentence from being separated from the rest of the sentence. Remember, however, that preventing line breaks increases the odds for bad formatting in other parts of a paragraph.

8.8.1.3. Fixed-width spaces: rarely used

Unicode contains a set of space characters, shown in Table 8-8, that are similar to the common space but have a fixed width. This means that they are normally not adjusted by typesetting programs. On the other hand, such programs may contain commands for inserting something such as a thin space, which might not be the Unicode thin space character but an internal code that affects spacing. In that case, the spacing effect is often controllable via the program's commands in a detailed manner.

Table 8-8. Fixed-width space characters in Unicode

Code

Name

Width

U+200B

Zero width space (ZWSP)

Nominally no width, but may expand

U+200A

Hair space

Defined as "narrower than thin space"

U+2006

Six-per-em space

1/6 em (0.166... em)

U+2009

Thin space

1/5 em (0.2 em) or sometimes 1/6 em

U+205F

Medium mathematical space

4/18 em (0.222... em)

U+2005

Four-per-em space

1/4 em (0.25 em)

U+2004

Three-per-em space

1/3 em (0.333... em)

U+2002

En space

1 en (0.5 em)

U+2000

En quad

1 en (0.5 em)

U+2003

Em space

1 em (the size of the font in use)

U+2001

Em quad

1 em

U+2008

Punctuation space

The width of a period (full stop) "."

U+2007

Figure space

The width of a digit (tabular width)

U+3000

Ideographic space

The width of ideographic (CJK) characters


The fixed-width characters have been included into Unicode mostly for compatibility reasons. They are rarely used in practice. They may have some special uses, however. For example, figure space could be used for alignment purposes in numerical tables. If you have, say, a column with values like 1.2, 1.151, and 1.41, you could right-pad the values with figure spaces so that they have the same number of characters to the right of the decimal point. Then aligning the column to the right would make the values aligned to the decimal point. This is useful in contexts where you have no direct method for such alignmente.g., in HTML authoring. The Unicode line breaking rules in UAX #14 (see Chapter 5) specify that the figure space is non-breaking and even recommend it: "This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking." In practice, it is seldom a good choice, due to lack of support.

In particular, zero-width space (ZWSP) can be used to suggest line breaking possibilities inside a string that could otherwise cause problems in typesetting. The ZWSP character is basically invisible, yet allows a line break after it. Do not confuse this with discretionary hyphens; when a string is broken after a ZWSP, no hyphen is added at the end of a line. For example, a long URL like http://www.cs.tut.fi/~jkorpela/unicode/spaces.html (when used in text) might be modified to contain ZWSP after some slash (/) characters. The ZWSP does not prevent increased spacing between the characters around it, if such spacing is appliede.g., in order to justify text.

Beware that implementations may fail to implement fixed-width spaces according to the Unicode descriptions. Programs may lack any particular support to fixed-width space characters in the sense that they would adjust spacing. Instead, programs might just insert a glyph for the fixed-width character if availableand most fonts lack them, so the result is often a symbol for unrepresentable character. To make things worse, the glyphs are often incorrect. For example, the thin space can be narrower or much wider than it should, and most fonts that contain a punctuation space have a far too wide a glyph for it.

Among commonly used fonts, only a few, such as Arial Unicode MS, Lucida Sans Unicode, and Code2000, contain glyphs for all or most fixed-width spaces.

Fixed-width spaces should be used only after checking the appearance in the particular font used and only when you can be reasonably sure that the text will always be rendered using that font.


The fixed-width spaces just listed (all except the figure space) have the basic semantics of a space in the sense that a line break is permitted. This is often a problem. For example, French orthography rules require "fine spaces" around some punctuation characters, as in « Voilà ! ». Although thin spaces would give roughly the correct spacing, they would also permit highly undesirable line breaks. Thus, no-break spaces are safer, though this would mean that the amount of spacing should be controlled elsewhere, above the character level.

8.8.1.4. Adjusting spacing in other ways

As mentioned earlier, fixed-width characters are not used very much. In fact, even if a typesetting program may have a command for inserting "thin space" for example, this need not mean that the Unicode thin space character is actually used. Instead, the program might internally adjust spacing between characters, using tools above the character level. This explains why such programs often let you modify the width of the "thin space" you insert.

In MS Word, you can use the Format Font command to enter a dialog where you can adjust character spacing. If you select a string with the mouse and then set character spacing for it that way, you actually add the specified spacing For example, to produce letter "a" with a line (macron) above it, you could try writing "a" and a macron ¯ (U+00AF), then adjusting the spacing for the "a" suitably so that the macron appears above it. Such tuning would however depend on the font. Usually better tools exist. You could use the small letter "a" with macron (), or you could use "a followed by a combining macron, or you could use a formula editor.

8.8.1.5. Additional no-break space characters

The character U+202F, narrow no-break space,would appear to address some common problems in spacing, since it is both narrow and nonbreaking. However, support to it in programs and fonts is still rather limited. It was included in Unicode (in Version 3.0) for special purposes: for use in the Mongolian script. It has been defined just as being narrower than a no-break space, without specifying the width, so it cannot give any precise control even in principle.

Finally, there is U+FEFF, zero-width no-break space (ZWNBSP) . As its name suggests, it is really an invisible connector. It would prevent a line break inside a string even if a break would otherwise be permitted. The recommended character for such usage is now U+2060, word joiner (WJ). The reason is that ZWNBSP also has a different usage: it is used as a byte order mark (see Chapter 6). However, in practice, ZWNBSP is more widely supported in software at present.

In theory, you could use a "nonbreakable thin space " (e.g., between numbers) by using a thin space followed by a word joiner, U+2009 U+2060. In addition to being clumsy, this would be unreliable, since it uses two characters that are not widely supported. Far too often, U+2060 displays as a box or as a question mark. You would get better results with U+FEFF instead of U+2060, but even then the method would work with some fonts only.

8.8.1.6. A practical approach to thin spaces

In contexts like French punctuation or the use of a space as a thousands separator (as in 500 000), we would like to use a thin space character that is non-breaking. Since this is almost impossible at present at the character level, we have two options, illustrated here with implementations in HTML and CSS:

  • Use no-break space characters and adjust the amount of spacinge.g., in a stylesheet; for example:

<span style="word-spacing: -0.08em">500&nbsp;000</span>

or:
<span style="margin-right: -0.08em">500</span>&nbsp;000

  • Use thin space characters and prevent line breaking using a stylesheet or markup; for example:

<span style="white-space: nowrap">500&thinsp;000</span>

or:
<nobr>500&thinsp;000</nobr>

The first method, where non-breakability is expressed at the character level and spacing adjustment is handled otherwise, is usually more practical. The no-break space character is far more widely supported than the thin space. As a variation of this method, you could use HTML markup rather than CSS for affecting the amount of spacingfor example, using 500<small>&nbsp;</small>000.

8.8.1.7. Disallowing and allowing line breaks

The Unicode standard recommends the use of WJ when you wish to prevent line breaks and ZWSP when you wish to allow line breaks, overriding normal line break rules. However, at present such line break control at the character level does not work very widely and should not be expected to be portable across text-processing applications. It is often better to use other methods, such as markup, stylesheets, or typesetting commands. For example, in HTML authoring, people even use nonstandard but widely supported markup such as <nobr>...</nobr> (prevents line breaks inside) and <wbr> (allows a line break; corresponds to ZWSP).

8.8.2. Quotation Marks

In Unicode, there are several pairs of asymmetric quotation marks, but of them, only the double angle quotation marks « and » belong to ISO Latin 1. Notice in particular that the normal quotation marks in U.S. English, namely left and right double quotation marks (U+201C, U+201D), do not belong to ISO Latin 1 (although they belong to Windows Latin 1). In Unicode, most quotation marks belong to the General Punctuation block.

The quotation marks vary greatly from one language to another and even within a language. When ISO Latin 1 has to be used, there are not many choices: you have to live with ", ', «, and ». It is better to use these typographically inferior characters for quotations than to try to ''construct´´ smart quotes from characters that are not quotes.

8.8.2.1. Language-specific quotation marks

In Chapter 2, we described how word processors can automatically generate language-dependent quotation marks. Beware, however, that the applicable rules are somewhat debatable, especially regarding nested punctuation. This means that the automatically generated marks do not always comply with official rules. Even versions of the Unicode standard have contained erroneous examples of the use of quotes. See "Using Common Locale Data Repository" in Chapter 11 for information about language-specific rules.

The most common quotation marks are listed in Table 8-9. The names are partly misleading, since a "left" quote does not always appear to the left of the quoted text.

Table 8-9. Quotation marks

Code

Character

Name

U+00AB

«

Left-pointing double angle quotation mark

U+00BB

»

Right-pointing double angle quotation mark

U+2018

'

Left single quotation mark

U+2019

'

Right single quotation mark

U+201A

'

Single low-9 quotation mark

U+201B

Single high-reversed-9 quotation mark

U+201C

"

Left double quotation mark

U+201D

"

Right double quotation mark

U+201E

"

Double low-9 quotation mark

U+201F

Double high-reversed-9 quotation mark

U+2039

Single left-pointing angle quotation mark

U+203A

Single right-pointing angle quotation mark


8.8.2.2. The apostrophe versus the single quotation mark

People often ask how to distinguish the apostrophe, as in "can't," from the right single quotation mark, as the closing quote in 'hello' (using British-style quotation marks). The short answer is that in Unicode, you don't. The answer often makes people uneasy, but we cannot really change this anymore.

Version 2.0 of the Unicode standard said that the preferred character for apostrophe is the modifier letter apostrophe U+02BC, but this was changed in Version 2.1. The modifier letter apostrophe is preferred where the character is to represent a modifier letter (for example, in transliterations to indicate a glottal stop). But as a punctuation apostrophe, as in "We've been here before," the right single quotation mark (U+2019) is preferred.

This means that in processing text data, you cannot tell a punctuation apostrophe (used as part of a word) from a right single quote without considering the context. This is practically not very serious, since there is in any case some variation in the ways that a punctuation apostrophe might be represented in data. The person who typed the data in the first place may have used the ASCII apostrophe, or the acute accent.

8.8.3. Hyphens and Dashes

It has become common to use the hyphen-minus character for a wide range of purposes, simply because it is the only hyphen-like character in ASCII. This is detrimental to typography, since different hyphen-like characters need different appearance. Sometimes two consecutive hyphens "--" are used to emulate an em dash, but this results in poor appearance, since the hyphens do not connect.

In Unicode, there is a rather large collection of hyphen-like or dash-like characters. Specifically, there is an official list (in Chapter 6 of the Unicode standard, Table 6-3), which is presented in Table 8-10 as amended with additional reference information. This table also contains the soft hyphen, which belonged to the corresponding table in Unicode 3 but is just mentioned after the table in the current version of the standard.

Table 8-10. Hyphens and dashes in Unicode

Glyph

Code

Name

Notes on meaning and usage

-

U+002D

Hyphen-minus

The well-known ASCII hyphen, with multiple usage, or "ambiguous semantic value"; the width should be "average"

~

U+007E

Tilde

The ASCII tilde, with multiple usage; "swung dash"

-

U+00AD

Soft hyphen

"Discretionary hyphen"

֊

U+058A

Armenian hyphen

As soft hyphen, but different in shape

-

U+1806

Mongolian todo hyphen

As soft hyphen, but displayed at the beginning of the second line

-

U+2010

Hyphen

Unambiguously a hyphen character, as in "left-to-right"; narrow width

-

U+2011

Non-breaking hyphen

As hyphen (U+2011), but not an allowed line break point

U+2012

Figure dash

As hyphen-minus, but has the same width as digits

U+2013

En dash

Used, for example, to indicate a range of values

'

U+2014

Em dash

Used, for example, to make a break in the flow of a sentence

U+2015

Horizontal bar

Used to introduce quoted text in some typographic styles; "quotation dash"; often (e.g., in the representative glyph in the Unicode standard) longer than em dash

U+2053

Swung dash

Like a large tilde; often missing in fonts

U+207B

Superscript minus

A compatibility character, equivalent to minus sign U+2212 in superscript style

U+208B

Subscript minus

A compatibility character, equivalent to minus sign U+2212 in subscript style

-

U+2212

Minus sign

An arithmetic operator; the glyph may look the same as the glyph for a hyphen-minus, or may be longer

U+301C

Wave dash

A Chinese/Japanese/Korean character

U+3030

Wavy dash

A Chinese/Japanese/Korean character


The hyphen bullet U+2043 is not listed among the hyphen dash characters, despite its name. There is no cross-reference in the description of the hyphen bullet in the code chart. Apparently, the hyphen bullet is really meant to be a bullet character that looks like a hyphen (of a kind), rather than comparable to hyphens and dashes. Note that in ASCII text, the hyphen-minus is often used in the role of a bullet in a bulleted list. Some typographic conventions favor the use of a hyphen-like bullet even when a rich character repertoire is available, though the bullet • and dashes like the en dash "" are more common in such usage. Typically, list bullets are generated by word processors or other programs, rather than written explicitly into documents.

8.8.3.1. Use of hyphens and dashes

When a sufficient character repertoire is available, the following usage rules are suitable, since they comply with old typographic and orthographic principles and the defined Unicode meanings of characters:

  • The hyphen-minuscharacter should be used only in computer languages and other contexts where this ASCII character belongs to the language syntax. Thus, for example, the C language statement a = b - c; must be written using the hyphen-minus character, despite the fact that it there denotes mathematical subtraction; the reason is that C language has been defined to use hyphen-minus as such an operator. Similar considerations apply to most programming, scripting, command, and markup languages, since they generally use ASCII characters only at least in the core language.

  • The hyphen character should be used as a normal hyphen in natural languages.

  • The non-breaking hyphen should be used instead of a normal hyphen when a line break is undesirable, as in the string "Latin-1."

  • The minus sign should be used as mathematical minus sign, both as a binary operator and as a unary operator (or simply as the sign of a number).

  • The en dash is used to indicate a range of values, such as 20002500. However, there are often other possible notations, like "2000 to 2500" or "2000...2500."

  • The em dash can be used to make a breaklike this'in the flow of a sentence, or to make a parenthetic remark.

The en dash and em dash especially have language-dependent uses. The uses mentioned in this list (as taken from the Unicode standard) should primarily be taken as typical uses in American English. For example, in Europe, it is much more common to use an en dash with spaces around it like this for parenthetic remarks. Historically, the spaces compensate for the shortness of the en dash.

8.8.3.2. The soft hyphen

The soft hyphen is defined as "discretionary hyphen" in Unicode. This means that it is normally not displayed at all but indicates a permissible hyphenation point. For texts in a Latin script, hyphenation means that a word may be broken so that the first part appears at the end of a line, with a hyphen after it.

Hyphenation hints useful for words that would not be properly hyphenated by a program's normal algorithmse.g., for foreign words or for words like "record" that have different hyphenations depending on meaning (verb "re-cord," noun "rec-ord"). In many programs, the occurrence of a soft hyphen prevents automatic hyphenation in the wordi.e., the word can only be hyphenated at a soft hyphen. Thus, for long words, it might be advisable to indicate all hyphenation points.

The reason why Unicode 4 does not list the soft hyphen as a hyphen is that the standard tries to clarify its meaning: "it marks a position for hyphenation, rather than being itself a hyphen character."

Though supported by some software, the soft hyphen does not work reliably across programs. In addition to the MS Word specialty discussed below, the soft hyphen is treated as a normal hyphen by various programs, including some web browsers.

8.8.3.3. MS Word specialties

Microsoft Word has an Insert Symbol function, which was described in Chapter 2. It contains a quick menu for some commonly used characters: "Special Characters." Some entries there are rather misleading:

  • "Nonbreaking Hyphen" (often with shortcut Ctrl-Shift--) does not insert the Unicode character non-breaking hyphen U+2011 but instead the control character U+001E. Word displays it as a hyphen and does not break a line after it. If the document is saved as plain text, Word turns the control character to a hyphen-minus. If you cut and paste text, the character turns into a question mark, ?.

  • "Optional Hyphen" (often with shortcut Ctrl--) does not insert the Unicode character soft hyphen U+00AD. Instead, it inserts the control character U+001F, which is interpreted by Word as indicating a possible hyphenation point. This information is usually lost when saving in other formats or when cutting and pasting.

However, when saving data in HTML format, Word 2002 generates &#8209; (character reference that means U+2011) from its internal "Nonbreaking Hyphen" and the U+00AD soft hyphen character from its internal "Optional Hyphen."

It is possible to insert U+2011 or U+00ADe.g., using the "Symbols" pane or, in sufficiently new systems, by typing 2011 Alt-x or ad Alt-x, respectively. The non-breaking hyphen U+2011 then works properly, assuming the font in use contains a glyph for it. The soft hyphen U+00AD however is displayed as a visible hyphen. Thus, MS Word does not support the soft hyphen as defined in Unicode. Internet Explorer, on the other hand, supports the soft hyphen, but some other web browsers do not.

8.8.4. Ellipsis

In English, three spaced dots are often used to indicate omission. The notation can be identified with the horizontal ellipsis "..." (U+2026), which belongs to windows-1252, too. This character is compatibility equivalent to a sequence of three period (full stop) characters ("...") with a presentation that has more spacing between the periods. MS Word automatically converts three periods to horizontal ellipsis (by default).

In some other languages, recommendations or practices may favor the use of unspaced periods. There is no Unicode character for such a combination, so it is naturally written as three periods. MS Word obeys such conventions: if it has recognized the language, for example, as French or Spanish (by inference or from an explicit setting of language), it leaves "..." intact.

In mathematics, other ellipsis characters are used, too. The most common of them is midline horizontal ellipsis "⋯" U+22EF. It is used, for example, in sums like a1 + a2 + ⋯ + an.

8.8.5. Angular brackets

There is great confusion about various characters called angle brackets. Here we will refer to them collectively with the name "angular brackets, " since the words "angle bracket" appear in the names of specific Unicode characters. Quite often, when someone says "angle bracket," he does not mean any of those characters but the less-than sign < and the greater-than sign >.

In mathematics and some other special notations, angular brackets are used for special purposes. Sometimes they are used as an additional type of brackets when you have run out of other typesi.e., normal parentheses ( ), square brackets [ ], and curly braces { }. More often, angular brackets are used to denote other things, such as the following:

  • Pairs, triplets, or n-tuples, instead of the more common use of normal parentheses. For example, 〈x,y,z〉 might mean an ordered triplet of coordinates, more commonly denoted as (x,y,z). This is potentially misleading, due to the other uses.

  • An inner product of two functions or vectors, often denoted as 〈f | g〉.

  • Specifically the L2-inner product, also called bracket product.

  • An expectation value: 〈X〉 is the expectation value of a variable X.

In any case, the identity of angular brackets in terms of Unicode characters usually remains unspecified. In many references, the less-than sign and the greater-than sign are described as being angle brackets or as identical in shape to them. Yet, there is considerable difference between those signs and the usual shapes of angular brackets in good mathematical typography. Usually angular brackets have a rather obtuse angle.

Further confusion is caused by the fact that the less-than sign and the greater-than sign, being ASCII characters, have been taken into many computer language for use as delimiters. We can say that they are used as (i.e., in the role of) angular brackets, but it would be incorrect to say that they are angular brackets. This includes the well-known use in HTML and XML tags like <body>. Of course, in such notations you must use the less-than sign and the greater-than sign, since they are part of the defined syntax. Partly imitating such usage, they are also used as delimiters in Unicode notations like <small> in compatibility mappings, in writing URLs in text (e.g., as <http://www.w3.org>), in handwritten typesetting instructions like <sc> for small caps, and in pseudo-markup like <joke> on Internet discussion forums.

There is some established use of less-than sign and greater-than sign as delimiters. There are also rare cases where you need typographically correct angular bracketse.g., in mathematics. Apart from such usage, angular brackets are best avoided.


The main reason for avoiding angular brackets is that the widely available less-than sign and the greater-than sign are typographically unsuitable for such use, and they are also heavily loaded with other meanings and uses. Other characters that might be considered for use as angular brackets are less widely available; some of them exist in a few fonts only. Moreover, they are easily confused with each other both by writers and by readers.

Table 8-11 lists several Unicode characters that might be understood as angular brackets in some sense. For simplicity, only "left-pointing" (or "opening") characters are considered. The corresponding "right-pointing" character usually appears in the next code position or otherwise close. The glyphs (in the second column) for the characters are shown in the Arial Unicode MS font; as you can see, some of the characters are missing even in this relatively large font.

Table 8-11. Unicode angular brackets

Code

Glyph

Name

Block

U+003C

<

Less-than sign

Basic Latin

U+2039

Left-pointing angle quotation mark

General Punctuation

U+2329

Left-pointing angle bracket

Miscellaneous Technical

U+276C

 

Medium left-pointing angle bracket ornament

Dingbats

U+27E8

 

Mathematical left angle bracket

Misc. Math. Symbols-A

U+29FC

 

Left-pointing curved angle bracket

Misc. Math. Symbols-B

U+3008

Left angle bracket

CJK Symbols and Punct.


Although angle quotation marks (guillemets, chevrons) have occasionally been used as angular brackets, as in ‹foo›, such usage is very problematic. Their size and shape differs from typographic angular brackets, and they might be incorrectly taken as quotation marksnot only by human readers but also by software, since they are quotation marks by Unicode definitions. Thus, they may confuse, for example, the automatic processing of quotations.

In the Dingbats block, there are also some other ornamental brackets in addition to U+276C. Generally, Dingbats characters are unsuitable for normal text and should be considered as decorations only, unless used by some special convention.

The characters in the blocks Miscellaneous Mathematical Symbols-A and Symbols-B are relatively new additions to Unicode (added in Version 3.2), and therefore poorly supported. Although U+27E8 (also known as "bra," matching "ket," which is a synonym for mathematical right angle bracket U+27E9) would theoretically be most adequate for use as an angular bracket, U+2329 is usually a much more practical choice.

Yet, the Unicode standard says about U+2329 and the right-pointing angle bracket U+232A that they are "discouraged for mathematical use because of their canonical equivalence to CJK punctuation." They have indeed been defined as canonical equivalent to U+3008 and U+3009, though displayed as visually different. The Unicode names of these characters, "left angle bracket" and "right angle bracket" are misleading, since they give no hint of their nature. They are meant for use in East Asian writing along with Chinese-Japanese-Korean ideographs. Consequently, they have some surprising properties.

A glyph for the left angle bracket (U+3008) has to suit its use with ideographs designed to fit into a square, such as 懌. Therefore, the left-pointing angle bracket 〈 (U+2329) is much more suitable, for example, for mathematical texts in English. However, the canonical equivalence means that software conforming to the Unicode standard may effectively treat them as identical, and mapping to any Unicode normalization form will replace U+2329 with U+3008.

Thus, if you really need angular brackets (in mathematics, for example):

  1. Use the mathematical brackets U+27E8 and U+27E9, if you can be reasonably sure that these rarely available characters will be displayed and printed correctly.

  2. Otherwise, use the left-pointing and right-pointing angle brackets U+2329 and U+232A (which are available in a few fonts), if you can guarantee that no problems will arise from normalization or other operations based on canonical equivalence.

  3. Both of the above failing, use the less-than sign and the greater-than sign, and give appropriate explanations so that readers will understand them as delimiters.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net