Section 9.2. Characters and Markup


9.2. Characters and Markup

XML (and SGML) markup has often been characterized as "semantic" or "logical," instead of presentational or physical. However, markup can be used for many purposes, including formatting and typography. If you need to store text data in a way that contains formatting information, nothing prevents you from using XML markup for that. To mention examples from HTML, <i> markup is used to indicate italics, and markup like <font face="Arial">, though deprecated, is still used to specify the font family.

On the other hand, many Unicode characters are typographic variants of other characters, coded as separate characters for different reasons. Many of them are compatibility characters and have been included only due to their existence in other character codes. However, there are other cases as well. The difference between normal (upright) and italics style may indicate a semantic distinction in mathematics or other special notations

Most things that are expressed in markup have no character-level counterpart. For example, designating some text as a heading, in the logical structure, cannot be done at the character level. What comes closest to that is writing the heading text in all uppercase with line breaks before and after, as we often do in plain text. If you consider markup for indicating the structure of a price list that is never meant to be displayed as such, only through special formatting processes and rules, it should be obvious that you cannot do anything at the character level to indicate some text as "product name" and some other text as "unit price" for the product.

Thus, the question of whether information should be expressed at the character level or in markup primarily deals with presentational distinctions, or distinctions that might at least arguably be regarded as presentational. It is an important special case of the problem of selecting an appropriate level of expression in such cases. There are specific guidelines on it suggested in a joint report by the Unicode Consortium and the World Wide Web Consortium (W3C). Note that text processors increasingly use XML markup in their data formats, so the principles apply to using their tools, too, at least indirectly.

9.2.1. Markup and Styling

In this context, "markup" really means "markup and styling," in most cases. Modern markup tends to be logical rather than presentational, and therefore markup alone does not usually imply any particular rendering style. In particular, if you use generic XML, inventing tags as you need and with the intention of specifying rendering in a stylesheet, then no markup has any rendering style as such, only through the stylesheet.

Although the i markup in HTML, for example, specifically means "italics," we cannot say the same about the (rarely used) var markup, which means "variable, placeholder." Yet, in contexts where it is suitable to replace a compatibility character with a normal character and i markup, it can be just as suitable, if not more so, to replace it by the normal character and var markup, provided that two conditions are met. First, the intended meaning should correspond to the defined meaning of the markup. Second, we should have reasonable expectations on having it rendered in italics. The expectations might be based on information about typical browser defaults, or on the use of a stylesheet that explicitly suggests such rendering (var { font-style: italic; }).

9.2.2. Document-wide Versus Local Decisions

The question "character level or markup?" has two levels:

  • Should you use plain text or marked-up text as the format of some information? Choosing plain text excludes all markup. Choosing marked-up text does not exclude the possibility of expressing information at the character level rather than in markup.

  • Inside marked-up text, should you express some information (say, the use of italics for a character) in markup, or at the character level, or perhaps both ways?

The alternative "both ways" can usually be excluded. Although it may sound ideal to get the best of both possible worlds, you easily end up with getting the worst of them. Besides, you might get nasty cumulative effects. Consider the simple example of writing the expression x⁴ (x to the power of 4). You could use the superscript four (U+2074), or you could use a higher-level protocol, such as markup (e.g., sup element in HTML) or superscript style in a word processor. Trying to use bothe.g., using x<sup>&#x2074;</sup> in HTMLwill probably combine the drawbacks of both alternatives: in rendering, it fails whenever U+2074 does not belong to the fonts in use, and it tends to mess up line spacing the way sup markup often does. Besides, it's illogical. It means double superscripting, and this will probably make the superscript appear as very small if at all, since a browser uses superscript 4 and reduces its size.

Moreover, as noted earlier in this chapter, if you express some superscripts as superscript characters and other superscripts using markup, they easily look disturbingly different. Any automated processing of the data, such as conversion to another format, would need to deal with two representations of superscripts instead of one.

You can mix markup with formatting information expressed at the character level, but you should normally not use both ways for the same information in a document.


There are a few exceptions, though. Sometimes there are two ways to present the same formatting information so that no harm arises from the duplication. For example, in HTML authoring, you could write a table cell as <td nowrap>42 m</td> so that the space between "42" and "m" is a no-break space U+00A0 (which you could write as &nbsp; in HTML). That would mean expressing both at the character level and in markup, with the nowrap attribute, that the cell content must remain on one line. Both ways are "safe" in practice. Although there is no particular benefit from using both, no harm is caused either, so such duplication need not be avoided, for example, when generating table markup automatically.

9.2.3. Unicode Versus Markup

The document "Unicode in XML and other Markup Languages" has been published as Unicode Technical Report UTR #21 at http://www.unicode.org/reports/tr20/ as well as a W3C Note at http://www.w3.org/TR/unicode-xml/. It is not part of the Unicode standard. It has been approved just as a Technical Report, though other documents may make normative references to it. In the W3C terminology, a Note is a document that has been endorsed by a working group but not reviewed or endorsed by the W3C as a whole (or by "W3C Members").

Nevertheless, UTR #21 is the best available general guideline on whether information should be expressed at the character level or in markup. We need to use it with discretion, partly because the report considers markup in general and not the specific features of various markup languages and systems. For example, in cases where the report recommends markup, the markup language we use might lack elements that could be used for the purpose, or their implementation in software might be wanting.

In practice, the report revolves around XML, which covers both generic XML (where you can invent tags as you go) and specific XML-based markup languages such as XHTML (the XML-ized version of HTML), MathML, MusicML, or SVG (Scaleable Vector Graphics, a language for two-dimensional graphics in XML, with a possibility of including text as character data). However, the markup concept is more general and covers SGML too, including classic HTML, which is nominally SGML-based. Even notations such as RTF and other rich text systems can be regarded as markup, even though their general syntax is different.

9.2.3.1. Differences between markup and plain text

Plain text is linear: a character follows another in a sequence. Although the visual rendering can be more complicatede.g., due to combining diacritic marks and to alterations in writing directionplain text is still processed linearly. Markup, on the other hand, expresses tree-like structures, even if it is written linearly. As any good book on markup will tell you, a notation like <x><y>foo</y><z>bar</z></x> describes a tree structure, with elements y and z as "children" (subtrees) of x. The marked-up text needs to be processed (parsed) in order to construct the tree structure, which in turn can be linearized into text. A markup element can be very large. In XML, the entire document is treated as one element, with subelements, which contain subelements, etc.

Information expressed at character level works on different grounds. Either the difference between characters as such carries some information (e.g., using 2 instead of 2 expresses that we have a superscript), or a character affects the interpretation or processing of the preceding or sometimes the following character. Some characters may set some internal state in interpretation or processing, such as writing direction. They might be compared to start tags in markup. Even then, such characters usually affect a state in a simple way, setting it to a specific value. There is normally no nesting involved as in markup.

Thus, character level is normally useful for rather local information only. On the other hand, it is generally simple and compact to use when it applies. Compare the simplicity of using a character (code point) for 2 as opposed to markup like <sup>2</sup>, where you need start and end tags even though you are saying something about a single character only.

Obviously, information at the character level is suitable for linear processing where you read a stream of characters and process them in succession. Similarly, marked-up text, once parsed, is suitable for structured processing where you start from a tree and process it by the structure.

9.2.3.2. Characters that should not be used in marked-up text

UTR #20 declares some characters, listed in Table 9-2, as "unsuitable for use with markup." Some of them might have use in plain text or other formats, but not in XML, for example. Most of these characters would rarely come into your mind anyway when using markup. Note, however, that U+FEFF has been used to some extent in marked-up text as an invisible joiner (to prevent undesired line breaks), and in practice, it still does the job more reliably than the suggested replacement, U+2060.

Table 9-2. Characters not suitable for use with markup

Character(s)

Description

Reason for avoiding

U+2028..U+2029

Line and paragraph separator

Use markup (like p and br in HTML)

U+202A..U+202E

Bidi embedding controls 

Use only markup to avoid conflicts; however, see notes after the table

U+206A..U+206B

Activate or Inhibit

Symmetric swapping

Deprecated in Unicode

U+206C..U+206D

Activate or Inhibit Arabic

form shaping

Deprecated in Unicode

U+206E..U+206F

Activate or Inhibit National

digit shapes

Deprecated in Unicode

U+FFF9..U+FFFB

Interlinear annotation characters

Use Ruby markup (see Chapter 8)

U+FEFF

Zero width no-break

space (ZWNBSP)

Use only as byte order mark (see, however, the note at the beginning of this section)

U+FFFC

Object replacement character

Use markup for embeddinge.g., img or object in HTML

U+1D173..U+1D173A

Scoping for Musical Notation

Use an appropriate markup language, as it becomes available

U+E0000..U+E007F

Language tag characters 

Use language markupe.g.,

lang or xml:lang attribute (see

Chapter 7)


As described in Chapter 5, the Line Separator (LS) U+2028 and the Paragraph Separator (PS) U+2029 were introduced to provide unambiguous means to denote line breaks and paragraph delimiters in plain text. This was meant to avoid the ambiguity caused by different uses of ASCII control characters like Line Feed. In practice, LS and PS have not been used much. If they appear in plain text being converted to marked-up text, they should be replaced by appropriate markup. In HTML, you use <br> (or <br /> in XHTML) for a forced line break, and you surround each paragraph with the tags <p> and </p>. Utr #20 recommends that an occurrence of LS or PS in marked-up text be treated as whitespacei.e., as equivalent to a space.

According to UTR #20, the Bidi embedding controls U+202A..U+202E (see Chapter 5) are "strongly discouraged" in the HTML 4 specification, which however actually just warns about possible conflicts between those controls and equivalent markup. It recommends that preferably one or the other should be used exclusively, and adds:

The markup method offers a better guarantee of document structural integrity and alleviates some problems when editing bidirectional HTML text with a simple text editor, but some software may be more apt at using the UNICODE characters. If both methods are used, great care should be exercised to insure proper nesting of markup and directional embedding or override, otherwise, rendering results are undefined.

UTR #20 suggests that markup be used instead of the controls on the following grounds:

The embedding controls introduce a state into the plain text, which must be maintained when editing or displaying the text. Processes that are modifying the text without being aware of this state may inadvertently affect the rendering of large portions of the text, for example by removing a PDF [= Pop Directional Formatting].

Although this recommendation is usually adequate, there are situations where markup cannot be used for Bidi embedding. Attributes of elements cannot contain markup, only text, and some elements may contain only text. Thus, if Bidi control is needede.g., a <title> element or an alt attribute of an <img> elementthe control characters are the only possibility.

9.2.3.3. Formatting characters that may be used in marked-up text

According to UTR #20, the characters listed in Table 9-3 may be used in XML documents or other marked-up text, even though they are invisible formatting characters or characters with formatting information. This does not mean that they should be used, or that it would always be appropriate and best to use them. Rather, they are regarded in principle as compatible with the ideas and practices of markup. This means that the potential risks of mixing character-level information and markup are not relevant, or they can be controllable enough. On the practical side, many of the characters listed are poorly supported or could be replaced by markup. The no-break space U+00A0 is often used and useful, whereas most of the other characters have little use in texts written in Latin letters, except for the soft hyphen U+00AD in some word processors.

Table 9-3. Formatting characters acceptable for use with markup

Character(s)

Name(s)

Notes

U+00A0

No-break space

Latin-1 character

U+00AD

Soft hyphen

Hyphenation hint

U+034F

Combining grapheme joiner

See explanation below

U+0600

Arabic number sign

Subtending mark

U+0601

Arabic sign sanah

Subtending mark

U+0602

Arabic footnote marker

Subtending mark

U+0603

Arabic sign safha

Subtending mark

U+06DD

Arabic end of ayah

Enclosing mark

U+070C

Syriac Abbreviation Mark (SAM)

Supertending mark

U+0F0C

Tibetan mark delimiter tsheg bstar

<noBreak> U+0F0B

U+180B..U+180E

Mongolian variation selectors and vowel

separator

Required for Mongolian

U+200C..U+200D

Zero-width joiner and non-joiner (ZWJ

and ZWNJ)

For ligature behavior; see

Chapter 5

U+200E..U+200F

Directional marks (LRM and RLM)

See Chapter 5

U+2011

Non-breaking hyphen

<noBreak> U+2010

U+202F

Narrow No-Break Space

Narrow form of U+00A0

U+2044

Fraction slash

Or use markup (MathML)

U+2060

Word Joiner

Prevents line break

U+2061

Function application

Mathematical use

U+2062

Invisible times

Mathematical use

U+2063

Invisible comma

Mathematical use

U+2FF0..U+2FFB

Ideographic character description

Graphic characters

U+303E

Ideographic variation indicator

Graphic character

U+FE00..U+FE0F

Variation selectors

Glyph selection indicators

U+E0100..U+E01DF

Variation selectors

Glyph selection indicators


The combining grapheme joiner (U+034F) is a combining mark rather than a formatting character. It does not affect cursive joining or ligation (as ZWJ and ZWNJ do). Neither does it combine or join graphemes, so its Unicode name is misleading. It has two uses, related to collation (sorting) of strings and to canonical reordering of combining marks. See the Unicode FAQ, http://www.unicode.org/faq/char_combmark.html.

Subtending marks are used in the Arabic and Syriac scripts to indicate that a mark be placed below a string of characterse.g., below a sequence of digits, to indicate a year. The Syriac abbreviation mark is used similarly but placed above a string, as a supertending mark, and the Arabic end of ayah is a similar but enclosing mark. In character data, a subtending mark precedes the affected characters; the end of the affected range is defined implicitly, usually by the first non-alphanumeric character. There is currently no markup that can replace these subtending, supertending, or enclosing marks.

Variation selectors were discussed in the section "Unicode and Fonts" in Chapter 4. They are used to select a glyph variant of the preceding character. Although they could in principle be replaced by markup and styling (glyph selection), this cannot be done in practice now. UTR #20 comments on them as "Not graphic characters," which is technically correct: they are not visible characters but meant to affect the rendering of another character.

9.2.3.4. Characters with compatibility mappings

The characters listed in Table 9-2 and Table 9-3 usually do not cause much of a problem when deciding what characters to use in marked-up text. Most of them would not be used anyway, and the rules for them are rather straightforward, though the practical considerations (would this formatting character work?) might require some study.

The third and last group of characters discussed in UTR #20, those with compatibility mappings, is more problematic, and more importante.g., in texts in English. As we noted in Chapter 5,compatibility mappings exist for different reasons and have varying meanings. The difference between a character and its compatibility mapping can vary from practically ignorable to substantial difference in meaning or appearance or both. The expression "characters with compatibility mappings" is admittedly clumsy, but the equivalent term "compatibility decomposable character" is also clumsy, and the simpler term "compatibility character" does not mean quite the same thing. (There are compatibility characters that have no compatibility mapping.)

The recommendations on using characters with compatibility mappings in marked-up text may appear to conflict with general Unicode principles on avoiding such characters in new data (see Chapter 5). The main reason is that these recommendations mainly deal with marking up existing character data rather than creation of completely new data. For example, the use of characters for ligatures (such as "" as one character) in new data should normally be avoided. However, if such data exists in plain text, it should not be indiscriminately replaced by its decomposition (such as letters "f and "l"), especially if we have no idea of how the ligature behavior could be expressed in markup or otherwise.

The recommendations of UTR #20 are summarized in Table 9-4 and commented (and criticized) after the table. The report presents them primarily as applicable when XML markup is first added to text that has no markup. It does not necessarily mean that existing marked-up text should be modified. The first column in the table specifies a "compatibility tag" as defined in the Unicode database. As explained in Chapter 5, such tags are metasymbols used to indicate the nature of the compatibility mapping, and they should not be confused with markup tags. For two tags, the recommended treatment is different for different characters, and this is indicated by specifying the applicability by code range in column 2. (For compactness, the "U+" prefix is omitted here.)

Table 9-4. What to do with characters with compatibility mappings

Tag

Code range

What to do

Description of characters and/or notes

<circled>

 

Retain, or use list markup

Circled letters and digits

<compat>

2002..200A

Retain

Fixed-width spaces; see comments

 

2100..2101

Retain

℀ and ℁; used as symbols

 

2105..2106

Retain

and ℆; used as symbols

2121, 213B

Retain

℡ and facsimile sign

 

2160..217F

Retain, or use list markup

Roman numerals, usually used as list item markers

 

2474..249B

Retain, or use list markup

Parenthesized or dotted number, usually used as list item marker

 

249C..24B5

Retain, or use list markup

Parenthesized letters, usually used as list item markers

 

3131..318E

Retain

Compatibility Hangul Jamo

 

3200..3229

Retain, or use list markup

Parenthesized Korean characters and ideographic numbers

 

322A..3243

Retain, or use list markup

Parenthesized ideographs

 

32C0..32CB

Retain

Ideographic telegraph symbols

for months

 

all other

Retain

Maintain, semantic distinctions apply

<final>

 

Normalize

Arabic presentation forms

<font>

 

Retain

Variant letter forms used as symbols

<fraction>

 

Normalize

"As long as fraction slash is supported!"

<initial>

 

Normalize

Arabic presentation forms

<isolated>

 

Normalize

Arabic presentation forms

<medial>

 

Normalize

Arabic presentation forms

<narrow>

 

Retain

Half-width characters

<noBreak>

 

Retain

Non-breaking variants; see notes below

<small>

 

Retain

Small forms of characters; see notes

<square>

3300..3357

Retain

Single display cell cluster containing multiple lines of kana for vertical layout

 

3358..337D

Retain

Ideographic symbols

 

33E0..33FE

Retain

Ideographic telegraph symbols for days

 

all other

Retain

Symbols used in vertical layout

<sub>

 

Use markup, or retain

Subscript characters

<super>

 

Use markup, or retain

Superscript characters

<vertical>

 

Normalize

East Asian Presentation forms

<wide>

 

Retain

Fullwidth characters


In this context, "normalize" means conversion to Normalization Form KC. As described in Chapter 5, this means compatibility ("K") decomposition followed by canonical composition ("C"). Only a few types of characters are normalized, according to UTR #20. Most compatibility characters are retained.

The treatment of characters with compatibility mappings needs to be more complicated than expressed in the summary table. In fact, there are internal inconsistencies in UTR #20 between the summary table and the prose explanations. Here it is interpreted according to the prose, which is more detailed, and some apparent errors have been corrected.

Most important, we need to consider the intended meaning of using a character with a compatibility mapping. If the purpose is just visual formatting, it should be replaced by the use of normal characters and markup (and a stylesheet). If there is a semantic difference involved, the character should be retained. The report illustrates this with a simple example of italicized characters:

  • It would be inappropriate to use compatibility characters like ℎ (U+210E), ℯ (U+212F), etc., to write the word hello in italics. This should be rather obvious on several accounts: the names of the characters, the variation in their glyphs (which are not based on any uniform "italics design"), and the rather practical fact that these characters are poorly supported.

  • On the other hand, the character ℎ (U+210E) is adequate for denoting the Planck constant, used in physics. In fact, "Planck constant" is its name and suggests its meaning. The report says that we should not use just an italicized "h," or specifically the HTML markup <i>h</i>, to denote the Planck constant. In practice, we often don't have a choice, due to character repertoire limitations. But the principle is clear: in cases like this, the compatibility character is to be preferred. The principle can be criticized, though: why would the Planck constant be an exception, when we use, for example, just an italicized c to denote the speed of light?

In the following, we present additional rules, explanations, and comments related to Table 9-4, organized in the same order as the table.


Characters with <circled> mapping

These are circled letters and digits such as ① (U+2460). They are most often used as list item markers, as footnote markers, or in text when referring to items in a numbered list. Although the report suggests in its summary that such characters be retained, the detailed rules rather suggest that when used as list markers, they should be replaced by list markup. On the other hand, this might be impractical if you wish to preserve the circled appearance of the markers. As the report warns, such formatting can be difficult or impossible. (In MS Word, for example, you can set up a numbered list, and then change its appearance to use circled numbers, up to the value of 20. In HTML or CSS, on the other hand, you cannot format a numbered list that way in practice.) In any case, if the characters are used both as list markers and in text as referencing list items, any replacements of the characters should preserve reasonable visual similarity between the markers and the references.


Characters with <compat> mapping in general

The report vaguely describes that "the <compat> label was given to a set of compatibility characters whose further classification was not settled at the time the standard was created." This seems to ignore the possibility of simply replacing the character with its compatibility mappinge.g., writing as "c/o." Perhaps the idea is to say that if the formatting or the special meaning is to be preserved, there is usually no other way than to retain the character. In some situations, such as vertical layout, it is necessary to keep the symbol as single character, and vertical layout is one of the reasons why the characters have been used in the first place. Besides, due to relatively poor support in fonts, most characters in this category are rarely used for purely typographic reasons. Therefore, it might be safest to assume, at least in automatic conversions, that if these characters appear, there is a particular reason for that, so they should be retained as such. On the other hand, if you know that, say, the character Roman numeral seven Ⅶ (U+2166) has been used just for typography or by mistake, its hard to see why you could not replace it with the three-character string "VII," optionally with some styling.


Fixed-width spaces

The report recommends that these characters be retained. However, as described in Chapter 8, most fixed-width spaces work unreliably and could often be replaced by the use of normal spaces and formatting commands or stylesheets.


Roman numerals

These characters each represent a Roman numeral, such as "VII," as a single character. Similarly to characters with <circled> mapping, they are often used as list item marks in a numbered list, and they could be similarly replaced by list markup. List styling tools (e.g., in HTML and CSS) usually support well the formatting of numbers as Roman numerals. In other usage, these characters should be retained; see earlier notes on <compat> mapping in general.


Parenthesized numbers

These characters, U+2474 to U+2487, have <compat> mappings like "(1)" and consist of a character in parentheses, such as ⑴ (U+2474). They are used much the same way as circled characters, and the report recommends, in its prose, the same approach for them. The feasibility of replacing these numbers with list markup and styling varies; for example, in CSS, it is currently not possible to make list markers appear as parenthesized numbers .


Dotted numbers

These characters, U+2488 to U+249B, are similar to parenthesized numbers but have <compat> mappings like "1." (i.e., a number followed by a full stop). Similar considerations apply. Note that many default renderings of numbered lists have a dot after the number; it can actually be difficult to get rid of it!


Parenthesized letters

These characters are similar to parenthesized numbers. The summary in the report says "use list item marker style or normalize," but this is probably an oversight. Instead, if it is infeasible to use list markup and marker styling, it is best to treat them the same was as other characters with <compat> mappings: retain them as such, unless you know that they can safely be replaced by their mappings (i.e., normalized).


Other parenthesized symbols

Characters U+3200..U+3229 and U+322A..U+3243 are parenthesized symbols, often used as list markers. Due to their scope of use, they are usually best retained.


Ideographic telegraph symbols for months

These characters have <compat> mappings consisting of a number (of month) followed by an ideograph. Due to their use in vertical layout, they are retained.


Arabic presentation forms

Characters with <final>, <initial>, <isolated>, or <medial> mapping are compatibility characters that represent specific contextual forms of Arabic writing. The report recommends that these be normalizedi.e., replaced with the corresponding generic characters. Note that text using contextual forms is difficult to edit, since the forms would need to be changed, and search operations are difficult, too. However, some rendering software might be able display the contextual forms but unable to select appropriate glyphs when normalized text is used. If you decide to retain contextual forms for such reasons, beware that there are many pitfalls. For example, you may need to specify directionality explicitly even for purely Arabic text.


Characters with <font> mapping

The report recommends that these be retained. This is cautious policy, based on the fact that the use of these characters may involve semantic distinctions. For example, the Planck constant "ℎ" (U+210E) belongs to this category. If it has been used properly, it should be retained, in principle, though you may have good reasons to deviate from this. If, on the other hand, we can know that this character has been mistakenly used just to produce the letter "h" in italics style, with no specific semantics, the letter "h" and suitable markup and styling should be used instead.


Fractions

Characters with <fraction> mapping are "vulgar fractions ." The report somewhat oddly recommends that they be normalized "as long as fraction slash is supported!" In reality, the fraction formatting as requested by the use of the fraction slash is poorly supported. When converting to mathematical markup, fractions should apparently be replaced by the use of constructs like the general mfrac element in MathML. However, the real choice is usually between retaining these characters and replacing them with linearized fractions (e.g., mapping ½ to "1/2" so that / is the normal slash, or solidus) or maybe using a different notation instead of a fraction (e.g., "0.5"). See suggestions on writing fractions in Chapter 8.


Half-width (narrow) characters

Characters with <narrow> mapping are half-width forms of characters, for use in East Asian writing that normally uses glyphs designed to fit into a full square. There is no equivalent markup in general.


Non-breaking characters

Characters with <noBreak> mapping are non-breaking variants of characters. Currently this means Tibetan mark delimiter tsheg bstar (U+0F0C),figure space (U+2007), non-breaking hyphen (U+2011), and narrow no-break space (U+202F). (In fact, all of these except the figure space already appear in Table 9-3.) Otherwise, prevention of line breaking needs to be handled using invisible characters or at a higher protocol level, as explained in Chapter 5. The report says enigmatically: "The compatibility mapping is merely a way to indicate the equivalent character that is not non-breaking. The distinction must be preserved." In reality, there are several alternate ways to express non-breakability, in markup or in a stylesheet. But non-breakability information should surely not just be dropped.


Small forms

Characters with <small> mapping are versions of some ASCII characters and a few other characters, for use in East Asian writing. The report says: "Precise usage unknown. Maintain, but do not generate."


Square forms

Characters with <square> mapping are presentational forms of characters and strings, for use in vertical layout. Although this category contains different types of characters, the report recommends that they all be retained. Typically, the characters are symbols composed of Latin or Japanese kana letters, digits, and slash, designed to fit into a square that can be used as a single cell. For many simple implementations, this is the only way to present, for example, metric units (say, "km") and common abbreviations in a manner suitable for vertical text.


Subscript and superscript characters

Characters with <sub> or <super> mapping are subscript or superscript variants of characters, such as 2. The summary in the report recommends replacing them by the use of sub and sup markup, respectively, apparently referring to HTML markup or similar markup. (Of course, there is no guarantee that an arbitrary XML-based markup language contains such elements, or that they have these names; in MathML, the names are msub and msup.) As discussed previously, the situation is rather complicated, and the text of the report acknowledges many of the problems. In the absence of information about the intended meaning, it is generally best to retain these characters. The report explicitly says that when subscripts and superscripts are to reflect semantic distinctions, "it is easier to work with these meanings encoded in text rather than markup, for example, in phonetic or phonemic transcription" and that especially for letters, the distinction can be essential (in phonetic notations, the meaning of "kh" is different from the meaning of "kh").


Vertical forms

Characters with <vertical> mapping are presentational forms of characters, for use in East Asian writing when it runs vertically and not horizontally. The report recommends that they be normalized (replaced by the mapping). This is feasible if the rendering software can be assumed to select vertical forms automatically as needed.


Fullwidth (wide) characters

Characters with <wide> mapping are fullwidth forms of characters, for use in East Asian writing that normally uses glyphs designed to fit into a full square. There is no equivalent markup in general.

Figure 9-5. An extract of a table with highly undesirable line breaks


9.2.4. Preventing Line Breaks

We return to the issue of preventing line breaks, discussed in this chapter as well as in Chapter 5 and Chapter 8. The reason is that it is so common to have poorly formatted data, especially tables, just because no method for preventing undesired line breaks has been used. Here we summarize the different methods and present some examples.

To illustrate the problem, consider the extract of tabular data presented in Figure 9-5. It is localization data from the CLDR (discussed in Chapter 11) and somewhat complicated in itself, but undesirable line breaks make things much worse. The first row shown in the figure is meant to specify that for the Farsi (Persian) language (language code fa), a positive monetary amount is expressed in the format #,##90.00 ¤ and a negative monetary amount in the format "-#,##0.00" ¤. Here you can see the currency symbol ¤ in actual use: it is a placeholder for a code, name, or symbol for a currency. The problem here, apart from the difficulty of understanding the notations of the formats, is that a web browser has broken the string #,##90.00 ¤;"-#,##0.00" ¤ (where the semicolon is just a separator between the formats) in a disturbing manner. Breaking at the space obfuscates the data. Similar things happen on the two other rows.

Especially in tables, horizontal space is often a scarce resource. When rendering software tries to fit a multicolumn table within some limited space, it may squeeze some columns so that even cell content like "5 m" is broken into two lines. Breaking it to "5" and "m" can be confusing, and it surely makes the appearance bad. In HTML authoring, specifically, there are many ways of preventing such breaks. They are presented in Table 9-5. Note that some of the ways are just theoretical, though they may illustrate techniques that are useful in other contexts.

Table 9-5. Methods of preventing line breaks in an HTML table cell

Description

Sample markup

Notes

No-Break Space

<td>5&nbsp;m</td>

Could use U+00A0 itself, too

Word Joiner

<td>5 &#x2060;m</td>

Theoretical alternative

Markup attribute

<td nowrap>5 m</td>

Deprecated markup in HTML

Markup element

<td><nobr>5 m</nobr></td>

Nonstandard, widely supported

Style sheet (CSS)

<td style="white-space: nowrap">

5 m</td>

Better done with external CSS


When using a stylesheet, it is usually better to put CSS code into a separate file, rather than embed it into HTML markup using the style attribute, as in the example. Normally you would use just a <td> tag without attributes, or such a tag with a class attribute, and the styling would be done outside the HTML document.

Although all the methods mentioned in Table 9-5 might be expected to have the same effect, the Word Joiner (WJ) methodwhich might be regarded as theoretically the most adequatefails on almost any browser. The other methods mostly have the same effect, but if there is an explicit width set for the table cell, both the markup attribute method and the stylesheet method fail to prevent the line break. This is just one example of the practically important oddities that you may encounter. Using the character-level method, no-break space, is usually the simplest and most effective method here. Note that the use of the entity reference &nbsp; is equivalent to using the no-break space character itself as data, and we use it here just for clarity.

Things change if you need to consider potential line breaking points other than spaces. In that case, you usually don't have a character like the no-break space that you could use. In particular, to prevent a line break after a hyphen, as in <td>555-123</td>, the character level methods (using the nonbreaking hyphen or the Word Joiner) hardly work in practice. You would thus use one of the last three methods mentionedi.e., the nowrap attribute, the nobr element, or a stylesheet (or maybe a combination of these).

Finally, as a practical observation that often makes things easier, note that it is often sufficient to prevent line breaks in one cell in a column. Typically, you would work on the cell with largest width requirement when written on one line. If you prevent a line break in a cell containing "1 000 000 $," then surely a cell with "42 $" in the same column won't be broken either.

9.2.5. Breaking the Flow of Text

Markup can be used even for parts of words. Should it affect the way in which the textual content is processed, such as recognition of words? Consider the (old-fashioned) HTML markup <b>F</b>oo, intended to make the word Foo appear so that first letter is bold. Could search engines, for example, treat it as two words, "F" and "oo"?

Search engines generally parse HTML in a manner that effectively ignores most tags. It is however possible that some programs do otherwise, either because they have poorly written parsers or because they have intentionally been programmed to honor markup, in a way. The latter would be quite natural for markup like <p>xxx</p><p>yyy</p>, where the two elements should be treated as paragraphs and the strings xxx and yyy as separate, not as xxxyyy.

In practice, search engines differ. Google treats <b>F</b>oo as "Foo," whereas AltaVista treats it as two words, "F oo." Moreover, search engine behavior may vary by situation and version. It is thus best to avoid using markup that breaks words, unless you have real need for it.

For a markup language like HTML, it would be natural to think that inline (text-level) markup (like b for bold face font) does not separate characters in any way, whereas block-level markup (like p for paragraph) acts as a separator. However, neither HTML specifications nor the Unicode standard discuss this issue, and search engines can hardly be expected to make such distinctions.

In a more general setting, such as XML, things become even more complicated. There is no division into inline and block-level elements in XML itself, though in XML-based languages, such a division might be made.

Thus, we should be prepared for both alternatives. In some situations, inline markup could separate strings. It might also fail to do that even when we would expect that, so markup like <p>xxx</p><p>yyy</p> is not safe; it is better to insert a space or a line break between the elements.

Similarly, if we write <font color="red">f</font>i, it may happen that the string "fi" is not presented as a ligature even if a browser would use a ligature when the font markup is not there. If we use e&#x304; in HTML (or XML), we may expect to see (letter "e and a combining macron, U+0304), and this may well happen. But if we write e<font color="red">&#x304;</font>, the situation may change, depending on the browser. The font tag might act as an invisible barrier between a character and a combining diacritic mark. Different browsers could render this as in normal color, as with a red macron, or as e¯ with a red macron, or even (incorrectly) as just "e." The example may sound contrived, but people really want to use such markup at timese.g., in linguistic contexts when drawing attention to a diacritic mark.

In any case, markup used inside words, even for individual characters, tends to make the markup hard to read. This is one of the reasons why UTR #20 allows several formatting characters that could in principle be replaced by markup. Compare, for example, the string foo-bar-1 (where the second hyphen is the nonbreaking hyphen, U+2011) or even foo-bar&#x2011;1 (using a character reference for U+2011) with the markup foo-<span style="white-space:nowrap">bar-1</span>.

9.2.6. Why Not Markup in Unicode?

Unicode contains a large number of characters that are, more or less, typographic variants of more basic characters. This, and reasons for it, were discussed in Chapter 4. To some extent, such characters can be explained by the universality principle: they have been taken into Unicode, since they exist in other character code standards. However, this does not explain the addition of more and more characters of this kind, especially for the needs of mathematics (e.g., mathematical bold capital "A," mathematical bold italic capital "A," mathematical sans-serif capital "A," and many, many others).

Since most of such characters can be described in terms of basic characters and a number of features such as "bold," "italic," and "sans-serif," it is natural to ask whether a more systematic approach could have been used. In fact, they could have been implemented more efficiently by adding a limited number of formatting characters into the Basic Multilingual Plane (BMP). That way, you would use such a formatting character before or after a normal letter to create a special variant. This would give much more flexibility, and it would be in accordance with the principles applied to characters with diacritic marks. The Unicode FAQ answers:

It would have provided too much flexibility, and would have tempted people to use such characters to create "poor man's markup" schemes rather than using proper markup such as SGML/HTML/XML. The mathematical letters and digits are meant to be used only in mathematics, where the distinction between a plain and a bold letter is fundamentally semantic rather than stylistic.

This means that there are two distinct points:

  • The Unicode standard intentionally excludes anything resembling general font markup. The expressed reason is that people should use "proper markup" instead.

  • The Unicode characters that can be classified as font variants are usually not just typographic variants but have specific meaning. However, Unicode defines the meaning rather abstractly by designating characters as "mathematical," for example.

In practice, if you decided to use, say, mathematical bold capital "A" (U+1D400) just to produce a bold A, you would not break any formal rule of Unicode. But in addition to breaking the spirit, it would almost always be unwise. The character U+1D400 has very limited support in fonts and in automatic processing in programs. Besides, programs that recognize it may treat it in a way that corresponds to its role as a mathematical symbol, rather than just a variant of the common letter "A."



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net