9.2. Characters and MarkupXML (and SGML) markup has often been characterized as "semantic" or "logical," instead of presentational or physical. However, markup can be used for many purposes, including formatting and typography. If you need to store text data in a way that contains formatting information, nothing prevents you from using XML markup for that. To mention examples from HTML, <i> markup is used to indicate italics, and markup like <font face="Arial">, though deprecated, is still used to specify the font family. On the other hand, many Unicode characters are typographic variants of other characters, coded as separate characters for different reasons. Many of them are compatibility characters and have been included only due to their existence in other character codes. However, there are other cases as well. The difference between normal (upright) and italics style may indicate a semantic distinction in mathematics or other special notations Most things that are expressed in markup have no character-level counterpart. For example, designating some text as a heading, in the logical structure, cannot be done at the character level. What comes closest to that is writing the heading text in all uppercase with line breaks before and after, as we often do in plain text. If you consider markup for indicating the structure of a price list that is never meant to be displayed as such, only through special formatting processes and rules, it should be obvious that you cannot do anything at the character level to indicate some text as "product name" and some other text as "unit price" for the product. Thus, the question of whether information should be expressed at the character level or in markup primarily deals with presentational distinctions, or distinctions that might at least arguably be regarded as presentational. It is an important special case of the problem of selecting an appropriate level of expression in such cases. There are specific guidelines on it suggested in a joint report by the Unicode Consortium and the World Wide Web Consortium (W3C). Note that text processors increasingly use XML markup in their data formats, so the principles apply to using their tools, too, at least indirectly. 9.2.1. Markup and StylingIn this context, "markup" really means "markup and styling," in most cases. Modern markup tends to be logical rather than presentational, and therefore markup alone does not usually imply any particular rendering style. In particular, if you use generic XML, inventing tags as you need and with the intention of specifying rendering in a stylesheet, then no markup has any rendering style as such, only through the stylesheet. Although the i markup in HTML, for example, specifically means "italics," we cannot say the same about the (rarely used) var markup, which means "variable, placeholder." Yet, in contexts where it is suitable to replace a compatibility character with a normal character and i markup, it can be just as suitable, if not more so, to replace it by the normal character and var markup, provided that two conditions are met. First, the intended meaning should correspond to the defined meaning of the markup. Second, we should have reasonable expectations on having it rendered in italics. The expectations might be based on information about typical browser defaults, or on the use of a stylesheet that explicitly suggests such rendering (var { font-style: italic; }). 9.2.2. Document-wide Versus Local DecisionsThe question "character level or markup?" has two levels:
The alternative "both ways" can usually be excluded. Although it may sound ideal to get the best of both possible worlds, you easily end up with getting the worst of them. Besides, you might get nasty cumulative effects. Consider the simple example of writing the expression x⁴ (x to the power of 4). You could use the superscript four (U+2074), or you could use a higher-level protocol, such as markup (e.g., sup element in HTML) or superscript style in a word processor. Trying to use bothe.g., using x<sup>⁴</sup> in HTMLwill probably combine the drawbacks of both alternatives: in rendering, it fails whenever U+2074 does not belong to the fonts in use, and it tends to mess up line spacing the way sup markup often does. Besides, it's illogical. It means double superscripting, and this will probably make the superscript appear as very small if at all, since a browser uses superscript 4 and reduces its size. Moreover, as noted earlier in this chapter, if you express some superscripts as superscript characters and other superscripts using markup, they easily look disturbingly different. Any automated processing of the data, such as conversion to another format, would need to deal with two representations of superscripts instead of one.
There are a few exceptions, though. Sometimes there are two ways to present the same formatting information so that no harm arises from the duplication. For example, in HTML authoring, you could write a table cell as <td nowrap>42 m</td> so that the space between "42" and "m" is a no-break space U+00A0 (which you could write as in HTML). That would mean expressing both at the character level and in markup, with the nowrap attribute, that the cell content must remain on one line. Both ways are "safe" in practice. Although there is no particular benefit from using both, no harm is caused either, so such duplication need not be avoided, for example, when generating table markup automatically. 9.2.3. Unicode Versus MarkupThe document "Unicode in XML and other Markup Languages" has been published as Unicode Technical Report UTR #21 at http://www.unicode.org/reports/tr20/ as well as a W3C Note at http://www.w3.org/TR/unicode-xml/. It is not part of the Unicode standard. It has been approved just as a Technical Report, though other documents may make normative references to it. In the W3C terminology, a Note is a document that has been endorsed by a working group but not reviewed or endorsed by the W3C as a whole (or by "W3C Members"). Nevertheless, UTR #21 is the best available general guideline on whether information should be expressed at the character level or in markup. We need to use it with discretion, partly because the report considers markup in general and not the specific features of various markup languages and systems. For example, in cases where the report recommends markup, the markup language we use might lack elements that could be used for the purpose, or their implementation in software might be wanting. In practice, the report revolves around XML, which covers both generic XML (where you can invent tags as you go) and specific XML-based markup languages such as XHTML (the XML-ized version of HTML), MathML, MusicML, or SVG (Scaleable Vector Graphics, a language for two-dimensional graphics in XML, with a possibility of including text as character data). However, the markup concept is more general and covers SGML too, including classic HTML, which is nominally SGML-based. Even notations such as RTF and other rich text systems can be regarded as markup, even though their general syntax is different. 9.2.3.1. Differences between markup and plain textPlain text is linear: a character follows another in a sequence. Although the visual rendering can be more complicatede.g., due to combining diacritic marks and to alterations in writing directionplain text is still processed linearly. Markup, on the other hand, expresses tree-like structures, even if it is written linearly. As any good book on markup will tell you, a notation like <x><y>foo</y><z>bar</z></x> describes a tree structure, with elements y and z as "children" (subtrees) of x. The marked-up text needs to be processed (parsed) in order to construct the tree structure, which in turn can be linearized into text. A markup element can be very large. In XML, the entire document is treated as one element, with subelements, which contain subelements, etc. Information expressed at character level works on different grounds. Either the difference between characters as such carries some information (e.g., using 2 instead of 2 expresses that we have a superscript), or a character affects the interpretation or processing of the preceding or sometimes the following character. Some characters may set some internal state in interpretation or processing, such as writing direction. They might be compared to start tags in markup. Even then, such characters usually affect a state in a simple way, setting it to a specific value. There is normally no nesting involved as in markup. Thus, character level is normally useful for rather local information only. On the other hand, it is generally simple and compact to use when it applies. Compare the simplicity of using a character (code point) for 2 as opposed to markup like <sup>2</sup>, where you need start and end tags even though you are saying something about a single character only. Obviously, information at the character level is suitable for linear processing where you read a stream of characters and process them in succession. Similarly, marked-up text, once parsed, is suitable for structured processing where you start from a tree and process it by the structure. 9.2.3.2. Characters that should not be used in marked-up textUTR #20 declares some characters, listed in Table 9-2, as "unsuitable for use with markup." Some of them might have use in plain text or other formats, but not in XML, for example. Most of these characters would rarely come into your mind anyway when using markup. Note, however, that U+FEFF has been used to some extent in marked-up text as an invisible joiner (to prevent undesired line breaks), and in practice, it still does the job more reliably than the suggested replacement, U+2060.
As described in Chapter 5, the Line Separator (LS) U+2028 and the Paragraph Separator (PS) U+2029 were introduced to provide unambiguous means to denote line breaks and paragraph delimiters in plain text. This was meant to avoid the ambiguity caused by different uses of ASCII control characters like Line Feed. In practice, LS and PS have not been used much. If they appear in plain text being converted to marked-up text, they should be replaced by appropriate markup. In HTML, you use <br> (or <br /> in XHTML) for a forced line break, and you surround each paragraph with the tags <p> and </p>. Utr #20 recommends that an occurrence of LS or PS in marked-up text be treated as whitespacei.e., as equivalent to a space. According to UTR #20, the Bidi embedding controls U+202A..U+202E (see Chapter 5) are "strongly discouraged" in the HTML 4 specification, which however actually just warns about possible conflicts between those controls and equivalent markup. It recommends that preferably one or the other should be used exclusively, and adds:
UTR #20 suggests that markup be used instead of the controls on the following grounds:
Although this recommendation is usually adequate, there are situations where markup cannot be used for Bidi embedding. Attributes of elements cannot contain markup, only text, and some elements may contain only text. Thus, if Bidi control is needede.g., a <title> element or an alt attribute of an <img> elementthe control characters are the only possibility. 9.2.3.3. Formatting characters that may be used in marked-up textAccording to UTR #20, the characters listed in Table 9-3 may be used in XML documents or other marked-up text, even though they are invisible formatting characters or characters with formatting information. This does not mean that they should be used, or that it would always be appropriate and best to use them. Rather, they are regarded in principle as compatible with the ideas and practices of markup. This means that the potential risks of mixing character-level information and markup are not relevant, or they can be controllable enough. On the practical side, many of the characters listed are poorly supported or could be replaced by markup. The no-break space U+00A0 is often used and useful, whereas most of the other characters have little use in texts written in Latin letters, except for the soft hyphen U+00AD in some word processors.
The combining grapheme joiner (U+034F) is a combining mark rather than a formatting character. It does not affect cursive joining or ligation (as ZWJ and ZWNJ do). Neither does it combine or join graphemes, so its Unicode name is misleading. It has two uses, related to collation (sorting) of strings and to canonical reordering of combining marks. See the Unicode FAQ, http://www.unicode.org/faq/char_combmark.html. Subtending marks are used in the Arabic and Syriac scripts to indicate that a mark be placed below a string of characterse.g., below a sequence of digits, to indicate a year. The Syriac abbreviation mark is used similarly but placed above a string, as a supertending mark, and the Arabic end of ayah is a similar but enclosing mark. In character data, a subtending mark precedes the affected characters; the end of the affected range is defined implicitly, usually by the first non-alphanumeric character. There is currently no markup that can replace these subtending, supertending, or enclosing marks. Variation selectors were discussed in the section "Unicode and Fonts" in Chapter 4. They are used to select a glyph variant of the preceding character. Although they could in principle be replaced by markup and styling (glyph selection), this cannot be done in practice now. UTR #20 comments on them as "Not graphic characters," which is technically correct: they are not visible characters but meant to affect the rendering of another character. 9.2.3.4. Characters with compatibility mappingsThe characters listed in Table 9-2 and Table 9-3 usually do not cause much of a problem when deciding what characters to use in marked-up text. Most of them would not be used anyway, and the rules for them are rather straightforward, though the practical considerations (would this formatting character work?) might require some study. The third and last group of characters discussed in UTR #20, those with compatibility mappings, is more problematic, and more importante.g., in texts in English. As we noted in Chapter 5,compatibility mappings exist for different reasons and have varying meanings. The difference between a character and its compatibility mapping can vary from practically ignorable to substantial difference in meaning or appearance or both. The expression "characters with compatibility mappings" is admittedly clumsy, but the equivalent term "compatibility decomposable character" is also clumsy, and the simpler term "compatibility character" does not mean quite the same thing. (There are compatibility characters that have no compatibility mapping.) The recommendations on using characters with compatibility mappings in marked-up text may appear to conflict with general Unicode principles on avoiding such characters in new data (see Chapter 5). The main reason is that these recommendations mainly deal with marking up existing character data rather than creation of completely new data. For example, the use of characters for ligatures (such as "" as one character) in new data should normally be avoided. However, if such data exists in plain text, it should not be indiscriminately replaced by its decomposition (such as letters "f and "l"), especially if we have no idea of how the ligature behavior could be expressed in markup or otherwise. The recommendations of UTR #20 are summarized in Table 9-4 and commented (and criticized) after the table. The report presents them primarily as applicable when XML markup is first added to text that has no markup. It does not necessarily mean that existing marked-up text should be modified. The first column in the table specifies a "compatibility tag" as defined in the Unicode database. As explained in Chapter 5, such tags are metasymbols used to indicate the nature of the compatibility mapping, and they should not be confused with markup tags. For two tags, the recommended treatment is different for different characters, and this is indicated by specifying the applicability by code range in column 2. (For compactness, the "U+" prefix is omitted here.)
In this context, "normalize" means conversion to Normalization Form KC. As described in Chapter 5, this means compatibility ("K") decomposition followed by canonical composition ("C"). Only a few types of characters are normalized, according to UTR #20. Most compatibility characters are retained. The treatment of characters with compatibility mappings needs to be more complicated than expressed in the summary table. In fact, there are internal inconsistencies in UTR #20 between the summary table and the prose explanations. Here it is interpreted according to the prose, which is more detailed, and some apparent errors have been corrected. Most important, we need to consider the intended meaning of using a character with a compatibility mapping. If the purpose is just visual formatting, it should be replaced by the use of normal characters and markup (and a stylesheet). If there is a semantic difference involved, the character should be retained. The report illustrates this with a simple example of italicized characters:
In the following, we present additional rules, explanations, and comments related to Table 9-4, organized in the same order as the table.
Figure 9-5. An extract of a table with highly undesirable line breaks9.2.4. Preventing Line BreaksWe return to the issue of preventing line breaks, discussed in this chapter as well as in Chapter 5 and Chapter 8. The reason is that it is so common to have poorly formatted data, especially tables, just because no method for preventing undesired line breaks has been used. Here we summarize the different methods and present some examples. To illustrate the problem, consider the extract of tabular data presented in Figure 9-5. It is localization data from the CLDR (discussed in Chapter 11) and somewhat complicated in itself, but undesirable line breaks make things much worse. The first row shown in the figure is meant to specify that for the Farsi (Persian) language (language code fa), a positive monetary amount is expressed in the format #,##90.00 ¤ and a negative monetary amount in the format "-#,##0.00" ¤. Here you can see the currency symbol ¤ in actual use: it is a placeholder for a code, name, or symbol for a currency. The problem here, apart from the difficulty of understanding the notations of the formats, is that a web browser has broken the string #,##90.00 ¤;"-#,##0.00" ¤ (where the semicolon is just a separator between the formats) in a disturbing manner. Breaking at the space obfuscates the data. Similar things happen on the two other rows. Especially in tables, horizontal space is often a scarce resource. When rendering software tries to fit a multicolumn table within some limited space, it may squeeze some columns so that even cell content like "5 m" is broken into two lines. Breaking it to "5" and "m" can be confusing, and it surely makes the appearance bad. In HTML authoring, specifically, there are many ways of preventing such breaks. They are presented in Table 9-5. Note that some of the ways are just theoretical, though they may illustrate techniques that are useful in other contexts.
When using a stylesheet, it is usually better to put CSS code into a separate file, rather than embed it into HTML markup using the style attribute, as in the example. Normally you would use just a <td> tag without attributes, or such a tag with a class attribute, and the styling would be done outside the HTML document. Although all the methods mentioned in Table 9-5 might be expected to have the same effect, the Word Joiner (WJ) methodwhich might be regarded as theoretically the most adequatefails on almost any browser. The other methods mostly have the same effect, but if there is an explicit width set for the table cell, both the markup attribute method and the stylesheet method fail to prevent the line break. This is just one example of the practically important oddities that you may encounter. Using the character-level method, no-break space, is usually the simplest and most effective method here. Note that the use of the entity reference is equivalent to using the no-break space character itself as data, and we use it here just for clarity. Things change if you need to consider potential line breaking points other than spaces. In that case, you usually don't have a character like the no-break space that you could use. In particular, to prevent a line break after a hyphen, as in <td>555-123</td>, the character level methods (using the nonbreaking hyphen or the Word Joiner) hardly work in practice. You would thus use one of the last three methods mentionedi.e., the nowrap attribute, the nobr element, or a stylesheet (or maybe a combination of these). Finally, as a practical observation that often makes things easier, note that it is often sufficient to prevent line breaks in one cell in a column. Typically, you would work on the cell with largest width requirement when written on one line. If you prevent a line break in a cell containing "1 000 000 $," then surely a cell with "42 $" in the same column won't be broken either. 9.2.5. Breaking the Flow of TextMarkup can be used even for parts of words. Should it affect the way in which the textual content is processed, such as recognition of words? Consider the (old-fashioned) HTML markup <b>F</b>oo, intended to make the word Foo appear so that first letter is bold. Could search engines, for example, treat it as two words, "F" and "oo"? Search engines generally parse HTML in a manner that effectively ignores most tags. It is however possible that some programs do otherwise, either because they have poorly written parsers or because they have intentionally been programmed to honor markup, in a way. The latter would be quite natural for markup like <p>xxx</p><p>yyy</p>, where the two elements should be treated as paragraphs and the strings xxx and yyy as separate, not as xxxyyy. In practice, search engines differ. Google treats <b>F</b>oo as "Foo," whereas AltaVista treats it as two words, "F oo." Moreover, search engine behavior may vary by situation and version. It is thus best to avoid using markup that breaks words, unless you have real need for it. For a markup language like HTML, it would be natural to think that inline (text-level) markup (like b for bold face font) does not separate characters in any way, whereas block-level markup (like p for paragraph) acts as a separator. However, neither HTML specifications nor the Unicode standard discuss this issue, and search engines can hardly be expected to make such distinctions. In a more general setting, such as XML, things become even more complicated. There is no division into inline and block-level elements in XML itself, though in XML-based languages, such a division might be made. Thus, we should be prepared for both alternatives. In some situations, inline markup could separate strings. It might also fail to do that even when we would expect that, so markup like <p>xxx</p><p>yyy</p> is not safe; it is better to insert a space or a line break between the elements. Similarly, if we write <font color="red">f</font>i, it may happen that the string "fi" is not presented as a ligature even if a browser would use a ligature when the font markup is not there. If we use ē in HTML (or XML), we may expect to see (letter "e and a combining macron, U+0304), and this may well happen. But if we write e<font color="red">̄</font>, the situation may change, depending on the browser. The font tag might act as an invisible barrier between a character and a combining diacritic mark. Different browsers could render this as in normal color, as with a red macron, or as e¯ with a red macron, or even (incorrectly) as just "e." The example may sound contrived, but people really want to use such markup at timese.g., in linguistic contexts when drawing attention to a diacritic mark. In any case, markup used inside words, even for individual characters, tends to make the markup hard to read. This is one of the reasons why UTR #20 allows several formatting characters that could in principle be replaced by markup. Compare, for example, the string foo-bar-1 (where the second hyphen is the nonbreaking hyphen, U+2011) or even foo-bar‑1 (using a character reference for U+2011) with the markup foo-<span style="white-space:nowrap">bar-1</span>. 9.2.6. Why Not Markup in Unicode?Unicode contains a large number of characters that are, more or less, typographic variants of more basic characters. This, and reasons for it, were discussed in Chapter 4. To some extent, such characters can be explained by the universality principle: they have been taken into Unicode, since they exist in other character code standards. However, this does not explain the addition of more and more characters of this kind, especially for the needs of mathematics (e.g., mathematical bold capital "A," mathematical bold italic capital "A," mathematical sans-serif capital "A," and many, many others). Since most of such characters can be described in terms of basic characters and a number of features such as "bold," "italic," and "sans-serif," it is natural to ask whether a more systematic approach could have been used. In fact, they could have been implemented more efficiently by adding a limited number of formatting characters into the Basic Multilingual Plane (BMP). That way, you would use such a formatting character before or after a normal letter to create a special variant. This would give much more flexibility, and it would be in accordance with the principles applied to characters with diacritic marks. The Unicode FAQ answers:
This means that there are two distinct points:
In practice, if you decided to use, say, mathematical bold capital "A" (U+1D400) just to produce a bold A, you would not break any formal rule of Unicode. But in addition to breaking the spirit, it would almost always be unwise. The character U+1D400 has very limited support in fonts and in automatic processing in programs. Besides, programs that recognize it may treat it in a way that corresponds to its role as a mathematical symbol, rather than just a variant of the common letter "A." |