Section 5.11. Effects on Choosing Characters

5.11. Effects on Choosing Characters

In Chapter 1, we discussed some reasons to be strict (or picky) in choosing the right character, as opposed to using a character that only looks right. At this point, we can return to such issues with a better technical background: what is the impact of the formally defined properties of characters?

We discussed the simple example of using the letter sharp "s" ß (U+00DF) as a replacement for the Greek small letter beta β (U+03B2). The idea might be tempting, and it has been applied, in situations where you can safely use the ISO Latin 1 repertoire of characters (as you can very often) but not the Greek letters, and you have just a casual need for a beta. Yet, even ignoring all the other arguments, if you compare the formally defined properties of the characters, you can see that they are fundamentally different. They are both letters, but from different scripts, and they have quite different uppercase mappings. Of course, in many contexts, you might get away with this, if no program processes the text in any manner where the differences in properties matter.

Even if you try to avoid any tricks based on visual similarity, trying to use the right character, you may find yourself puzzled. Unicode has about 100,000 characters, and even though very large portions thereof can be classified as belonging to particular scripts, there is still quite a lot to choose from. This applies in particular to characters that can be classified as "symbols" in the sense that they do not belong to any particular script, though perhaps to a very specialized area of application like some branch of science or technology.

5.11.1. Example: Some Mathematical Operators

As a simple example, consider scalar and vector products in mathematics (and physics). They are operations on entities called vectors and conventionally denoted by bold face letters. Normally the scalar (or dot) product of vectors a and b is denoted as a b, and their vector (or cross) product is denoted as There are two obvious candidates for the symbols: the middle dot · (U+00B7) and the multiplication sign x (U+00D7). This is a convenient choice, since the characters belong to the ISO Latin 1 repertoire. In fact, it is very often the best choice, given the existing limitations. It would be theoretically more adequate to use the dot operator (U+22C5) and the vector or cross product x (U+2A2F), respectively, since they have more specific meanings. In practice, these characters (especially the latter) are available far less often, due to font limitations and other problems. We might regard the more common, mixed-usage characters as just as good. However, if we look at the properties of characters, we see many differences, as illustrated in Table 5-13. Generally, the property values of the dot operator correspond to its specific meaning as mathematical operator, whereas the middle dot, with multiple semantics (described in Chapter 8 to some extent), has less informative, more neutral property values.

Table 5-13. Comparison of properties of two "dot-like" characters
Property	Value for middle dot	Value for dot operator
General Category (gc)	Po (Punctuation, other)	Sm (Symbol, math)
Line Break (lb)	AI (= alphabetic/ideographic)	AL (= alphabetic/symbol)
Math (= mathematical)	No	Yes
Word Break (WB)	MidLetter	Other
East Asian Width (ea)	A (= ambiguous)	N (= narrow)

Most people probably don't even consider using the dot operator, since they've never heard of the character. After careful analysis, you might decide to do the same, since the practical benefits of using the middle dot are more important than the considerations of semantics and properties. The line-breaking property values, for example, are not essentially different: default line-breaking rules will not break before or after a middle dot or a dot operator, unless a space intervenes.

Consequently, people often use characters that are not "quite right" and do not have the properties that the theoretically most adequate characters would have. When processing texts, you need to be prepared to deal with input where "wrong" characters are used. For example, if you edit a mathematical journal, you should expect that authors use different characters as symbols for scalar and vector product. Authors might use almost anything that looks close enough to them, using, for example, a bullet operator • (U+2022) or even the normal period (full stop) "." as a symbol of dot product. It would be your responsibility to unify the notations (or to make a conscious decision not to do that), and this might mean that you need to use a program that lets you check the codes of the characters in the text easily.

Similarly, any attempt to process mathematical texts should not assume much about the use of characters, unless it has been carefully verified. You cannot expect, for example, that characters used to denote mathematical operators have the appropriate formal properties. Their general category might be something quite different from "Sm" (Symbol, math), for example "Po" (Punctuation, other).

Any processing by the formal properties of characters should be made with care. It might be suitable as a fallback, after you have dealt with all "expected" characters, including characters commonly used as replacements for newer, semantically exact characters.

Section 5.11. Effects on Choosing Characters

5.11. Effects on Choosing Characters

5.11.1. Example: Some Mathematical Operators

Table 5-13. Comparison of properties of two "dot-like" characters