Section 8.10. Mathematical and Technical Symbols

8.10. Mathematical and Technical Symbols

There is a large and growing amount of characters that are used as special symbols in mathematical and technical texts, often in highly specialized meaning and context. The use of mathematical notations is increasingly common even in social sciences and humanities. Rules for usage are generally well established, though with some typographic and other variation. See, for example, the extensive international standard ISO 31-11, "Quantities and Units. Part 11: Mathematical signs and symbols for use in the physical sciences and technology." The MathWorld web site http://mathworld.wolfram.com illustrates and explains the conventional mathematical notations.

In Unicode, digits and other numeric symbols appear in different script-specific blocks, including Basic Latin, of course. There are also some very commonly used mathematical operators and other symbols in blocks like Basic Latin, Latin-1 Supplement, and General Punctuation. In addition to these, there are several blocks for mathematical and technical symbols, allocated in a rather confusing way for historical reasons. An overview of this situation is given in Table 8-13. For more information, consult the Unicode Technical Report 25, "Unicode Support for Mathematics," http://www.unicode.org/reports/tr25/.

Table 8-13. Blocks containing mathematical and technical symbols
Code range	Name of block	Notes
0000..007F	Basic Latin	E.g., 0, 1, +, %, =, <
FF00..FFEF	Halfwidth and Fullwidth Forms	Clones of symbols, for CJK
0080..00FF	Latin-1 Supplement	E.g., ¬, ±, ², x, ½
0300..03FF	Greek and Coptic	Used as symbols, e.g., π
2000..206F	General Punctuation	E.g., fraction slash, ⁄
2150..218F	Number Forms	Fractions, Roman numerals
2070..209F	Superscripts and Subscripts	Digits, parentheses, etc.
2100..214F	Letterlike Symbols	E.g., ℇ, ℕ,
1D400..1D7FF	Mathematical Alphanumeric Symbols	Bold, italic, etc., variants
2190..21FF	Arrows	E.g., ,
2200..22FF	Mathematical Operators	E.g., , ∆, , ,
2A00..2AFF	Supplemental Mathematical Operators	Variants of operators, etc.
27C0..27EF	Miscellaneous Mathematical Symbols-A	Modal logic, etc.
2980..29FF	Miscellaneous Mathematical Symbols-B	Brackets, fences, angles, etc.
27F0..27FF	Supplemental Arrows-A	Long arrows, etc.
2900..297F	Supplemental Arrows-B	Arrows with strokes, etc.
2B00..2BFF	Miscellaneous Symbols and Arrows	White and black arrows, etc.
25A0..25FF	Geometric Shapes	E.g., , ▲, , ◙
2500..257F	Box Drawing	E.g.,, , ,, ,
2580..259F	Block Elements	E.g., , ,
2400..243F	Control Pictures	Names of controls, e.g., ␀
2300..23FF	Miscellaneous Technical	E.g., ࣺ, ⌂, ⌇, , , , ⌚

8.10.1. Superscripts and Subscripts

Superscripts are used partly as stylistic variation, as in writing "first" as "1st" and not "1st." On the other hand, superscripting is used to indicate exponentiation and other semantic relations; for example, "2³" is certainly not just a stylistic variant of "23." Subscripting is mostly a matter of established notational convention, as in "H2O."

Both superscripting and subscripting are mostly something applied to character data, rather than part of the data itself. However, largely reflecting the practices of older character codes, Unicode contains some characters that are superscript or subscript variants of other characters, usually defined as compatibility equivalents. Many of them are letters, such as masculine ordinal indicator º (U+00BA), which is a superscript letter "o," and modifier letter small "h" ʰ (U+02B0), which is a phonetic symbol.

Superscript variants that can be used for mathematical purposes exist in Unicode for digits 09, letters "i" and "n," plus and minus sign, equals sign, and normal parentheses. For historical reasons, superscript variants of 1, 2, and 3 are not in the Superscripts and Subscripts block but in the Latin-1 Supplement. Subscript variants exist for digits 09, plus and minus sign, equals sign, and normal parentheses.

Thus, you could write relatively complicated superscripts or subscripts. However, this is not very common and it would not take you very far. You would inevitably meet restrictions in writing superscript or subscript expressions. Normally other methods are used, such as markup languages or special formatting, as discussed in Chapter 9.

8.10.2. The Number Forms Block

The Number Forms block covers the range from U+2150 to U+218F and contains some relatively uninteresting characters, which are special presentations of some numerals. Almost all of them are compatibility characters. Currently the block contains only characters for Roman numerals and for some vulgar (common) fractions .

8.10.2.1. Roman numerals

The characters for Roman numerals are not meant to be used in normal text. Instead of U+2612 Roman numeral three, Ⅲ, you normally use a sequence of capital letters, "III." The special characters for Roman numerals have been included in Unicode for compatibility with other character codes.

It has been argued, though, that the special characters for Roman numerals might be preferable due to their more specific semantics. The character U+2610 Roman numeral one unambiguously denotes a number, while the Latin capital letter "I" has multiple uses. A speech generator, for example, would in principle be in a much better position to decide how to pronounce the notation. But this will probably remain just theory.

8.10.2.2. Fractions

Fractional numbers such as 1/4 (one fourth) are commonly written in linearized notation, using normal digits and a normal solidus (slash) character. However, in typesetting traditions, fractions are often presented in a different style, perhaps using special glyphs, like ¼. There are two basic variants of the style: "shilling" fractions, where the numerator and denominator are separated by a slanted slash, and "vertical" fractions, where the numerator is right above the denominator and there is a horizontal line between them.

Some frequently used fractions have been included into Unicode as separate characters. For example, there is the character U+00BC, vulgar fraction one fourth (¼), which is compatibility equivalent to the three-character sequence 1/4. In most fonts, the appearance is "shilling" fraction.

The only such fractions in ISO Latin 1 are ½, ¼, and ¾. They appeared in some typewriter keyboards and may still appear in some computer keyboards. Moreover, when you type, say, the characters 1/4 in succession, your word processor might convert the sequence to ¼, as described in Chapter 2. This can be undesirable especially if your document contains other fractions, like 1/3, which would appear in a quite different style.

In Unicode, the Number Forms block contains a few more fraction characters, namely for 1/3, 2/3, 1/5, 2/5, 3/5, 4/5, 1/6, 5/6, 1/8, 3/8, 5/8, 7/8, as well as for numerator one (1/). However, only a few fonts contain glyphs for them.

As a different approach, you could use the U+2044 fraction slash character. This character, absent in many fonts, has an appearance similar to that of the common solidus, though it is often more slanted, even in an 45° angle, as in ⁄. More important, it has special semantics, as suggested by its name. It unambiguously separates the numerator and the denominator of a fraction and never has any other meaning. Moreover, a program that is capable of rendering fractions in a classic typographic style should do that automatically. However, such behavior is not common in programs. In MS Word, you probably get just something like the following: 1⁄4 (i.e., normally rendered 1 and 4 separated with the fraction slash).

Thus, if you wish to produce typographically formatted fractions, you mostly need tools above the character level, such as typesetting commands. The web page "How to create fractions in Word," http://word.mvps.org/FAQs/Formatting/CreateFraction.htm, illustrates some techniques in producing both "vertical" fractions and "shilling" fractions.

8.10.3. Characters in SI Notations

This subsection discusses the character-level issues of presenting values of physical quantities according to the SI, the International System of Units (Système international). For general information on the SI, please refer to the Metric System FAQ http://www.cl.cam.ac.uk/~mgk25/metric-system-faq.txt. Note especially its item 1.12, "What is the correct way of writing metric units?," which also mentions some practical typing methods not discussed here.

The organization responsible for the definition of SI units is the General Conference on Weights and Measures (CGPM), http://www.bipm.org/en/convention/cgpm/. Official information is also available from the Bureau International des Poids et Mesures (BIPM), see http://www.bipm.org/en/si/, and the National Institute of Standards and Technology (NIST), see http://physics.nist.gov/cuu/Units/. There are also international ISO standards and national standards on the use of the SI.

8.10.3.1. Conceptual levels of SI notations

The use of the SI can be considered at different levels, which are defined by different standards, conventions, and other norms:

Physical definitions of units, established by international conventions; the definitions are often complicated in order to be exact; and they need to name the units somehow, but the different language-dependent names are not defined in this context; example: "The meter is the length of the path travelled by light in vacuum during a time interval of 1/299 792 458 of a second."
Names of units, such as "metre" (British English), "meter" (U.S. English), "Meter" (German), "metri" (Finnish), etc.; these are defined by various language authorities, or just by common usage in a language community.
Symbols of units, such as "m" for the meter; these symbols are defined by international conventions and are intended for international use as such; however, in some cultures, otherwise applying the SI, language-dependent abbreviations are used as symbols, such as for kilogram in Russian.
Expression of quantities using a numeric value and a unit, perhaps with a prefix, such as "1,5 km" or "1.5 km," depending on language, or maybe, for example, "1.5x10³ m"; this is defined by international conventions, with additional recommendations from other sources, including national standards and publishers' rules.
The exact identification of characters used to write the expressions. Since the conventions generally do not identify characters except by showing them, this is a somewhat gray area; but it is the level that we are mostly interested in here.
Typography, such as the width of a space used to separate a number from a unit, or the use of a particular font to render a character like "m," such as Times New Roman "m" or Arial "m"; this is generally not standardized but left to typographers, except that there is a strong recommendation to use "upright" letters and not an italics font.

Here we mostly consider the last but one level, characters, or abstract characters to be more exact.

8.10.3.2. Notes on individual characters

Most characters used in SI notations can easily be identified as abstract characters, or more specifically, as Unicode characters. For example, the symbol of the meter, "m," is apparently the character named Latin small letter "m" in Unicode, with the code position 6D in hexadecimal, therefore it's often denoted by U+006D in Unicode contexts. But the following characters need to be considered:

The multiplication symbols, which are used in numeric expressions like the alternative notations "1,5·10³" and "1.5x10³." They can apparently be identified with the Unicode characters middle dot (U+00B7) and multiplication sign (U+00D7). The former is also used in symbols for compound units such as "N·m" (newton meter; often written less suitably as "N m" or questionably as "Nm"). However, it can be argued that middle dot is a punctuation character and that the dot used for multiplication (called "half-high dot" in the ISO 31-0 standard) should be identified with U+22C5 dot operator, which is classified as a mathematical operator. A practical argument in favor of this is that the representative glyph for dot operator in the Unicode code chart is a larger dot than that of the middle dot, hence more noticeable and more suitable for use as an operator. And in the Arial Unicode MS fontone of the few fonts that has a fairly good repertoire of mathematical symbolsthe situation is the same and dot operator is at a somewhat higher position. It is positioned in a way that corresponds better to the notion of a multiplication operator. You can see this from the following samples that contain (in Arial Unicode MS) the expression for pascal second first using the middle dot, then using the dot operator: Pa·s Pas
The minus sign used before a number (in an exponent, too), is logically to be identified with the minus sign, U+2212. However, instead of this character, the en dash, U+2013, or (far more often) the ASCII hyphen-minus U+002D is used. A problem with these is that Unicode line breaking rules permit a line break after these characters. This creates the risk of having the sign appear at the end of a line and the number at the start of the next line. (This should not happen for the real minus sign.) There are various ways to try to avoid this probleme.g., by using the nonstandard nobr markup in HTML authoring.
The space between a numeric value and a unit (or between unit symbols when multiplication of units is indicated in this less satisfactory way). It is difficult to say how the space is to be interpreted in Unicode, considering the multitude of space characters in Unicode. Presumably, any space character, excluding those with zero width, is acceptable. Using the no-break space U+00A0 character would help in preventing undesired line breaks between the number and the unit. Using the thin space U+2009 character would help in making the space narrower than a normal space between words. The problem is that these two cannot be combined in a single Unicode character, in the present repertoire of Unicode. There are different possible approaches:
The exponents used in some numeric values (such as "1.5x10³") as well as in many compound unit symbols (such as "m²" or "s¹"). The numbers 2 and 3 as exponents can easily be represented using the characters for them, superscript two U+00B2 and superscript three U+00B3. Unicode contains also other digits and the minus sign as exponent, but these characters have very limited support in programs and fonts. Hence, it is better to use the tools of text-processing systems or other methods (such as sup markup in HTML) for superscripting for them. For typographic reasons, it is best to represent all superscript that way if you need anything other that just 2 or 3. Otherwise, the visual difference in superscripting of, for example, 2 and 1 is too disturbing.
The symbol of micro prefix, corresponding to multiplication by 10⁶. An apparent candidate is the micro sign (U+00B5), µ, which is widely available in fonts. However, Unicode defines micro sign as a compatibility character that has Greek small letter mu U+03BC as its compatibility decomposition. This means that the two are distinct characters but the micro sign has been included for legacy reasons only, and the two are equivalent except perhaps for formatting information. In practice, the characters are very often similar in appearance. Since the micro sign is more widely available, it is probably to be preferred. It might also be argued that it has unambiguous semantics, whereas Greek small letter mu is primarily a letter and has varying other uses as well.
The symbol for ohm can be identified with the ohm sign (U+2126, in the Symbols Area). It has a specific meaning, but it is defined as canonical equivalent to Greek capital letter omega Ω (U+03A9), and the Unicode standard recommends using the latter. The ohm sign has somewhat wider support in fonts. If a font contains both, they may look somewhat different.
The symbols for minutes and secondsin expressions for angles should be identified with the prime (U+2032) and the double prime (U+2033). However, these characters are rarely available, so it is common to use the ASCII apostrophe (U+0027) and the ASCII quotation mark (U+0022) as surrogates. In visual appearance, prime and double prime are clearly slanted, whereas apostrophe and quotation mark should have straight (vertical) glyphs according to Unicode, and they often have.
Several letterlike symbolsin Unicode denote characters used in the SI context, in a sense. However, this is mostly an illusion, and a misleading one. For example, the script small "l" (U+2113), is often used as a symbol for liter. However, the NIST Guide to SI units explicitly says: "The script letter is not an approved symbol for the liter." Such confusions will be separately discussed in the next section.

8.10.3.3. Letterlike symbols and the SI

People interested in unit symbols and Unicode have become surprised when they have found that, for example, the unit "degree Celsius" has a symbol of its own, U+2103, presenting °C as a single character. Similarly, for degree Fahrenheit (a completely non-SI unit of course), there is U+2109; for siemens, U+2127; and for Kelvin, U+212A, for example, in the Letterlike Symbols block. Educated people may well think that it is better to use such specific characters, with limited semantics, especially if dealing with documents that might be read by a text-to-speech converter later on, or otherwise processed by software that might use semantic information about characters. They might also be seen as typographically suitable, since they allow detailed formatting that corresponds to the specific meanings.

But in addition to being poorly supported in most fonts, such characters are inadequate in principle, by Unicode rules. For example, degree Celsius U+2103 is compatibility equivalent to U+00B0 U+0043 (i.e., degree sign followed by letter C). It has little to do with typographic correctness. Rather, it is a matter of compatibility, so that data containing that character in some non-Unicode encoding can be encoded in Unicode without losing the distinction between that character and the U+00B0 U+0043 pair, should someone wish to retain that distinction. This means that the data can also be converted back to the original encoding and get the original data exactly. It is not recommended for use in new, originally Unicode data. The Unicode standard says, in the discussion of unit symbols :

Unit Symbols. Several letterlike symbols are used to indicate units. In most cases, however, such as for SI units (Système International), the use of regular letters or other symbols is preferred. U+2113 SCRIPT SMALL L is commonly used as a non-SI symbol for the liter. Official SI usage prefers the regular lowercase letter l.

Three letterlike symbols have been given canonical equivalence to regular letters: U+2126 OHM SIGN, U+211A KELVIN SIGN, and U+211B ANGSTROM SIGN. In all three instances the regular letter should be used. In normal use, it is better to represent degrees Celsius "°C" with a sequence of U+00B0 DEGREE SIGN + U+0043 LATIN CAPITAL LETTER C, rather than U+2103 DEGREE CELSIUS. For searching, treat these two sequences as identical.

Unfortunately, the Unicode standard has wrong information about the symbol for the liter. The official position in the SI system is that both "l" and "L" are allowed, with no expressed preference (although in the U.S., "L" is preferred by national authorities).

The special letterlike characters discussed here were taken into Unicode due to their presence in some character codes used in East Asia, such as the Japanese JIS X 0212. These characters do their job in allowing conversions between character codes without losing information. Problems arise when people use utilities like the Character Map (described in Chapter 2) without knowing the background and looking just at the characters and their names.

To conclude, it is acceptable and recommendable to use normal Latin letters as SI unit symbols, such as "K" for kelvin.