Section 8.2. ASCII (Basic Latin)


8.2. ASCII (Basic Latin)

In Chapter 3, we briefly described the ASCII characters and their encoding. Here we go into the details of the meanings of these characters. For technical reasons, ASCII characters are widely used even when more appropriate characters exist in Unicode. This is partly caused by the history, partly by the fact that ASCII characters are well-known and easy to type and process, and they work reliably across platforms. But this implies that many of the characters have multiple uses or, to put it in other words, multiple semantics.

In the Unicode framework, ASCII characters constitute the very first block of Unicode, called Basic Latin and ranging from U+0000 to U+007E.

8.2.1. Names of ASCII Characters

The names of ASCII characters have a long history, and they can be rather misleading. For example, " (U+0022) is called "quotation mark," although it is not a correct quotation mark in English or human languages in general. The name "grave accent" for ' (U+0060) reflects one of the original intended uses, rather than actual practice. Many of the special characters in ASCII have a large variety of names in common usage, and the name used in Unicode usually corresponds to the choice made in ASCII.

Generally, the Unicode name of an ASCII character might be suitable in some official contexts, but not necessarily in more normal usage. As an example of the differences, Table 8-1 presents some ASCII characters for which the Unicode name and the name normally used in O'Reilly books are different.

Table 8-1. Some variation in names of ASCII characters

Chararcter

Code

Unicode name

Name(s) used in O'Reilly books

#

U+0023

Number sign

Hash sign, sharp sign

.

U+002E

Full stop

Period, dot

/

U+002F

Solidus

Slash

@

U+0040

Commercial at

At sign

\

U+005C

Reverse solidus

Backslash

^

U+005E

Circumflex accent

Caret, circumflex

'

U+0060

Grave accent

Backquote, backtick


8.2.2. Alphanumeric Characters

The ASCII set contains the uppercase characters AZ and the lowercase characters az as well as the common digits 09. The letters are often called basic Latin letters, though

Figure 8-1. Different renderings of common digits


Latin has no "w" letter. It is more adequate to refer to the ASCII letters as letters of the English alphabet. The digits 09 are often called Arabic digits, but they differ from the original Arabic digits (٠,١,٢,٣, etc., also called Arabic-Indic digits), which are still in wide use in the Arabic world.

In many computer languages, the ASCII alphanumeric characters are what you can use in names (identifiers ), usually with the added requirement that the first character must be a letter. However, quite often a computer language allows more latitude, such as the use of the underline character "_" and possibly other characters as well in identifiers. Typographically, there are different presentations of digits, illustrated in Figure 8-1:


Uppercase versus lowercase digits

Uppercase digits all have the same height, usually the same as the height of uppercase letters. They are also called modern style or lining digits (or figures or numbers). Lowercase digits vary in height and have ascenders and descenders. They have been traditionally used in print typography in running texts. They are also known as old style or non-lining digits. According to typographic rules, lowercase digits should not be used in an expression formed from digits and uppercase letters, like "ABC-123."


Equal-width (tabular) versus varying-width digits

In tabulated data, digits normally need to be of equal width to produce good appearance where numbers line up. Inside text, the widths of digits may vary. Often only the digit "1" has a width different from other digits.

Unicode, or character standards in general, does not make either of these distinctions. Thus, in plain text you cannot have both lowercase and uppercase digits. The distinction can be made at font level, when suitable fonts are available. Expert fonts may contain two sets of digits, lowercase and uppercase. Some techniques use Private Use characters for thisi.e., allocate, for example, lowercase digits to code positions that have been reserved for use by private agreements only. This is risky because the data becomes cryptic if information about the particular agreement is lost in data transfer and processing.

Most fonts commonly used in computers have equal-width uppercase digits. However, the Georgia font has lowercase varying-width digits. If you use such a font in the text of your document, you should use a different font for tabulated numbers.

8.2.3. Parentheses

ASCII contains three sets of paired parentheses:


Common parentheses, ( and )

Called left parenthesis and right parenthesis in Unicode. They are widely used both in natural languages and in computer languages. In natural languages, they usually enclose a parenthetic (less important) remark. In computer languages, they have different uses that might not have anything to do with importance. For example, arguments of a function are usually written in parentheses, as in mathematicse.g., f(42, x+y).


Square brackets, [ and ]

Called left bracket and right bracket in Unicode. They are sometimes used in natural languages in special contexts (such as in a parenthetic remark inside a parenthetic remark [like here], or to indicate an addition or change in quoted text). In phonetics, brackets are used to denote that pronunciation is specified. In mathematics, square brackets are sometimes used as outer parentheses when parentheses are nested, as in 2[(a + b)/c]. In computer languages, brackets have a wide range of uses, often including the use in subscripted variables or array component selectors like a[i].


Braces, { and }

Called left curly bracket and right curly bracket in Unicode. They are rare in natural languages but relatively common in computer languages. In mathematics, they are sometimes used when parentheses are nested several levels. More commonly, they are used to denote sets; e.g., {5, 42, 83} is a set of three numbers.

The Unicode names of these characters have the attributes "left" and "right," although they are logically treated as opening and closing parentheses (and called that way in some standards). This is relevant because the physical appearance adapts to the writing direction of the text (see Chapter 7). Thus, if you have Arabic or Hebrew text containing a parenthetic expression, then the "left parenthesis" is located to the right of the enclosed expression and the "right parenthesis" is on the left side, since the text generally runs right to left. On the other hand, the parentheses are displayed as mirror images, so that the opening "left" parenthesis looks like a right parenthesis.

The characters < and > are often used in a parenthesis-like manner and referred to as left and right angle bracket. Such usage, as well as "real" angle brackets, is discussed in the section "General Punctuation" later in this chapter.

8.2.4. Other Graphic Characters

The other graphic characters in ASCII will be described here in alphabetic order by their Unicode name, which deviates from the common name in some cases.

8.2.4.1. Ampersand & (U+0026)

In natural languages, this character normally means just "and." In some programming and command languages, it has a comparable meaning, as logical AND operator, as bitwise AND operator, as string concatenation operator, or as a sequential operator. But the ampersand also appears in many technical uses that have nothing to do with the meaning "and." For example, in the C programming language, &x denotes the address of x.

The visual appearance of this character varies a lot. In some designs, the character's origin as a ligature of "ET"the Latin word for "and"can readily be seen.

8.2.4.2. Apostrophe ' (U+0027)

This character has mixed usage, usually as a punctuation character. In normal text, it is used either as an apostrophe as in the English word don't or as a single quotation mark. (In Unicode Version 1.0, this character was named "apostrophe-quote" to reflect this.) In both types of usage, the apostrophe is just a replacement used to overcome character repertoire limitations. It should not be confused with the typographically correct apostrophe, and it can be called "ASCII apostrophe " to emphasize this.

With regards to use as a single quote, compare to notes below on the use of the quotation mark. Analogously with the quotation mark, the apostrophe is defined (in Unicode) as having a "neutral (vertical)" glyph. This reflects its use as both an opening single quote and a closing single quote. However, in practice it may get displayed as slanted or even curved. As with the quotation mark, it is sometimes difficult to find out what really happens, since word processors may convert an apostrophe to a different character, often to a language-specific quotation mark.

Unicode defines modifier letter prime ʹ (U+02B9) and prime ' (U+2032) as distinct characters. The former is used mainly in linguistics to denote primary stress or palatalization (e.g., when transliterating Cyrillic soft sign). The latter is used to denote minutes or feet, and in mathematics, to denote a derivative (differentiation). When only ASCII (or only ISO Latin 1) is available, the apostrophe can be used as a surrogate for those characters. It might look natural to use the acute accent ´ (which is slanted) for some of such purposes, but since the whole idea is to use a replacement due to character repertoire restrictions, it is best to use a replacement that works most widely (due to being an ASCII character).

In ASCII, the apostrophe was intended to have secondary usage as acute accent, to be overprinted on a letter. This explains, in part, why the glyph is often slanted.

8.2.4.3. Asterisk * (U+002A)

The asterisk has a wide range of uses, including the following:

  • In natural languages, an asterisk or a sequence of asterisks is sometimes used as a reference to a footnote or a margin note*. Several other symbols, such as daggers and (superscript-style) digits and letters, are used for such purposes too. Due to glyph problems discussed below, it is probably best to avoid the use of asterisks for such purposes and use some other notations. *The footnote or margin note itself begins with the asterisk or sequence of asterisks.

  • The asterisk is sometimes used when indicating the year or date of birthe.g., * 1952.

  • Especially in command languages, the asterisk is often used as a wildcard character that matches any string of characters. For example, *.txt as a command argument might refer to all filenames ending with .txt.

  • In regular expressions, the asterisk often denotes possible repetition. For example, depending on the particular regexp syntax, xy* might denote the set of strings consisting of an x followed by any number (including zero) of y'si.e., x, xy, xyy, xyyy, etc.

  • In mathematics, the asterisk has several uses as an operator symbol of some kind. Generally, such uses are surrogate notations for various star-like symbols with more specific semantics. A double asterisk ** sometimes indicates exponentiation.

  • In linguistics, a leading asterisk before a word can be used to indicate a reconstructed form (e.g., "the word king probably derives from old Germanic *kuningaz"); it may also indicate an ungrammatical expression.

  • In Usenet postings and some other plain text contexts, the asterisk may also be used for *emphasis* (though using _underlines_ is more common).

  • One of the early uses was to make a series of asterisks a "check protector," to flank the amount of a check so one could not kite or change the value. That method was applied in punch cards and printers too, and it's still often used, for example, in password input, to help the user count characters but protect the password from prying eyes.

  • The asterisk is sometimes used to indicate a "masked out" character, as in "G*d."

  • In several programming languages, asterisk is the multiplication symbol, but it may also have other uses. For example, int *p; declares p as a pointer to int in C.

When writing or quoting expressions in computer languages that have the asterisk as part of language syntax, the asterisk shall be preserved of course. On the other hand, such usage should not be extended to other contexts, unless the limitations of the character repertoire prevent the use of better symbols. Specifically, in ISO Latin 1 there is a separate multiplication sign. In some contexts the middle dot (·) is, somewhat arguably, an adequate multiplication symbol.

The glyphs for the asterisk vary, but generally it appears in a more or less superscript style, perhaps in a rather small size. It is difficult to say what an asterisk should look like, given its mixed usage. When used as an operator of some kind, it should be vertically positioned the same way as, for example, the plus sign. When used as a reference sign, and perhaps in some other uses too, it should appear in superscript style. It seems that most font designs reflect the latter style, making expressions like a*b look somewhat odd. If you cannot use a symbol with less ambiguous meaning, you might try to help things by using a font where the asterisk looks more operator-like, such as the Courier font, though even the Courier * is somewhat raised. Quite often it might be better to use a monospace font for all expressions (like a*b) quoted from programming and command languages, etc.

The Unicode standard mentions that asterisk is called "star" on phone keypads. It also mentions that the asterisk is distinct from Arabic five-pointed star ٭ (U+066D), asterisk operator (U+2217), and heavy asterisk ✱ (U+2731). Note that this list of Unicode characters resembling the asterisk in appearance is far from complete; there are many more, especially in the Dingbats block.

8.2.4.4. Circumflex accent ^ (U+005E)

This character, often called just "circumflex" or "arrow," is used for a variety of technical purposese.g., in programming and command languages. It might, for example, be used as an exponentiation operator in linear notation (a^b = ab). In regular expression syntax (see Chapter 11), the circumflex matches the start of a string.

This character was introduced into ASCII for several purposes, including the use as a diacritic mark, with overprinting techniques. This never became common, and the usual shape of the character reflects much more the technical use: it is operator-like, relatively large, and rather different from a circumflex accent as used in a character like â. The name of the character is thus rather misleading.

In ASCII, this character has the primary name "upward arrow head," and "circumflex accent" appears there as a secondary name only.

8.2.4.5. Colon : (U+003A)

This character is used as a punctuation symbol in natural and other languages. The rules for using it vary from one language to another, and even from one authority to another.

The colon is also used when presenting ratios (proportions) as in "2:3," but in Unicode, you can use a more specific character, ratio U+2236.

8.2.4.6. Comma , (U+002C)

Primarily this character is a punctuation symbol in natural languages. The rules for using it vary from one language to another and even from one authority to another.

In numbers, some languages (mainly English) use comma as thousands separator (e.g., "1,234" means one thousand two hundred thirty-four) whereas in many other languages it is used as a decimal point (e.g., "1,234" means the same as "1.234" in English). The Unicode standard mentions "decimal separator" as another name for the comma.

In ASCII, the comma was intended to have secondary usage as cedilla.

The comma should not be confused with the Unicode character single low-9 quotation mark "'" (U+201A), which is used in quotations in some usages.

8.2.4.7. Dollar sign $ (U+0024)

This character is a famous currency symbol, but its exact meaning is not quite clear. The Unicode standard explicitly says that this character is unambiguously dollar sign, not a generic currency symbol. On the other hand, this is not meant to limit the use to only those currencies that are named "dollar," still less the U.S. dollar only. The Unicode standard mentions "milreis" and "escudo" as alternative names for dollar sign, so obviously the symbol can be used to denote those currencies, too.

According to the Unicode standard, a glyph for the dollar sign may have one or two vertical bars. That is, the number of bars is a glyph difference, not character difference.

In computing, the dollar sign has secondary uses that may have nothing to do with any currency. For example, it can be a character that is allowed in identifiers, perhaps used to signal a reserved or otherwise special identifier.

8.2.4.8. Commercial at @ (U+0040)

This character was originally used in English in conjunction with unit prices in the meaning "each." Its name still reflects such usage, which is relatively rare, and often unknown in other languages.

This character has become most widely known as a separator in Internet email addresses, where it can be read as "at" rather naturally, as in jkorpela@cs.tut.fi. It has many other special uses, too, for example, in Perl to indicate that a symbol denotes an array.

There are many names in use in different languages for this character. Many of the names use words that try to describe the visual appearance or connotations, such as a monkey or a sitting cat and a long tail.

8.2.4.9. Equals sign = (U+003D)

This character is used to denote equality both in mathematics (as in 2 + 2 = 4) and in other areas. It is distinct from the Unicode character (identical to U+2261).

In programming languages, the equals sign very often means assignmente.g., a = b + c means that the sum b + c is computed and the result is assigned to the variable a. This means that usually some other operator (such as ==, eq, or .EQ.) is used in a logical expression to test for equality.

8.2.4.10. Exclamation mark ! (U+0021)

This character is basically used as a punctuation character at the end of an exclamation. It is also used in mathematics to denote a factorial (as in "5!," which denotes 1x2x3x4x5). Many other special usages exist; e.g., in the C programming language, the exclamation mark denotes a "not" operator (negation)! The Unicode standard mentions the alternate names "factorial" and "bang."

This character is also used as a substitute for a similar-looking character, Latin letter retroflex click (U+01C3) used in the orthography of some African languages, to denote a click sounde.g., in the name "!Kung" (denoting a people in southern Africa). In principle, the two characters are distinct, despite similarity in glyph appearance.

8.2.4.11. Full stop "." (U+002E)

In U.S. English, this character is known as "period" (which was the name used for it in Unicode Version 1.0). It is commonly used as a punctuation character but also for other purposes. The Unicode standard mentions the alternative names "dot" and "decimal point."

The Unicode standard uses this character to illustrate the principle that "a character may have a broader range of use than the most literal interpretation of its name might indicate" and admits that the name of a character can be misleading. It says: "U+002E full stop can represent a sentence period, an abbreviation period, a decimal number separator in English, a thousands number separator in German, and so on." Note that the use of the full stop as a thousands separator is discouraged in several standards, which recommend the use of some space character instead.

In addition to such usages, programming languages and other notations often use the full stop for purposes that do not correspond to natural-language punctuation (or the name "full stop"!) at all. In particular, it is often used as a separator between components of a hierarchic name, so that foo.bar could denote the bar component of a structure named foo (which might be read as "foo's bar").

The Unicode standard mentions that this character "may be rendered as a raised decimal point in old style numbers." This is to be taken as a warning against interpreting such a character as a middle dot (·).

8.2.4.12. Grave accent ' (U+0060)

This character, often called just "grave," is used for a variety of technical purposese.g., in programming and command languages. For example, in many Unix shells, the grave accent is a quoting character with a special meaning, "command substitution" (sometimes even called "grave command"!). In such a case, the value of the expression 'foo' is the output from executing the command foo.

This character was introduced into ASCII for several purposes, including the use as a diacritic mark, to produce characters like è with overprinting techniques. This never became common. The technical uses of the character also remained relatively limited because the character is not very visible and because it is easily confused with some other characters.

Sometimes the grave accent is used in normal text as a single quote, especially to create the appearance of "smart" (asymmetric) quotes. In such style, people use the grave accent instead of an opening single quote and either the apostrophe or (less often) the acute accent ´ instead of a closing single quote, as in 'this' or 'this´. In some fonts, this looks relatively correct because the glyphs for the grave accent and the acute accent are (rather questionably) curly, quote-like. In processing natural language texts, it is usually reasonable to assume that a grave accent is meant to act as a quotation mark of some kind, since there is not much other usage for it in normal text. However, sometimes, for example, e' might be used to mean è.

When the American National Standards Institute adopted ASCII as national standard, it added a provision for overloading the code positions 60 and 27 (hexadecimal) with the typographic characters left and right single quotation mark. This practice become widely used in some communities in the United States and is now found in numerous and still even some contemporary English-language ASCII files. Naturally, unless output routines specifically handle the issue, this means that text meant to display as 'foo' will appear as 'foo'. The design of the grave accent and the ASCII apostrophe in fonts may reflect attempts to make things less distracting by making them resemble single quotes.

8.2.4.13. Greater-than sign > (U+003E)

This character primarily denotes a mathematical relation. It is widely used for some secondary purposes as well, such as in the role of a closing angle bracket, as described earlier.

Some programming languages avoid using > as an operator, or use it for some data types. A language might even have, say, > and "gt" as different "greater than" operators.

The character pair >= has often been used to mean "greater than or equal to." In Unicode, you can use the character greater-than or equal to (U+2265) instead.

8.2.4.14. Hyphen-minus "-" (U+002D)

This is a dual-purpose character: it can be used as a hyphen (punctuation character) or as a minus sign (mathematical symbol). It is usually called "hyphen" or "minus" depending on the context and meaning. The term "hyphen-minus" is used mostly in character standard contexts only. The Unicode standard mentions "hyphen or minus sign" as a synonym, but it is best avoided, since it often makes statements ambiguous.

Unicode contains two characters that can be used instead of the hyphen-minus character to resolve the ambiguity at character level: hyphen (U+2010) and minus sign (U+2212). This may help to produce a better visual appearance, too. Usually the hyphen is relatively short and the minus sign is rather long, comparable to an en dash. One of the problems with hyphen-minus is that its glyph is usually so short that it does not look good and prominent enough in expressions like "-1" (for "minus one").

It is common to use a hyphen or two hyphens "--" as a replacement for an en dash "" or em dash "'", when the dashes cannot be used. There are other hyphen-like characters in Unicode as well, to be discussed later in the Punctuation section.

8.2.4.15. Less-than sign < (U+003C)

This character primarily denotes a mathematical relation. It is widely used for some secondary purposes as well, such as in the role of an angle bracket, as described earlier.

Some programming languages avoid using < as an operator, or use it for some data types. A language might even have, say, < and "lt" as different "less than" operators.

The character pair <= has often been used to mean "less than or equal to." In Unicode, you can use the character less-than or equal to (U+2264) instead.

8.2.4.16. Low line _ (U+005F)

This character is usually known as "underline" or "underscore."

Probably the most typical use of this character is to make long identifiers more readable in programming languages. Due to their general syntax, such languages generally do not allow spaces in identifiers; but several programming languages allow underscores in identifiers. For example, one could write number_of_events in such languages.

In plain texte.g., in Usenet discussionsit is customary to use a low line before and after a word or phrase to indicate underlining of enclosed text, usually to denote emphasis (e.g., "this is _very_ important") due to lack of better methods. Some software automatically recognizes the notation and renders the expression in a more advanced way (e.g., "this is very important" or "this is very important").

One of the original ideas was to use the low line for underlining text using overprinting. This is irrelevant these days, but the character might be used to create a horizontal line in plain text. It depends on the font whether successive underline characters are joined (____) or not (_ _ _ _).

8.2.4.17. Number sign # (U+0023)

The name of this character reflects its use to mean "number," as in "item #42" (meaning "item number 42, the 42nd item"). Such usage is mostly limited to U.S. English. More often, the word "number" is abbreviated as nr., no., n., or No. In U.S. English, the character is sometimes used to denote pound as a unit of weight (mass)e.g., in the paper industry "70#" means "70 lb."

In computer languages, this character has many different uses, and it is usually called a hash. In some of these uses, it relates to ordinal numbers. For example in HTML and XML, &#n; denotes the character that occupies code position n in Unicode. Mostly the # character is just a separator (e.g., indicating the rest of the line as comment) or has some special meaning assigned to it more or less arbitrarily, with no connection with numbering. It is used in web addresses (URL references), and the URL syntax specification calls it "crosshatch" character. Many other names are used as well, such as "octothorpe."

The number sign character unambiguously occupies code position 23 hexadecimal in ISO Latin 1 and in Unicode. The Unicode standard mentions "pound sign" as an alternative name, but here "pound" means the unit of weight, not currency. Further confusion has been caused by the varying definitions of ASCII and ISO 636, since some definitions allow the position 23 to be used either for # or for £ (the pound sterling sign), as "agreed between interested parties." Some programs and devices might still reflect this in their behavior (displaying £ when the data contains #).

In Unicode (and ISO Latin 1), the pound sign £ (as a currency symbol) is a completely independent symbol in its own code position, U+00A3.

The number sign has also been used as a surrogate for music sharp sign U+266F, due to some similarity in appearance.

8.2.4.18. Percent sign % (U+0025)

This character is used after numbers, in the meaning "in the hundred" or "of each hundred." It is commonly used immediately after a number (e.g., 50%), but quite often, the official spelling requires a space (e.g., 50 %), although this depends on authority. In computer language notations, a space is often disallowed. For example, in a CSS stylesheet, width: 50% is correct, whereas width: 50 % would be incorrect. On the other hand, in natural languages, as well as in notations related to the International System of Units (SI), the official recommendations often require a space. If a space is used, it should be a no-break space, for obvious reasons.

In some situations, expressions like "o/o" are used instead of the percent sign. This might be a practical choice in a context where the per mille sign (U+2030) would be needed too but cannot be used due to technical restrictions. You might then use "o/oo as a replacement, and therefore "o/o" too, for uniformity. However, contrary to popular belief, the percent sign has not evolved from "o/o" or "0/0" but from an abbreviation of the Latin words "pro cento," which mean "for a hundred."

In computer languages, the percent sign has very different uses, which might have nothing to do with percentages. For example, % is a modulus operator in C, and it indicates an identifier as a hash in Perl.

8.2.4.19. Plus sign + (U+002B)

This is the well-known plus sign, primarily used to denote addition and as a unary plus. It has many technical uses that have little or nothing to do with addition. It may indicate string concatenation, for example.

8.2.4.20. Question mark ? (U+003F)

This character is basically used as a punctuation character at the end of a direct question. The detailed rules for using it vary from one language to another and even from one authority to another. In some languages, some space is left before the question mark. In formal notations such as regular expressions, the question mark has special meanings. It could, for example, be a wildcard character that represents any single character.

8.2.4.21. Quotation mark " (U+0022)

This punctuation character is a "symmetric" quotation mark as opposite to "smart" or "asymmetric" quotation marks. That is, when this character is used to mark quotations, the opening quote is identical with the closing quote. Its glyph should be "neutral" (vertical) to reflect this. The Unicode standard explicitly says about it: "neutral (vertical), used as opening or closing quotation mark." However, in practice, the appearance varies, and some fonts have a slightly slanted glyph for the quotation mark.

It is sometimes difficult to find out what really happens, since text-processing programs (word processors) like MS Word typically convert a quotation mark to a different character, as described in Chapter 2. Pressing the " key often inserts a language-specific quotation mark, perhaps to a "smart" (curved) quotation mark in English text, a chevron (« or ») in French text, etc. Note that this means a replacement at the character level: the different quotation marks are different characters, not just different glyphs.

The name "quotation mark" is a historical relic: this character was the only double quotation mark used in computers when ASCII was developed. It was natural to call it just "quotation mark," and this name was kept even in Unicode. This creates problems, since often we need to talk about quotation marks in general (as we will do later in the section "General Punctuation"), but there is no official name for U+0022 that would let us identify it in such contexts. Thus, we may need to identify it by its code, or use an unofficial name like "ASCII quotation mark" or "machine quotation mark." Typographers may call the character an inch symbol, but this is actually incorrect: although the ASCII quotation mark is often used as a substitute for an inch symbol, the appropriate Unicode character for inch is the double prime U+2033.

When typewriters were designed, several simplifications were made to the use of characters. For physical and economic reasons, the character repertoire was kept small. Early typewriters often lacked even the digits 0 and 1, on the grounds that you could use letters "O" and "l" instead! Similarly, only one double quotation mark was included. The key cap might actually have a curved glyph like ", to confuse us more. This approach was copied to early computer keyboards, and that's what we still mostly live with.

At the character level, this means that there is a huge amount of text data (both plain text and other formats) that uses the ASCII quotation mark for normal quotations. The use of ASCII quotation marks has become so common that you often find it even in printed matter and in other contexts where the author had no compelling technical reason to do so.

Why would you use the ASCII quotation mark in text processing? Well, if your text discusses C or JavaScript code or Unix commands, then the ASCII quotation mark is the correct charactere.g., in an assignment like str = "foo". Using a "smart" (curved) quotation mark would not be smart at all in such cases.

The Unicode standard explicitly says that "APL quote" is identical with the quotation mark. In addition to that, the quotation mark is used in many other programming and command languages, typically to delimit string constants. In some of such languages, a string can be delimited using either quotation marks or apostrophes with no change in meaning, whereas in some others there is a definite difference. For example, in the C language, quotation marks delimit string constants whereas apostrophes delimit character constants; in Perl, quotation marks allow variable substitution within the string, whereas apostrophes indicate a pure literal.

The quotation mark is often used instead of different symbols such as the inch sign, due to similarity in appearance. Table 8-2 shows some of them.

Table 8-2. Symbols that are often replaced by a quotation mark

Name

Code

Character

Proper use

Double prime

U+2033

"

Inches or (in angles and times) seconds

Ditto mark

U+3003

Repetition of information, "the same"

Modifier letter double prime

U+20BA

ʺ

E.g., transliteration of Cyrillic "hard sign"


In ASCII, the quotation mark was intended to have secondary usage as dieresis (see section "Diacritic Marks" later in this chapter). That is, you were supposed to overprint, say, the letter "a" with a quotation mark to produce something that looks like ä. This was an odd idea, but it may have affected the design of some fonts.

8.2.4.22. Reverse solidus \ (U+005C)

This character is best known under the name "backslash." It has a wide range of uses in technical contextse.g., as a separator in hierarchical filenames in Windows and in several "escape notations," such as '\n', which denotes line break character in many programming languages (see Chapter 11 for more examples). The reverse solidus was taken into character repertoires for special usage, such as to allow the construction of symbols \/ and /\ for logical and and logical or from the reverse solidus and the solidus. This never became common, but quite different other uses were invented.

The reverse solidus is especially suitable for use in "escape notations" just because it is, in a sense, an artificial creation. Since it is not used in normal text, it will less likely be confused with normal data characters than other characters that might be used for "escaping." However, confusion may still arise when different notational systems that use the reverse solidus (for different purposes) are combined.

In Unicode, the reverse solidus is regarded as distinct from set minus U+2216, which is used in mathematics as an operator on sets (meaning set difference), but conceivably, \ can be used as a surrogate for that character.

Rather often, the reverse solidus is confused with the solidus (slash) character, /. They are similar in shape, just slanted differently. But they are quite distinct characters and have different uses. They are rarely interchangeable. However, Internet Explorer treats the reverse solidus in a URL (where it is not permitted by the URL syntax) as the solidus.

8.2.4.23. Semicolon ; (U+003B)

This character is used as a punctuation symbol in natural and other languages. It is often used as a separator in lists of numbers with commas as the decimal separator (for example, "1,2; 1,3; 1,5," corresponding to "1.2, 1.3, 1.5" in common English notation). In many programming languages, semicolon is the statement separator or terminator.

8.2.4.24. Solidus / (U+002F)

The name "solidus" was taken from British English. This character is much more widely known as "slash" (which was its name in Unicode Version 1.0). It is sometimes called "virgule" or even "shilling" (which are alternative names mentioned in the Unicode standard) or "diagonal." Do not confuse it with the reverse solidus (backslash, \). Sometimes the solidus is called "forward slash" to distinguish it from the backslash.

The solidus is used for many different purposes, typically as a separator of some kind. Ambiguities easily arise. For example, a date notation like 3/4 might mean the 3rd of April, or the 4th of March. In the ISO 8601 notation for dates, the solidus is used when expressing a time interval (e.g., 1998-03-04/04-03 unambiguously means "from 4th of March to 3rd of April in 1998").

Sometimes the solidus separates alternativese.g., on a form, with the suggestion to strike out the inapplicable alternative(s). In natural languages, the solidus is often used in a very confusing way, so that "foo/bar" might mean "foo or bar" or "foo alias bar" or "foo and bar," or something else. The ambiguity created that way might be intentional.

In HTML (and in other SGML- or XML-based markup languages), start and end tags are distinguished from each other by the presence of a solidus in the end tag, so that, for example, </cite> means "end of cite element."

In web addresses and other URLs, the solidus is a separator between hierarchic components. This usage is historically based on similar usage in pathnames in hierarchic filesystems.

Unicode defines fraction slash U+2044 and division slash U+2215 as characters distinct from solidus and from each other. The fraction slash is meant for use in fractional numbers, whereas the division sign is a division operator. In Unicode encoded data, you do not need to use these characters with more specific semantics; Unicode just allows you to make a distinction. The fraction slash may have a special visual effect, creating a vulgar fraction, as discussed in the section "The Number Forms block" later in this chapter.

8.2.4.25. Space " " (U+0020)

This is the well-known space character, also known as "blank." The abbreviation SP is often used for the name of the character. Sometimes the character symbol for space ␠ (U+2420) is used in instructions and descriptions referring to the use of a space. The ISO 8859-1 standard defines the space character formally as follows:

This character may be interpreted as a graphic character, a control character or as both. As a graphic character it has the visual representation consisting of the absence of a graphic symbol.

Usually a font contains a glyph for a space, but the glyph is empty (blank): it just takes some space. The width of a space varies considerably. Programs might also interpret a space as a control charactere.g., so that instead of using a particular glyph, the program just leaves some empty space. The width of this spacing may vary by circumstances. In particular, the inter-word gaps can be of different widths in visual presentation especially when text is justified on both sides. Thus, spaces might be "stretchable" as well as "shrinkable." This will be discussed in "General Punctuation" later in the chapter.

The term whitespace character is often used in programming and markup contexts. It is a generalization of the space character and denotes a set of characters, typically including at least the space, some line break characters, and horizontal tab. The vertical tab is often included, too. For example, in the C programming language, the standard function isspace() tests for its argument being a whitespace character, not just a space.

8.2.4.26. Tilde ~ (U+007E)

This character has mixed usage. The word "tilde" is of Spanish origin and refers to a wavy diacritic mark, as in Spanish ñ (although in Spanish, the word "tilde" often denotes the acute accent, too!). The name of this character thus reflects one of the originally intended uses. Currently such use has little to do with tilde as an ASCII and Unicode character. In jargon, names like "squiggle" and "twiddle" are used.

In practice, tilde is used for a variety of technical purposes according to specific rulese.g., in programming and command languages. For example, in many Unix shells, ~ denotes the user's home directory. Reflecting this tradition, on many web servers, people's web pages are named in a manner that involves the tilde character. In Windows systems, the mapping of Windows filenames to DOS-compatible filenames ("8+3 characters") uses tilde; e.g., LONGFILENAME.TXT may get mapped to LONGFI~1.TXT. In the C language, the tilde denotes a bitwise operator that complements each bit. In Perl, the tilde is used in matching operators.

The glyph for tilde has varying shapes. Sometimes it looks like a diacritic tilde, but much more often it looks like an operator, placed vertically at the same level as a hyphen "-" or a little higher. The different uses of the tilde make it impossible to design a glyph that would be suitable for all, or even most, of the uses.

The overall tone in the Unicode standard is that the tilde character could and should often be replaced by characters with more specific semantics and more appropriate visual appearance. Care must be taken, however, since many computer languages explicitly define the tilde as the character to be used. Thus, the following recommendations apply basically to other contexts, such as prose texts, and only with caution:

  • For a symbol for negation in formal logic, use the not sign ¬ (U+00AC).

  • As a symbol for approximate value, use the almost equal to sign (U+2248).

  • mathematical meanings like "varies with," "is proportional to," "is similar to," etc., use the tilde operator ~ (U+223C).

  • As punctuation to denote alternation as well as in dictionary usage to indicate repetition of the defined term in examples, the visually wider character swung dash U+2053 is preferred in principle. However, almost all fonts lack this character, which was added to Unicode in Version 4.

  • As a spacing clone of a diacritic tilde (i.e., spacing counterpart of combining tilde U+0303), use the small tilde (U+02CD).

In ASCII, the tilde character has the primary name "overline" and a corresponding appearance; "tilde" was a secondary name only.

8.2.4.27. Vertical line | (U+007C)

This character is commonly known as "vertical bar" or just "bar." It is most typically used in formal languages (such as BackusNaur Form, BNF) between alternatives, corresponding to the word "or." In mathematics, vertical lines are used around an expression to denote its absolute valuee.g., |-42| = 42. In some dictionaries, a vertical line is used to indicate a possible hyphenation point; there is also a quite different dictionary usage: to separate the invariable part of a word from the rest in a paragraph that describes several words that begin the same way (e.g., imitat|e ... -ion ... -ive). Several other usages exist, too, especially in technical contexts. In Unix shells, for example, this character is used to denote "piping," and the character itself is then often known as "pipe." For example, in Unix shells, ls | more means "execute the ls program directing its output to the more program as input."

When discussing characters in general, the name "vertical line" is preferable to "vertical bar," since in Unicode, there are several other characters named as vertical bar symbols. Among them, even light vertical bar U+2658 is intended to be thicker than vertical line!

In some old fonts and keyboards, this character appears as a broken vertical line. However, in Unicode (and Latin 1), the broken bar () is a completely distinct character, though very little used.

8.2.5. ASCII Control Characters (C0 Controls)

Character codes often contain code positions that are not assigned to any visible character but might be used for control purposes. For example, in communication betwee n a terminal and a computer using the ASCII code, the computer could regard octet 3 as a request for terminating the currently running process. Some older character code standards contain explicit descriptions of such conventions. Newer standards just reserve some positions for such usage, to be defined in separate standards or agreements such as "C0 controls" (discussed below) and "C1 controls," or specifically ISO 6429, which is equivalent to ECMA-48, available from http://www.ecma-international.com.

ASCII, Unicode, and other standards reserve some code positions for eventual use for control purposes. Usually only a few of them, mainly those for line breaks, are defined in the standard itself. Somewhat confusingly, a standard may assign a name to such a code position. Such names (as in Table 8-3) may relate to actual or proposed usage, but they must not be taken as defining the meaning, or even as describing the most common usage.

Unicode does not assign official names to control codes, but in practice, various names and abbreviations taken from other standards are used. For example, U+000A is commonly called "line feed" (or "linefeed") or briefly "LF."

8.2.5.1. Control characters or control codes?

It is a matter of rather arbitrary definition whether you regard "control characters " as characters or just codes (code positions reserved for control purposes). In character code standards, they are usually called characters. It is however important to realize that a "control character" has no visual appearance as such (not even emptiness). Instead, their control effects may include visual formatting.

When people read or write about characters, their idea of character may or may not include control characters. Usually the context and content will help in resolving this. For example, if someone says that a font has the same width for all characters, he is clearly excluding control characters, since they normally have no width.

8.2.5.2. Types of control characters

Control codes can be used for device control such as cursor movement, page eject, or changing colors. Quite often, they are used in combination with codes for graphic characters, so that a device driver is expected to interpret the combination as a specific command and not display the graphic character(s) contained in it. For example, in the classical VT100 controls, ESC followed by the code corresponding to the letter "A" or something more complicated (depending on mode settings) moves the cursor up. To take a different example, the Emacs editor treats ESC a as a request to move to the beginning of a sentence. Note that the ESC control code is logically distinct from the ESC key in a keyboard, and many other things than pressing ESC might cause the ESC control code to be sent. Also note that phrases like "escape sequence" are often used to refer to things that do not involve ESC at all and operate at a quite different level, such as writing \" to include the character " as data, instead of having it interpreted as a delimiter.

One possible form of device control is changing the way a device interprets the data (octets) that it receives. For example, a control code followed by some data in a specific format might be interpreted so that any subsequent octets to be interpreted according to a table are identified in some specific way. This is often called "code page switching," and it means that control codes could be used to change the character encoding. It is then more logical to consider the control codes and associated data at the level of fundamental interpretation of data rather than direct device control. The international standard ISO 2022 defines powerful facilities for using different 8-bit character codes in a document. However, such approaches did not gain popularity, and nowadays, Unicode has made them rather unimportant.

Widely used formatting control codes include carriage return (CR), linefeed (LF), and horizontal tab (HT), which in ASCII occupy code positions 13, 10, and 9. The names (or abbreviations) suggest generic meanings, but the actual meanings are defined partly in each character code definition, partlyand more importantby various other conventions above the character level. The formatting codes were previously often seen as a special case of device control, but nowadays, they are rather treated as indicating the line structure of text; see the section "Line Structure Control" later in this chapter.

The horizontal tabulation HT (TAB) character, or tab for short, was previously used for real "tabbing" to some predefined writing position (tab stop), as on typewriters. The tab character is nowadays not used much for such purposes, partly because tab stop settings may vary, partly because more advanced tools (such as tables) exist. However the tab is often used to indicate data boundaries, without implying any particular presentational effect. In particular, the "tab separated values" (TSV) data format is used to transfer data between spreadsheet applications, using line breaks to separate records (rows) and tabs to separate fields (cells) within records.

8.2.5.3. Visible symbols for control characters

Although a control character cannot have a graphic presentation (a glyph) in the same way as normal characters have, we sometimes use visual symbols to indicate the presence of control characters in a data stream. In Unicode, there is a separate block, Control Pictures, for such purposes. These characters have different shapes in different fontse.g., ␛ or ␛. They are of course quite distinct from the control codes they symbolize. The symbol for escape ␛ (U+241B) is not the same as the escape U+001B.

In manuals and instructions where you need to explicitly indicate the use of spaces, you might use the blank symbol ␢ (U+2422) or the open box (U+2423). The latter is probably more common and more easily recognizable. There is no specific character for indicating the Enter or Return key; a small image probably works best. Sometimes the symbol for newline ␤ (U+2424) is used. Beware that glyphs for it vary considerably, though they generally contain the letters "NL" in some style.

If you display a text file containing octets in the C0 Controls on MS-DOS or in the DOS-like mode in Windows, you may get graphic characters like ☺. This is because in some Windows code pages (such as CP 437), octets in that range are treated as graphic characters. For a list, see http://czyborra.com/charsets/codepages.html.

On the other hand, a control code might occasionally be displayed, by some programs, in a visible form, perhaps describing the control action rather than the code. For example, upon receiving octet 3 in the example situation just described, a program might echo back (onto the terminal) *** or INTERRUPT or ^C. All such notations are program-specific conventions. Some control codes are sometimes named in a manner that seems to bind them to characters. In particular, control codes 1, 2, 3,... are often called control-A, control-B, control-C, etc. (or CTRL-A or Ctrl-A or C-A). This is associated with the fact that on many keyboards, control codes can be sent to a computer by using a special key labeled "Control" or "Ctrl" or something like that together with letter keys "A," "B," "C," etc. This in turn is related to the fact that the code numbers of characters and control codes have been assigned so that the code of "Control-X" is obtained from the code of the uppercase letter "X" by a simple operation (subtracting 64 decimal). However, such things imply no real relationships between letters and control codes. The control code 3, or "Control-C," is not a variant of letter C at all, and its meaning is not associated with the meaning of C.

8.2.5.4. Summary of C0 Controls

Although the meanings of control characters depend on specific agreements and often vary greatly, many of them have typical uses, which are reflected in their commonly used names. The following table contains additional notes on the usage, especially in text data. If you design an application or data format that uses C0 Controls, it is up to you to assign meanings to them. It is however advisable to use assignments that correspond to common usage, partly because this helps to avoid clashes with assignments in software that might interact with your system.

The C0 Controls consist of the first 32 code positions (U+0000..U+001F) in Unicode and ASCII as well as the last position in ASCII, U+007F. Table 8-3 lists their ASCII names. The primary Unicode names are somewhat different: U+0009 is character tabulation, U+000C is line tabulation, and U+001C..U+001F are information separator four, three, two, and one.

Table 8-3. C0 Controls

Code

Abbr.

Name

Ctrl-x

Typical usage

0000

NUL

Null

Ctrl-@

Data or time fill, or terminator

0001

SOH

Start of heading

Ctrl-A

Starts a message header

0002

STX

Start of text

Ctrl-B

Starts a message body

0003

ETX

End of text

Ctrl-C

End of text entity

0004

EOT

End of transmission

Ctrl-D

End of sending one or more texts

0005

ENQ

Enquiry

Ctrl-E

Asks for identification

0006

ACK

Acknowledge

Ctrl-F

Affirmative response

0007

BEL

Bell

Ctrl-G

Alarm, often audible (beep)

0008

BS

Backspace

Ctrl-H

One character position backward

0009

HT

Horizontal tabulation

Ctrl-I

Move to next tab stop; separator

000A

LF

Line feed

Ctrl-J

One line downward; line break

000B

VT

Vertical tabulation

Ctrl-K

Move downward

000C

FF

Form feed

Ctrl-L

Page eject; page separator

000D

CR

Carriage return

Ctrl-M

Move to start of line; line break

000E

SO

Shift out

Ctrl-N

Shift out from alternate code page

000F

SI

Shift in

Ctrl-O

Switch to alternate code page

0010

DLE

Data link escape

Ctrl-P

Data transmission control

0011

DC1

Device control one

Ctrl-Q

Resume data transmission

0012

DC2

Device control two

Ctrl-R

Special mode of device operation

0013

DC3

Device control three

Ctrl-S

Pause data transmission

0014

DC4

Device control four

Ctrl-T

Deactivate ancillary device

0015

NAK

Negative acknowledge

Ctrl-U

Negative response to sender

0016

SYN

Synchronous idle

Ctrl-V

Synchronization of transmission

0017

ETB

End of transmission block

Ctrl-W

Transmission of data in blocks

0018

CAN

Cancel

Ctrl-X

Ignore preceding data

0019

EM

End of medium

Ctrl-Y

End of medium or recorded data

001A

SUB

Substitute

Ctrl-Z

Indicates invalid/erroneous data

001B

ESC

Escape

Ctrl-[

Starts a control command

001C

FS

File separator

Ctrl-\

Delimits a set of data (file)

001D

GS

Group separator

Ctrl-]

Delimits a data group

001E

RS

Record separator

Ctrl-^

Delimits a line or other record

001F

US

Unit separator

Ctrl-_

Delimits a unit (field) of data

007F

DEL

Delete

 

Data or time fill


The DEL character was originally used on punched tapes to delete a character by making all seven bits to one. This explains its code position. Later it has been used as a fill in a data stream. Do not confuse it with the effect of a Delete (or Del or Rubout) key, which often sends the code for backspace (BS, Ctrl-H).

Normal plain text data seldom contains C0 Controls except CR and LF to indicate line breaks, sometimes HT to indicate tabbing, and rarely VT or FF for vertical spacing. When reading text data in a program, occurrences of other C0 Controls can typically be treated as symptoms of data errors, unless there is a special agreement to use them.

C1 Controls include, loosely speaking, the corresponding set of control characters in the upper half of 8-bit character codes, Unicode range U+0080..U+009F. However, there are different assignments for those positions, see http://www.itscj.ipsj.or.jp/ISO-IR/2-6.htm. Note that in Windows and Macintosh character sets, many of these positions have been assigned to graphic characters.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net