Section 3.1. Good Old ASCII

3.1. Good Old ASCII

ASCII is still the set of characters that work safely in most text applications and on the Internet. Almost all programming languages, command languages, markup languages, Internet protocol headers, and many other notation systems still exclusively use ASCII in their basic syntax. They may allow other characters in contexts like quoted strings, but the commands, reserved words, and operators are written using good old ASCII. Moreover, most character codes currently in use can be regarded as extensions of ASCII: they preserve the meaning of code numbers 0 through 127 and add some more.

On the other hand, ASCII has a very small character repertoire. Historically, it was a big improvement over even more restricted character codes, but it was created at a time when bits were very expensive. ASCII was designed to be represented in 7 bits, and many character positions were reserved for control codes such as linefeed (LF) and escape (ESC). Only about a hundred character positions were assigned to printable characters.

Moreover, since the needs of programming were more important than those of text processing, the assignments use positions for many technical characters. Even "smart" quotation marks were omitted; the idea was that the ASCII quotation mark, ", was to be used as a neutral quotation mark.

3.1.1. American Origin

The name ASCII is originally an acronym for "American Standard Code for Information Interchange." The ASCII code was developed in the United States and standardized by ANSI, the American National Standards Institute. The standard is often referred to as ANSI X3.4-1986, but the current version is ANSI INCITS 4-1986 (R2002).

The creation of ASCII started in the late 1950s, and several additions and modifications were made in the 1960s. The 1963 version had several unassigned code positions. The ANSI standard, where those positions were assigned, mainly to accommodate lowercase letters, was approved in 1967/1968, and later modified slightly.

The nameUS-ASCII is also used, and is even the preferred name in some recommendations, to distinguish ASCII proper from different "national variants of ASCII." In principle, the name ASCII is unambiguous, since the "variants" are just different codes with more or less resemblance to ASCII and with names of their own.

Contrary to popular belief, the designers of ASCII did not limit the scope to the English language only. Some characters were included for the purpose of writing accented letters. For example, the tilde ~ character was meant to be used so that it is overprinted on a lettere.g., writing "n," Backspace, and ~ on paper to produce a character that looks like ñ. This never became popular, and the characters introduced for the purpose were used for other purposes as well, creating a conflict of interests in font design. But ASCII surely tried to address the needs of other languages as well.

3.1.2. The ASCII Repertoire

The following presentation contains the printable ASCII characters by their code number (32126) order, in rows of 16 characters, except for the last one, which has only 15 characters. The first character is the space, which is graphically empty, of course; it is often classified as a graphic character. The font used here is the monospace font used for computer code in this book:

! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ' a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

Thus, there are 6 x 16 - 2 = 94 graphic characters, if we do not count the space as graphic. They include 26 uppercase letters, 26 lowercase letters, and 10 digits, leaving only 32 code positions for other characters.

The repertoire corresponds rather closely to the characters that can be written on old typewriters and similar devices. This is no coincidence, but intentional design. Only a few extra characters were added, such as the backslash \ (reverse solidus).

3.1.3. The ASCII Encoding

By design, ASCII is a 7-bit character codei.e., each code number can be represented as an integer in binary notation using 7 bits. In the early days, ASCII data was sometimes packed into 7-bit bytese.g., putting 5 bytes into a 36-bit computer "word" (storage unit).

Nowadays, we almost always use an 8-bit byte, or octet, to represent an ASCII character. This leaves 1 bit (normally the most significant bit) unused. It has been used for various purposese.g., as a parity check bit, which helps to detect errors in data. In modern protocols and applications, the most significant bit is usually kept as zero. This, too, allows checks of a kind: if a text file purported to contain ASCII data has any octet with the most significant bit set, there is an error of some kind somewhere.

This makes the character encoding used for ASCII really simple: each code number, and hence each character, is represented as an octet with that number as its value, when interpreted as an integer.

3.1.4. ISO 646 and National Variants of ASCII

There are several national variants of ASCII. Technically, each variant is a character code that is defined separately and has its own name. In such variants, some special characters have been replaced by national letters and other symbols. There is great variation here, and even within one country and for one language, there might be different variants. The variants have lost much of their significance because of more modern approaches to encoding characters, but they can still be in use for legacy data and legacy applications.

A large number of the variants have been defined on the basis of the international standard ISO 646, issued by the International Organization for Standardization (ISO), . ISO 646 has a so-called International Reference Version (IRV) which is equivalent to ASCII; thus, in this context, "International" effectively means "English"! In some contexts, it might be politically correct to refer to ISO 646 IRV instead of ASCII.

ISO 646 defines a character set similar to US-ASCII but with code positions corresponding to US-ASCII characters @, [, \, ], {, |, and } as "national use positions." It also gives some liberties with characters #, $, ^, ', and ~. Ecma International has issued the ECMA -6 standard, which is equivalent in content to ISO 646 and is freely available on the Web via http://www.ecma-international.org. Ecma was originally European Computer Manufacturers' Association, but it is now a worldwide association for standardization, though with some European emphasis.

The ISO 646 standard is cited more officially as ISO/IEC 646, since it is a joint standard approved by ISO and the International Electrotechnical Commission (IEC) . A similar note applies to many ISO standards mentioned in this book.

Within the framework of ISO 646, and also outside of it, several "national variants of ASCII " have been defined, assigning different letters and symbols to the "national use" positions. Thus, the characters that appear in those positionsincluding those in US-ASCIIare more or less "unsafe" in international data transfer, although this problem is losing significance. The trend is towards using the corresponding codes strictly for US-ASCII meanings; national characters are handled otherwise, giving them their own unique and universal code positions in character codes larger than ASCII. However, old software and devices as well as legacy data may still reflect various "national variants of ASCII."

In principle, the phrase "national variant of ASCII" is incorrect. They are character codes that are defined independently, although they have been derived from ASCII. These codes are often registered and named in a manner that reflects the geographic scope. For example, a variant designed for use in Sweden and Finland has the primary name "SEN_850200_B" but also more understandable alias names like "ISO646-SE."

Table 3-1 lists ASCII characters that have been replaced by other characters in some "national variant of ASCII." That is, the code positions of these US-ASCII characters might be occupied by other characters needed for national use. The lists of characters here is not intended to be exhaustive, it just shows some typical examples. The "Code" column specifies the ASCII (and Unicode) code number in hexadecimal.

Table 3-1. ASCII characters that vary in "national variants"
Code	Character	Unicode name	National variants
23	#	Number sign	£ Ù
24	$	Dollar sign	¤
40	@	Commercial at	É § Ä à ³
5B	[	Left square bracket	Ä Æ ° â ¡ ÿ é
5C	\	Reverse solidus	Ö Ø ç Ñ ½ ¥
5D	]	Right square bracket	Å Ü § ê é ¿ \|
5E	^	Circumflex accent	Ü î
5F	_	Low line	è
60	'	Grave accent	é ä µ ô ù '
7B	{	Left curly bracket	ä æ é à ° ¨
7C	\|	Vertical line	ö ø ù ò ñ f
7D	}	Right curly bracket	å ü è ç ¼
7E	~	Tilde	ü ¯ ß ¨ û ì ´ _

Thus, for example, text containing "foo[1]" might be displayed as "fooä1å" when processed by software that assumes that the input is in a national variant of ASCII. Such software has become rare as the use of ISO 8859 and other wider character codes has become common, since almost all characters used in the national variants have been incorporated into an ISO 8859 character code. However, legacy data may contain characters that need to be interpreted according to some national variant. If you see a text containing the string "Sch}ler," odds are that } actually means ü. You need information on the legacy codes used in a cultural environment in order to make educated guesses in such situations. For a quick reference to such codes, presented graphically, consult the page http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/CJK.html.

3.1.5. Subsets of ASCII for Safety

Mainly due to the national variants discussed in the previous section, some characters are less "safe" than othersi.e., more often transferred or interpreted incorrectly. In addition to the letters of the English alphabet (AZ, az), the digits (09), and the space ( ), only the following characters can be regarded as really "safe" in data transmission:

! " % & ' ( ) * + , - . / : ; < = > ?

Even these characters might eventually be interpreted wrongly by the recipient. For example, a human reader could see a glyph for & as something other than what it is intended to denote. A program could interpret < as starting some special markup, or ? as a so-called wildcard character, etc.

When you need to name things (e.g., files, variables, data fields, etc.), it is often best to use only the characters listed above, even if a wider character repertoire is possible. Naturally, you need to take into account any additional restrictions imposed by the applicable syntax. For example, the rules of a programming language might restrict the character repertoire in identifier names to letters, digits, and one or two other characters. On the other hand, the underscore (low line) character _ is often usable in names, and it normally works reliably.

3.1.6. The Misnomer "8-bit ASCII"

The phrase "8-bit ASCII" is used surprisingly often. It follows from the discussion in the previous section that in reality ASCII is strictly and unambiguously a 7-bit code in the sense that all code positions are in the range 0127. It can be, and it usually is, represented using 8-bit bytes, but with the first bit always zero, or used for other purposes so that it is not part of the encoded form of a character.

The misnomer "8-bit ASCII" most often denotes windows-1252, the 8-bit code defined by Microsoft for use in the Western world. More generally, 8-bit ASCII is used to refer to various character codes, which are extensions of ASCII and mutually more or less incompatible. The character repertoire in such a code contains ASCII as a subset, the code numbers are in the range 0256, and the code numbers of ASCII characters equal their ASCII codes.