Section 11.1. Characters in Computer Languages

11.1. Characters in Computer Languages

What do we really mean when we say that a particular programming language has such-and-such a character repertoire? Ranging from the narrowest to the broadest, the interpretations are:

The characters you need for writing the basic constructs of the language, such as operators and punctuation. This is almost always a subset of ASCII.
The characters used in the basic constructs and identifiers that the programmer chooses to use as names for variables, arrays, functions, etc. This too is usually a subset of ASCII, but the modern trend is to allow a larger repertoire of letters as identifiers. This lets a programmer name her variables in her native language. The repertoire might even consist of almost all Unicode characters, with special arrangements to make it possible to parse source code unambiguously. After all, we need to know the start and end of an identifier and distinguish identifiers from other symbols.
The characters that can be used in the above constructs or in character and string literals. Most programming languages do not let you use, for example, an accented letter like é in an identifier, but they may well allow it in literals like 'é' or "égalité".
All the characters that are allowed in source programs. This includes, in addition to the characters discussed above, characters than can be used in comments. Usually you can write anything into comments, but there might be some limitations.
All the characters that are expressible in source programs. This can be a larger repertoire than the characters that are allowed as such, due to various "escape" mechanisms. Even if a language might not allow you to enter, say, a Cyrillic letter into a source program (even in a literal), it may well let you write a character constant that has a Cyrillic letter as its value, such as '\u042f' (which denotes ).

In this example, the dollar symbol, the quotation marks, and the semicolon are basic symbols of Perl (item 1). The string msg contains characters allowed in a name (item 2). Inside a string constant, where the rules can be more permissive (item 3), the non-ASCII character § might be allowed, depending on implementation. Anything following the character # is a comment, so anything goes (item 4), including "smart" quotation marks, which would not be allowed as string delimiters. The string constant contains a notation that refers to U+2665 (black heart suit, ), but not that character as such (item 5). Such a reference might work even in circumstances where that character cannot appear as such even in a comment, due to restrictions imposed by the character encoding.

In a markup language, the interpretations are similar, except for the last one, which does not exist. You don't process data in markup. The same applies to various descriptive languages, metalanguages, etc.

Thus, whenever you see a statement like "language X supports Unicode," you should ask what it means. Usually it means, at most, that Unicode characters are allowed in the sense expressed in items 3 to 6, but sometimes also item 2, though with limitations.

Only a few programming languages have been designed to allow (and require) the use of non-ASCII characters in the basic constructs (item 1). In the early days of computing, some language definitions used special characters like (logical and, U+2227) as operators. Actual implementations used various replacement notations. Later, even specifications were written to use ASCII only.

The APL language is an exception. It is oriented toward processing of arrays and matrices, and it uses a collection of special symbols, all of which have been included in Unicode, some only due to their use in APL. The use of APL has always been relatively small, partly due to the special techniques (a special keyboard or special software) needed for writing it.

Work on a language called Fortress has been started, by Sun Microsystems, to create a programming language that allows the use of symbols and notations as in the tradition of mathematics and logice.g., a² ∈ A B. However, the language defines ways of using symbols constructed from ASCII characters instead of the special characters. Information about Fortress is available at http://research.sun.com/projects/plrg/.

By definition, comments are ignored (skipped) by programming language compilers and interpreters or, in the case of a markup language, by parsers and browsers. Thus, it is natural to expect that we can use any characters inside comments, as long as we don't try to use a comment terminator inside a comment.

char ch = 'X'; /* A comment: I   C ☺  */

However, special characters could cause problems if they are in an encoding that is not recognized by compilers and interpreters. Interpreted wrongly, they might mess up the processing. This should not be a problem if you use an ISO 8859 encoding or UTF-8 and the compiler effectively processes it as ASCII, treating octets outside the ASCII range as unknown characters. It should then simply ignore such octets in comments.

Some old compilers are known to get confused with octets outside the ASCII range even if they occur inside comments only. Try to get a better compiler if this occurs.

11.1.1. Common Escape Notations

Many modern computer languages use "backslash escape" notations for characters inside character and string constants, and possibly in other contexts as well. Escape notations in general were discussed in Chapter 2.

A rather common set of conventions, historically largely based on the C programming language, is presented in Table 11-1. Various languages have deviations from and additions to these notations. Some of the notations, such as \b, are rarely used nowadays but often preserved in the repertoire for historical continuity. The notations are typically allowed in character and string constants that have enclosing quotation marks, but depending on the language definition, they might be allowed, for example, in unquoted values and identifiers, too.

Table 11-1. Widely available escape notations for characters
Notation	Unicode value	Explanation
`\a`	U+0007	(Audible) alert, BEL
`\b`	U+0008	Backspace (move one position backwards)
`\f`	U+000C	Form feed (page eject)
`\n`	Implementation-dependent	Newline; see "Line structure control," Chapter 9
`\r`	U+000D	Carriage return (move to start of line)
`\t`	U+0009	Horizontal tabulation, tab
`\v`	U+000B	Vertical tab
`\\`	U+005C	Reverse solidus (backslash) itself
`\"`	U+0022	ASCII quotation mark
`\'`	U+0027	ASCII apostrophe

The use of \" and \' is relevant in contexts where the quotation mark or apostrophe could otherwise be taken as terminating a character or string constant. For example, in order to write a string constant that means the three-character string a"b, you may need to write "a\"b".

Usually "backslash escapes" can also be used to specify characters by their code numbers. For example, \0 might be used to denote the null character U+0000. However, great care is needed when you change from one language to another, since there are essential differences. Usually the implied numbering is according to Unicode, but the range of permitted numbers varies, and might cover only the ASCII range (0127) or ASCII and Latin 1 Supplement (0255).

Moreover, the notations for numbers vary. For example, in C, the number is interpreted as decimal, unless it begins with a zero, in which case it is interpreted as octal (base 8). In Java, the number is interpreted in octal, unless preceded by the letter u, in which case it is interpreted as hexadecimal. Besides, there can be special rules for the amount of digits.

11.1.2. Characters in Markup Languages and CSS

Although markup languages (such as HTML and XML) and CSS, the stylesheet language, are not programming languages, we will discuss them to some extent in this chapter. One reason is that in dealing with characters, they resemble programming languages in many ways. Moreover, they are also used in conjunction with programming languages in a manner that often confuses people. Think about the following attempt at Perl code, meant to generate a piece of HTML code, the tag <p style="em">:

print "<p style="em">\n";   # This won't work!

This will fail, with an unfriendly error message, because the Perl interpreter treats the second quotation mark as terminating the quoted string. The problem is that we have a quoted string that needs to contain a quoted string in another language. In this particular case, there are many simple solutions, such as using single quotation marks in the HTML code. There are, however, more difficult situations.

11.1.2.1. Characters in HTML and XML

The methods of using characters, including entity and character references, in HTML and XML were explained in Chapter 2 and Chapter 10. There are some finer points to be discussed here. What exactly is the repertoire of characters that you can use? How do HTML notations interfere with those of programming languages?

HTML and XML derive their escape notationscharacter references and entity references from SGML (Standard Generalized Markup Language), which is far less known to most people than its descendants. The escape mechanisms of SGML are rather different from those of programming languages and include characters that cause some clashes. In a character reference like { in SGML, the &# and ; parts are just particular instances (though the default instances) of the general concepts of Character Reference Open (CRO) and Reference Close (REFC) . They can be changed to other symbols if desired for the needs of a particular markup system based on SGML. However, both HTML and XML have made such things fixed.

This has some implications especially regarding the ampersand &, which is widely used for special purposes in programming languages and other notations. In particular, the ampersand is used as a separator between fields (name = value pairs) in the format of data generated from form submission. This means that URLs often contain ampersands. When you include a URL into an HTML document, you must therefore escape the ampersand. For example, to refer to http://www.google.com/search?hl=en&q=rosebud in a link in HTML, you should write:

<a href="http://www.google.com/search?hl=en&amp;q=rosebud">...

Contrary to popular belief, entity references are recognized in attribute values, too. This has often caused confusion, since people have failed to see the difference between a URL (which contains just & here) and the way of writing a URL in an HTML document.

Luckily, the backslash character \ has no special role in HTML or XML. It is just a normal data character.

The HTML and XML specifications define that the document character set is ISO 10646 . As explained in Chapter 4, this is effectively the same as saying that it is Unicode. However, the document character set relates only to the repertoire of characters that may appear in documents and specifically to the interpretation of character referencesi.e., notations of the form &#n; or &#xn;. The document character set is the character code (mapping of integers to characters) according to which the n in such notations is to be interpreted.

In particular, HTML and XML specifications do not impose Unicode semantics on characters, for two reasons: they formally refer to ISO 10646, not the Unicode standard, and even if they referred to Unicode, this would not constitute a requirement on conformance to the standard. Of course, software that processes HTML or XML documents may apply Unicode semantics and rules, such as line breaking rules, but this is not a requirement. Only for some features related to directionality do HTML specifications refer to Unicode rules normatively.

The HTML specifications contain some special restrictions on the use of control characters, as listed in Table 11-2. There is usually little reason why control characters other than line breaks and sometimes horizontal tabs would appear in HTML documents. They may, however, appear due to conversions. The rules for them are somewhat different in HTML up to and including HTML 4.01 and in XHTML. (Technically, the SGML declaration for HTML 4.01 disallows U+000C, but the prose discusses it as an allowed character. It would anyway be whitespace and not a page eject character.)

Table 11-2. C0 and C1 Control characters in HTML
Character(s)	Explanation	Use in HTML
U+0000..U+0008	C0 Controls (part)	Forbidden
U+0009	Horizontal Tab	A whitespace character, may tabulate
U+000A	Line Feed	Line break; a whitespace character
U+000B	Vertical Tab	Forbidden
U+000C	Form Feed	Obscure in HTML, forbidden in XHTML
U+000D	Carriage Return	Line break; a whitespace character
U+000E..U+001F	C0 Controls (part)	Forbidden
U+007F	DEL (= Delete)	Disallowed in HTML, discouraged in XHTML
U+0080..U+0084	C1 Controls (part)	Disallowed in HTML, discouraged in XHTML
U+0085	NEL (= Next Line)	Disallowed in HTML, line break in XHTML
U+0086..U+009F	C1 Controls (part)	Disallowed in HTML, discouraged in XHTML

The specific restrictions in XHTML are derived from the XML 1.0 specification, which has a rigorous definition of allowed characters, or rather code points. By the specification, an XML processor must accept any code point (including unassigned code points) except certain control characters, the surrogate blocks, and two noncharacters, as shown in Table 11-3. On the other hand, the XML 1.0 specification declares some characters as discouraged. Discouraged characters are allowed and must be accepted by an XML processor, but authors are advised to avoid using them. They are:

All compatibility characters as defined in the Unicode standard.
The ranges U+1FFFE..U+1FFFF, U+2FFFE..U+2FFFF, etc.i.e., the last two code points of all planes except the BMP. They are noncharacters.
Some other specific ranges of code points; these are indicated in the table as "Discouraged."

Table 11-3. Characters and other code points in XML 1.0
Code point(s)	Explanation	Status in XML
U+0000..U+0008	C0 Controls (part)	Forbidden
U+0009	Horizontal Tab	OK
U+000A	Line Feed	OK (line break)
U+000B..U+000C	VT, FF	Forbidden
U+000D	Carriage Return	OK (line break)
U+000E..U+001F	C0 Controls (part)	Forbidden
U+0020..U+007E	Basic Latin (printable)	OK
U+007F..U+0084	Control characters	Discouraged
U+0085	NEL (= Next Line)	OK (line break)
U+0086..U+009F	C1 Controls (part)	Discouraged
U+00A0..U+D7FF	Various BMP characters	OK
U+D800..U+DFFF	Surrogates	Forbidden
U+E000..U+FDCF	Various BMP characters	OK
U+FDD0..U+FDDF	Noncharacters	Discouraged
U+FDE0..U+FFFD	Various BMP characters	OK
U+FFFE..U+FFFF	Noncharacters	Forbidden
U+10000..U+10FFFF	Non-BMP characters	OK with exceptions (see above)

In XML 1.1, which has few implementations and less use than XML 1.0, the character concept is somewhat broader: all characters in the range U+0001..U+D7FFi.e., including most control characters forbidden in XML 1.0'are permitted. The NUL character U+0000 is forbidden even in XML 1.1, to avoid problems with applications that may treat it as a string terminator. On the other hand, XML 1.1 allows C0 and C1 Controls (excluding the line break characters and the horizontal tab) only as character references such as  or , not directly as data characters.

11.1.2.2. Problems in generating markup programmatically

When you write a program that generates markup, you often encounter the problem that the programming language and the markup language have different escape notations. This is, however, mostly a conceptual problem: you need to remember the conventions of both notations and not mix them with each other. Consider the following simple statement in the Perl language:

print "<p style=\"em\">\n";

Here we have solved the previously mentioned problem with quotation marks by escaping the inner quotation marks. In Perl strings, \" is an escape notation for the quotation mark as a data character. Usually there are many alternative ways of solving such problems.

Here is a perhaps trickier example:

print "<p>The price is $100.</p>";    # Will print wrong data

The problem is that the dollar sign $, which is just an ordinary data character in HTML, has special meanings in Perl; for example, it starts a scalar variable, and $100 is a special variable. The program is in error, and it probably prints "<p>The price is .</p>" (without any error message or warning, unless you use the -w switch when invoking the Perl interpreter).

When you mix two languages, check your strings for problems with syntactically special characters and notations in either language.

Problems discussed here can usually be solved by modifying the code in either of the languages, usually with some kind of an escape notation. Moreover, there are typically two or more ways of doing that. In the last example, it would be simplest to solve the problem at the Perl level, either by using single quotation marks (since inside them, the dollar sign loses its special meaning) or by escaping the dollar sign with backslash:

print '<p>The price is $100.</p>';     # OK, but implies limitations print "<p>The price is \$100.</p>";    # A better solution

11.1.2.3. Problems in using scripts inside HTML

There is another way to "nest" HTML and a programming language: putting a program inside an HTML document. You might wish to show program source code in an HTML page if you are writing about programming or documenting a program. This would mean that the program source code is normal textual content, so the usual rules for escaping < and & in HTML will apply. Here is an example of HTML markup, for text containing the (somewhat artificial) C language expression &x<y (note that the code markup does not affect the interpretation of <, &, etc.):

<p>Consider the statement <code>&amp;x&lt;y</code>.</p>

A more difficult question arises if you wish to use program code to be executed by the browseri.e., client-side scripting. You would typically use JavaScript, and you can attach a program (script) to an HMTL document in three ways:

Write the program into an external file, say zap.js, and refer to it in HTML using an element like <script type="text/javascript" src="/books/1/536/1/html/2/zap.js"></script>. This will avoid all problems discussed here, since in the external file, no HTML rules apply. The file could contain, for example, the JavaScript code alert("Hello&bye").
Write the program inside a script elemente.g., <script type="text/javascript">alert('Hello&bye');</script>. According to HTML specifications up to and including HTML 4.01, the HTML escape rules are not applied inside a script element, so you could and would have to write, for example, the ampersand as such, as in the example. In XHTML, you would need to use the escapes there. This makes things so complicated that it is much easier to write the code in an external script file (i.e., use the first way).
Write the program inside an event attribute such as onload or onclicke.g., <body onload="alert('Hello&bye')">. In this case, all HTML escape conventions apply. Moreover, you cannot use the same quotation mark in the JavaScript code as you have used as the attribute value delimiter in HTML. The common style is to use the double quote " in HTML, the single quote ' in JavaScript.

Things can be even more complicated, and that's not even rare. You might have HTML markup that contains JavaScript that generates HTML markup. For example, consider the following HTML element:

<script type="text/javascript">   document.write('<div>Hello world<\/div>'); </script>.

In the example, we have written \/, which is a JavaScript escape for the / character. Without such escaping, the browser would see </div> as an end tag, causing a syntax error in HTML.

You might now carry out a simple exercise: write a HTML document so that when it is opened, the message "Helloworld" appears in a pop-up window created by the JavaScript function alert. The basic code has been presented above, and you just need to find out how to express the em dash character, U+2014. In JavaScript, you can use the escape notation \u2014 for it. But could you also use an HTML character reference, and how would you do that, in the three ways discussed above?

11.1.2.4. Characters in CSS

A stylesheet written in CSS (Cascading Style Sheets) can use any encoding recognized by a browser on which it will be used. Usually only ASCII characters are used in CSS, so the encoding is not a big issue. However, you might wish to use non-ASCII characters for in some special cases:

In identifiers such as element or class namese.g., p.Einführung {...}
In property values such as font namese.g., font-family: Lübeck
In stringse.g., quotes: "\201d""\201d";

CSS code may appear in a separate file or as embedded into an HTML document. In the latter case, it of course shares the encoding of the HTML document. In the former case, the web server (HTTP server) should announce the encoding, as for HTML documents (see Chapter 10). This is problematic, and for casual use of non-ASCII characters, it might be best to use escape notations.

The basic escape mechanism for characters in CSS is simple and similar to the general mechanisms in programming languages. You start an escape with a backslash (reverse solidus) \ and then you write the Unicode code number in hexadecimal. Recognizing the end of the notation is somewhat problematic. The rules were briefly described in Chapter 2, but here we will present them in more detail and also list some alternative notations.

CSS has three kinds of uses for the backslash:

A backslash immediately followed by a line break is ignored together with the line break. Thus, a \ at the end of a line is used for continuation lines. In practice, it is used inside a string that must not contain a line break, when we wish to keep the physical line length reasonably small.
Any single character but a hexadecimal number can be escaped by prefixing it with a backslash. This notation is useful when the character itself would not be syntactically permitted or would have a special meaning. Thus, \\ means the backslash itself as data character, \" means the ASCII quotation mark, etc.
A backslash followed by one to six hexadecimal digits denotes the character with that code number. For example, \2013 means U+2013, the en dash "". If the notation is followed by a character that is a hexadecimal digite.g., you would like to express "14"'the end of the notation needs to be indicated. There are two ways to do this: use exactly six digits; e.g., 1\0020134; or put a whitespace character after the last hexadecimal digit; e.g., 1\2013 4. The whitespace character will be ignored by a program that processes the CSS code, and a CR LF pair will be counted as one character in this context. This is a convenient method, and you could use an extra space routinely, even when not needed. However, the convention implies that if the escape notation should be followed by a real space character, the space needs to be doubled or escaped. For example, "1 4" would be written as 1 \2013 4 (with two spaces before "4") or as 1 \2013\ 4.

Let us suppose that your HTML document contains <p > or, equivalently, uses an entity reference, <p >. If you write CSS code in a suitable encoding, you can enter the character ü (U+00FC) directly, but you can alternatively use the escape notation \fc for it, for example:

p.Einf\fc hrung { font-size: 120%; }

The point is that although HTML and CSS have quite different escape mechanisms, you can escape a character in both languages and have it interpreted the same way. You can also escape a character in one of the languages and use it as such in the other.

If your CSS code is embedded inside an HTML document, it is better to use CSS escapes rather than HTML escapes. One reason for this is that the latter are not always recognized:

In a style attribute, as in <p style="font-family: Lübeck">, HTML escapes are recognized. You could write Lübeck or Lübeck there, but the CSS escaped form L⁠\⁠f⁠c beck works, too.
In a style element, as in <style type="text/css">p { font-family: Lübeck }</style>, HTML escapes are not recognized according to HTML specifications up to and including HTML 4.01. The CSS escapes work, of course. (In XHTML, the processing of the content of style elements has been defined differently, so that HTML escapes are recognized.)

11.1.2.5. Identifiers in CSS

The HTML specifications do not prescribe the syntax of class names. It is left to stylesheet languages, and CSS is rather permissive. You don't often see non-ASCII characters in class names, though, because people are afraid of using them, partly for a reason.

In practice, it is safest to use only ASCII letters, digits, and hyphen-minus characters in class names in HTML and CSS. However, a much wider range of characters is permitted in principle. In CSS, class names are identifiers, and CSS identifiers may include:

Letters "A" to "Z" and "a" to "z"
Digits "0" to "9"
ASCII hyphen (hyphen-minus) "-"
Underscore (low line) "_"
Any Unicode character from U+00A1 up
Any Unicode character in an escaped form, such as "\0000A0"

There are limitations on the first character of an identifier in CSS: it must not be a digit, and identifiers starting with the ASCII hyphen are allowed in some contexts only.

The rules for CSS identifiers are important when you use CSS in conjunction with XML, where non-ASCII characters may appear in element and attribute names. Even some ASCII characters may cause problems. For example, using the colon, :, in an attribute name is common in XML (e.g., in the attribute name xml:lang), but the colon is not permitted as such in a CSS identifier. The reason is that it has a special meaning in CSS syntax. It thus needs to be escaped, if the name is used in CSS (e.g., xml\:lang).