Section 2.8. Escape Sequences


2.8. Escape Sequences

Characters can often be written using various "escape" notations. This rather vague term means notations that are later converted to (or just displayed as) characters according to some specific rules. The rules are applied by a program like a text formatter or web browser, and the rules depend on the context. They may belong to a markup, programming, or other computer language. (Programming languagerelated issues will be discussed in Chapter 11.) If different computer languages have similar conventions in this respect, a language designer may have picked up a notation from another language, or it might be a coincidence.

The phrase "escape sequence," or even "escape" for short, is rather widespread, and it reflects the general idea of escaping from the limitations of a character repertoire or device or protocol or something else. These notations should not be confused with the use of the ESC (escape) control code in ASCII and other character codes. Especially in old text, "escape sequence" may mean a sequence of characters starting with ESC and typically used for controlling a device. The "escape sequences" discussed here are strings of printable characters used in text, and we will emphasize this by using the term "escape notation."

2.8.1. Examples of Escape Notations

Table 2-4 illustrates the use of escape notations in some markup and other computer languages. It shows examples of notations for the character Ä (A with dieresis, U+00C4) and the string "-8 °C" (minus sign, digit eight, no-break space, degree sign, letter C). Often a computer language has several alternative escape notations for a character; shown is just one of the possibilities. The principles of the notations will be explained in some detail after the table. As you see, the notations are partly similar, partly quite different. Once you know a few of them, learning new ones will be easy, as long as you manage to keep the different systems as separate in your mind.

Table 2-4. Escape notations in computer languages

Language or notation

Code for Ä

Code for -8 °C

CSS

\c4

\2013 8\a0 \b0 C

HTML

Ä

−8 °C

PostScript

\304

(-8 \260C)

RTF

\'c4

\u8722\'2d8\~\'b0C

TeX

\"A

$-$8~\char'0260 C

XML (and HTML)

Ä

–8 °C


As you can see, the notations typically involve some (semi-)mnemonic name or the code number of the character, in some number system. The ISO 8859-1 code number for our example character Ä is 196 in decimal, 304 in octal, and C4 in hexadecimal. The notations contain some method of indicating that the letters or digits are not to be taken as such but as part of a special notation denoting a character. Often some specific character such as the backslash \ is used as an "escape character." This implies that such a character cannot be used as such in the language or format but must itself be "escaped." For example, to include the backslash itself in a string constant in C, you need to write it twice (\\).

In cases like these, the character itself does not occur in a file (such as an HTML document or a TeX source file). Instead, the file contains the "escape" notation as a character sequence, which will then be interpreted in a specific way by programs like a web browser or a TeX program. We can in a sense regard the "escape notations" as encodings used in specific contexts upon specific agreements.

2.8.1.1. CSS

CSS (Cascading Style Sheets) is a language for suggesting presentational features for an HTML or XML document. It does not have much use for string constants, but in principle, so-called generated content strings may contain arbitrary Unicode data. The convention is that within a string constant, \n means the character with hexadecimal Unicode number n. This works well when the number is followed by a character that cannot be part of a hexadecimal number. There are special conventions that help in other cases: a \n construct is treated as terminated after six consecutive hexadecimal digits, and a space immediately following a \n construct is ignored. By these rules, the string 15 (containing an en dash, U+2013) can be written as a CSS constant in two ways: "1\0020135" or "1\2013 5".

2.8.1.2. PostScript

PostScript is a page description language defined by Adobe. PostScript format can be viewed on screen, too, using tools like GhostScript software. More than markup, PostScript is a powerful (and complex) programming language. Normally PostScript code is generated from other formats with automatic tools, but sometimes people edit the resulting code for fine-tuning of visual appearance or small changes to the content.

PostScript contains a large collection of names for glyphs. The names are mnemonic and relatively long, such Adieresis for Ä. Some character databases mention, along with other information about characters, the PostScript names, also known as "Adobe names." However, the names refer to glyphs, not characters. You would not use Adieresis in a PostScript file to get Ä printed; instead, you would use \304.

Adobe's information on PostScript, including the PostScript reference manual, can be found at http://www.adobe.com/products/postscript/resources.html.

2.8.1.3. RTF

Rich Text Format (RTF) was designed for information interchange between text-processing programs, preserving much of the formatting of text. Such programs typically have a "Save As RTF" function, and they can open RTF files and automatically convert them to the program's internal format. RTF contains much more than plain text, but often conversion to RTF format loses some information, if advanced or specialized tools have been used in text processing. RTF is favored by some organizations for security reasons: RTF does not contain macros, so RTF files, unlike MS Word files, cannot contain macro viruses.

RTF should not be confused with the general concept of "rich text," which may mean almost any data format that allows some formatting of texts, such as italics and bolding.

RTF markup is verbose and confusing to a human reader, since it is meant to be read by programs primarily. In addition to notations for characters as discussed here, RTF contains quite a lot of commands merged with text content. The RTF format is, however, a text format, defined as the Internet media type text/rtf and usually containing ASCII characters only. (The media type application/rtf is used, too.)

The meanings of notations like 'c4 depend on the encoding used. However, an RTF file may contain commands that specify the encoding, making the document more portable.

In the example \u8722\'2d8\~\'b0C, the notation \u8722 refers to the minus sign character (U+2212) by its code number in decimal. The notations \'2d and \'b0 refer to the hyphen-minus (U+002D) and the degree sign (U+00B0) by their two-digit hexadecimal codes, whereas \~ is a special notation that denotes the no-break space. The hyphen-minus character appears in the notation for fallback behavior: it is the character to be rendered if the preceding character cannot be displayed.

For more information on RTF, consult RTF Pocket Guide by O'Reilly or the extensive web site, http://interglacial.com/rtf/.

2.8.1.4. TeX

In TeX typesetting systems (including LaTeX, AMSTeX, etc.), there are different ways of producing characters, possibly depending on the "packages" used. Examples of ways to produce Ä include: \"A, \symbol{196}, \char'0304, and \capitaldieresis{A}. For a large list of such notations, consult The Comprehensive LaTex Symbol List, http://www.ctan.org/texarchive/info/symbols/comprehensive.

2.8.2. Notations for Human Readers

There are also "escape notations" that are to be interpreted by human readers directly. For example, when sending email, you might use A" (letter A followed by a quotation mark) as a surrogate for Ä (letter A with dieresis), or you might use AE instead of Ä. The reader is assumed to understand that, for example, A" on a display actually means Ä. Quite often, the purpose is to use ASCII characters only, so that the typing, transmission, and display of the characters is "safe."

However, such notations typically make texts rather messy. The name Hämäläinen does not look too good or readable when written as Ha"ma"la"inen or Haemaelaeinen. Such usage is based on special (though often implicit) conventions and can cause a lot of confusion when there is no mutual agreement on the conventions. Many different and mutually incompatible conventions are used. For example, to denote letter "a" with an acute accent, á, a convention might use the apostrophe, a', or the solidus, a/, or the acute accent, a´, or something else.

Some notations are rather evident, such as using a^ to denote â. The character ^ has no normal use in words, so the most plausible explanation is that the writer meant to indicate that a circumflex should appear above the preceding letter. But quotation marks, apostrophes, and even acute and grave accents could sometimes be mistaken for punctuation marks.

There is an old (1992) proposal by K. Simonsen, "Character Mnemonics & Character Sets," published as RFC 1345, that lists a large number of "escape notations" for characters. They are very short, typically two characterse.g., "Co" for ©, "A:" for Ä, and "th" for (thorn). Naturally, theres the problem that the reader must know whether, for example, "th" is to be understood that way or as two letters t and h. So the system is primarily for referring to characters, but under suitable circumstances, it could also be used for actually writing texts, when the ambiguities can somehow be removed by additional conventions or by context. RFC 1345 is old and not approved by any authority, but if you need, for some applications, an "escape scheme," you might consider using those notations instead of reinventing the wheel. RFCs are available via http://www.rfc-editor.org/.

2.8.3. Explanations to Human Readers

Extending the meaning of "escape sequence" even furtherand probably beyond what many experts find reasonablelet us consider the common problem of explaining verbally which character you mean. This may happen when you cannot show the character (e.g., when spelling out a foreign name over the phone) or when showing the character is not sufficient. Here we are not primarily interested in using characters in running text but in specifying which character is being discussed.

As an example, consider a situation where you need to mention the Cyrillic letter in a situation where you can safely use only ASCII (e.g., email, or a Usenet discussion). There are various ways to try to describe the character:


The Russian letter that looks like a mirrored "R"

Such descriptions of the shape of a character might do their job in some cases, but they don't work well in general. The shape of a character may vary, and different people interpret shapes differently. For example, the letter looks like a chair to some people, while some might describe it as digit 4, etc.


U+042F

This is the other extreme: a unique code-like notation, which is just fine when understood, but rather useless to most readers.


Я

This is code-like too. It might, however, be understood by people who know HTML authoring but have never heard of the U+... notation.


Cyrillic capital letter ya

This is better, and when understood as a Unicode name, it is unique and immutable. However, such names are not always intuitively understandable, even to people who know the character itself. For example, due to differences in English and French transliteration of Russian, the phrase "Cyrillic capital letter che" might be understood either as meaning or as meaning .


The character you can see at http://www.fileformat.info/info/unicode/char/042f/

This would mean a reference to a web page that contains information on the character, including a glyph of it as a largish image. If you use it in an email message, the recipient is usually able to just click on the address to visit the page. Unfortunately, for many purposes, the content of online services that could be used for such references tends to be rather technical in nature. The formal information might even confuse the reader.


A combination of some of the above

This is usually the best strategy. The methods used will vary by the audience and by the character. An explanation such as "Cyrillic capital letter ya (in Unicode: U+042F)" might work reasonably well.

Sometimes you need to avoid common phrases in order to be unambiguous. It is common to say "double slash" in English, when you mean two consecutive slash characters, //. However, such wording is potentially ambiguous, since Unicode contains the double solidus operator // (U+2AFD) as a separate, independent character. Unicode contains hundreds of characters with the word "double" in their name. Thus, a wording like "two slashes" is safer. Since even this might be misunderstood as referring to one character, the expression "two slash characters" is even safer.

2.8.4. HTML, SGML, and XML Notations for Characters

HTML is the markup language in which web pages are usually written. It is formally a special case of SGML or XML, which are generic markup languages. These languages have special notations that you can use for writing characters, if it is for some reason difficult or impossible to use the characters themselves.

2.8.4.1. Character and entity references in web authoring

If you use a "Save as HTML" or "Save as Web page" command or something similar in a word processor, it is quite possible that some characters in your text get stored as entity references or as character references. For example, you have typed é but the program stores it as é or as é. Web-authoring programs often do the same.

There is nothing wrong with this per se (in most cases; some programs generate incorrect character references, though). Web browsers can deal with such references. Many web page editors can interpret them as well. But if you wish to edit the document later using a program like Notepad, you will see the references, and things can get really awkward if you need to work with data like résumé a lot. Depending on the software you use, the references might appear as such, or interpreted and displayed as the characters they denote.

Some programs have options that control whether and how Unicode characters are replaced by entity or character references. Moreover, they may have options for setting the encoding of the HTML document, and this may affect the situation.

For example, suppose you use OpenOffice to create a document with é and a Chinese character, and then use File Save and select HTML format. With default settings, the program saves é as é and the Chinese character as a character reference like 不. The latter part is understandable, since the default encoding in HTML documents created with OpenOffice is windows-1252, which contains no Chinese characters. There is no good explanation for using é, though. If you set the encoding to utf-8 (via Tools Options Load and Save HTML Compatibility), then the Chinese character is saved as such, UTF-8 encoded. The é entity reference still appears. It is of course a correct notation, but it makes HTML source harder to read.

Thus, one of the reasons for using references is that the document's encoding might not allow all characters to be represented as such, and character references offer a universal way to overcome such limitations. But programs might also use such output form for no good reason.

If you represent all non-ASCII characters using entity references or character references, you can use ASCII only in an HTML document. The data will be "7-bit safe"i.e., it can even be sent over a connection that does something nasty to octets with the most significant bit set. This is seldom relevant these days, but many tutorials have taught that entity and character references are safer than using the actual characters, and people tend to believe such things.

Exceptionally, such issues might still be relevant, if you work, say, in a Mac environment and upload your documents to a server that runs Unix. It might then happen that the software you use for uploading performs a wrong character encoding conversion, or doesn't do a conversion when it should. But if you have used only ASCII characters (and wrote, for example, accented letters using entity references), then no such conversion is needed, and no conceivable conversion will harm you either, since conversions would leave ASCII characters intact.

The Free Recode program available from http://recode.progiciels-bpi.ca can perform an impressive amount of code conversions, including conversions that replace references by characters or vice versa. Beware that it uses rather odd terminology: it refers to "HTML charsets" when it actually means HTML format. Normally "charset " means character encoding, at the character level, without any notion of entity references or character references.

2.8.4.2. The role and use of character and entity references

Entity references like é and character references like é are actually quite distinct concepts, though commonly confused with each other in HTML tutorialsand even in specifications! What they share is that they relate to markup languages, namely SGML, XML, and languages defined with them, such as HTML. The references do not belong to Unicode at all, though they usually make use of Unicode code numbers. Rather, they are at a "higher level."

Thus, the references make sense only in contexts where markup is used and interpreted. For example, they do not work in normal email, though they may work if email is sent and interpreted in HTML format. However, references might at times appear otherwise too, due to programming errors, or sometimes intentionally. For example, in some situations, Internet Explorer represents characters in user input in forms as character references. However, by the specifications, a browser should send form data as plain text, not in HTML format in any way. On some web-based discussion forums, you might be able to type a character reference and have it displayed to your readers as the character you mean. Technically, this is easily achieved in the design of forum software: it just needs to pass the reference through as such.

2.8.4.3. Definition: character reference

Generally, in any SGML-based system, or SGML application as the jargon goes, a character reference of the form &#number; can be used. It refers to the character that occupies code position n in the character code defined for the SGML application in question. This is actually very simple: you specify a character by its index (position, number). In SGML terminology, the character code that determines the interpretation of a character reference is called, quite confusingly, the document character set . It need not have anything to do with the character encoding in which the document is written.

Originally, SGML used decimal numbers in character references. Later, the hexadecimal alternative was added, and it uses letter "x" (or "X") in front of the digits: &#xnumber;. Thus, Ä is equivalent to Ä.

For HTML, the document character set is Unicode (or, to be exact, a subset thereof, depending on HTML version). A most essential point is that for HTML, the document character set is completely independent of the encoding of the document! Some early browsers (Netscape 4) got this wrong.

XML, which can be regarded as a lightweight derivative of SGML, has a very similar character reference concept. XML fixes the document character set to Unicode. It also simplifies the syntax by making the trailing semicolon (which is optional in some situations in SGML) an obligatory part of a character reference.

2.8.4.4. Definition: entity reference

Entity references such as © in HTML can be regarded as symbolic names defined for some characters. Contrary to popular belief, entity references are not less system-dependent than character references like ©. It's rather the opposite. The entity references in HTML are defined by equating them with character references, using XML declarations like:

<!ENTITY copy  "&#169;">

Entity references in SGML and XML correspond to macro invocations in many programming and command languages. You define an entity with a declaration (e.g., the sample above) and you use ("call") the entity by prefixing its name with an ampersand. An SGML or XML processor, including web browsers, simply substitutes internally the defining string &#169; for a referencein this case, &copy;. In the general case, the definition of an entity could be a long string, even the content of an external file.

2.8.4.5. Entity references in HTML

The HTML language (including XHTML) has a finite set of predefined entities, and they are all defined in terms of character references. This is a special case, but it has made people understand entity references just as names for characters. Even HTML specifications call them character entity references as opposed to numeric character references .

Moreover, although HTML was formally defined as an application of SGML, web browsers never supported the general mechanisms for declaring and using entities. Thus, in practice, entities exist only in the sense that you can use the predefined entities. To the extent that web browsers support XHTMLi.e., XML-based versions of HTMLthe situation is different: new entities can be declared.

By SGML rules, the trailing semicolon in entity references may be omitted, if the next character is non-alphanumeric (e.g., a space). However, popular browsers often get this wrong, so &euro; is much safer than &euro without the semicolon. Moreover, in XML, and therefore in XHTML, the semicolon is required.

The entity references in HTML are officially defined in the HTML specifications; for example, see http://www.w3.org/TR/xhtml1/dtds.html#h-A2. There are also some more readable presentations, such as http://www.htmlhelp.com/reference/html40/entities/. However, there are several reasons why the entity references are not that useful:

  • In modern authoring with Unicode-enabled tools, you don't need the entities. You simply write characters themselves and see them as such even in HTML source, and you store and serve your page in UTF-8 encoding, for example.

  • Entities exist for a rather haphazard collection of characters.

  • The entity names are often just half-mnemonic, or not mnemonic at all. Who could guess that &lang; means left-pointing angle bracket? What would be your guess on &ni;? Part of the quasi-mnemonic nature is caused by the fact that the names have been taken from the SGML standard, which uses entity names with a maximum length of six characters.

2.8.4.6. Character entities in XML

People often assume that the character entity references known from HTML are automatically available in XML. However, in XML, only a very small set of predefined entities exist, as shown in Table 2-5. Entities have been defined for markup-significant charactersi.e., characters that might otherwise be understood as constituting part of markup. If you use the < character in document content, as in the expression "a<b", you need to escape it as &lt; or, equivalently, as &#60; or &#x3c;. Otherwise, "<b" would be taken as starting a tag. There is actually no need to escape the > character, but an entity has been defined for it for symmetry. The & character, on the other hand, must always be escaped in XML, when it is not meant to start a character reference or an entity reference. The apostrophe and the quotation mark need not be escaped in document content but only in attribute values, where the character would otherwise terminate the value.

Table 2-5. Predefined entities (denoting characters) in XML

Entity reference

Expansion

Character

Unicode name

Need for the entity

&lt;

&#60;

<

Less-than sign

< normally starts a tag

&gt;

&#62;

>

Greater-than sign

For symmetry with &lt;

&amp;

&#38;

&

Ampersand

& normally starts a reference

&apos;

&#39;

'

Apostrophe

Within an attribute value

&quot;

&#34;

"

Quotation mark

Within an attribute value


In XML, any other entities must be defined before use, though you can write the definitions into an external file and refer to the file in an entity declaration.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net