As you learned earlier in the week, HTML files are ASCII text and should contain no formatting or fancy characters. In fact, the only characters you should put in your HTML files are the characters that are actually printed on your keyboard. If you have to hold down any key other than Shift, or type an arcane combination of keys to produce a single character, you can't use that character in your HTML file. This includes characters you might use every day, such as em dashes and curly quotes (if your word processor is set up to do automatic curly quotes, you should turn them off when you write your HTML files).
"But wait a minute," you say. "If I can type a character like a bullet or an accented a on my keyboard using a special key sequence, and I can include it in an HTML file, and my browser can display it just fine when I look at that file, what's the problem?"
The problem is that the internal encoding your computer does to produce that character (which enables it to show up properly in your HTML file and in your browser's display) probably won't translate to other computers. Someone on the Internet who's reading your HTML file with that funny character in it might end up with some other character or just plain garbage. Or, depending on how your page is sent over the Internet, the character might be lost before it ever gets to the computer where the file is being viewed.
So, what can you do? HTML provides a reasonable solution. It defines a special set of codes, called character entities, that you can include in your HTML files to represent the characters you want to use. When interpreted by a browser, these character entities are displayed as the appropriate special characters for the given platform and font.
Some special characters don't come from the set of extended ASCII characters. For example, quotation marks and ampersands can be presented on a page using character entities even though they're found within the standard ASCII character set. These characters have a special meaning in HTML documents within certain contexts, so they can be represented with character entities in order to avoid confusing the web browsers. Modern browsers generally don't have a problem with these characters, but it's not a bad idea to use the entities anyway.
Character Entities for Special Characters
Character entities take one of two forms: named entities and numbered entities.
Named entities begin with an ampersand (&) and end with a semicolon (;). In between is the name of the character (or, more likely, a shorthand version of that name, such as agrave for an a with a grave accent, or reg for a registered trademark sign). Unlike other HTML tags, the names are case sensitive, so you should make sure to type them in exactly. Named entities look something like the following:
à " « ©
The numbered entities also begin with an ampersand and end with a semicolon, but rather than a name, they have a pound sign (#) and a number. The numbers correspond to character positions in the ISO-Latin-1 (ISO 8859-1) character. Every character you can type or for which you can use a named entity also has a numbered entity. Numbered entities look like the following:
You can use either numbers or named entities in your HTML file by including them in the same place that the character they represent would go. So, to place the word résumé in your HTML file, you would use either
In Appendix B, "HTML 4.01 Quick Reference," I've included a table that lists the named entities currently supported by HTML. See that table for specific characters.
Character Entities for Reserved Characters
For the most part, character entities exist so that you can include special characters that aren't part of the standard ASCII character set. However, there are several exceptions for the few characters that have special meaning in HTML itself. You must use entities for these characters also.
Suppose that you want to include a line of code that looks something like the following in an HTML file:
<p><code>if x < 0 do print i</code></p>
Doesn't look unusual, does it? Unfortunately, HTML cannot display this line as written. Why? The problem is with the < (less-than) character. To an HTML browser, the lessthan character means "this is the start of a tag." Because the less-than character isn't actually the start of a tag in this context, your browser might get confused. You'll have the same problem with the greater-than character (>) because it means the end of a tag in HTML, and with the ampersand (&) because it signals the beginning of a character escape. Written correctly for HTML, the preceding line of code would look like the following instead:
<p><code>if x < 0 do print i</code></p>
HTML provides named escape codes for each of these characters, and one for the double quotation mark as well, as shown in Table 6.1.
The double quotation mark escape is the mysterious one. Technically, if you want to include a double quotation mark in text, you should use the escape sequence and you shouldn't type the quotation mark character. However, I haven't noticed any browsers having problems displaying the double quotation mark character when it's typed literally in an HTML file, nor have I seen many HTML files that use it. For the most part, you're probably safe using plain old quotes (") in your HTML files rather than the escape code.