Special Characters


As you learned earlier in the week, HTML files are ASCII text and should contain no formatting or fancy characters. In fact, the only characters you should put in your HTML files are the characters that are actually printed on your keyboard. If you have to hold down any key other than Shift, or type an arcane combination of keys to produce a single character, you can't use that character in your HTML file. This includes characters you might use every day, such as em dashes and curly quotes (if your word processor is set up to do automatic curly quotes, you should turn them off when you write your HTML files).

"But wait a minute," you say. "If I can type a character like a bullet or an accented a on my keyboard using a special key sequence, and I can include it in an HTML file, and my browser can display it just fine when I look at that file, what's the problem?"

The problem is that the internal encoding your computer does to produce that character (which enables it to show up properly in your HTML file and in your browser's display) probably won't translate to other computers. Someone on the Internet who's reading your HTML file with that funny character in it might end up with some other character or just plain garbage. Or, depending on how your page is sent over the Internet, the character might be lost before it ever gets to the computer where the file is being viewed.

So, what can you do? HTML provides a reasonable solution. It defines a special set of codes, called character entities, that you can include in your HTML files to represent the characters you want to use. When interpreted by a browser, these character entities are displayed as the appropriate special characters for the given platform and font.

Some special characters don't come from the set of extended ASCII characters. For example, quotation marks and ampersands can be presented on a page using character entities even though they're found within the standard ASCII character set. These characters have a special meaning in HTML documents within certain contexts, so they can be represented with character entities in order to avoid confusing the web browsers. Modern browsers generally don't have a problem with these characters, but it's not a bad idea to use the entities anyway.

Character Entities for Special Characters

Character entities take one of two forms: named entities and numbered entities.

Named entities begin with an ampersand (&) and end with a semicolon (;). In between is the name of the character (or, more likely, a shorthand version of that name, such as agrave for an a with a grave accent, or reg for a registered trademark sign). Unlike other HTML tags, the names are case sensitive, so you should make sure to type them in exactly. Named entities look something like the following:

à " « ©


The numbered entities also begin with an ampersand and end with a semicolon, but rather than a name, they have a pound sign (#) and a number. The numbers correspond to character positions in the ISO-Latin-1 (ISO 8859-1) character. Every character you can type or for which you can use a named entity also has a numbered entity. Numbered entities look like the following:

‚ õ


You can use either numbers or named entities in your HTML file by including them in the same place that the character they represent would go. So, to place the word résumé in your HTML file, you would use either

résumé


or

résumé


In Appendix B, "HTML 4.01 Quick Reference," I've included a table that lists the named entities currently supported by HTML. See that table for specific characters.

Character Set: ISO-Latin-1 Versus Unicode

HTML's use of the ISO-Latin-1 character set allows it to display most accented characters on most platforms, but it has limitations. For example, common characters such as bullets, em dashes, and curly quotes simply aren't available in the ISO-Latin-1 character set. Therefore, you can't use these characters at all in your HTML files. (If they're absolutely necessary, you can create images representing those characters and use them on your pages. I don't recommend that option, though, because it can interfere with the layout of your page. Also, it can look odd if the user's browser is set to a nonstandard text size.) Also, many ISO-Latin-1 characters might be entirely unavailable in some browsers, depending on whether those characters exist on that platform and in the current font.

HTML 4.01 takes things a huge leap further by proposing that Unicode should be available as a character set for HTML documents. Unicode is a standard character encoding system that, although backward-compatible with our familiar ASCII encoding, offers the capability to encode characters in almost any of the world's languages, including Chinese and Japanese. This means that documents can be created easily in any language, and they also can contain multiple languages. Both Internet Explorer and Netscape support Unicode, and it can render documents in many of the scripts provided by Unicode as long as the necessary fonts are available.

This is an important step because Unicode is emerging as a new de facto standard for character encoding. Java uses Unicode as its default character encoding, for example, and Windows supports Unicode character encoding.


Character Entities for Reserved Characters

For the most part, character entities exist so that you can include special characters that aren't part of the standard ASCII character set. However, there are several exceptions for the few characters that have special meaning in HTML itself. You must use entities for these characters also.

Suppose that you want to include a line of code that looks something like the following in an HTML file:

<p><code>if x < 0 do print i</code></p>


Doesn't look unusual, does it? Unfortunately, HTML cannot display this line as written. Why? The problem is with the < (less-than) character. To an HTML browser, the lessthan character means "this is the start of a tag." Because the less-than character isn't actually the start of a tag in this context, your browser might get confused. You'll have the same problem with the greater-than character (>) because it means the end of a tag in HTML, and with the ampersand (&) because it signals the beginning of a character escape. Written correctly for HTML, the preceding line of code would look like the following instead:

<p><code>if x &lt; 0 do print i</code></p>


HTML provides named escape codes for each of these characters, and one for the double quotation mark as well, as shown in Table 6.1.

Table 6.1. Escape Codes for Characters Used by Tags

Entity

Result

&lt;

<

&gt;

>

&amp;

&

&quot;

"


The double quotation mark escape is the mysterious one. Technically, if you want to include a double quotation mark in text, you should use the escape sequence and you shouldn't type the quotation mark character. However, I haven't noticed any browsers having problems displaying the double quotation mark character when it's typed literally in an HTML file, nor have I seen many HTML files that use it. For the most part, you're probably safe using plain old quotes (") in your HTML files rather than the escape code.




Sams Teach Yourself Web Publishing with HTML and CSS in One Hour a Day
Sams Teach Yourself Web Publishing with HTML and CSS in One Hour a Day (5th Edition)
ISBN: 0672328860
EAN: 2147483647
Year: 2007
Pages: 305

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net