The Rules of HTML | HTML & XHTML: The Complete Reference (Osborne Complete Reference Series)

< Day Day Up >

HTML has rules, even in its standard form. Unfortunately, these "rules" really aren't rules, but more like suggestions because most browsers pretty much let just about anything render. However, under XHTML these rules are enforced, and incorrectly structured documents may have significant downsides, often exposed only once other technologies such as CSS or JavaScript are intermixed with the markup. The reality is that most HTML, whether created by hand or a tool, generally lie somewhere between strict conformance and no conformance to the specification. Let's take a brief tour of some of the more important aspects of HTML syntax.

HTML Is Not Case Sensitive, XHTML Is

These markup examples

  <B>  Go boldly!  </B>   <B>  Go boldly!  </b>   <b>  Go boldly!  </B>   <b>  Go boldly!  </b>

are all equivalent under traditional HTML. In the past, developers were highly opinionated on how to case elements. Some designers pointed to the ease of typing lowercase tags as well as to XHTML's preference for lowercase elements as reasons to go all lowercase. Other designers pointed out that, statistically, lowercase letters were more common than upper; keeping tags in all uppercase made them easier to pick out in a document, thus making its structure clearer to someone doing hand markup. However, given the XHTML lowercase preference, you should always use lowercase tags.

HTML/XHTML Attribute Values May Be Case Sensitive

One interesting aspect of HTML's case sensitivity is that although HTML element names and attribute names are not case sensitive, we can't assume everything is case insensitive. For example, consider <img src="test.gif"> and <img src="test.gif"> . Under traditional HTML, these are equivalent because the <img> tag and the src attribute are not case sensitive. However, for compatibility with XHTML, they should be lowercase. Yet, regardless of the use of XHTML or HTML, the actual attribute values in some tags may be case sensitive, particularly where URLs are concerned . So <img src="test.gif"> and <img src="TEST.GIF"> are not necessarily referencing the same image. When referenced from a UNIX-based Web server where filenames are case sensitive, test.gif and TEST.GIF would be two different files, while on a Windows Web server where filenames are not case sensitive, they would reference the same file. This is a common problem and will keep a site from easily being transported from one server to another.

HTML/XHTML Is Sensitive to a Single White Space Character

Browsers collapse white space between characters down to a space. This includes all tabs, line breaks, and carriage returns.

Consider the markup

  <b>  T e s t o f s p a c e s  </b><br />   <b>  T   e   s   t   o f   s p a c e s  </b><br />   <b>  T e s t o f s p            a c e s  </b><br />

As shown here, all the spaces, tabs, and returns are collapsed to a single element.

Note that in some situations, HTML does treat white space characters differently. In the case of the pre element, which defines a preformatted block of text, white space is not ignored and is preserved because the content is considered preformatted. Also, white space is preserved within the textarea element when setting default text for a multiline text entry field.

Because browsers will ignore most white space, HTML authors often format their HTML documents for readability. However, the reality is that browsers really don't care one way or another, and either do end users. Because of this, some sites have adopted an idea called "HTML crunching " to save bandwidth, which is discussed in Chapter 2.

Subtle errors tend to creep into HTML files where white space is concerned; be especially careful with spacing around <img> and <a> tags. For example, consider the markup here:

  <a href="http://www.democompany.com">   <img src="democompany.gif" width="221" height="64"   border="0" alt="Demo Company" />   </a>

Notice the line return after the < img> tag, just before the < /a> tag that closes the link. Under some browsers, this will result in a small "tail" to the image, often termed a tick, as shown here:

Some browsers will fix the tick problem, others won't. What's interesting is that the browsers showing the tick actually are interpreting the specification properly.

The final aspect of spacing to consider is the use of the nonbreaking space entity, or   . Some might consider this the duct tape of the Web ”useful in a bind when a little bit of formatting is needed or an element has to be kept open . While the   entity has many useful applications, such as keeping empty table cells from collapsing, designers should avoid relying on it for significant formatting. While it is true that markup such as

  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  Look, I'm spaced out!

would indent some spaces in the text, the question is, exactly how far? In print, using spaces to format is dangerous and things rarely line up. It is no different on the Web.

XHTML/HTML Follows a Content Model

Both HTML and XHTML support a strict content model that says that certain elements are supposed to occur only within other elements. For example, markup like this,

  <ul>   <p>  What a simple way to break the content model!  </p>   </ul>

which often is used for simple indentation, actually doesn't strictly follow the XHTML content model. The <ul> tag is only supposed to contain < li> tags. The <p> tag is not really appropriate in this context. Much of the time, HTML page authors are able to get away with this, but sometimes they can't. For example, in a strictly conformant browser, the <input> tag found outside a <form> tag is simply not displayed. HTML documents should follow a structured content model.

Elements Should Have Close Tags Unless Empty

Under traditional HTML, some elements have optional close tags. For example, both of the paragraphs here are allowed, although the second one is better:

  <p>  This isn't closed.  <p>  This is.  </p>

A few tags, such as the horizontal rule <hr> or line break <br> , do not have close tags because they do not enclose any content. These are considered empty elements and can be used as is in traditional HTML. However, under XHTML you must always close tags so you would have to write <br></br> or, more commonly, use a self-closing tag format with a final "/" character, like so: <br /> .

Unused Elements May Minimize

Sometimes tags may not appear to have effect in a document. Consider, for example, the <p> tag, which specifies a paragraph. As a block tag it induces a return by default, but when used repeatedly, like so,

  <p></p><p></p><p></p>

it does not produce numerous blank lines because the browser minimizes the empty p elements. Some HTML editors output nonsense markup such as

  <p>&nbsp;</p><p>&nbsp;</p><p>&nbsp;</p>

to deal with this. This is a misuse of HTML. Multiple <br /> tags should have been used instead to achieve line spacing.

Elements Should Nest

A simple rule states that tags should nest, not cross, thus

  <b><i>  is in error as tags cross  </b></i>

whereas

  <b><i>  is not since tags nest  </i></b>  .

Breaking this rule seems harmless enough, but it does introduce some ambiguity if tags are automatically manipulated using a script; under XHTML, proper nesting is mandatory.

Attributes Should Be Quoted

Although under traditional HTML simple attribute values did not need to be quoted, not doing so can lead to trouble with scripting. For example,

  <img src=robot.gif height=10 width=10 alt=robot>

would work fine in most browsers. Not quoting the src attribute is troublesome but should work. But what would happen if the src attribute were manipulated by JavaScript and changed to "robot 2.gif" complete with a space? This could cause a problem. Furthermore, XHTML does enforce quoting, so all attributes should be quoted like so

  <img src="robot.gif" height="10" width="10" alt="robot" />

and the empty img element would close itself with a trailing slash. Generally, it doesn't matter if single or double quotes are used, unless JavaScript is found in an attribute value. Stylistically, double quotes tend to be favored, but either way you should be consistent.

Browsers Ignore Unknown Attributes and Elements

For better or worse , browsers will ignore unknown elements and attributes, so

  <bogus>  this text will display on screen  </bogus>

and markup such as

 <  p id="myPara" obviouslybadattribute="TRUE"  >will also render fine.</p>

Browsers make best guesses at structuring malformed content and tend to ignore code that obviously is wrong. The permissive nature of browsers has resulted in a massive amount of malformed HTML documents on the Web. Oddly, from many people's perspective it hasn't hurt as much as you might expect because the browsers do make sense out of the "tag soup" they find. However, such a cavalier use of the language creates documents with shaky foundations at best. Once we add other technologies such as CSS and JavaScript to the mix, brazen flaunting of the rules can have repercussions and may result in broken pages. Furthermore, to automate the exchange of information on the Web we need to enforce stricter structure of our documents. The introduction of XHTML brings some hope for stability and structure of Web documents.

< Day Day Up >