HTML: The Bad Stuff | XML and SOAP Programming for BizTalk(TM) Servers (DV-MPS Programming)

[Previous] [Next]

The primary complaint about HTML is that it can't do certain formatting functions, such as snaked columns, tab alignment (without complex table elements), and precise white-space control—capabilities page designers need. Cascading style sheets (CSS) have been developed to get around some of the formatting limitations of HTML, but they are add-ons to the basic functionality of HTML, and each browser handles style attributes differently.

HTML is governed by a fairly loose set of recommendations. There are very few specific rules regarding the processing of page data, and most browsers don't enforce the rules that do exist very well. For example, including the end tag </P> for the paragraph element <P> is optional, which has implications for whether the tag is used—people don't usually indicate the ends of paragraphs. HTML programmers get comfortable with not using any end tags, their sloppiness carries over to other elements, and they end up not including end tags that actually are required. I once created the following tagging on an HTML page:

 <P>To find out more, <A HREF="more.htm">Go here<A>.

The <A> tag is the element used to create a hyperlink. (A stands for anchor. Go figure.) The words "Go here" should be underlined in blue by default, and clicking the link should load the file in the browser. The tag following "here" is the end tag, and thus I should have written </A>. This is an example of poor coding. There is an official specification for HTML that, if enforced by Web browsers, would cause the browser to issue errors in cases like this and refuse to load such documents.

That doesn't happen. When I looked at the page in Microsoft Internet Explorer 4, the link looked fine, so I deployed the page to a Web site. Not long afterward, I got an email addressed to the Webmaster saying that everything on the page following "Go here" was a link underlined in blue. The reader was using Netscape Navigator. The HTML parser in my Microsoft browser (the parser reads the document and determines its structure) realized that I was trying to end a link inside another link. My guess is that this had happened before, and the browser programmers put in an exception to allow this type of bad coding to get through. Netscape's HTML parser didn't pick up this particular problem, so it didn't end the link tag.

Why do browsers go through all this processing to accommodate bad code? Because no manufacturer wants its browser to get the reputation that it doesn't read everything on the Web. So browser manufacturers compete to see whose product can read the worst code! In this type of environment, programmers will continue to write bad code because there is no penalty for taking shortcuts. Therefore, bad code begets more bad code, and there is no end.

HTML is defined as a fixed set of tags optimized for delivering electronic documents. HTML has element names such as p (paragraph), li (list), and table, which are formatting directives. HTML does not have elements with names such as invoice number, policy type, and blood pressure. If you want to express that type of information, HTML is not the best choice. HTML can't be adapted directly to suit your needs.

Because of the limited number of elements available in earlier versions of the language, HTML coders use tags that don't necessarily describe the information but rather simply achieve a certain effect. For example, I've used the dl (definition list) element in places where no definitions could be found, just because I know that using this element creates a left indent. I've seen the author tag used not because the coder wants to indicate the page's author (the tag's intended purpose), but because the author tag creates a line break and italicizes the contained text.

The Worldwide Web Consortium (W3C), a group of companies interested in developing and maintaining core Internet technologies, is working on a standardization of HTML that will be both rigorously enforced by client software and extensible. The specification, XHTML, will solve the problem of bad code because compliant parsers will refuse to read poorly formed documents. That's fine for new documents, but a massive amount of legacy code needs to be fixed before the entire Web is properly structured. This restructuring is unlikely to happen, so parsers will have bad-code processors for the foreseeable future. Maybe this is a job for newly unemployed Y2K programmers!