HTML Syntax Basics | UNIX: The Complete Reference, Second Edition (Complete Reference Series)

An HTML document consists of ordinary text interspersed with HTML tags. The browser uses the tags to help it format the document for display. A tag consists of text (called a directive) enclosed in angle brackets (< and >).

Depending on the function, tags are used singly or in pairs. A pair of tags indicates a region of the document that should be displayed in a particular way, for example as a header or in a distinctive type style. Most tags are used in pairs, enclosing text between a starting and ending tag. The ending tag looks just the same as the starting tag except that a forward slash (/) precedes the directive within the angle brackets. A single tag tells the browser to do something at a particular point in the document, for example, to start a new paragraph or insert a horizontal rule.

 <h1>This is the text of a header</h1> <p>This is text of a paragraph.

HTML is not case sensitive-that is, "<TITLE>," "<title>," and "<TiTlE>" are all equivalent. However, XHTML is case sensitive because XML is case sensitive. The XHTML standard mandates that tags be lowercase. Accordingly, it is recommended that you use lowercase for all tags.

A Minimum Document

The following HTML document shows the simplicity of the language and how easy it is to get started. Although strictly speaking, the document is not legal because a couple of directives were omitted, it will work fine with most browsers.

 <title>A minimum home page title</title> Hello, World. This is my first home page.

You can quickly view this page by invoking a browser and passing the filename as a command-line argument, like this:

 $ mozilla file.html

The phrase “A minimum home page title” surrounded by the <title></title> tags is the title of the document and is displayed in the top border of the window. The text “Hello, World. This is my first home page.” is the content and is displayed in the content region of the browser.

Every document should have a title. The title is displayed separately from the document and is used for identification in other contexts, for example in a browser bookmark file or personal menu bar. Some web search services archive the titles of all web pages and search for keywords contained in the titles. By choosing a title carefully, you can make it easier for others to find your page.

A Proper Minimum Document

Next, let’s make the preceding document legal by adding a little window dressing. Although browsers will not usually complain if this is omitted, you should include it to comply with the HTML specifications and for compatibility with future browsers that may not be so permissive. The first bit of window dressing that we need to look at is the Document Type Definition.

The W3C has recommended that web sites do something called “validate”; that is, the HTML code used in web site HTML documents should conform to a written W3C standard. One of the conditions for a web page to be validated is that it contain the correct Document Type Definition (also called a “Doctype”) for the kind of page that it is presenting. The Doctype essentially tells the web browser which rules to follow when rendering a page on the screen. Doctypes are considered essential to the proper rendering and functioning of web documents in modern, standards-compliant web browsers. In order to specify which HTML standard they conform to, all HTML documents should start with a Document Type Declaration. For example,

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

Doctype Declarations have three parts as shown here:

Start: <!DOCTYPE HTML PUBLIC
Public identifier: "-//W3C//DTD HTML 4.01 //EN"
System identifier: "http://www.w3.org/TR/html4/strict.dtd">

The preceding example DTD declares that this document conforms to the Strict DTD of the HTML 4.01 standard. The W3C’s own list of valid Doctypes is at http://www.w3.org/QA/2002/04/valid-dtd-list.html. The presence or absence of a DTD in an HTML document may influence how a web browser will display that document.

Generally, modern web browsers have two rendering modes, “standards” mode and “quirks” mode. When a web browser loads an HTML document that is missing a Doctype, that begins with an invalid Doctype, or begins with an HTML 3.2–4.1 “Transitional” flavor Doctype, it will attempt to render that HTML document in “quirks” mode, emulating the parsing, page rendering, and bugs of older browsers from the mid-to-late ‘90s. If the HTML document begins with a valid Doctype, then using “standards” mode, a modern browser will do its best to render the document according to the W3C recommendations, up to and including XHTML 1.0.

Here is the proper minimum document with the Doctype at the top:

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>A minimum home page title</title> </head> <body> Hello, World. This is my first home page. </body> </html>

The <html> directive indicates that all text up to </html> is an HTML document. The text between <head> and </head> is header information, and the text between <body> and </body> is the body or content of the document. Figure 27–1 shows how the proper minimal document is rendered in a browser.

image from book
Figure 27–1: A proper minimal HTML document

Headings

Six levels of headings are supported by HTML, numbered 1 through 6, with 1 being the most prominent. Headings are displayed larger and/or bolder than normal body text. Headings are important in a document to enhance appearance and readability The syntax follows for the heading tag, and Figure 27–2 shows how the headings would be rendered.

image from book
Figure 27–2: Six levels of the heading tag

 <html> <head> <title>Header Examples</title> </head> <body> <h1>Heading level 1</h1> <h2>Heading level 2</h2> <h3>Heading level 3</h3> <h4>Heading level 4</h4> <h5>Heading level 5</h5> <h6>Heading level 6</h6> </body> </html>

The header level does not tell the browser how big or how bold to make the header text on an absolute scale, but only in relationship to the other header levels. This is an important concept that illustrates a basic principle of HTML. For the most part, tags in HTML describe the function that a particular text serves in the document, but they do not indicate exactly how the text should be displayed. That decision is left to the browser, perhaps with consideration for user preferences. In contrast, a typesetting language like that read by troff describes the appearance of the page down to the last detail, leaving nothing up to the typesetting program.

Paragraphs

Unlike documents in most word processors, HTML documents accord no significance to carriage returns and white space. Word wrapping can occur at any point in the document, and multiple spaces are collapsed into a single space. This means that the formatting you infer by the appearance of the HTML source file is completely ignored by the browser (with the exception of text tagged as preformatted). A nicely formatted source file, with extra space between paragraphs, indents, and line breaks, will be collapsed into a hopelessly unreadable solid block of text. Instead, you have to note paragraph breaks with the <p> tag. The following sample document is rendered in a browser window in Figure 27–3.

 <html> <head> <title>Paragraph Break Example</title> </head> <body> peeped into the book her sister was reading, but it had no conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?' <p> So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy chain would be worth the trouble. </body> </html>

image from book
Figure 27–3: The paragraph break tag in action

The <p> tag is one of the few that is not required to be used in open-close pairs.

Hypertext Links

The capability to link one document to another, anywhere in the world, is what sets HTML and the web apart from all predecessors. Hypertext links are the single most important factor in the incredible success of the web. (The other single most important factor is the integration of dissimilar services into one consistent user interface.) Links are described this way:

 <a href="target_url">link text</a>

The address of the document that is being linked to is indicated by “target_url”. The phrase “link text” is displayed in a distinctive style, such as in a contrasting color or underlined, indicating that it is a hyperlink.

The browser follows the hyperlink to “target_url” when this link text is clicked with the mouse or otherwise selected. The tag name comes from the notion of an “anchor” for the hyperlink. Here is an example of a hyperlink:

 <a href="http://www.foobar.com">Visit the FooBar home page.</a>

You can specify an image for a hyperlink instead of text with the following:

 <a href="http://www.foobar.com"><img src="/books/4/466/1/html/2/logo.gif"></a>

Here the image described by the file logo.gif will be displayed with a distinctive border that indicates it is a hyperlink. Clicking anywhere in the image will follow the link.

Inline Images

Inline images are indicated in HTML with the <img> tag as follows:

 <img src=file_path>

where file_path is the name of the image file relative to the root of the server’s directory hierarchy

If the image reference appeared on a page accessed with a user’s URL (i.e., a URL including -user in the document path), file_path is relative to the user’s web directory hierarchy.

By default the bottom of an image is aligned with the adjacent text. Include “align=top” if you want the top of the image aligned with the adjacent text, like this:

 <img align=top src=logo.gif>

Several image formats are in common use, for example, .gif and .jpg. However, not all browsers support all formats. Unless you know that your target audience uses only one type of browser, you may be better off using only .gif- or .jpg-format inline images. Like everything else about the web, the image formats supported by specific browsers are likely to change by the time you read this, so look for up-to-the-minute information before committing to a particular format.

Images can add a lot to the visual appeal of a document, but on slow links such as serial modems (yes, they are still used) they can also be frustrating because of the amount of information that has to be sent to describe the image. There are a few things that you can do to improve performance when using images. Modems have the capability to compress the data they transfer. The amount of compression attained depends on the degree of randomness in the data; completely random data cannot be compressed. A simple image with a small number of colors will transfer significantly faster than a complex image with many colors and a lot of detail (such as photos). Of course the size is a factor as well but less so than image detail. Most browsers cache images on the local disk drive. This means that an image only has to be transferred on the first reference; thereafter, the browser obtains it from the local cache. You can take advantage of caching by keeping the number of different images to a minimum. For example, if your documents include navigation icons (e.g., home, next, previous) on each page, use the same ones on all pages. In other words, don’t use different images for the “next” icon on each of your pages.

Image Maps

The coordinates of the mouse position within a hyperlink image are sent to the server if the “ismap” directive is included in the <img> tag:

 <a href=http: //page1. html><img src=logo.gif ismap></a>

The coordinates are sent along with the hypertext reference when the mouse is clicked. This is a powerful feature that makes it possible for the server to customize the response according to the position in an image where the mouse is clicked. For example, the mouse coordinates in the image of a control panel would indicate which control button was selected. In an image of a geographic map, the mouse coordinates might indicate a region of interest to the user.

Processing “ismap” requests at the server may require system administrator access to the web server’s designated CGI-BIN directory and server configuration files.

Named Anchors

A hyperlink ordinarily takes you to the top of the page of the new document. You can also link to a specific section within a document so that the section is displayed when the link is followed. This can be useful when linking from one document to a section within a large document or from a table of contents or index to other sections within the same document. First, define the points within the document that you are linking to, like this:

 <a name=anchor_name>Associated Text</a>

“Associated Text” will appear at or near the top of the document when the link is followed to it. However, it is not displayed in a distinctive style because it is the destination of a link, not the origin of a link. Next, create a link to the target document and section as shown here:

 <a href=http://www.foobar.com/big_page.html#anchor_name>HyperLink Text</a>

The term “anchor_name” is the binding text and appears in the URL separated from the pathname with a “#” symbol. If the origin and destination of a named anchor hyperlink are within the same document, only the anchor name is needed in the link, as shown here:

 <a href=http://#anchor_name>HyperLink Text</a>

Lists

Several types of lists are supported by the HTML language. All lists start with an opening tag and end with a closing tag, and all elements in the list are marked with an item tag. Lists can be arbitrarily nested. A list item can contain a list. A single list item can also include a number of paragraphs, each containing additional lists. List presentation varies from browser to browser. Some may provide successive levels of indent for nested lists or vary the bullets used with unnumbered lists.

Unordered Lists

The exact presentation of an unordered list is browser-specific and might include bullets, dashes, or some other distinctive icon. Start the list with <ul>, precede each list item with <li>, and end the list with </ul>. Figure 27–4 shows how the list is rendered.

 <html> <head> <title>An Unordered List</title> </head> <body> <ul> <li>Alice <li>Rabbit <li>Dinah </ul> </body> </html>

image from book
Figure 27–4: An unordered list

Ordered Lists

Items in an ordered list are preceded by a number indicating the position of the item. The browser chooses the numbers, so you never have to maintain them as you modify the list. Numbers start at 1 at the beginning of each list.

Start the list with <ol>, precede each list item with <li>, and end the list with </ol>. The following HTML incorporates an ordered list (with the results shown in Figure 27–5):

 <html> <head> <title>An Ordered List</title> </head> <body> <ol> <li>Alice <li>Rabbit <li>Dinah </ol> </body> </html>

image from book
Figure 27–5: An ordered list

Descriptive Lists

A descriptive list consists of an item name followed by a definition or description. Start the list with <dl>, precede the item name with <dt> and the item definition with <dd>, and end the list with </dl>. The following HTML incorporates a descriptive list (with the results shown in Figure 27–6):

 <html> <head> <title>A Descriptive List</title> </head> <body> <dl> <dt>Alice <dd>Alice is the main character in the book. <dt>Rabbit <dd>The Rabbit led Alice down the rabbit hole. <dt>Dinah <dd>Dinah was Alice's cat. </dl> </body> </html>

image from book
Figure 27–6: A descriptive list

Phrase Markup

In page layout it is common to use a distinctive style of type, border, indent, and other typo-graphic features to convey the logical function of document sections and to provide visual discrimination between sections. HTML includes definitions for many logical styles likely to be found in technical documentation, including source code, sample text, keyboard phrases (i.e., something you type), variable phrases (i.e., a generic prototype for information you supply), citation phrases, and typewriter text.

Although HTML includes the definitions for many logical styles, it is up to the browser to display each in a distinctive way Some do, some don’t, and what they do depends on the browser. Certain browsers display source code, sample text, keyboard phrases, and typewriter text all in the same typeface, and other phrases in different typefaces. So the text here,

 <head> <title>Phrase Markup Examples</title> </head> <body> <code>code - Source code phrase</code><br> <samp>samp - Sample text or characters</samp><br> <kbd>kbd - Keyboard phrase</kbd><br> <var>var - Variable phrase</var><br> <cite>cite - Citation phrase</cite><br> <em>em - Emphasized Phrase</em><br> <strong>strong - Strong Emphasis</strong><br> </body> </html>

displays as shown in Figure 27–7.

image from book
Figure 27–7: Phrase markup

You may also indicate certain typographic features by physical style, such as bold, italic, or typewriter text:

 <head> <title>Physical Style Examples</title> </head> <body> <b>b - Bold Text</b><br> <i>i - Italic Text</i><br> <tt>tt - Typewriter Text</tt><br> </body> </html>

as shown in Figure 27–8.

image from book
Figure 27–8: Physical style markup

Preformatted Text

Sometimes you may want to prevent the browser from mangling your document and instead display it just as it appears in your source file. For example, a section of C code, carefully indented and commented, would ordinarily be rendered unreadable by the browser.

The browser will preserve the layout of text enclosed between <pre> and </pre>, including all spaces, tabs, and newlines:

 <html> <head> <title>Preformatted Text Example</title> </head> <body> <pre>     main()     {             printf( "Hello, world\n"     } </pre> </body>

as shown in Figure 27–9.

image from book
Figure 27–9: Preformatted text

Comments

Comments are introduced with “”. They are useful for including nondisplayed annotations in HTML source and for temporarily suppressing the display of a section of source.

 <!-- This is an HTML comment. -->

Line Breaks

Because the browser ignores the format or layout of the HTML source file, you must specify line breaks explicitly with the <br> tag. Unlike a paragraph tag (<p>), the line break tag does not add any extra space. The following code,

 <html> <head> <title>Line Break Example</title> </head> <body> peeped into the book her sister was reading, but it had no conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?' <br> So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble. </body> </html>

produces the page shown in Figure 27–10.

image from book
Figure 27–10: A line break

Horizontal Rules

The <hr> tag produces a break in the text and a horizontal rule the width of the browser’s window. Use it to separate document sections.

Forms

A form provides a mechanism to collect data from a user viewing your web page. Using a variety of devices such as text boxes, menus, check boxes, and radio buttons, a user can enter data onto the form and click a Submit button to send the data back to a server for processing. Here is an example of an HTML form, and the resulting page is shown in Figure 27–11:

 <html> <head> <title>Forms Example</title> </head> <body> <form> <input name=name10 type=text value="initial value"> text 1 <input name=name11 type=text> text 2 <input name=name12 type=text> text 3 <hr> <input name=name2 type=checkbox> checkbox 1 <input name=name2 type=checkbox> checkbox 2 <input name=name2 type=checkbox> checkbox 3 <hr> <input name=name3 type=radio> radio 1 <input name=name3 type=radio> radio 2 <input name=name3 type=radio> radio 3 <hr> <select> <option name=sel1> selection 1 <option name=sel2> selection 2 <option name=sel3> selection 3 </select> <hr> <textarea name=txt1 rows=5 cols=40> This is default textarea input </textarea> <hr> <input name=sub1 type=submit> <input name=sub2 type=reset> </form> </body> </html>

image from book
Figure 27–11: Example of an HTML form