Section 3.1. How Web Hosting Works | Creating Web Sites: The Missing Manual

3.1. How Web Hosting Works

As you learned in Chapter 1, the Web isn't stored on any single computer, and no company owns the Web. Instead, the individual pieces (Web sites) are scattered across millions of computers (Web servers). Only a subtle illusion makes all these Web sites seem to be part of a single environment. In reality, the Internet is just a set of standards that let independent computers talk to each other.

So how does your favorite browser navigate this tangled network of computers to find the Web page you want? It's all in the URLthe Web site address you type into your browser.

3.1.1. Understanding the URL

A URL (Uniform Resource Locator) consists of several pieces. Some of these pieces are optional, because they can be filled in by the browser or Web server automatically. Others are always required. Figure 3-1 dissects the URL http://www.SellMyJunkForMillions.com/Buyers/listings.htm .

Figure 3-1. The average URL consists of four pieces. The first part (the protocol) indicates how the page is going to be retrieved. The second part (the domain) indicates the Web server you're accessing. The third and fourth parts indicate the path and file on the Web server where the Web page is located.

Altogether, the URL packs a lot of information into one place, including:

The protocol is the way you communicate over the Web. Technically, it's the way that request and response messages are transmitted across your Internet connection. Web pages always use HTTP (HyperText Transport Protocol), which means the protocol is always http:// or https :// . (The latter establishes a super-secure connection over HTTP that encrypts sensitive information you type in, like credit card numbers or passwords.) In most browsers, you can get away without typing this part of the URL. For example, when you type www.google.com, your browser will automatically convert it to the full URL http://www.google.com.

Tip: Although http:// is the way to go when surfing the Web, depending on your browser you may also use other protocols for other tasks . Common examples include ftp:// (File Transfer Protocol) for uploading and downloading files and file:/// for retrieving a file directly from your own computer's hard drive.
The domain identifies the Web serverthe computer that hosts the Web site you want to see. As a convention, these computers usually have names that start with www to identify them as Web servers, although this isn't always the case. As you'll discover in this chapter, the friendly seeming domain name is really just a fa §ade hiding a numeric address.
The path identifies the location on the Web server where the Web page is stored. This part of the URL can have as many levels as is needed. For example, the path /MyFiles/Sales/2005/ refers to a MyFiles folder that contains a Sales folder that, in turn , contains a folder named 2005. Windows fans, take notethe slashes in the path portion of the URL are ordinary forward slashes, not the backward slashes used in Windows file paths (like c:\MyFiles\Current ). This convention is designed to match the file paths used by Unix-based computers, which were the first machines to host Web sites. It's also the convention used in modern Macintosh operating systems (OS X and later).

Tip: Some browsers are smart enough to correct the common mistake of typing the wrong type of slash. However, you shouldn't rely on this happening, because similar laziness can break the Web pages you create. For example, if you use the <img> tag to link to an image (as demonstrated on Section 2.3.3) and you use the wrong type of slash, your picture won't appear.
The file name is the last part of the path. Often, you can recognize it by the file extension .htm or .html , both of which stand for HTML.

Tip: Web pages often end with .htm or .html , but they don't need to. Even if you look in the URL and see the extension .blackpudding , odds are you're still looking at an HTML document. In most cases, the browser ignores the extension as long as the file contains information that the browser can interpret. However, just to keep yourself sane, this is one convention that you shouldn't break.
The bookmark is an optional part of a URL that identifies a specific position in a page. You can recognize a bookmark because it always starts with the hash character (#), and is placed after the file name. For example, the URL http://www.LousyDeals.com/index.html#New includes the bookmark #New. When clicked, it takes the visitor to the section of the index.html page where the New bookmark is placed. You'll learn about bookmarks in Chapter 8.
The query string is an optional part of the URL that some Web sites use to send extra instructions from one Web page to another. You can identify the query string because it starts with a question mark (?) character, and is placed after the file name. To see a query string in action, surf to www.google.com and perform a search for "pet platypus." When you click the Search button, you're directed to a URL like http://www.google.ca/search?hl=en&q=pet+platypus&meta=. This URL is a little tricky to analyze, but if you search for the question mark in the URL you'll discover that you're on a page named "search." The information to the right of the question mark indicates that you're performing an English language search for pages that match both the "pet" and " platypus " keywords. When you request this URL, a specialized Google Web application analyzes the query string to determine what type of search it needs to perform.

Note: You won't use the query string in your own Web pages, because it's designed for heavy-duty Web applications like the one that powers Google. However, by understanding the query string, you get a bit of insight into how other Web sites work.

3.1.2. How Browsers Analyze the URL

Clearly, the URL packs a lot of useful information into one place. But how does a browser actually use the URL to request the Web page you want? To understand how this works, it helps to take a peek behind the scenes (see Figure 3-2).

Figure 3-2. A simple Web request usually involves a bevy of computers contacting each other. The first computer (the DNS server) gives you the all-important IP address, allowing you to track down the second computer (the Web server), which gets you the Web page you want.

The following list of steps shows a breakdown of what the browser needs to do when you type http://www.SellMyJunkForMillions.com/Buyers/listings.htm into the address bar and hit Enter:

First, the browser needs to figure out what Web server to contact. It does this by extracting the domain from the URL .

In this example, the domain is www.SellMyJunkForMillions.com .
In order to find the Web server named www.SellMyJunkForMillions.com, the browser needs to convert the domain name into a more computer-friendly number, which is called the IP address . Every computer on the WebWeb servers and regular PCs alikehas an IP address. To find the IP address for the Web server, the browser looks up the Web server's domain name in a giant catalog called the DNS (Domain Name Service) .

An IP address looks like a set of four numbers separated by periods (or, in techy speak, dots). For example, the www.SellMyJunkForMillions.com Web site may have the IP address 17.202.99.125.

Note: The DNS catalog isn't stored on your computer, so your browser actually needs to grab this information from the Internet. You can see the advantage that this approach provides. In ordinary circumstances, a company's domain name will never change, because that's what customers use and remember. But an IP address may change, because the company may need to move their Web site from one Web server to another. As long as the company remembers to update the DNS, this won't cause any disruption. Fortunately, you won't need to worry about managing the DNS yourself, because that process is automatically handled for you by the company that hosts your Web site.
Using the IP address, the browser sends the request to the Web server .

The actual route that the message takes is difficult to predict. It may cross through a number of other Web servers on the way.
When the Web server receives the request, it looks at the path and file name in the URL .

In this case, the Web server sees that the request is for a file named listings.htm in a folder named Buyers . It looks up that file, and then sends it back to the Web browser. If the file doesn't exist, it sends back an error message instead.
The browser gets the HTML page it's been waiting for (the listings.htm file), and renders it for your viewing pleasure .

The URL http://www.SellMyJunkForMillions.com/Buyers/listings.htm is a typical example. However, in the wild, you'll sometimes come across URLs that seem a lot simpler. For instance, consider http://www.amazon.com. It clearly specifies the domain name (www.amazon.com), but it doesn't include any information about the path or file name. So what's a Web browser to do?

When your URL doesn't include a file name, the browser just sends the request as is, and lets the Web server decide what to do. The Web server sees that you aren't requesting a specific file, and so it sends you the site's default Web page, which is often named index.htm or index.html . However, the Web administrator can configure the Web server to use any Web page file name as the default.

Now that you understand how URLs work, you're ready to integrate your own pages into the fabric of the Web.