2.2 A Packet s Tour of the Web

only for RuBoard - do not distribute or recompile

2.2 A Packet's Tour of the Web

The easiest way to explain the functioning of the Web today is to explore what happens when you start a web browser and attempt to view a page on the Internet.

2.2.1 Booting Up Your PC

Every computer[5] manufactured today is equipped with a small memory chip that holds its information even when the computer is turned off. When you turn on your computer, the computer's microprocessor starts executing a small program that is stored on this memory chip. The program is called the computer's Basic Input Output System, or BIOS. The BIOS has the ability to display simple information on the computer's screen, to read keystrokes from the keyboard, to determine how much memory is in the computer, and to copy the first few blocks of the computer's disk drive into memory and execute them.

[5] Most people on the Web today are using Windows-based computers, so this example will use them as well. However, much of what is said here about Windows-based computers applies in approximately the same way to computers running Unix or Linux operating systems, Macintosh computers, and a variety of other systems.

The first few blocks of the computer's hard drive contain a program called the bootstrap loader.[6] The bootstrap loader reads in the first part of the computer's operating system from storage (on a disk or CD-ROM), which loads in the rest of the computer's operating system, which starts a multitude of individual programs running. Some of these programs configure the computer's hardware to run the operating system, others perform basic housekeeping, and still others are run for historical reasons whether that is to assure compatibility with previous generations of operating systems, or because developers at the company that created the operating system forgot to take the programs out of the system before it was shipped.

[6] The phrase bootstrap loader comes from the expression "to pull oneself up by one's bootstraps." Starting a computer operating system is a tricky thing to do, because the only way to load a program into a computer's memory is by using another program, and when the computer starts operating it doesn't have a program in memory.

The computer may finally prompt you for a username and password. This information is used to "log in" (authenticate) to the computer that is, to set up the computer for a particular user's preferences and personal information, and possibly to gain access to network resources shared with other computers local to your organization. Finally, the computer starts up a graphical user interface, which displays the computer's desktop. Simson's desktop computer's screen is shown in Figure 2-4.

Figure 2-4. When your operating system starts up, the computer doesn't know anything about the Internet
figs/wsc2_0204.gif

2.2.2 PC to LAN to Internet

What your computer knows about the Internet after it boots depends on how your computer is connected to the Internet. A computer network is a collection of computers that are physically and logically connected together to exchange information. To exchange information with other computers in an orderly fashion, each computer has to have a unique address on each network where it has a direct connection. Addresses on a computer network are conceptually similar to telephone numbers on the telephone network.

A large number of people who use the Internet today do so using a dial-up connection and the Point-to-Point Protocol (PPP). Their computers use a device called a modem for dialing the telephone and communicating with a remote system. Other forms of connection, including DSL[7] and ISDN,[8] also use a modem to connect to remote systems using telephone lines.

[7] Digital Subscriber Loop, a technology for providing high-speed data connections on ordinary telephone lines.

[8] Integrated Services Digital Network, another technology for providing data connections on telephone lines.

If your computer accesses the Internet by dialing a modem, your computer probably does not have an address assigned until it actually connects to the network.[9] If your computer is connected to a local area network (LAN) at an office or in many homes, then your computer probably has an IP address and the ability to transmit and receive packets on the LAN at any time.

[9] These simple generalizations are frequently more complex in practice. For example, some Internet service providers assign permanent addresses to dialup users. Others are using dialup protocols such as PPP over local area networks a technique called Point-to-Point Protocol over Ethernet, or PPPoF. The users have the appearance of "dialing" the Internet, but their computers are in fact always connected.

2.2.2.1 Dialing up the Internet

Computers that use dial-up Internet connections have two distinct modes of operation. When they start up, they are not connected to the Internet. To connect to the Internet, the computer runs a program called a dialer. The dialer sends codes to the computer's modem that cause the modem to go "off hook," dial the phone, and initiate a data connection with the remote system.

When the modem of the local computer can exchange information with the remote system, the dialer starts the PPP subsystem. PPP is a sophisticated protocol that lets two computers that are on either end of a communications link exchange packets. First, PPP determines the capabilities of the system on each end. When dialing up an Internet service provider, the next thing that typically happens is that the called system asks the calling system to provide a valid username and password. If the username and password provided are correct, the called system will assign the caller an IP address. From this point on, the two computers can exchange IP packets as if they were on the same network.

2.2.2.2 Connected by LAN

Computers that are connected to the Internet by a LAN typically do not need to "dial up the Internet," because they are already connected. Instead, computers with permanent connections usually have preassigned addresses; when they wish to send a packet over the Internet, they simply send the packet over the LAN to the nearest gateway.[10] When the gateway receives a packet, it retransmits the packet along the desired path.

[10] Some computers on a LAN use the Dynamic Host Configuration Protocol (DHCP) to get a dynamically assigned IP address when they start up or before they need to use the network.

2.2.2.3 The Walden Network

Most of the examples in this book are based on the Walden Network, a small local area network that Simson created for his house on Walden Street in Cambridge.

The Walden Street network was built with several computers connected to a 10/100 Ethernet hub. Every computer in the house has an Ethernet interface card, and each card has an IP address on the home LAN. The computers include Simson's desktop computer, his wife's computer, and a small server in their living room that they also call Walden.[11]

[11] Gene also has an Ethernet LAN in his house, along with a wireless network. However, his wife won't let him put a computer in the living room. Yet.

Walden the computer is a rack-mounted PC running FreeBSD, a Unix-like operating system that's similar to Linux and Sun's Solaris. Walden has a pair of 75-gigabyte hard drives on which Simson stores all of his files. The computer actually has three Ethernet cards; one has an IP address on the network that communicates with the cable modem, one has an IP address on the network that communicates with the DSL bridge, and the third has an IP address for the internal LAN. The computer sends and receives email on the Internet. Walden runs the web server that houses his home page http://www.simson.net/. Finally, Walden is a firewall: it isolates the home network from the Internet at large. Figure 2-5 shows a schematic picture of the Walden network.

Figure 2-5. Simson's home network
figs/wsc2_0205.gif

As we hinted previously, the Walden network actually has two connections to the Internet. The primary connection is a cable modem. The cable modem is pretty fast: in Simson's testing, he got roughly 600 kilobits per second (kbps) in both directions. This is nearly 20 times faster than a traditional dial-up modem. The second connection is a DSL line. The DSL line is slower it only runs at 200 kbps and it costs more than three times as much ($150/month, versus $41.50/month).

So why have the DSL line? The most important reason has to do with Internet addressing. Simson's cable company gives him what's called a dynamically assigned IP address . This means that the IP address that his computer gets from the cable modem can change at any time. (In fact, it seems to change about once every two or three months.) For the DSL line, he pays extra money to get a static IP address , meaning that the IP address does not change.

But there is another reason why Simson has two Internet connections. That reason is redundancy. In the two years since he has had his cable modem, he's lost service on more than a dozen occasions. Usually lost service was no big deal, but sometimes it was a major inconvenience. And when he started running his own mail server, losing service resulted in mail's being bounced. This caused a huge problem for him and to make matters worse, the cable modem company would never answer his questions or tell him why the service was down or when service would be restored. In the end, Simson decided that if he wanted to have a reliable service, he would need to pay for that reliability the same way that big companies do: by obtaining redundant Internet connections from separate companies, and then designing his network so that if one connection went down, the other connection would be used in its place.[12]

[12] One complication of having multiple Internet service providers is dealing with multiple IP addresses. Each ISP will typically give their customers IP addresses particular to the ISP. (Large organizations handle the complexity by getting their own IP address space and having this space "announced" by their upstream providers.) On the Walden Street network, this issue is resolved by putting both sets of IP addresses on the Walden computer and using DNS techniques so that incoming connections arrive through the appropriate network.

We've gone into this level of detail because the Walden network is actually quite similar to the networks operated by many small businesses. When the computer on Simson's desktop creates a packet bound for the outside network, it puts the packet on the home LAN. This packet is picked up by the gateway, which retransmits the packets out to the Internet. When packets come back from the Internet, the cable modem or the DSL modem transmits the packet to the gateway, and from there to the LAN, and from the LAN to his desktop.

2.2.3 The Domain Name Service

There's a very important part of the Internet's infrastructure that hasn't been discussed until now the machinery that converts Internet domain names that people use (e.g., www.aol.com or www.simson.net) into the IP address that the Internet's underlying transport network can use (e.g., 205.188.160.121 or 64.7.15.234). The machinery that does this is called the Domain Name Service (DNS).

Think of the DNS as a huge phone book that lists every computer that resides on the network. But whereas a traditional phone book has a list of names and phone numbers, the DNS has a list of computer names and IP addresses.

For example, the internal phone book for a small organization might look like this:

Front desk                     617-555-1212 Fax                            x101 Mail Room                      x102 Publications                   x103

The DNS for that same organization might look like this:

192.168.0.1                    router.company.com 192.168.0.2                    faxserver.company.com 192.168.0.3                    mail.company.com 192.168.0.4                    www.company.com

In this example, the organization's domain is company.com, and the names router, faxserver, mail and www represent computers operating within that domain.[13]

[13] This example, and most of the other examples in this book, draws its IP addresses from the set of unroutable IP addresses that were established by the Internet Engineering Task Force in RFC 1918. These IP addresses are designed for test networks. Increasingly, they are also used by organizations for internal addressing behind a firewall or NAT (Network Address Translation) converter. The three blocks of unroutable IP addresses are 10.0.0.0-10.255.255.255, 172.16.0.0-172.31.255.255, and 192.168.0.0-192.168.255.255.

It's easy to get confused between local area networks and domains, because computers on the same local area network are usually in the same domain, and vice versa.

But this doesn't have to be the case. For example, the company in the previous example might pay for a large ISP to run its web server. In this case, the IP address for the web server would probably be on a different local area network, and the DNS table for the organization might look like this instead:

192.168.0.1                    router.company.com 192.168.0.2                    faxserver.company.com 192.168.0.3                    mail.company.com 172.16.10.100                  www.company.com

DNS wasn't part of the original Internet design. For the first ten years of operation, the Internet was small enough so that all of the host names and IP addresses could fit in a single file a file that was aptly named HOSTS.TXT. But by the mid-1980s this file had grown to be tens of thousands of lines long, and it needed to be changed on a daily basis. The Internet's creators realized that instead of having all of the hosts located in a single file, the network itself needed to provide some sort of host name resolution service.

Instead of a single centralized host service, the Internet uses a distributed database. This system allows each organization to be responsible for its own piece of the global namespace. The system is based upon programs called nameservers that translate Internet host names (e.g., www.company.com) to IP addresses (e.g., 172.16.10.100).

2.2.3.1 How DNS works

Host name resolution is the process by which computers on the Internet translate Internet host names to IP addresses. When your computer needs to translate a host name to an IP address, it creates a special kind of packet called a DNS request and sends the packet to a nearby nameserver; your system is normally configured (manually or dynamically during dialup) to know of several nameservers.

The nameserver that your computer contacts will sometimes know the answer to the request. If so, the DNS response is sent back to your computer. If the nameserver doesn't have the information, it will usually attempt to discover the answer and then report it back to your system.

Let's see how this works in practice. To do this, we'll use the ping program on a Windows desktop PC and the tcpdump network monitoring tool on a Unix server. The ping program sends out an ICMP ECHO packet to the remote system. The Unix tcpdump program displays packets as they move across the local area network.

For example, let's say we wish to ping the web server at the Internet Corporation for Assigned Names and Numbers (ICANN). (We'll discuss ICANN in some detail in the last section of this chapter.) From the Windows desktop, we might type:

C:\>ping www.icann.org Pinging www.icann.org [192.0.34.65] with 32 bytes of data: Reply from 192.0.34.65: bytes=32 time=101ms TTL=235 Reply from 192.0.34.65: bytes=32 time=99ms TTL=235 Reply from 192.0.34.65: bytes=32 time=100ms TTL=235 Reply from 192.0.34.65: bytes=32 time=99ms TTL=235 Ping statistics for 192.0.34.65:     Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds:     Minimum = 96ms, Maximum =  104ms, Average =  100ms C:\>

We can watch the packets as they flow across the local area network using the program tcpdump:

21:07:38.483377 desk.nitroba.com.2001 > walden.domain:  1+ A? www.icann.org. (31) 21:07:39.986319 desk.nitroba.com.2001 > walden.domain:  1+ A? www.icann.org. (31) 21:07:40.196762 b.domain > desk.nitroba.com.2001:  1* 1/6/9 A www.icann.org (365) 21:07:40.202501 desk.nitroba.com > www.icann.org: icmp: echo request 21:07:40.303782 www.icann.org > desk.nitroba.com: icmp: echo reply 21:07:41.206051 desk.nitroba.com > www.icann.org: icmp: echo request 21:07:41.304950 www.icann.org > desk.nitroba.com: icmp: echo reply 21:07:42.210855 desk.nitroba.com > www.icann.org: icmp: echo request 21:07:42.310785 www.icann.org > desk.nitroba.com: icmp: echo reply 21:07:43.215661 desk.nitroba.com > www.icann.org: icmp: echo request 21:07:43.315358 www.icann.org > desk.nitroba.com: icmp: echo reply

Each line of this printout corresponds to a packet sent across the local area network. Each column contains specific information about each packet. Let's explore the contents of the first line to understand the columns. Here it is:

21:07:38.483377 desk.nitroba.com.2001 > walden.domain:  1+ A? www.icann.org. (31)

The columns represent:

  • The time (21:07:38.483377 on the first line corresponds to almost half a second past 9:07:38 p.m.).

  • The computer that transmitted the packet (desk.nitroba.com is the desktop computer).

  • The port from which the packet was sent. (In this case, port 2001. This port was randomly chosen by the desktop computer and has no special significance; its sole purpose is to allow replies from the domain server to be sent to the correct program.)

  • The computer to which the packet was sent (walden).

  • The port on the receiving computer (in this case, the port is domain, the port of the DNS nameserver).

  • The contents of the packet (1+ A? indicates a DNS request for an A record, which is the type of record that relates host names to IP addresses).

With this information, we can now decode the output from the tcpdump command. In the first line, a DNS request is sent from the desktop to walden. Walden doesn't respond to this request, and 1.5 seconds later another request is sent. This time Walden responds with an A record containing the IP address of www.icann.org. After this response from the nameserver, Walden is no longer directly involved in the communications. Instead, the computer sends www.icann.org four ICMP ECHO REQUEST packets, and www.icann.org sends back four ICMP ECHO REPLY packets.

Although this seems complicated, what is actually going on behind the scenes is even more complicated. Look again at the first two lines. When desk asks walden for the IP address of www.icann.org, walden doesn't know the answer. But rather than replying "I don't know," walden attempts to find the answer. When the second request comes in, walden is deep in a conversation with a number of other computers on the Internet to find out precisely where www.icann.org resides. A few tenths of a second later, the conversation is finished and Walden knows the answer, which it then reports back to desk.

When walden's nameserver starts up, all it knows is the addresses of the root nameservers. When the request comes in for www.icann.org, walden's nameserver goes to the root nameserver and asks for the address of a nameserver that serves the org domain. walden then contacts that nameserver, asking for a nameserver that knows how to resolve the icann.org address. Finally, walden contacts the icann.org nameserver, asking for the address of www.icann.org.

2.2.4 Engaging the Web

When we type the address http://www.icann.org/ into the web browser, we instruct the web browser to fetch the home page of the ICANN web server. As shown in the previous example, the first thing that the computer needs to do to reach a remote web server is to learn the IP address of the web server.

Once the IP address of the remote machine is known, the desktop computer attempts to open up a TCP/IP connection to the web server. This TCP/IP connection can be thought of as a two-way pipe: the connection allows the computer to send information to the remote system and to receive a response.

Opening a TCP/IP connection is a three-step process. First, the desktop computer sends a special packet (packet #1) called a SYN packet to the remote system. This SYN packet requests that a TCP/IP connection be opened. Once again, we can eavesdrop on the communications with the tcpdump packet monitor:[14]

[14] To simplify reading the output from the tcpdump program, we have inserted a blank line between each packet. To simplify the discussion, we've labeled each packet with a packet number.

packet #1: 21:54:28.956695 desk.nitroba.com.6636 > www.icann.org.http: S 2897261633: 2897261633(0) win 16384 <mss 1460,nop,nop,sackOK> (DF)

The (DF) at the end of the line indicates that this packet has the don't fragment option set. If the remote system is able to open the TCP/IP connection, it responds with a SYN/ACK packet (packet #2). When the computer receives this SYN/ACK packet, it sends an ACK packet (packet #3). This three-packet exchange is known as a three-way handshake, and it is the way that every TCP/IP connection is started.

packet #2: 21:54:29.039502 www.icann.org.http > desk.nitroba.com.6636: S 3348123210: 3348123210(0) ack 2897261634 win 32120 <mss 1460,nop,nop,sackOK> (DF) packet #3: 21:54:29.039711 desk.nitroba.com.6636 > www.icann.org.http: . ack 1 win 17520 (DF)

In the first line, packet #1 is sent from the computer's port 6636 to the remote system's "http" port. (Once again, the fact that the packet came from port 6636 has no actual significance; when a computer initiates a TCP/IP connection to a remote system, it uses a randomly-selected port, usually between 1024 and 65535.) This first packet contains a randomly-generated number, in this case 2897261633, which is known as the TCP/IP sequence number. The remote system responds with packet #2, a SYN carrying a sequence number of its own (3348123210). Finally, the desktop system sends packet #3, the ACK packet.

After the TCP/IP connection is set up, the desktop computer sends a single packet that contains the HTTP request:

packet #4: 21:54:29.041008 desk.nitroba.com.6636 > www.icann.org.http: P 1:304(303) ack 1 win  17520 (DF)

This packet consists of 303 bytes. The "P" indicates that the TCP/IP push option is set, which has the effect of telling the destination system (www.icann.org) that the data should be immediately transmitted to the receiving program on the remote system.

By using another Unix tool (the strings (1) command), it's possible to look inside the packet and display the text that it contains:[15]

[15] If you count up the letters in the previous code fragment, you will see that there are only 289 characters. But there are six lines with text and one blank link, and each line is terminated by a carriage return/line feed pair. Adding 289+7+7=303, which is the number of bytes in the packet.

GET / HTTP/1.0 Host: www.icann.org Accept-Encoding: gzip, deflate Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms- powerpoint, application/vnd.ms-excel, application/msword, */* User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0) Accept-Language: en-us

The first line of this packet indicates that it is requesting the root HTML page from the remote web server using HTTP protocol 1.0. The second line indicates that we wish to connect to the computer www.icann.org (the Host: line is useful when multiple domains are served from a single IP address). Explanation of the remaining lines is beyond the scope of this discussion.

After this packet requests a page from the web server, the web server sends a stream of packets containing the web page back to the local computer. As the data is received, it is acknowledged by a series of ACK packets:

packet #5: 21:54:29.124031 www.icann.org.http > desk.nitroba.com.6636: . ack 304 win 31817 (DF) packet #6: 21:54:29.132202 www.icann.org.http > desk.nitroba.com.6636: . 1:1461(1460) ack 304 win  32120 (DF) packet #7: 21:54:29.258989 desk.nitroba.com.6636 > www.icann.org.http: . ack 1461 win 17520 (DF) packet #8: 21:54:29.348034 www.icann.org.http > desk.nitroba.com.6636: . 1461:2921(1460) ack 304  win 32120 (DF) packet #9: 21:54:29.348414 desk.nitroba.com.6636 > www.icann.org.http: . ack 2921 win 17520 (DF) packet #10: 21:54:29.349575 www.icann.org.http > desk.nitroba.com.6636: . 2921:4381(1460) ack 304  win 32120 (DF) packet #11: 21:54:29.438365 www.icann.org.http > desk.nitroba.com.6636: . 4381:5841(1460) ack 304  win 32120 (DF) packet #12: 21:54:29.438760 desk.nitroba.com.6636 > www.icann.org.http: . ack 5841 win 17520 (DF) packet #13: 21:54:29.442387 www.icann.org.http > desk.nitroba.com.6636: . 5841:7301(1460) ack 304  win 32120 (DF) packet #14: 21:54:29.442743 desk.nitroba.com.6636 > www.icann.org.http: . ack 7301 win 17520 (DF) packet #15: 21:54:29.528037 www.icann.org.http > desk.nitroba.com.6636: . 7301:8761(1460) ack 304  win 32120 (DF) packet #16: 21:54:29.529545 www.icann.org.http > desk.nitroba.com.6636: . 8761:10221(1460) ack  304 win 32120 (DF) packet #17: 21:54:29.529918 www.icann.org.http > desk.nitroba.com.6636: FP 10221:10703(482) ack  304 win 32120 (DF) packet #18: 21:54:29.529958 desk.nitroba.com.6636 > www.icann.org.http: . ack 10221 win 17520  (DF) packet #19: 21:54:29.530133 desk.nitroba.com.6636 > www.icann.org.http: . ack 10704 win 17038  (DF) packet #20: 21:54:29.550608 desk.nitroba.com.6636 > www.icann.org.http: F 304:304(0) ack 10704  win 17038 (DF) packet #21: 21:54:29.630469 www.icann.org.http > dhcp103.walden.vineyard.net.3l-l1: . ack 305 win  32120 (DF)

If a portion of the transmitted data is not acknowledged within a certain period of time, the remote system automatically retransmits that data. Acknowledging data sent over the network in this fashion assures that all of the data arrives at the remote system even if the Internet is overloaded and dropping packets.

Notice that the packets #17 and #20 both have the "F" bit set. Here the "F" means FIN. When a packet has its FIN bit set, this tells the remote system that no more data will be sent in that direction along the connection. Packet #17 sent from ICANN to desk says that the last byte sent is 10703 bytes into the data stream; desk then responds with packet #18, saying that it has processed through byte #10221, and packet #19, saying that it is ready for byte 10704. Of course, there will be no byte #10704, because the FIN has been sent.

This process repeats on packets #20 and #21. In packet #20, desk sends a packet to the remote system saying that 304 bytes have been sent and no more are coming. In packet #21, the remote system says that all of these bytes have been processed and it is ready for byte 305. Of course, that byte will not be coming either.

After packet #21 is sent, both sides have sent their FIN packets and both FIN packets have been acknowledged. The TCP/IP connection is presumed to be closed.

We can also look inside the contents of the packets that were sent from the ICANN web server to our desktop system. The first 30 lines look like this:

HTTP/1.1 200 OK Date: Wed, 07 Feb 2001 02:54:29 GMT Server: Apache/1.3.6 (Unix)  (Red Hat/Linux) Last-Modified: Mon, 22 Jan 2001 01:10:54 GMT ETag: "d183c-28c4-3a6b889e" Accept-Ranges: bytes Content-Length: 10436 Connection: close Content-Type: text/html <HTML> <HEAD>   <META NAME="GENERATOR" CONTENT="Adobe PageMill 3.0 Win">   <TITLE>ICANN | Home Page </TITLE>   <META CONTENT="text/html; charset=windows-1252" HTTP-EQUIV="Content-Type"> </HEAD> <BODY BGCOLOR="#ffffff"> <P><CENTER><TABLE BORDER="0" CELLPADDING="0" CELLSPACING="2"  WIDTH="95%">   <TBODY>    <TR> <TD WIDTH="12%">&nbsp;</TD>      <TD WIDTH="21%"><IMG src="/books/2/513/1/html/2//logos/icann-logo.gif" ALIGN="BOTTOM"       ALT="ICANN Logo" HEIGHT="145" WIDTH="188" NATURALSIZEFLAG="0"></TD>      <TD WIDTH="67%">       <P><CENTER><STRONG><FONT SIZE="+2" FACE="Arial">The Internet       Corporation <BR>       for Assigned Names and Numbers</FONT></STRONG></CENTER></TD>   </TR></TBODY>  </TABLE></CENTER></P> <P><CENTER><HR NOSHADE SIZE="1" WIDTH="95%"><TABLE BORDER="0"

Finally, Figure 2-6 shows how the web page itself appears.

Figure 2-6. http://www.icann.org/
figs/wsc2_0206.gif
only for RuBoard - do not distribute or recompile


Web Security, Privacy & Commerce
Web Security, Privacy and Commerce, 2nd Edition
ISBN: 0596000456
EAN: 2147483647
Year: 2000
Pages: 194

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net