Chapter 7 - The Transmission Control Protocol


SummaryThe Transmission Control Protocol provides a reliable, connection-oriented transport protocol for transaction-oriented applications to use. TCP is used by almost all of the application protocols found on the Internet today, as most of them require a reliable, error-correcting transport layer in order to ensure that data does not get lost or corrupted.
Protocol ID6
Relevant STDs2 (http://www.iana.org/);
3 (includes RFCs 1122 and 1123);
7 (RFC 793, republished)
Relevant RFCs793 (Transmission Control Protocol);
896 (The Nagle Algorithm);
1122 (Host Network Requirements);
1323 (Window Scale and Timestamp);
2018 (Selective Acknowledgments);
2581 (TCP Congestion Control);
Related RFCs1072 (Extensions for High Delay)
1106 (Negative Acknowledgments);
1146 (Alternate Checksums);
1337 (Observations on RFC 1323);
1644 (Transaction Extensions);
1948 (Defending Against Sequence Number Attacks);
2414 (Increasing the Initial Window);
2525 (Known TCP Implementation Problems);
2582 (Experimental New Reno Modifications to Fast Recovery)

On an IP network, applications use two standard transport protocols to communicate with each other. These are the User Datagram Protocol (UDP), which provides a lightweight and unreliable transport service, and the Transmission Control Protocol (TCP), which provides a reliable and controlled transport service. The majority of Internet applications use TCP, since its built-in reliability and flow control services ensure that data does not get lost or corrupted.

TCP is probably the most important protocol in use on the Internet today. Although IP does the majority of the legwork, moving datagrams and packets around the Internet as needed, TCP makes sure that the data inside of the IP datagrams is correct. Without this reliability service, the Internet would not work nearly as well as it does, if it worked at all.

It is also interesting to note that the first versions of TCP were designed before IP, with IP being extracted from TCP later. In fact, TCP is now designed to work with any packet-switched network, whether this be raw Ethernet or a distributed IP-based network like the Internet. This flexible design has resulted in TCP being adopted by other network architectures, including OSI's Transport Protocol 4 (TP4) and Apple Computer Corporation's AppleTalk Data Stream Protocol (ADSP).

The TCP Standard

TCP is defined in RFC 793, which has been republished as STD 7 (TCP is an Internet Standard protocol). However, RFC 793 contained some vagaries which were clarified in RFC 1122 (Host Network Requirements). In addition, RFC 2001 introduced a variety of congestion-related elements to TCP, which have been included into the standard specification, although this RFC was superseded by RFC 2581 (a.k.a., RFC 2001 bis). As such, TCP implementations need to incorporate RFC 793, RFC 1122, and RFC 2581 in order to work reliably and consistently with other implementations.

RFC 793 states that the Protocol ID for TCP is 6. When a system receives an IP datagram that is marked as containing Protocol 6, it should pass the contents of the datagram to TCP for further processing.

TCP Is a Reliable, Connection-Centric Transport Protocol

Remember that all of the transport-layer protocols (including TCP and UDP) use IP for their basic delivery services, and that IP is an unreliable protocol, providing no guarantees that datagrams or packets will reach their destination intact. It is quite possible for IP packets to get lost entirely (due to an untimely link failure on the network somewhere), or for packets to become corrupted (due to an overworked or buggy router), or for packets to get reordered as they cross different networks en route to the destination system, or for a myriad of other problems to crop up while packets are being bounced around the Internet.

For applications that need some sort of guarantee that data will arrive at its destination intact, this uncertainty is simply unacceptable. Electronic mail, TELNET, and other network applications are the basis of many mission-critical efforts, and as such they need some sort of guarantee that the data they transmit will arrive in its original form.

This reliability is achieved through the use of a virtual circuit that TCP builds whenever two applications need to communicate. As we discussed in Chapter 1, An Introduction to TCP/IP, a TCP session is somewhat analogous to a telephone conversation in that it provides a managed, full-duplex, point-to-point communications circuit for application protocols to use. Whenever data needs to be sent between two TCP-based applications, a virtual circuit is established between the two TCP providers, and a highly monitored exchange of application data occurs. Once all of the data has been successfully sent and received, the connection gets torn down.

Building and monitoring these virtual circuits incurs a fair amount of overhead, making TCP somewhat slower than UDP. However, UDP does not provide any reliability services whatsoever, which is an unacceptable trade-off for many applications.

Services Provided by TCP

Although it is possible for applications to provide their own reliability and flow control services, it is impractical for them to do so. Rather than developing (and debugging) these kinds of services, it is much more efficient for applications to leverage them as part of a transport-layer protocol, where every application has access to them. This arrangement allows shorter development cycles, better interoperability, and less headaches for everybody.

TCP provides five key services to higher-layer applications:

Virtual circuits
Whenever two applications need to communicate with each other using TCP, a virtual circuit is established between the two TCP endpoints. The virtual circuit is at the heart of TCP's design, providing the reliability, flow control, and I/O management features that distinguish it from UDP.

Application I/O management
Applications communicate with each other by sending data to the local TCP provider, which then transmits the data across a virtual circuit to the other side, where it is eventually delivered to the destination application. TCP provides an I/O buffer for applications to use, allowing them to send and receive data as contiguous streams, with TCP converting the data into individually monitored segments that are sent over IP.

Network I/O management
When TCP needs to send data to another system, it uses IP for the actual delivery service. Thus, TCP also has to provide network I/O management services to IP, building segments that can travel efficiently over the IP network, and turning individual segments back into a data-stream appropriate for the applications.

Flow control
Different hosts on a network will have different characteristics, including processing capabilities, memory, network bandwidth, and other resources. For this reason, not all hosts are able to send and receive data at the same rate, and TCP must be able to deal with these variations. Furthermore, TCP has to do all of this seamlessly, without any action being required from the applications in use.

Reliability
TCP provides a reliable transport service by monitoring the data that it sends. TCP uses sequence numbers to monitor individual bytes of data, acknowledgment flags to tell if some of those bytes have been lost somewhere, and checksums to validate the data itself. Taken together, these mechanisms make TCP extremely reliable.

All told, these services make TCP an extremely robust transport protocol.

Virtual Circuits

In order for TCP to provide a reliable transport service, it has to overcome IP's own inherent weaknesses, possibly the greatest of which is the inability to track data as it gets sent across the network. IP only moves packets around the network, and makes no pretense towards offering any sort of reliability whatsoever. Although this lack of reliability is actually a designed-in feature of IP that allows it to move data across multiple paths quickly, it is also an inherent weakness that must be overcome in order for applications to communicate with each other reliably and efficiently.

TCP does this by building a virtual circuit on top of IP's packet-centric network layer, and then tracking data as it is sent through the virtual circuit. This concept is illustrated in Figure 7-1. Whenever a connection is made between two TCP endpoints, all of the data gets passed through the virtual circuit.

By using this virtual circuit layer, TCP accomplishes several things. It allows IP to do what it does best (which is moving individual packets around the network), while also allowing applications to send and receive data without them having to worry about the condition of the underlying network. And since each byte of data is monitored individually by TCP, it's easy to take corrective actions whenever required, providing reliability and flow control services on top of the chaotic Internet.

0272-01.gif
Figure 7-1.
An overview of TCP's virtual circuits

These virtual circuits are somewhat analogous to the way that telephone calls work. It is easy to see this corollary if you think of the two TCP endpoints as being the telephones, and the applications being the users of those telephones.

When an application wants to exchange data with another system, it first requests that TCP establish a workable session between the local and remote applications. This process is similar to you calling another person on the phone. When the other party answers ( Hello? ), they are acknowledging that the call went through. You then acknowledge the other party's acknowledgment ( Hi Joe, this is Eric ), and begin exchanging information ( The reason I'm calling is ).

Likewise, data travelling over a TCP virtual circuit is monitored throughout the session, just as a telephone call is. If at any time parts of the data are lost ( What did you say? ), the sending system will retransmit the lost data ( I said ). If the connection degrades to a point where communications are no longer possible, then sooner or later both parties will drop the call. Assuming that things don't deteriorate to that point, then the parties will agree to disconnect ( See ya ) once all of the data has been exchanged successfully, and the call will be gracefully terminated.

This concept is illustrated in Figure 7-2. When a TCP connection needs to be established, one of the two endpoint systems will try to connect with the other endpoint. If the call goes through successfully, then the TCP stack on the remote system will acknowledge the connection request, which will then be followed by an acknowledgment from the sender. This three-way handshake ensures that the connection is sufficiently reliable for data to be exchanged.

Likewise, each clump of data that is sent is explicitly acknowledged, providing constant feedback that everything is going okay. Once all of the data has been sent, either endpoint can close the virtual circuit. However, the disconnect process also uses acknowledgments in order to ensure that both parties are ready to terminate the call. If one of the systems still had data to send, then they might not agree to drop the circuit.

0273-01.gif
Figure 7-2.
TCP virtual circuits versus telephone calls

The virtual circuit metaphor has other similarities with traditional telephone calls. For example, TCP is a full-duplex transport that allows each party to send and receive data over the same virtual circuit simultaneously, just like a telephone call does. This allows for a web browser to request an object and for the web server to send the requested data back to the client using a single virtual circuit, rather than requiring that each end establish its own communication channel.

Every TCP virtual circuit is dedicated to one pair of endpoints, also like a telephone call. If an application needs to communicate with multiple endpoints simultaneously, then it must establish unique circuits for each endpoint pair, just as telephone calls do. This is true even if the same applications are in use at both ends of the connection. For example, if a web browser were to simultaneously request four GIF images from the same server using four simultaneous HTTP GET commands, then four separate TCP circuits would be needed in order for the operations to complete, even though the same applications and hosts were being used with all of the requests.

For all of these reasons, it is easy to think of TCP's virtual circuits as being very similar to the familiar concept of telephone calls.

Application I/O Management.

The primary benefit of the virtual circuit metaphor is the reliability that it allows. However, another set of key benefits is the I/O management services that this design provides.

One of the main features that comes from this design is that applications can send and receive information as streams of data, rather than having to deal with packetsizing and management issues directly. This allows a web server to send a very large graphic image as a single stream of data, rather than as a bunch of individual packets, leaving the task of packaging and tracking the data to TCP.

This design helps to keep application code simple and straightforward, resulting in lower complexity, higher reliability, and better interoperability. Application developers don't have to build flow control, circuit-management, and packaging services into their applications, but can instead use the services provided by TCP, without having to do anything special. All an application has to do is read and write data; TCP does everything else.

TCP provides four distinct application I/O management services to applications:

Internal Addressing. TCP assigns unique port numbers to every instance of every application that is using a TCP virtual circuit. Essentially, these port numbers act as extension numbers, allowing TCP to route incoming data directly to the appropriate destination application.

Opening Circuits. Applications inform TCP when they need to open a connection to a remote application, and leave it to TCP to get the job done.

Data Transfer. Whenever an application needs to send data, it just hands it off to TCP, and assumes that TCP will do everything it can to make sure that the data is delivered intact to the destination system.

Destroying Circuits. Once applications have finished exchanging data, they inform TCP that they are finished, and TCP closes the virtual circuit.

Application addressing with TCP ports

Applications communicate with TCP through the use of ports, which are practically identical to the ports found in UDP. Application are assigned 16-bit port numbers when they register with TCP, and TCP uses these port numbers for all incoming and outgoing traffic.

Conceptually, port numbers provide extensions for the individual applications in use on a system, with the IP address of the local system acting as the main phone number. Remote applications call the host system (using the IP address), and also provide the extension number (port number) of the destination application that they want to communicate with. TCP uses this information to identify the sending and receiving applications, and to deliver data to the correct application.

Technically, this procedure is a bit more complex than it is being described here. When an application wishes to communicate with another application, it will give the data to TCP through its assigned port number, telling TCP the port number and IP address of the destination application. TCP will then create the necessary TCP message (called a segment ), marking the source and destination port numbers in the message headers, and storing whatever data is being sent in the payload portion of the message. A complete TCP segment will then get passed off to the local IP software for delivery to the remote system (which will create the necessary IP datagram and shoot it off).

Once the IP datagram is received by the destination system, the remote IP software will see that the data portion of the datagram contains a TCP segment (as can be seen by the Protocol Identifier field in the IP header), and will hand the contents of the segment to TCP for further processing. TCP will then look at the TCP header, see the destination port number, and hand the payload portion of the segment off to whatever application is using the specified destination port number.

This concept is illustrated in Figure 7-3. In that example, an HTTP client is sending data to the HTTP server running on port 80 of the destination system. When the data arrives at the destination system, TCP will examine the destination port number for that segment, and then deliver the contents of the segment to the application it finds there (which should be the HTTP server).

0275-01.gif
Figure 7-3.
Application-level multiplexing with port numbers

Technically, a port identifies only a single instance of an application on a single system. The term socket is used to identify the port number and IP address concantenated together (i.e., port 80 on host 192.168.10.10 could also be referred to as socket 192.168.10.10:80). A socket pair consists of both endpoints on a virtual circuit, including the IP addresses and port numbers of both applications on both systems.

All TCP virtual circuits work on the concept of socket pairs. Multiple connections between two systems must have unique socket pairs, with at least one of the two endpoints having a different port number.

TCP port numbers are not necessarily linked with applications on a one-to-one basis. It is quite common for some applications to open multiple connections simultaneously, and these connections would all require unique socket pairs, even if there was only one application in use. For example, if an HTTP 1.0 client were to simultaneously download multiple graphic objects from an HTTP server, then each instance of the HTTP client would require a unique and separate port number in order for TCP to route the data correctly. In this case, there would be only one application, but there would be multiple bindings to the network, with each binding having a unique port number.

It is important to realize that circuits and ports are entirely separate entities, although they are tightly interwoven. The virtual circuit provides a managed transport between two endpoint TCP providers, while port numbers provide only an address for the applications to use when talking to their local TCP provider. For this reason, it is entirely possible for a server application to support several different client connections through a single port number (although each unique virtual circuit will have a unique socket pair, with the client-side address and/or socket being the unique element).

For example, Figure 7-4 shows a single HTTP server running on Arachnid, with two active virtual circuits (one for Ferret, and another for Greywolf). Although both connections use the same IP address and port number on Arachnid, the socket pairs themselves are unique, due to the different IP addresses and port numbers in use by the two client systems. In this regard, virtual circuits are different from the port number in use by the HTTP server, although these elements are also tightly related.

Most of the server-based IP applications that are used on the Internet today use what are referred to as well-known port numbers, as we discussed in the previous chapter. For example, an HTTP server will listen on TCP port 80 by default, which is the well-known port number associated with HTTP servers. This way, any HTTP client that needs to connect to any HTTP server can use the default destination of TCP port 80. Otherwise, the client would have to specify the port number of the server that it wanted to connect with (you've seen this in some URLs that use http://www.somehost.com:8080/ or the like; 8080 is the port number of the HTTP server on www.somehost.com).

0277-01.gif
Figure 7-4.
An HTTP server with two connections, using two distinct socket pairs

Most servers let you use any port number and are not restricted to the well-known port number. However, if you run your servers on non-standard ports, then you would have to tell every user that the server was not accessible on the default port. This would be hard to manage at best. By sticking with the defaults, all users can connect to your server using the default port number, which is likely to cause the least amount of trouble.

owl.gif Some network administrators purposefully run application servers on nonstandard ports, hoping to add an extra layer of security to their network. However, it is my opinion that security through obscurity is no security at all, and this method should not be relied upon by itself.

Historically, only server-based applications have been allowed to run on ports below 1024, as these ports could be used only by privileged accounts. By limiting access to these port numbers, it was more difficult for a hacker to install a rogue application server. However, this restriction is based on Unix-specific architectures and is not easily enforced on all of the systems that run IP today. Many application servers now run on operating systems that have little or no concept of privileged users, making this historical restriction somewhat irrelevant.

There are a number of predefined port numbers that are registered with the Internet Assigned Numbers Authority (IANA). All of the port numbers below 1024 are reserved for use with well-known applications, although there are also many applications that use port numbers outside of this range. Some of the more common port numbers are shown in Table 7-1. For a detailed listing of all of the port numbers that are currently registered, refer to the IANA's online registry (accessible at http://www.isi.edu/in-notes/iana/assignments/port-numbers).

Table 7-1. Some of the Port Numbers Reserved for Well-Known TCP Servers
Port NumberDescription
20File Transfer Protocol, Control Channel (FTP)
21File Transfer Protocol, Data Channel (FTP-Data)
23TELNET
25Simple Mail Transfer Protocol (SMTP)
80Hypertext Transfer Protocol (HTTP)
110Post Office Protocol, v3 (POP3)
119Network News Transfer Protocol (NNTP)

Besides the reserved addresses that are managed by the IANA, there are also unreserved port numbers that can be used by any application for any purpose, although conflicts may occur with other users who are also using those port numbers. Any port number that is frequently used should be registered with the IANA.

To see the well-known ports used on your system, examine the /etc/services file on a Unix host, or the C:\WinNT\System32\Drivers\Etc\SERVICES file on a Windows NT host.

Opening a circuit

Applications communicate with each other using the virtual circuits provided by TCP. These circuits are established on an as-needed basis, getting created and destroyed as requested by the applications in use. Whenever an application needs to communicate with another application somewhere on the network, it will ask the local TCP provider to establish a virtual circuit on its behalf.

There are two methods for requesting that a virtual circuit be opened: either a client will request an open so that data can be sent immediately, or a server will open a port in listen mode, waiting for a connection request to arrive from a client.

The simplest of the two methods is the passive open, which is the form used by servers that want to listen for incoming connections. A passive open indicates that the server is willing to accept incoming connection requests from other systems, and that it does not want to initiate an outbound connection. Typically, a passive open is unqualified, meaning the server can accept an incoming connection from anybody. However, some security-sensitive applications will accept connections only from predefined entities, a condition known as a qualified passive open. This type is most often seen with corporate web servers, ISP news servers, and other restricted-access systems.

When a publicly accessible server first gets started, it will request that TCP open a well-known port in passive mode, offering connectivity to any node that sends in a connection request. Any TCP connection requests that come into the system destined for that port number will result in a new virtual circuit being established.

Client applications (such as a web browser) use active opens when making these connection request. An active open is the opposite of a passive open, in that it is a specific request to establish a virtual circuit with a specific destination socket (typically this will be the well-known port number of the server that is associated with the specific client).

This process is illustrated in Figure 7-5. When an HTTP client needs to get a document from a remote HTTP server, it issues an active open to the local TCP software, providing it with the IP address and TCP port number of the destination HTTP server. The client's TCP provider then allocates a random port number for the application and attempts to establish a virtual circuit with the destination system's TCP software. The server's TCP software verifies that the connection can be opened (Is the port available? Are there security filters in place that would prevent the connection?), and then respond with an acknowledgment.

If the destination port is unavailable (perhaps the web server is down), then the TCP provider on the server system rejects the connection request. This is in contrast to UDP, which has to rely on ICMP Destination Unreachable: Port Unreachable Error Messages for this service. TCP is able to reject connections explicitly and can therefore abort connection requests without having to involve ICMP.

If the connection request is accepted, then the TCP provider on the server system acknowledges the request, and the client would then acknowledge the server's acknowledgment. At this point, the virtual circuit would be established and operational, and the two applications could begin exchanging data, as illustrated in Figure 7-5.

The segments used for the handshake process do not normally contain data, but instead are zero-length command segments that have special connection-management flags in their headers, signifying that a new virtual circuit is being established. In this context, the most important of these flags is the Synchronize flag, used by two endpoints to signify that a virtual circuit is being established.

For example, the first command segment sent by the client in Figure 7-5 would have the Synchronize flag enabled. This flag tells the server's TCP software that this is a new connection request. In addition, this command segment will also

0280-01.gif
Figure 7-5.
A TCP virtual circuit being established

provide the starting byte number (called the sequence number ) that the client will use when sending data to the server, with this data being provided in the Sequence Identifier field of the TCP header.

If the server is willing to establish a virtual circuit with the client, then it will respond with its own command segment that also contains the Synchronize flag and that also gives the starting sequence number that the server will use when sending data back to the client. This command segment will also have the Acknowledgment flag enabled, with the Acknowledgment Identifier field pointing to the client's next-expected sequence number.

The client will then return a command segment with the Acknowledge flag enabled to the server, and with its Acknowledgment Identifier field pointing to the server's next-expected sequence number. Note that this segment does not have the Synchronize flag enabled, since the virtual circuit is now considered up and operational, with both systems now being able to exchange data as needed.

It is entirely possible for two systems to issue active opens to each other simultaneously, although this scenario is extremely rare (I know of no applications that do this purposefully). In theory, such an event is possible, although it probably happens only on very slow networks where the circuit-setup messages pass each other on the wire.

For more information on the Synchronize and Acknowledgment flags, refer to Control Flags later in this chapter. For more information on the sequence and acknowledgment numbers, refer to Reliability also later in this chapter.

Exchanging data

Once a virtual circuit has been established, the applications in use can begin exchanging data with each other. However, it is important to note that applications do not exchange data directly. Rather, each application hands data to its local TCP provider, identifying the specific destination socket that the data is for, and TCP does the rest.

Applications can pass data to TCP in chunks or as a contiguous byte-stream. Most TCP implementations provide a write service that is restricted in size, forcing applications to write data in blocks, just as if they were writing data to a file on the local hard drive. However, TCP's buffering design also supports application writes that are contiguous, and this design is used in a handful of implementations.

TCP stores the data that it receives into a local send buffer. Periodically, a chunk of data will get sent to the destination system. The recipient TCP software will then store this data into a receive buffer, where it will be eventually passed to the destination application.

For example, whenever a web browser issues an HTTP GET request, the request is passed to TCP as application data. TCP stores the data into a send buffer, packaging it up with any other data that is bound for the destination socket. The data then gets bundled into an IP datagram and sent to the destination system. The recipient's TCP provider then takes the data and passes it up to the web server, which fetches the requested document and hands it off to TCP. TCP sends chunks of the document data back to the client in multiple IP packets, where it is queued up and then handed to the application.

This concept is outlined in Figure 7-6, which shows an HTTP client asking for a document from a remote HTTP server. Once the TCP virtual circuit is established, the HTTP client writes GET document into the local send buffer associated with the virtual circuit in use by the client. TCP then puts this data into a TCP segment (creating the appropriate TCP headers), and sends it on to the specified destination system via IP. The HTTP server at the other end of the connection would then take the same series of steps when returning the requested document back to the client.

The important thing to remember here is that application data is transmitted as independent TCP segments, each of which requires acknowledgments. It is at this layer that TCP's reliability and flow control services are most visible.

0282-01.gif
Figure 7-6.
Data being exchanged over a TCP virtual circuit

For more information on how TCP converts application data into IP datagrams, refer ahead to Network I/O Management.

Closing a circuit

Once the applications have exchanged all of their data, the circuit can be closed. Closing a circuit is similar to opening one, in that an application must request the action (except in those cases where the connection has collapsed, and TCP is forced to terminate it).

Either end of the connection may close the circuit at any time, using a variety of different means. The two common ways to close are active closes that initiate a shutdown sequence and passive closes that respond to an active close request.

Just as building a circuit requires a bidirectional exchange of special command segments, so does closing it. One end of the connection requests that the circuit be closed (the active close at work). The remote system then acknowledges the termination request and responds with its own termination request (the passive close). The terminating system then acknowledges the acknowledgment, and both endpoints drop the circuit. At this point, neither system is able to send any more data over the virtual circuit.

Figure 7-7 shows this process in detail. Once the HTTP client has received all of the data, it requests that the virtual circuit be closed. The HTTP server then returns an acknowledgment for the shutdown request, and also sends its own termination request. When the server's shutdown request is received by the client, the client issues a final acknowledgment, and begins closing its end of the circuit. Once the final acknowledgment is received by the server, the server shuts down whatever is left of the circuit. By this point, the connection is completely closed.

0283-01.gif
Figure 7-7.
A TCP virtual circuit being torn down

Just as TCP uses special Synchronize flags in the circuit-setup command segments, TCP also has special Finish flags that it uses when terminating a virtual circuit. The side issuing the active close sends a command segment with the Finish flag enabled and with a sequence number that is one byte higher than the last sequence number used by that side during the connection. The destination system responds with a command segment that also has the Finish flag enabled and with its sequence number also incremented by one. In addition, the Acknowledgment Identifier field for this response will still point to the next-expected sequence number, even though no other data should be forthcoming. In this regard, the Finish flag is considered to be one byte of data (just like the Synchronize flag), and as such must be acknowledged explicitly.

Once the Finish segments have been exchanged, the terminating system must respond with a final acknowledgment for the last Finish segment, with the

Acknowledgment Identifier also pointing to the next-expected sequence number (even though there should not be another segment coming). However, this last segment will not have the Finish flag enabled, since the circuit is considered to be down and out of action by this point.

It's important to note that either endpoint can initiate the circuit-termination process, and there are no hard and fast rules for which end should do it, although typically it is left to the client to perform this service since it may make multiple requests over a single connection. POP3 is a good example of this process, as POP3 allows a client to submit multiple commands during a single session. The client would need to dictate when the circuit should be closed with this type of application. However, sometimes a server issues the active close. For example, Gopher servers close the virtual circuit after sending whatever data has been requested, as do HTTP 1.0 servers.

It's also important to note that server applications keeps the port open until the application itself is terminated, allowing other clients to continue connecting to that server. However, the individual circuits will be torn down on a per-connection basis, according to the process described above.

Sometimes, the two systems do not close their ends of the circuit simultaneously. This results in a staggered close—also known as a half-close —with each end issuing passive close requests at different times. One example of this type can be found in the rsh utility, which is used to submit shell commands to rsh servers. On some systems, once an rsh command has been sent the client will close its end of the connection, effectively switching the virtual circuit into half-duplex mode. The server will then process the shell command, send the results back to the client (for display or further processing), and then close its end of the connection. Once both ends have been closed, the circuit is dropped.

Another option for closing a circuit is to simply drop it without going through an orderly shutdown. Although this method will likely cause unnecessary traffic, it is not uncommon. Typically, this method should only happen if an application is abruptly terminated. If an application needs to immediately close a circuit without going through the normal shutdown sequence, then it will request an immediate termination, and TCP will issue a segment with the Reset flag set, informing the other end that the connection is being killed immediately.

For more information on the Finish, Reset, and Acknowledgment flags, refer ahead to Control Flags. For more information on the sequence and acknowledgment numbers, refer ahead to Reliability.

Application design issues

Some applications open a connection and keep it open for long periods of time, while others open and close connections rapidly, using many circuits for a single operation.

For example, if you instruct your web browser to open a document from an HTTP 1.0 Web server, the HTTP client issues an active open to the destination HTTP 1.0 server, which then sends the document to the client and close the TCP connection. If there are any graphic objects on that document, the HTTP client has to open multiple unique connections for each of those objects. Thus, opening a single web page could easily result in twenty or more circuits being established and destroyed, depending on the number of objects embedded in the requested web page.

Since this model generates a lot of traffic (and uses a lot of network resources on the server), this process was changed with HTTP 1.1, which now allows a single circuit to be used for multiple operations. With HTTP 1.1, a client may request a page and then reuse the existing circuit to download objects embedded within that page. This model results in significantly fewer virtual circuits being used, although it also makes the download process synchronous rather than asynchronous.

Most applications use a single circuit for everything, keeping that circuit open even when there may not be any noticeable activity. TELNET is one example of this, where the TELNET client will issue an active open during the initial connection, and then use that virtual circuit for everything until the connection is terminated. After logging in, the user may get up and walk away from the client system, and thus no activity may occur for an extended period of time, although the TCP connection between the two systems would remain active.

Whether the circuits are torn down immediately or kept open for extended periods of time is really a function of the application's design goal, rather than anything mandated by TCP. It is entirely possible for clients to open and close connections rapidly (as seen with web browsers that use individual circuits for every element in a downloaded document), or to open a single connection and maintain it in perpetuity (as seen with TELNET).

Keep-alives

Although RFC 793 does not make any provision for a keep-alive mechanism, some TCP implementations provide one anyway. There are good reasons for doing this, and bad ones as well.

By design, TCP keep-alives are supposed to be used to detect when one of the TCP endpoints has disappeared without closing the connection. This feature is particularly useful for applications where the client may be inactive for long periods of time (such as TELNET), and there's no way to tell whether the connection is still valid.

For example, if a PC running a TELNET client were powered off, the client would not close the virtual circuit gracefully. Unfortunately, when that happened the TELNET server would never know that the other end had disappeared. Long periods of inactivity are common with TELNET, so not getting any data from the client for an extended period would not cause any alarms on the TELNET server itself. Furthermore, since the TELNET server wouldn't normally send unsolicited data to the client, it would never detect a failure from a lack of acknowledgments either. Thus, the connection might stay open for infinity, consuming system resources for no good purpose.

TCP keep-alives allow servers to check up on clients periodically. If no response is received from the remote endpoint, then the circuit is considered invalid and will be released.

RFC 1122 states that keep-alives are entirely optional, should be user-configurable, and should be implemented only within server-side applications that will suffer real harm if the client were to disappear. Although implementations vary, RFC 1122 also states that keep-alive segments should not contain any data, but may be configured to send one byte of data if required for compatibility with noncompliant implementations.

Most systems use an unsolicited command segment for this task, with the sequence number of the command segment set to one byte less than the sequence number of the next byte of data to be sent, effectively reusing the last sequence number of the last byte of data sent over the virtual circuit. This design effectively forces the remote endpoint to issue a duplicate acknowledgment for the last byte of data that was sent over that connection. When the acknowledgment arrives, the server knows that the client is still there and operational. If no response comes back after a few such tests, then the server can drop the circuit.

Network I/O Management

When an application needs to send data to another application over TCP, it writes the data to the local TCP provider, which queues the data into a send buffer. Periodically, TCP packages portions of the data into bundles (called segments ), and passes them off to IP for delivery to the destination system, as illustrated in Figure 7-8.

Although this process sounds simple, it involves a lot of work, primarily due to segment-sizing issues that TCP has to deal with. For every segment that gets created, TCP has to determine the most efficient segment size to use at that particular moment, which is an extremely complex affair involving may different factors.

0287-01.gif
Figure 7-8.
An overview of TCP's data-encapsulation process

However, this is also an extremely important service, since accurately determining the size of a segment dictates many of the performance characteristics of a virtual circuit.

For example, making the segment too small wastes network bandwidth. Every TCP segment contains at least 40 bytes of overhead for the IP and TCP headers, so if a segment only contained one byte of data, then the byte ratio of headers-to-data for that segment would be 40:1, a miserable level of throughput by anybody's standard. Conversely, sending 400 bytes of data would change this ratio to 1:10, which is better, although still not very good. Sending four kilobytes of data would change this ratio to 1:100, which would provide excellent utilization of the network's capacity.

On the other hand, sending too much data in a segment can cripple performance as well. If a segment were too big for the IP datagram to travel across the network (due to topology-specific restrictions), then the IP datagram itself would have to be fragmented in order for it to get across that network. This situation would not only require additional processing time on the router that had to fragment the packet, but it would also introduce delays to the destination system, as the receiving IP stack would have to wait for all of the IP fragments to arrive and be reassembled before the TCP segment could be passed to TCP for processing.

In addition, fragmentation also introduces reliability concerns. If a fragment is lost, then the recipient's fragmentation timers have to expire, and an ICMP Time Exceeded error message has to be issued. Then the sender has to resend the entire datagram, which is likely to result in even more fragmentation occurring. Furthermore, on networks that experience known levels of packet loss, fragmentation increases the network's exposure to damage, since a single lost fragment will destroy a large block of data. But if the data were sent as discrete packets to begin with, the same lost packet would result in only that one small segment being lost, which would take less time to recover from. For all of these reasons, avoiding fragmentation is also a critical function of accurately determining the most effective segment size for any given virtual circuit.

Determining the most effective segment size involves the following factors:

Send buffer size
The most obvious part of this equation involves the size of the send buffer on the local system. If the send buffer fills up, then a segment must be sent in order to make space in the queue for more data, regardless of any other factors.

Receive buffer size
Similarly, the size of the receive buffer on the destination system is also a concern, as sending more data than the recipient can handle would cause overruns, resulting in the retransmission of lost segments.

MTU and MRU sizes
TCP also has to take into consideration the maximum amount of data that an IP datagram can handle, as determined by the Maximum Transfer Unit (MTU) size of the physical medium in use on the local network, the Maximum Receive Unit (MRU) size of the destination system's network connection, and the MTU/MRU sizes of all the intermediary networks in between the two endpoint systems. If a datagram is generated that is too large for the end-to-end network to handle, then fragmentation would definitely occur, penalizing performance and reliability.

Header size
IP datagrams have headers, which will steal anywhere from 20 to 60 bytes of data from the segment. Likewise, TCP also has variable-length headers which will steal another 20 to 60 bytes of space. TCP has to leave room for the IP and TCP headers in the segments that get created, otherwise the datagram would be too large for the network to handle, and fragmentation would occur.

Data size and timeliness
The frequency at which queued data is sent is determined by the rate at which data is being generated. Obviously, if lots of data is being generated by an application, then lots of TCP segments will need to be sent quickly. Conversely, small trickles of data will still need to be sent in a timely manner, although this would result in very small segments. In addition, sometimes an application will request that data be sent immediately, bypassing the queue entirely.

Taking all of these variables into consideration, the formula for determining the most efficient segment size can be stated as follows:

MESS = (lesser of (send buffer, receive buffer, MTU, or MRU)) - headers) or (data + headers)

Simply put, the most efficient segment size is determined by finding the lowest available unit of storage (send buffers, receive buffers, or the MTU/MRU values in use) minus the required number of bytes for the IP and TCP headers, except in those situations where there is only a little bit of data to send. In that case, the size of the data (plus the required headers) will determine the size of the segment that is being sent.

By limiting the segment size to the smallest available unit of storage, the segment can be sent from one endpoint to another without having to worry about fragmentation. In turn, this allows TCP to use the largest possible segment for sending data that can be sent end-to-end, which allows the most amount of data to be sent in the least amount of time.

Buffer size considerations

Part of determining the most efficient segment size is derived from the size of the send and receive buffers in use on the two systems. If the send buffer is very small, then the sender cannot build a very large segment. Similarly, if the receive buffer is small, then the sender cannot transmit a large segment (even if it could build one), as that would cause overruns at the destination system, which would eventually require the data to be retransmitted.

Every system has a different default buffer size, depending upon its configuration. Most PC-based client systems have eight kilobyte send and receive buffers, while many server-class systems have buffers of 16 kilobytes or more. It is not uncommon for high-end servers to have 32 or 48 kilobyte send and receive buffers. However, most systems will let you specify the default size for the receive buffers on your system, and they will also let the application developer configure specific settings for their particular application.

Sometimes the size of the local system's send buffer is the bottleneck. If the send buffer is very small, then the sending device just won't be able to generate large segments, regardless of the amount of data being written, the size of the receive buffer, or the MTU/MRU sizes in use on the two networks. Typically, this is not the case, although it can be in some situations, particularly with small hand-held computers that have very limited system resources.

Similarly, sometimes the size of the receive buffers in use at the destination system will be the limiting factor. If the receive buffer on the destination system is very small, then the sender must restrict the amount of data that it pushes to the receiving endpoint. This is also uncommon, but is not unheard of High-speed Token Ring networks are capable of supporting MTUs of 16 kilobytes and more, while the PCs attached to those networks may only have TCP receive buffers of eight kilobytes. In this situation, the segment size would be restricted to the available buffer space (eight kilobytes), rather than the MTU/MRU capabilities of the network (16 kilobytes).

Obviously, a sender already knows the size of its send buffers, but the sender also has to determine the size of the recipient's receive buffer before it can use that information in its segment-sizing calculations. This is achieved through the use of a 16-bit Window field that is stored in the header of every TCP segment that gets sent across a virtual circuit. Whenever a TCP segment is created, the sending end-point stores the current size of their receive buffers into the Window field, and the recipient then reads this information once the segment arrives. This allows each system to constantly monitor the size of the remote system's receive buffers, thereby allowing them to determine the maximum amount of data that can be sent at any given time.

In addition, the Window field is only 16 bits long, which limits the size of a receive buffer to a maximum of 64 kilobytes. RFC 1323 defines a TCP option called the Window Scale option that allows two endpoints to negotiate 30-bit window sizes, allowing for sizes up to one gigabyte to be advertised.

For more information on the Window field, refer to Window. For information on how to calculate the optimal default receive window size for your system, refer to Notes on Determining the Optimal Receive Window Size. For more information on how the window value affects flow control, refer to Receive window size adjustments. For more information on the TCP Window Scale option, refer to Window Scale. All of these sections appear later in this chapter.

MTU and MRU size considerations.

Although buffer sizing issues can have an impact on the size of any given segment at any given time, most of the time the deciding factor for segment sizes is based on the size of the MTU and MRU in use by the end-to-end network connection.

For example, even the weakest of systems will have a TCP receive buffer of two or more kilobytes, while the MTU/MRU for Ethernet networks is only 1.5 kilobytes. In this case (and almost all others), the MTU/MRU of the Ethernet segment will determine the maximum segment size for that system, since it indicates the largest amount of data that can be sent in a single datagram without causing fragmentation to occur.

Typically, the MTU and MRU sizes for a particular network are the same values. For example, Ethernet networks have an MTU/MRU of 1500 bytes, and both of

these values are fixed. However, many dial-up networks allow an endpoint system to define different MTU and MRU sizes. In particular, many dial-up systems set the MTU to be quite small, while also setting the MRU to be quite large. This imbalance can actually help to improve the overall performance of the client, making it snappier than a fixed, medium-sized MTU/MRU pair would allow for.

To understand why this is so, you have to understand that most dial-up systems are clients, using applications such as POP3 and TELNET to retrieve large amounts of data from remote servers. Having a small MTU size forces the client to send segments quickly, since the MTU is the bottleneck in the segment-sizing calculations. Conversely, having a large MRU on a dial-up circuit allows the client to advertise a larger receive value, thereby letting the server send larger blocks of data down to the client. Taken together, the combination of a small MTU and a large MRU allows a dial-up client to send data quickly while also allowing it to download data in large chunks.

For example, one endpoint may be connected via a dial-up modem using a 1500-byte MRU, while the other node may be connected to a Token Ring network with a four-kilobyte MTU, as shown in Figure 7-9. In this example, the 1500-byte MRU would be the limiting factor when data was being sent to the dial-up client, since it represented the bottleneck. Furthermore, if the dial-up client had a 576 byte MTU (regardless of the 1500-byte MRU), then that value would be the limiting factor when data was being sent from the dial-up client up to the Token Ring-attached device.

0291-01.gif
Figure 7-9.
An overview of the segment-sizing process, using MTU and MRU values

Regardless of whether or not the client has a large or small MTU, it should be obvious that senders have to take the remote system's MRU into consideration when determining the most efficient segment size for a virtual circuit. At the same time, however, the sender also has to worry about the size of its local MTU. Both of these factors will determine the largest possible segment allowable on any given virtual circuit.

In order for all of this to work, both systems have to be able to determine each other's MRU sizes (they already know their own MTU sizes), and then independently calculate the maximum segment sizes that are allowed for the virtual circuit.

This determination is achieved by each system advertising its local MRU during the circuit-setup sequence. When each system sends its TCP start segments, it also includes its local MRU size (minus forty bytes for the IP and TCP headers) in those segments, using a TCP option called the Maximum Segment Size option. Since each system advertises its MRU in the start segments, it is a simple procedure for each of the systems to read the values and compare it with its own MTU values.

In truth, the MSS value advertised in the MSS option field tends to be based on the sender's MTU, rather than the MRU. Only a handful of systems actually use the MRU for their MSS advertisements. Although RFC 732 states that the MSS should be derived from the MRU, RFC 1122 clarified this position, stating that the MSS should be derived from the largest segment size that could be reassembled, which could be just about any value (although most implementations set this to the MTU size). Also, since most networks have fixed MTU/MRU pairs, most vendors set this value to the MTU size, knowing that it is the largest segment they can send. While this probably isn't the most technically accurate approach, it is what most implementations have chosen.

Note that RFC 793 states that the use of the MSS option is entirely optional, and therefore not required. If a system did not include an MSS option in its start segments, then a default value of 536 bytes (which is 576 bytes minus 40 bytes for the TCP and IP headers) should be used as the default. However, RFC 1122 reversed this position, stating that the MSS option is mandatory and must be implemented by all TCP providers.

Also note that some BSD-based systems can send only segments with lengths that are multiples of 512 bytes. So, even if an MTU of 576 bytes were available, the segments generated by these systems would be only 512 bytes long. Similarly, circuits capable of supporting MTU sizes of 1.5 kilobytes would use segments of only 1,024 bytes in length.

For a list of the default MTU sizes used with the most-common network topologies, refer to Table 2-5 in Chapter 2, The Internet Protocol. For more information on the MSS option, refer to Maximum Segment Size.

Path MTU discovery

Even though TCP systems are able to determine the MTU values in use by the endpoints on a virtual circuit, they are not be able to determine the MTU sizes of the networks in between the two endpoints, which may be smaller than the MTU/MRU values in use at either of the endpoint networks. In this scenario, fragmentation would still occur, since the MTU of the intermediary system would require that the IP datagrams be fragmented.

For example, if two systems are both on Token Ring networks using four-kilobyte MTUs, but there is an Ethernet network between them with a 1.5 kilobyte MTU, then fragmentation will occur when the four-kilobyte IP datagrams are sent over the 1.5 kilobyte Ethernet network. This process will lower the overall performance of the virtual circuit and may introduce some reliability problems.

By itself, TCP does not provide any means for determining the MTU of an intermediate network, and must rely on external means to discover the problem. One solution to this problem is to use a technique called Path MTU Discovery, which incorporates the IP Don't Fragment bit and the ICMP Destination Unreachable: Fragmentation Required error message to determine the MTU of the end-to-end IP network.

Essentially, Path MTU Discovery works by having one system create an IP packet of the largest possible size (as determined by the MTU/MRU pair for the virtual circuit), and then setting the Don't Fragment flag on the first IP packet. If the packet is rejected by an intermediary device (due to the packet being too large to forward without being fragmented), then the sender will try to resend the packet using a smaller segment size.

This procedure is repeated until ICMP errors stop coming back. At this point, the sender could use the size of the last-tested packet as the MTU for the entire network. Unfortunately, some systems assume that no error messages means that the packet was delivered successfully, without conducting any further testing to verify the theory. However, some routers and firewalls do not return ICMP errors (due to security concerns or configuration errors), which may result in the ICMP errors not being returned to the sender.

This unreliability can cause a situation known as Path MTU Black Hole, where the sender has chosen to use an MTU that is too large for the end-to-end network, but the network is unable or unwilling to inform the sender of the problem. In this scenario, the sender continues sending data with an MTU that is too large for the intermediary network to forward without being fragmented (which is prohibited by the sender). Some implementations are aware of this problem, and if it appears that packets are not getting through then they reduce the size of the segments that they generate until acknowledgments are returned, or they clear the Don't Fragment flag, allowing fragmentation to occur.

For a complete discussion on this subject, refer to Notes on Path MTU Discovery in Chapter 5, The Internet Control Message Protocol.

Header size considerations

As we discussed in Chapter 2, The Internet Protocol, most IP packets have a 20-byte header, with a maximum of 60 bytes being used for this data. TCP segments also have their own header information, with a minimum value of 20 bytes (the most common), and a maximum size of 60 bytes. Taken together, most TCP/IP datagram have 40 bytes of header data (20 from IP and 20 from TCP), with the maximum amount of header data being limited to 120 bytes (60 bytes from IP and TCP each).

Whenever TCP creates a segment, it must leave room for these headers. Otherwise, the IP packet that was generated would exceed the MTU/MRU pair in use on that virtual circuit, resulting in fragmentation.

Although RFC 1122 states that TCP implementations must set aside 40 bytes of data when a segment is created, this isn't always enough. For example, some of the new advanced TCP options utilize an additional 10 or more bytes. If this information isn't taken into consideration, then fragmentation will likely occur.

TCP is able to determine much of this information, but not always. If the underlying IP stack also utilizes IP options that TCP is not aware of, then TCP will not make room for them when segments are created. This will also likely result in fragmentation.

For more information on IP header sizes, refer to The IP Header in Chapter 2, The Internet Protocol. For more information on TCP header sizes, refer to The TCP Header later in this chapter.

Data considerations

Remember that applications write data to TCP, which then stores the data into a local send buffer, generating a new segment whenever it is convenient or prudent to do so. Although segment sizes are typically calculated based on the available buffer space and MTU/MRU values associated with a given virtual circuit, sometimes the nature of the data itself mandates that a segment be generated, even if that segment won't be the most efficient size.

For example, if an application sends only a little bit of data, then TCP will not be able to create a large segment since there just isn't much data to send to the remote endpoint. This is regardless of the inefficiencies of sending small amounts of data; if there isn't a lot of data to send, TCP can't send large segments.

The decision process that TCP goes through to figure out when to send small amounts of data incorporates many different factors. If an application is able to tell TCP how much data is being written—and if TCP isn't busy doing other stuff—then TCP could choose to send the data immediately. Conversely, TCP could choose to just sit on the data, waiting for more data to arrive.

Sometimes, an application knows that it will be sending only a little bit of data, and can explicitly tell TCP to immediately send whatever data is being written. This service is provided through the use of a push service within TCP, allowing an application to tell TCP to go ahead and immediately send whatever data it gets.

The push service is required whenever an application needs to tell TCP that only a small amount of data is being written to the send buffer. This is most often seen with client applications such as POP3 or HTTP that send only a few bytes of data to a server, but it can also be seen from servers that write a lot of data. For example, if an HTTP server needed to send more data than would fit within a segment; the balance of the data would have to be sent in a separate (small) segment. Once the HTTP server got to the end of the data, it would tell TCP that it was finished and to go ahead and send the data without waiting for more. This step would be achieved by the application setting the Push flag during the final write.

Some applications cause the Push flag to be set quite frequently. For example, some TELNET clients will set the Push flag on every keystroke, causing the client to send the keystroke quickly, thereby causing the server to echo the text back to user's display quickly.

Once TCP gets data that has been pushed, it stores the data in a regular TCP segment, but it also sets a Push flag within that segment's TCP header. This allows the remote endpoint to also see that the data is being pushed. This is an important service, since the Push flag also affects the receiving system's segment-handling process. Just as a sending TCP will wait for more data to arrive from an application before generating a segment, a receiving TCP will sometimes wait for more segments to arrive before passing the data to the destination application. But if a receiver gets a segment with the Push flag set, then it is supposed to go ahead and send the data to the application without waiting for any more segments to arrive.

An interesting (but somewhat irrelevant) detail about the Push flag is that the practical usage is quite a bit different from the behavior defined in the standards. Although RFC 793 states that A sending TCP is allowed to collect data until the push function is signaled, then it must send all unsent data, most TCP implementations do not allow applications to set the Push flag directly. Instead, most TCP implementations simply send data as they receive it (most of the time, applications write data to TCP in chunks rather than in continuous streams), and TCP will set the Push flag in the last segment that it sends. Some implementations will even set the Push flag on every segment that they send.

Similarly, many implementations ignore the Push flag on data they receive, immediately notifying the listening application of all new data, regardless of whether the Push flag is set on those segments.

Another interesting flag within the TCP header is the Urgent flag. The Urgent flag can be used by an application whenever it needs to send data that must be dealt with immediately. If an application requests that a segment be sent using the Urgent flag, then TCP is supposed to place that segment at the front of the send queue, sending it out as soon as possible. In addition, the recipient is supposed to read that segment ahead of any other segments that may be waiting to be processed in the receive buffer.

Urgent data is often seen with TELNET, which has some standardized elements that rely on the use of the TCP Urgent flag. Some of the standardized control characters used with TELNET (such as interrupt process and abort output) have speccific behavioral requirements that benefit greatly from the out-of-stream processing that the Urgent flag defines. For example, if a user were to send an interrupt process signal to the remote host and flag this data for Urgent handling, then the control character would be passed to the front of the queue and acted upon immediately, allowing the output to be flushed faster than would otherwise happen.

However, the use of the Urgent flag has been plagued by incompatibility problems ever since RFC 793 was first published. The original wording of that document did not clarify where the urgent data should be placed in the segment, so some systems put it in one place while other systems put it in another. The wording was clarified in RFC 1122, which stated that the urgent pointer points to the last byte of data in the stream. Also of interest is the fact that the urgent pointer can refer to a byte location somewhere up ahead in the stream, in a future segment. All of the data up to and including the byte position specified by the urgent pointer are to be treated as a part of the urgent block. Unfortunately, some systems (such as BSD and its derivatives) still do not follow this model, resulting in an ongoing set of interoperability problems with this flag in particular.

For more information on the Push and Urgent flags, refer to Control Flags later in this chapter.

Flow Control

When an application needs to send data to another application over TCP, it writes the data to the local TCP provider, which queues the data into a send buffer. Periodically, TCP will package portions of the data into segments and pass them off to IP for delivery to the destination system.

One of the key elements to this process is flow control, where a sending system will adjust the rate at which it tries to send data to the destination system. A change in rate may be required due to a variety of reasons, including the available buffer space on the destination system and the packet-handling characteristics of the network. For this reason, TCP incorporates a variety of flow control mechanisms, allowing the sending system to react to these changes easily.

Originally, RFC 793 proposed only a handful of flow control mechanisms, most of which were focused on the receiving end of the connection. Of these services, the two most important were:

Receive window sizing
TCP can send only as much data as a receiver will allow, based on the amount of space available in the remote system's receive buffer, the frequency at which the buffers are drained, and other related factors. Therefore, one way for a receiver to adjust the transfer rate is to increase or decrease the size of the buffer being advertised. This in turn controls how much data a sender can transmit at once.

Sliding receive windows
In addition to the Window size being advertised by a receiver, the concept of a sliding window allows the sender to transmit segments on credit, before acknowledgments have arrived for segments that were already sent. This lets an endpoint send data even though the preceding data has not yet been acknowledged, trusting that an acknowledgment will arrive for that data shortly.

These mechanisms put the destination system in charge of controlling the rate at which the sender transmits data. As the original theory went, the receiver was likely to be the point of congestion in any transfer operation, and as such needed to have the last word on the rate at which data was being sent.

Over time however, the need for sender-based flow control mechanisms has been proven, particularly since network outages may occur, which will require the sender to reduce its rate of transmission, even though the receiving system may be running smoothly. For this reason, RFC 1122 mandated that a variety of networkrelated flow control services also be implemented. Among these services are:

Congestion window sizing
In order to deal with congestion-related issues, the use of a congestion window is required at the sending system. The congestion window is similar in concept to the receive window in that it is expanded and contracted, although these actions are taken according to the underlying IP network's ability to handle the quantity of data being sent, rather than the recipient's ability to process the data.

Slow start
In an effort to keep congestion from occurring in the first place, a sender must first determine the capabilities of the IP network before it starts sending mass quantities of data over a newly established virtual circuit. This is the purpose of slow start, which works by setting the congestion window to a small size and gradually increasing its size, until the network's saturation point is found.

Congestion avoidance
Whenever network congestion is detected, the congestion window is reduced, and a technique called congestion avoidance is used to gradually rebuild the size of the congestion window, eventually returning it to its maximum size. When used in conjunction with slow start, this helps the sender to determine the optimal transfer rate of a virtual circuit.

Taken together, the use of the receive and congestion windows gives a sending system a fairly complete view of the state of the network, including the state of both the recipient and the congestion on the network.

A note on local blocking

Although there are a variety of flow control mechanisms found with TCP, the simplest form of flow control is local blocking, whereby a sending system refuses to accept data from a local application. This feature is needed whenever TCP knows that it cannot deliver any data to a specific destination system—perhaps due to problems with the receiver or the network—and the local send buffer is already full. Having nowhere to send the data, TCP must refuse to accept any new data from the sending application.

Note that TCP cannot block incoming network traffic (coming from IP). Since TCP is unable to tell which application a segment is destined for until its contents have been examined, TCP must accept every segment that it gets from IP. However, TCP may be unable to deliver the data to the destination application, due to a full queue or some other temporary condition. If this happens, TCP could choose to discard the segment, thereby causing the sender to retry the operation later (an effort which may or may not succeed).

Receive window size adjustments

In the section entitled Network I/O Management, I first mentioned the TCP header's Window field, suggesting that it provided an insight into the size of the receive buffer in use on a destination system. Although this is an accurate assessment when looking at TCP's segment sizing process, the primary purpose of the Window field is to provide the receiving system with flow control management services. The Window field is used to tell a sender how much data a recipient can handle. In this model, the recipient dictates flow control.

According to RFC 793, the window field specifies the number of octets that the receiving TCP is currently prepared to receive. In this scenario, a sending system can transmit only as much data as will fit within the recipient's receive buffer (as specified by the Window field) before an acknowledgment is required. Once the sender has transmitted enough data to fill the receive buffer, it must stop sending data and wait for an acknowledgment from the recipient before sending any more data.

Therefore, one way to speed up and slow down the data transfer rate between the two endpoint systems is for the receiving system to change the buffer size being advertised in the Window field. If a system that had been advertising an eight-kilo398byte window suddenly started advertising a 16-kilobyte window, the sender could pump twice as much data through the circuit before having to wait for an acknowledgment.

Conversely, if the recipient started advertising a four-kilobyte window, then the sender could transmit only half as much data before requiring an acknowledgment (this would be enforced by the sender's TCP stack, which would start blocking writes from the sending application when this occurred).

An important consideration here is that recipients are not allowed to arbitrarily reduce their window size, but instead are only supposed to shrink the advertised window when they have received data which has not yet been processed by the destination application. Arbitrarily reducing the size of the receive window can result in a situation where the sender has already sent a bunch of data in accordance with the window size that was last advertised. If the recipient were to suddenly reduce the window size, then some of the segments would probably get rejected, requiring the sender to retransmit the lost data.

owl.gif What happens when the receive buffer goes to zero, effectively preventing the sender from sending any data whatsoever? The answer varies by implementation, but generally speaking the sender will simply stop sending data until the receiver is ready to take data again. Any segments that were already sent may be rejected (or may get accepted), and as such the sender will have to deal with this issue when the window opens again.
Also, many systems implement an incremental fall-back timer, where they will probe the receiver for a window update periodically whenever this situation occurs. In this scenario, the sender will probe the receiver, and if the size of the window is still zero, then the sender will double the size of its probe timer. Once the timer expires, the sender will probe the receiver again, and if the receive window is still zero, the timer will get doubled again. This process will continue as long as the probe results in an acknowledgement—even if the window remains at zero—up to an implementation-specific maximum (such as 64 or 128 seconds).
As soon as the stalled system is able to begin accepting more data, it is supposed to send an unsolicited acknowledgment to the remote system, advising it that the window is open again.

Since the Window field is included in the header of every TCP segment, advertising a different buffer size is a very straightforward affair. If the recipient is willing to speed up or if it needs to slow down, it simply changes the value being advertised in the Window field of any acknowledgment segment that is being returned, and the sender will notice the change as soon as the segment containing the new value is received. Note that there may be some delay in this process, as it may take a while for that segment to arrive.

The size of the buffer also affects the number of segments that can be received, in that the maximum number of available segments is the Window size divided by the maximum segment size. Typically, systems will set their window size to four times the segment size (or larger), so if a system is using one kilobyte segments, then the smallest window size you would want to use on that system would be four kilobytes.

Unfortunately, since the Window field is only 16 bits long, the maximum size that can be advertised is 65,535 bytes. Although this is plenty of buffer space for most applications, there are times when it just isn't enough (such as when the MTU of the local network is also 64 kilobytes, resulting in a Window that is equal to only a single segment). One way around this limitation is the Window Scale option, as defined in RFC 1323. The Window Scale option allows two endpoints to negotiate 30-bit window sizes, allowing up to one gigabyte of buffer space to be advertised.

While it may seem best to use very large window sizes, it is not always feasible or economical to do so. Each segment that is sent must be kept in memory until it has been acknowledged. A hand-held system may not have sufficient resources to cache many segments, and thus would have to use small window sizes in order to limit the amount of data being sent.

In addition, there is a point at which the size of the receive window no longer has any effect on throughput, but instead the bandwidth and delay characteristics of the virtual circuit become the limiting factors. Setting a value larger than necessary is simply a waste of resources and can also result in slower recovery. For example, if a sender sees a large receive window being advertised then it might try to fill that window, even though a router in between the two endpoints may not be able to forward the data very quickly. This delay can result in a substantial queue building up in the router, and if a segment ever does get lost, then it will take a long time for the recipient to notice the problem and the sender to correct it. This would result in extremely long gaps between retransmissions, and may also result in some of the queued data getting discarded (requiring even more retransmissions).

For more information on the Window field, refer to Window. For more information on the TCP Window Scale option, refer to Window Scale. For detailed instructions on how to calculate the most optimal window size for a particular connection, refer to Notes on Determining the Optimal Receive Window Size. All of these sections appear later in this chapter.

Sliding receive windows

Even though large window sizes can help to increase overall throughput, they do not provide for sustained levels of throughput. In particular, if a situation required the use of a synchronous send-and-wait design that required a system to send data and then stop to wait for an acknowledgment, the network would be quite jerky, with bursts of writes followed by long pauses. This problem is most noticeable on networks with high levels of latency that cause extended periods of delay between the two endpoints.

n an effort to avoid this type of scenario, RFC 1122 states that a recipient should issue an acknowledgment for every two segments that it receives, if not more often. This design causes the receiver to issue acknowledgments quickly. In turn, these acknowledgments arrive back to the sender quickly.

Once an acknowledgment has arrived back at the sending system, the outstanding data is cleared from the send queue, thereby letting the sender transmit more data. In effect, the sending system can slide the window over by the number of segments that have been successfully acknowledged, allowing it to transmit more data, even though not all of the segments have been acknowledged yet.

As long as a sender continues receiving acknowledgments, it is able to continue sending data, with the maximum amount of outstanding segments being determined by the size of the recipient's receive buffer. This concept is illustrated in Figure 7-10, which shows how a sender can increment the sliding window whenever it receives an acknowledgment for previously sent data. For example, as the sender is transmitting segment number three, it receives an acknowledgment for segment number one, allowing the sender to move the send buffer forward by one segment.

The key element here is that the sender can transfer only as many bytes of data as the receiver can handle, as advertised in the Window field of the TCP headers sent by the recipient. If the recipient's receive window is set to eight kilobytes and the sender transmits eight one-kilobyte segments without having received an acknowledgment, then it must stop and wait for an acknowledgment before sending any more data.

However, if the sender receives an acknowledgment for the first two segments after having sent eight of them, then it can go ahead and send two more, since the window allows up to eight kilobytes to be in transit at any time. On networks with low levels of latency (such as Ethernet), this feature can have a dramatic impact on

0302-01.gif
Figure 7-10.
An overview of TCP's sliding window mechanism

overall performance, providing for sustained levels of high utilization. On networks with very high levels of latency (such as those that use satellite links), the effect is less pronounced, although it is still better than the send-and-wait effect that would otherwise be felt.

In situations where the window size is smaller than the MTU, a sliding window is harder to implement. Some systems will write only a single segment (up to the maximum allowed by the advertised receive buffer), and then stop to wait for an acknowledgment. Other systems will reduce the size of the local send buffers to half (or less) of the advertised receive window, thereby forcing multiple small segments to be written in an effort to increase the number of acknowledgments that are generated.

Another problem can occur with some TCP implementations that do not issue acknowledgments for every two segments that are received, but instead issue acknowledgments when they have received enough data to fill two maximum-sized segments. For example, if the system has a local MTU of 1500 bytes, but is receiving data in 500-byte chunks, then such a system would only issue acknowledgments for every six segments that arrive (6 × 500 = 3000, which is MTU times two). This process would result in a substantially slower acknowledgment cycle that could cause problems if the sender had a small send window. Although this problem is somewhat rare, it does happen.

For systems that implement this procedure correctly (sending acknowledgments for every two segments that are received, regardless of the maximum segment size), this design can substantially improve overall performance. By using the sliding window technique—and by using large windows—it is quite possible for two fast systems on a fast network to practically saturate the connection with data.

For more information on how frequent acknowledgments can impact performance, refer ahead to Delayed acknowledgments.

The Silly Window Syndrome

The amount of buffer space that a system advertises depends on how much buffer space it has available at that given moment, which is dependent upon how quickly applications can pull data out of the receive buffer. This in turn is driven by many factors, such as the complexity of the application in use, amount of CPU time available, the design of the TCP stack in use, and other elements.

Unfortunately, many of the first-generation TCP-based applications did a very poor job of cleaning out the receive buffers, taking only a few bytes at a time. The system only advertised a receive buffer of a few bytes. In turn, a sender would transmit only a very small segment, since that was all that was being advertised by the recipient. This process would repeat incessantly, with the recipient taking another few bytes out of the receive queue, advertising a small window, and then receiving yet another very small segment.

To prevent this scenario (known affectionately as the Silly Window Syndrome ), RFC 1122 clarified the amount of buffer space that could be advertised, stating that systems could only advertise a non-zero window if the amount of buffer space available could hold a complete segment (as defined by the value shown in the MSS option), or if the buffer space was at least half of the normal window size. If neither of these conditions are met, then the receiver should advertise a zero-length window, effectively forcing the sender to stop transmitting.

The Nagle algorithm.

The Silly Window Syndrome is indicative of a problem at the receiver's end of the virtual circuit. Data is not being read from the receive buffers quickly, resulting in small window sizes being advertised, which in turn causes the sender to transmit small segments. The result is that lots of network traffic gets generated for very small amounts of data.

However, a sending system can also cause these kinds of problems, although for totally different reasons. Some applications (such as TELNET) are designed to send many small segments in a constant barrage, which causes high levels of network utilization for small amounts of data. Other situations in which this is a problem are applications that write data only in small chunks, such as writing 10 megabytes of data in 512-byte blocks. The number of packets that will get generated in that model are extremely wasteful of bandwidth, particularly when this same transfer could be done using larger writes.

One solution proposed to this kind of problem is the Nagle algorithm, which was originally described in RFC 896. Simply put, the Nagle algorithm suggests that segments that are smaller than the maximum size allowed (as defined by the MSS option of the recipient or the discovered MTU of the end-to-end path) should be delayed until all prior segments have been acknowledged or until a full-sized segment can be sent. This rule forces TCP stacks to merge multiple small writes into a single write, which is then sent as a single segment.

On a low-latency LAN, the Nagle algorithm rarely comes into play, since a small segment will be sent and acknowledged very quickly, allowing another small segment to be sent immediately (effectively eliminating the use of the Nagle algorithm). On slow WAN links though, the Nagle algorithm comes into play quite often, since acknowledgments take a long time to be returned to the sender. This results in the next batch of small segments getting bundled together, providing a substantial increase in overall network efficiency.

For these reasons, use of the Nagle algorithm is encouraged by RFC 1122, although its usage is not mandatory. Some applications (such as X Windows) react poorly when small segments are clumped together. In those cases, users must have the option of disabling the Nagle algorithm on a per-circuit basis. However, most TCP implementations do not provide this capability, instead allowing users to enable or disable its use only on a global scale, or leaving it up to the application developer to decide when it is needed.

This limitation can be somewhat of a problem, since some developers have written programs that generate inefficient segment sizes frequently, and have then gone and disabled the use of the Nagle algorithm on those connections in an effort to improve performance, even though doing so results in much higher levels of network utilization (and doesn't do much to improve performance in the end). If those developers had just written their applications to use large writes instead of multiple small writes, then the Nagle algorithm would never come into effect, and the applications would perform better anyway.

Another interesting side effect that appears when the Nagle algorithm is disabled is that the delayed acknowledgment mechanism (as described later in Delayed acknowledgments ) does not tend to work well when small segments are being generated, since it waits for two full-sized segments to arrive before returning an acknowledgment for those segments. If it does not receive full-sized segments due to a developer having turned off the Nagle algorithm, then the delayed acknowledgment mechanism will not kick in until a timer expires or until data is being returned to the sender (which the acknowledgments can piggyback onto).

This can be a particular problem when just a little bit of data needs to be sent. The sender will transmit the data, but the recipient will not acknowledge it until the timer expires, resulting in a very jerky session.

This situation can also happen when a small amount of data is being generated at the tail-end of a bulk transfer. However, the chances are good that in this situation the remote endpoint is going to generate some sort of data (such as a confirmation status code or a circuit-shutdown request). In that case, the delayed acknowledgment will piggyback onto whatever data is being returned, and the user will not notice any excess delays.

For all of these reasons, application developers are encouraged to write data in large, even multiples of the most-efficient segment size for any given connection, whenever that information is available. For example, if a virtual circuit has a maximum segment size of 1460 bytes (the norm for Ethernet), the application should write data in even multiples of 1460 (such as 2,920 byte blocks, or 5,840 byte blocks, and so forth). This way, TCP will generate an even number of efficiently sized segments, resulting in the Nagle algorithm never causing any delay whatsoever, and also preventing the delayed acknowledgment mechanism from holding up any acknowledgments.

Congestion window sizing

TCP's use of variable-length, sliding windows provides good flow control services to the receiving end of a virtual circuit. If the receiver starts having problems, it can slow down the rate at which data is being sent simply by scaling back the amount of buffer space being advertised. But if things are going well, the window can be scaled up, and traffic can flow as fast as the network will allow.

Sometimes, however, the network itself is the bottleneck. Remember that TCP segments are transmitted within IP packets, and that these packets can have their own problems outside of the virtual circuit. In particular, a forwarding device in between the two endpoints could be suffering from congestion problems, whereby it was receiving more data than it could forward, as is common with dial-up servers and application gateways.

When this occurs, the TCP segments will not arrive at their destination in a timely manner (if they make it there at all). In this scenario, the receiving system (and the virtual circuit) may be operating just fine, but problems with the underlying IP network are preventing segments from reaching their destination.

This problem is illustrated in Figure 7-11, which shows a device trying to send data to the remote endpoint, although another device on the network path is suffering from congestion problems, and has sent an ICMP Source Quench error message back to the sender, asking it to slow down the rate of data transfer.

0306-01.gif
Figure 7-11.
Detecting congestion with ICMP Source Quench error messages

Congestion problems can be recognized by the presence of an ICMP Source Quench error message, or by the recipient sending a series of duplicate acknowledgments (suggesting that a segment has been lost), or by the sender's acknowledgment timer reaching zero. When any of these problems occur, the sender must recognize them as being congestion-related, and take counter-measures that deal with them appropriately. Otherwise, if a sender were to simply retransmit segments that were lost due to congestion, the result would be even more congestion. Orderly congestion recovery is therefore required in order for TCP to maintain high performance levels, but without causing more congestion to occur.

At the heart of the congestion management process is a secondary variable called the congestion window that resides on the sender's system. Like the receive window, the congestion window dictates how much data a sender can transmit without stopping to wait for an acknowledgment, although rather than being set by the receiver, the congestion window is set by the sender, according to the congestion characteristics of the IP network.

During normal operation, the congestion window is the same size as the receive window. Thus, the maximum transfer rate of a smooth-flowing network is still restricted by the amount of data that a receiver can handle. If congestion-related problems occur, however, then the size of the congestion window is reduced, thereby making the limiting factor the sender's capability to transmit, rather than the receiver's capability to read.

How aggressively the congestion window is reduced depends upon the event that triggered the resizing action:

If congestion is detected by the presence of a series of duplicate acknowledgments, then the size of the congestion window is cut in half, severely restricting the sender's ability to transmit segments. TCP then utilizes a technique known as congestion avoidance to slowly increment the size of the congestion window, cautiously ramping up the rate at which it can send data, until it returns to the full throttle state.

If congestion is detected by the TCP acknowledgment timer reaching zero or by the presence of an ICMP Source Quench error message, then the congestion window is shrunk so small that only one segment can be sent. TCP uses a technique known as slow start to begin incrementing the size of the congestion window until it is half of its original size, at which point the congestion avoidance technique is called into action to complete the ramp-up process.

Slow start and congestion avoidance are similar in their recovery techniques. However, they are also somewhat different and are used different times. Slow start is used on every new connection—even those that haven't yet experienced any congestion—and whenever the congestion window has been dropped to just one segment. Conversely, congestion avoidance is used both to recover from non-fatal congestion-related events and to slow down the rate at which the congestion window is being expanded, allowing for smoother, more sensitive recovery procedures.

Slow start

One of the most common problems related to congestion is that senders attempt to transmit data as fast as they can, as soon as they can. When a user asks for a big file, the server gleefully tries to send it at full speed immediately.

While this might seem like it would help to complete the transfer quickly, in reality it tends to cause problems. If there are any bottlenecks between the sender and receiver, then this burst-mode form of delivery will find them very quickly, causing congestion problems immediately (most likely resulting in a dropped segment). The user may experience a sudden burst of data, followed by a sudden stop as their system attempts to recover one or more lost segments, followed by another sudden burst. Slow start is the technique used to avoid this particular scenario.

In addition, slow start is used to recover from near-fatal congestion errors, where the congestion window has been reset to one segment, due to an acknowledgment timer reaching zero, or from an ICMP Source Quench error message being received.

Slow start works by exponentially increasing the size of the congestion window. Every time a segment is sent and acknowledged, the size of the congestion window is increased by one segment's worth of data (as determined by the discovered MTU/MRU sizes of the virtual circuit), allowing for more and more data to be sent.

For example, if the congestion window is set to one segment (with the segment size being set to whatever value was determined during the setup process), a single segment will be transmitted. If this segment is acknowledged, then the congestion window is incremented by one, now allowing two segments to be transmitted simultaneously. The next two segments in the send buffer then get sent. If they are both acknowledged, they will each cause the congestion window to be incremented by one again, thus adding room for two more segments (with the congestion window being set to four segments total). All of the segments do not have to be acknowledged before the congestion window is incremented, as shown in Figure 7-12.

0308-01.gif
Figure 7-12.
An overview of the slow start algorithm

If a connection is new, then the process is repeated until congestion is detected or the size of the congestion window is equal to the size of the receive window, as advertised by the receiver's Window field. If the ramping process is successful, then the virtual circuit will eventually be running at full speed, with the flow control being dictated by the size of the recipient's receive buffer. But if congestion is detected during the incrementing process, the congestion window will be locked to the last successful size. Any further congestion problems will result in the congestion window being reduced (as per the process described earlier in Congestion window sizing ).

However, if the slow start routines are being used to recover from a congestion event, then the slow start procedure is used only until the congestion window reaches half of its original size. At this point, the congestion avoidance technique is called upon to continue increasing the size of the congestion window (as described soon in Congestion avoidance ). Since congestion is likely to occur again very quickly, TCP takes the more cautious, linear-growth approach outlined with congestion avoidance, as opposed to the ambitious, exponential growth provided with slow start.

Note that although RFC 1122 mandates the use of slow start with TCP, the procedure was not fully documented until RFC 2001 was published. Therefore, many of the earlier systems do not incorporate the slow start routines described here.

In addition, RFC 2414 advocates the use of four segments as the seed value for slow start, rather than the one segment proposed in RFC 2581 (TCP Congestion Control), which is arguably an improvement with applications that send more than one segment. For example, if an application needed to send two segments of data but the initial congestion window was locked at one segment, then the application could send only one of those segments. As such, the remote endpoint would not receive all of the application data, and the delayed acknowledgment mechanisms would force a long pause before an acknowledgment was returned. But by setting the initial congestion window to two segments, the sender can issue two full-sized segments, which will result in the recipient issuing an acknowledgment immediately. Although this allows connections to ramp-up faster, note that RFC 2414 is only experimental and is not required to be implemented in any shipping TCP implementations.

Congestion avoidance

The congestion avoidance routines are called whenever a system needs to use a slower, more sensitive form of congestion avoidance than the exponential mechanisms offered by the slow start procedure. A slow congestion avoidance mechanism may be required when a system detects congestion from the presence of multiple duplicate acknowledgments, or as part of the recovery mechanisms that are utilized when an acknowledgment timer reaches zero.

Although duplicate acknowledgments are not uncommon (and are allowed for by TCP's error-recovery mechanisms), the presence of many such acknowledgments tends to indicate that an IP datagram has been lost somewhere, most likely due to congestion occurring on the network. As such, RFC 1112 states that if three or more duplicate acknowledgments are received, then the size of the congestion window should be cut in half, and the congestion avoidance technique is to be used in an effort to return the network to full throttle.

Another scenario where congestion avoidance is used is if the sender's acknowledgment timer has expired, which means that no acknowledgments are coming back from the other end. This signifies that there are serious congestion problems, or that the other system has left the network. In an effort to recover from this event, the congestion window is shrunk so small that only one segment can be sent. Then the slow start mechanism is called upon and used the congestion window is half of its original size. Congestion avoidance is then used to return the network to full speed, albeit at a slower, more cautious rate.

Congestion avoidance is very similar to slow start in that the size of the congestion window is expanded whenever acknowledgments arrive for segments that have been sent. However, rather than incrementing the congestion window on a one-for-one basis (as is done with slow start), the congestion window is incremented by only one segment when all of the segments sent within a single window are acknowledged.

For example, assume that a system's congestion window is set to allow four segments, although the recipient's receive window is advertising a maximum capacity of eight segments. Using congestion avoidance, a system would send four segments and then wait for all of them to be acknowledged before incrementing the size of the congestion window by one (now being set to five segments ).

If this effort was a success, then the next five segments would be sent, and if all of them were acknowledged, then the congestion window would be increased to six. This process would continue until either congestion occurred again or the congestion window equals the size of the receive window being advertised by the recipient ( eight segments here).

Note that it doesn't matter if the remote system sends back a single acknowledgement for all of the segments previously sent, or if individual acknowledgments are returned for each of the segments. With congestion avoidance, all of the segments must be acknowledged before the size of the congestion window will be incremented.

Also note that although RFC 1122 mandates the use of congestion avoidance with TCP, the procedure was not fully documented until RFC 2001 was published. Therefore, many of the earlier systems do not incorporate the congestion avoidance routines described here.

Reliability

The most often touted TCP service is reliability, with TCP's virtual circuit design practically guaranteeing that data will get delivered intact. Using this design, TCP will do everything it can to get data to the proper destination, if at all possible. If this is not possible—perhaps due to a failure in the network or some other

catastrophic event—then naturally TCP won't be able to deliver the data. However, as long as the network and hosts are operational, TCP will make sure that the data is delivered intact.

TCP's reliability service takes many forms, employing many different technologies and techniques. Indeed, RFC 793 states that TCP must be able to recover from data that is damaged, lost, duplicated, or delivered out of order. This is a broad range of service, and as such TCP's reliability mechanisms tend to be somewhat complex.

The most basic form of reliability comes from the use of checksums. TCP checksums are used to validate segments (including the TCP headers and any associated data). Furthermore, checksums are mandatory with TCP (as opposed to being optional as they are with UDP), requiring that the sender compute them, and that the recipient compare them to segments received. This provides a simple validation mechanism that lets a receiver test for corrupt data before handing the data off to the destination application.

Although checksums are useful for validating data, they aren't of any use if they never arrive. Therefore, TCP also has to provide delivery services that will ensure that data arrives in the first place. This service is provided by TCP's use of sequence numbers and acknowledgments, both of which work together to make TCP a reliable transport. Once a segment has been sent, the sender must wait for an acknowledgment to be returned stating that all of the data has been successfully received. If a segment is not acknowledged within a certain amount of time, the sender will eventually try to send it again. This design allows TCP to recover from segments that get lost in transit.

Furthermore, the use of unique sequence numbers allows a receiver to reorder any segments that may have come in out of sequence. Since IP is unpredictable, it is entirely possible that some datagram will be routed over a slower link than the rest of the datagrams, causing some of them to arrive in a different order than they were sent. The receiving TCP system can use the sequence numbers to reorder segments into their correct sequence—as well as eliminate any duplicates—before passing the data off to the destination application.

Taken together, these services make TCP an extremely reliable transport protocol, which is why it is the transport of choice for most Internet applications.

In summary, the key elements of TCP's reliability service are:

Checksums
TCP uses checksums for every segment that is sent, allowing the destination system to verify that the data within the segment is valid.

Sequence numbers
Every byte of data that gets sent across a virtual circuit is assigned a sequence number. These sequence numbers allow the sender and receiver to refer to a range of data explicitly, and also allows the recipient to reorder segments that come in out of order, as well as eliminate any duplicates.

Acknowledgements
Every byte of data sent across a virtual circuit must be acknowledged. This task is achieved through the use of an acknowledgment number, which is used to state that a receiver has received all of the data within a segment (as opposed to receiving the segment itself), and is ready for more data.

Timers
Since TCP uses IP for delivery, some segments can get lost or corrupted on their way to the destination. When this happens, no acknowledgment will be received by the sender, which would require a retransmission of the questionable data. In order to detect this error, TCP also incorporates an acknowledgment timer, allowing the sender to retransmit lost data that does not get acknowledged.

In practice, these mechanisms are tightly interwoven, with each of them relying on the others in order to provide a totally reliable implementation. They are discussed in detail in the following sections.

TCP checksums

TCP checksums are identical to UDP checksums, with the exception that checksums are mandatory with TCP (instead of being optional, as they are with UDP). Furthermore, their usage is mandatory for both the sending and receiving systems. RFC 1122 clearly states that the receiver must validate every segment received, using the checksum to verify that the contents of the segment are correct before delivering it to the destination application.

Checksums provide a valuable service in that they verify that data has not been corrupted in transit. All of the other reliability services provided by TCP—the sequence numbers, acknowledgments, and timers—serve only to ensure that segments arrive at their destination; checksums make sure the data inside the segments arrives intact.

Checksums are calculated by performing ones-complement math against the header and data of the TCP segment. Also included in this calculation is a pseudo-header that contains the source and destination IP addresses, the Protocol Identifier (6 for TCP), and the size of the TCP segment (including the TCP headers and data). By including the pseudo-header in the calculations, the destination system is able to validate that the sender and receiver information is also correct, in case the IP datagram that delivered the TCP segment got mixed up on the way to its final destination.

TCP must validate the checksum before issuing an acknowledgment for the segment. If a segment is received with an invalid checksum, then the segment must be discarded. Discarding the segment is a silent event, with no notification of the failure being generated or sent.

This is required behavior, since the recipient has no way of determining which circuit the segment belongs to if the checksum is deemed invalid (the header could be the corrupt part of the segment). In such a situation, an error message could be sent to the wrong source, thereby causing additional (and unrelated) problems to ensue. Instead, the segment is thrown away, and the original sending system would eventually notice that the data was not successfully received (due to the acknowledgment timer expiring), and the segment would eventually be reissued.

Since each virtual circuit consists of a pair of sockets, the receiver has to know the IP address of the sender in order to deliver the data to the correct destination application. If there are multiple connections to port 80 on the local server (as would be found with an HTTP server), TCP has to know which system sent the data in order to deliver it to the right instance of the local server. Although this information is available from IP, TCP verifies the information using the checksum's pseudo-header.

Note that RFC 1146 introduced a TCP option for alternative checksum mechanisms. However, the Alternative Checksum option was classified as experimental, and RFC 1146 has since expired. Therefore, the Alternative Checksum option should not be used with any production TCP implementations.

Sequence numbers

A key part of TCP's reliability service is the use of sequence numbers and acknowledgements, allowing the sender and receiver to constantly inform each other of the data that has been sent and received. These two mechanisms work hand-in-glove to ensure that data arrives at the destination system.

RFC 793 states that each [byte] of data is assigned a sequence number. The sequence number for the first byte of data within a particular segment is then published in the Sequence Identifier field of the TCP header. Thus, when a segment is sent, the Sequence Identifier field shows the starting byte number for the data within that particular segment. Note that sequence numbers do not refer to segments, but instead refer to the starting byte of a segment's data block.

Once a segment is received, the data is verified by the recipient (using the checksum), and if it's okay, then the recipient will send an acknowledgment back to the sender. The acknowledgment is also contained within a TCP segment, with the

Acknowledgment Identifier field in the TCP header pointing to the next sequence number that the recipient is willing to accept. The acknowledgment effectively says I received all of the data up to this point and am ready for the next byte of data, starting at sequence number n.

When the acknowledgment arrives, the sender knows that the receiver has successfully received all of the data contained within the segment, and the sender is then able to transmit more data (up to the maximum amount of data that will fit within either the receiver's current receive window or the sender's current congestion window). This process is illustrated in Figure 7-13. In that example, the sender has identified the first byte of data being sent as 1, while the acknowledgment for that segment points to the first byte of data from the next segment that the receiver expects to get (101).

0314-01.gif
Figure 7-13.
Using sequence numbers to track data

In practice, sequence numbers should rarely be numbered 1. Sequence numbers are 32-bit integers, with a possible range in values from zero through 4,294,967,295. RFC 1122 states that systems must seed the sequence number value on all new circuits using a value derived from the local system's clock. Therefore, the first byte of data being sent across a virtual circuit should not be numbered 1, but instead should be numbered according to a value derived from the current time. Some systems violate this principle, starting at 1 even though they're not supposed to.

The main reason for seeding the sequence number with the system clock is for safety. If sequence numbers always start at a fixed integer (like 1), there is an increased opportunity for overlapping to occur, particularly on circuits that are opened and closed rapidly. For example, if two systems used the same port numbers for multiple connections, and a segment from the first connection got lost in transit, that segment may arrive at the destination during the next connection, thereby appearing to be a valid sequence number. For this reason, all TCP

implementations should always seed the sequence number for all new connections using a value derived from the local system's clock.

In addition, RFC 1948 discussed how this information could be used to launch a variety of attacks against a system, and that using predictable sequence numbers was not only a technical problem but a security risk as well. Essentially, predictable sequence numbers also mean that acknowledgment numbers can be predicted. Given that information, it is easy for a remote hacker to send fake packets to your servers, providing valid IP addresses and acknowledgment numbers. This loophole lets the bad guy compromise your systems without ever seeing a single packet. Unfortunately, some systems still use highly predictable sequence numbers today, and this problem has not gone away entirely.

Another concern with sequence numbers is that they can wrap around during long data transfers. Although there are more than four billion possible sequence numbers, this is not an infinite amount, so reusing sequence numbers will certainly happen on some circuits, particularly those that are kept open for extended periods of time. For example, if a 10-gigabyte file was transferred between two hosts, then the sequence numbers used on that virtual circuit would have to wrap around twice, with some (if not many) of the sequence numbers getting reused at some point. When this occurs, a segment that got lost or redirected in transit could show up late and appear to be a valid segment.

In order to keep reused sequence numbers from causing these kinds of problems, the recipient must limit the active sequence numbers to a size that will fit within the local receive buffer. Since the receive buffer limits the amount of data (in bytes) that can be outstanding and unacknowledged at any given time, a recipient system can simply ignore any segments with sequence numbers that are outside the boundaries of the current window range.

For example, if a recipient has an eight-kilobyte receive buffer, it can set an eight-kilobyte limit on the data that it receives. If it is currently processing segments with sequence numbers around 100,000, then it can safely ignore any segments that arrive with sequence numbers less than 92,000 or greater than 108,000.

In addition, IP's built-in Time-To-Live mechanism also helps to keep older segments from showing up unexpectedly and wreaking havoc. If an IP datagram has a medium-sized Time-To-Live value, then the datagram may be destroyed before it ever reaches the destination. However, most TCP implementations set the Time-To-Live value at 60 seconds (a value recommended in RFC 1122). Since this value tends to be greater than the acknowledgment timer, it is quite possible that a sender will reissue a segment that has not been acknowledged, and that the old datagram will show up unexpectedly. Since the two segments would have the same sequence number, the recipient should be able to detect that they are duplicates, and simply discard the duplicate segment.

Another way to deal with this problem is to use the TCP Timestamp option to identify when a particular segment was sent. On extremely fast networks (such as those using Gigabit Ethernet), it takes only 17 seconds to completely cycle through all four billion sequence numbers. Since the Time-To-Live value on most IP datagrams is substantially larger than this, there is a high probability of an old datagram showing up with a recently used sequence number. RFC 1323 provides a solution to this problem, using the TCP Timestamp option as a secondary reference for each unique segment. When used together, these two mechanisms keep old segments from wreaking havoc when sequence numbers are being reused.

Since the Sequence Identifier field is a standard part of the TCP header, every segment that is sent must have a Sequence Identifier field, even if it is a segment that doesn't contain any data (such as acknowledgments). There's an obvious problem here: it does not contain any data, so what should it use for the Sequence Identifier? After all, the sequence number is supposed to refer to the first byte of data.

If a segment does not contain any data, then the next byte of data expected to be sent is used in the Sequence Identifier field of the TCP header. This sequence number would continue to be used until some data was actually sent, forcing the sequence number to be incremented.

Figure 7-14 illustrates how zero-length segments reuse sequence numbers. As the sender pumps data down to the recipient, the latter has to periodically acknowledge the data that it has received. These acknowledgments are sent as individual segments, with each segment having a Sequence Identifier in the TCP header. Since the client isn't sending any data, it is using the next byte it expects to send as the sequence number. This sequence number will be reused until data actually does get sent, at which point the client's sequence number will be incremented.

One drawback of this approach is that these acknowledgment segments are nonrecoverable if they get lost or become corrupted. Since each of these segments contain duplicate sequence numbers, there's no way for the other end of the connection to uniquely identify them. The remote endpoint cannot ask that sequence number n be resent, because there are lots of segments with that sequence number. In addition, sequence number n has not yet been sent anyway, since the Sequence Identifier field is referring to the next byte of data expected to be sent.

However, this does not mean that the connection will collapse if a zero-length segment is lost. Since zero-length segments typically contain acknowledgments, if one of them is lost then the acknowledgment will be lost as well. But if the sender has sent more data beyond that segment, then the recipient will likely return an acknowledgment for a higher-numbered sequence anyway, obviating the need for that particular acknowledgment to be resent. If the sender is not sending any more data, then it will eventually notice the missing acknowledgment, and resend the

0317-01.gif
Figure 7-14.
Reusing sequence numbers with command segments

questionable data. This action should result in the recipient re-issuing the lost acknowledgment, providing full recovery.

Some of the other zero-length command segments that get used—such as the command segments used to open and close virtual circuits—have their own special sequence number considerations. For example, start segments that use the Synchronize bit use a sequence number that is one lower than the sequence numbers used for data, while close segments that use the Finish or Reset bits use sequence numbers that are one greater than the sequence numbers used for data. By using sequence numbers outside the range of the sequence numbers used for data, these particular segments will not interfere with actual data delivery, and can be tracked individually if necessary.

For more information on the Synchronize, Finish, and Reset flags, refer to Control Flags later in this chapter.

Acknowledgment numbers

Acknowledgment numbers and sequence numbers are closely tied, working together to make sure that segments arrive at their destination.

Just as sequence numbers are used to identify the individual bytes of data being sent in a segment, acknowledgment numbers are used to verify that all of the data in that segment was successfully received. However, rather than pointing to the first byte of data in a segment that has just arrived, the acknowledgment number points to the next byte of data that a recipient expects to receive in the next segment.

This process is illustrated earlier in Figure 7-13. In that example, the sender transmits 100 bytes of data, using a sequence number of 1 to identify the first byte of data in the segment. The receiver returns an acknowledgment for the segment, indicating that it's ready to accept the next segment (starting at sequence number 101). Notice that the acknowledgment does not point to bytes 1 or 100, but instead points to 101, the next byte that the receiver expects to get.

owl.gif In truth, acknowledgment numbers are closer in concept to a flow control throttle rather than being explicit acknowledgments. Rather than saying I got the data, they say I'm ready for more data, starting at byte number n.

This design is commonly referred to as being cumulatively implicit, indicating that all of the data up to (but not including) the acknowledgment number has been received successfully, rather than explicitly acknowledging that a particular byte has been received. Implicit acknowledgments work well when data is flowing smoothly, as a receiver can continually request more data. However, when things go bad, implicit acknowledgments are not very robust. If a segment gets lost or corrupted, then the recipient has no way of informing the sender of the specific problem. Instead, it must re-request the next expected byte of data, since that's all the cumulatively implicit acknowledgment scheme allows for.

Remember that the sliding window mechanism allows a sending system to transmit as many segments as can fit within the recipient's advertised receive buffer. If a system is advertising an eight-kilobyte window, and the sender is using one-kilobyte segments, then as many as eight segments may be issued and in transit at any given moment. If the first segment is lost or damaged, the recipient may still get the remaining seven segments. Furthermore, it should hold those other segments in memory until the missing data arrives, preventing the need for all of the other segments to get resent.

However, the recipient must put the segments back into their original order before passing the data up to the destination application. Therefore, it has to notify the sender of the missing segment before it can process the remaining seven segments.

Most network protocols use either negative acknowledgments or selective acknowledgments for this service. Using a negative acknowledgment, the recipient can send a message back to the sender stating segment n is missing, please resend. A selective acknowledgment can be used to notify the sender that bytes a through g and bytes s through z were received, please resend bytes h through r.

However, TCP does not use negative or selective acknowledgments by default. Instead, a recipient system has to implement recovery using the implicit acknowledgment mechanism, simply stating all bytes up to n have been received. When a segment is lost, the recipient has to resend the acknowledgment, thereby informing the sender that it is still waiting for a particular sequence number. The original sender then has to recognize the duplicate acknowledgment as a cry for help, stop transmitting new data, and resend the missing data.

Note that RFC 1106 introduced an experimental TCP option that allowed for the use of negative acknowledgments. However, the Negative Acknowledgment option was never widely used, and RFC 1106 has since expired. Therefore, the Negative Acknowledgment option should not be used with any production TCP implementations.

In addition, RFC 1072 introduced selective acknowledgments to TCP, by way of a set of TCP options. However this work has been clarified in RFC 2018. Using the selective acknowledgment options described therein, a TCP segment can precisely state the data it has received—and thus the data that's missing—even if those blocks of data are non-contiguous. In this model, a receiver uses the normal acknowledgment scheme to state that it is looking for sequence number n, and then supplements this information with the Selective Acknowledgment Data option, stating that it also has bytes y through z in the receive queue. The sender would then resend bytes n through x, filling the hole in the receiver's queue. For a more detailed discussion on Selective Acknowledgments, refer to Selective Acknowledgments Permitted and Selective Acknowledgment Data both later in this chapter.

The cumulatively implicit acknowledgment scheme used by TCP is illustrated in Figure 7-15. In that example, each segment contains 100 bytes. The first segment is received successfully, so the recipient sends an implicit acknowledgment for bytes zero through 100 back to the sender. The second segment, however, is lost in transit, so the recipient doesn't see (or acknowledge) it.

When the third segment arrives, the recipient recognizes that it is missing bytes 101 through 200, yet having no way to issue a negative acknowledgment, it repeats the previous implicit acknowledgment, indicating that it is still waiting for byte 101.

What happens next depends on a variety of implementation issues. In the original specification, the sender could wait until an acknowledgment timer for sequence number 101 had expired before resending the segment. However, RFC 1122 states that if three duplicate acknowledgments are received for a segment—and if no other acknowledgments have been received for any subsequent segments—then the sender should assume that the segment was probably lost in transit. In this

0320-01.gif
Figure 7-15.
Detecting data loss from multiple duplicate acknowledgments

case, the sender should just retransmit the questionable segment rather than waiting for the acknowledgment timer for that segment to expire. This process is known as fast retransmit, which is documented in RFC 2581.

It is important to note that fast retransmit does not work when the data has been lost from the tail-end of the stream. Since no other segments would have been sent after the lost segment, there would not be any duplicate acknowledgments, and as such fast retransmit would never come into play. In those situations, the missing data will be discovered only when the acknowledgment timer for that segment expires.

Regardless of the retransmission strategy used, once the sender had resent the lost segment, it would have to decide whether or not it needed to resend all of the segments following the lost segment, or simply resume sending from the point it left off when the missing segment was discovered. The most common mechanism that is used for this is called fast recovery, which is also described in RFC 2581. The fast recovery algorithm states that if data was retransmitted due to the presence of multiple duplicate acknowledgments, then the sender should just resume transmitting segments on the assumption that none of the subsequent segments were lost. If in fact there were other lost segments, then that information would be discovered in the acknowledgments for the retransmitted segment. This position assumes that multiple segments are not likely to have been lost, and is accurate most of the time.

Of course, this also depends on whether or not the recipient actually kept any other segments that may have been sent. Although RFC 1122 states that the recipient should keep the other segments in memory (thereby allowing for reassembly to occur locally rather than requiring a total retransmission), not all systems conform to this recommendation.

Another related issue is partial acknowledgment, whereby a recipient has lost multiple segments. When that happens, the sender may discover and resend the first lost segment through the use of the fast retransmission algorithm. However, rather than getting an acknowledgment back for all of the segments sent afterwards, an acknowledgment is returned that only points to some of the data sent afterwards. Although there aren't any standards-track RFCs that dictate how this situation should be handled, the prevailing logic is to have the sender retransmit the next-specified missing segment and then continue resending from where it left off.

This process illustrates the importance of the recipient's receive buffer, particularly as it pertains to the sender. Every time the recipient received another segment that was out of order, it would have to store this data into the receive buffer until the missing segment arrived. This in turn would take up space in the receive buffer, and so the recipient would have to advertise a smaller receive buffer every time it sent a duplicate acknowledgment for the missing segment. This would in turn cause the sender to slow down (as described earlier in Receive window size adjustments ), with the sender eventually being unable to send any additional segments. Once the recipient got the missing segment, it would reorder the segments and then pass the data off to the destination application. Then it could advertise a large receive buffer again, and the sender could resume sending data.

Depending on the size and condition of the receive buffer, the sender may be able to resume sending data from where it left off (in our example, sequence number 401), without waiting for an acknowledgment for the next segment. This really depends on the number of unacknowledged segments currently outstanding and the maximum amount of unacknowledged data allowed by the recipient.

For example, if the size of the receive buffer was 800 bytes (using 100-byte segments)—and if only two segments were currently unacknowledged—then once the sender had resent the missing data, it could go ahead and resume transmitting additional segments without waiting for an acknowledgment for those other segments. But if the receive buffer had been cut down to just two hundred bytes, then the sender could not send any more data until the two outstanding segments had been acknowledged.

For more details on how the receive buffer controls flow control in general, refer back to Flow Control. For more information on the selective acknowledgment option, refer ahead to Selective Acknowledgment Data.

Acknowledgment timers

Most of the time, spurious packet loss is dealt with by using the fast retransmit and fast recovery algorithms, as defined in RFC 2581. However, those algorithms are not always usable. For example, if the link has failed completely, then multiple duplicate acknowledgments will not be received. Also, if the last segment from a transmission were the one that got lost, then there would not be any additional segments that would cause multiple duplicate acknowledgments to get generated. In these situations, TCP has to rely on an acknowledgment timer (also known as a retransmission timer) in order to detect when a segment has been lost in transit.

Whenever a sender transmits a segment, it has to wait for an acknowledgment to arrive before it can clear the segment from the outbound queue. On well-heeled networks, these acknowledgments come in quickly, allowing the sender to clear the data immediately, increment the sliding window, and move on to the next waiting segment. But if a segment is lost or corrupted and no acknowledgment ever arrives, then the sender has to rely on the timer to tell it when it should resend unacknowledged data.

Determining the most efficient size for the acknowledgment timer is a complex process that must be handled carefully. Setting the timer too short would result in frequent and unnecessary retransmissions, while setting the timer too long would result in unproductive delays whenever loss actually occurred.

For example, the acknowledgment timer for two systems connected together on a high-speed LAN should be substantially shorter than the timer used for a slow connection over the open Internet. Using a short timer allows failure to be recognized quickly, which is desirable on a high-speed LAN where latency is not much of an issue. However, setting a long timer would be more practical when many slow networks were involved, as it would not be efficient to continually generate duplicate segments when the problem is slow delivery (rather than packet loss).

Most systems start with a default acknowledgment timer, and then adjust this timer on a per-circuit basis according to the round-trip delivery times encountered on that specific connection. However, even this approach can get complicated, because the default timer is likely to be inappropriate for many of the virtual circuits, since some of them will be used for slow, long-haul circuits while others will be used for local and fast connections.

For example, most modern systems use a default timer of 3000 milliseconds, which is really too large for local area networks that have a round-trip time less than 10 milliseconds (even though this is the recommended default in RFC 1122). Conversely, many earlier implementations had a default timer of 200 milliseconds, which is far too short for many dial-up and satellite links, resulting in frequent and totally unnecessary retransmissions.

Also, the round-trip delivery times of most networks change throughout the day, due to changes in network utilization, congestion, and routing updates that affect the path that segments take on the way to their destination. For these reasons, the default setting is only accurate some of the time, and must be modified to reflect the specific latency characteristics of each virtual circuit throughout the connection's lifetime.

The two formulas used for determining round-trip delivery times are Van Jacobsen's algorithm and Karn's algorithm. Van Jacobsen's algorithm is useful for determining a smoothed round-trip time across a network, while Karn's algorithm offers techniques for adjusting the smoothed round-trip time whenever network congestion is detected. Although these two algorithms are outside the scope of this book, understanding their principles is required in order to fully understand how they can impact TCP acknowledgment timers.

The basis of Van Jacobsen's algorithm is for a sender to watch the delay encountered by acknowledgment segments as they cross the network, constantly tweaking the variables in use according to the specific amount of time it takes to send a segment and then receive an acknowledgment for that segment.

Although Van Jacobsen's original algorithm used acknowledgments to determine round-trip times for specific segments, this model did not provide for guaranteed accuracy, as multiple acknowledgments could arrive for a single segment (due to loss or due to the use of acknowledgments to command segments, each of which would share the same sequence number). In order to provide for a more accurate monitoring tool, RFC 1072 introduced a pair of TCP options that could be used for measuring the round-trip time of any given circuit, called the Echo and Echo Reply options. However, this work was abandoned in favor of a generic Timestamp option, as defined in RFC 1323.

RFC 1323 uses two fields in a single Timestamp option, allowing both systems to monitor the precise round-trip delivery time of every segment that they send. Whenever a system needs to send a segment, it should place the current time into the Timestamp Value field of the Timestamp option in the TCP header of the outgoing segment. When the remote system receives the segment, it will copy that data into the Timestamp Reply field of the response segment, and place its own timestamp into the Timestamp Value field. Upon receipt of the response, the original sender will see the original timestamp for the original segment, and can compare that data to the current time allowing it to determine the exact amount of latency for the network. In addition, the field-swapping operation will also be repeated, allowing the remote end to determine the same information. For more information on the Timestamp option, refer to Timestamp later in this chapter.

Karn's algorithm amplifies the basic round-trip time formula, although it focuses on how to deal with packet loss or congestion. For example, Karn's algorithm suggests that it is best to ignore the round-trip times on packets that get lost (i.e., where no acknowledgment has been received) in order to prevent one failure from unnecessarily tilting the smoothed round-trip time determined from using Van Jacobsen's algorithm. Karn's algorithm also suggests that the value of the acknowledgment timer should be doubled whenever questionable data has been retransmitted due to the acknowledgment timer expiring, in case the problem is with temporary link failure or congestion.

In this model, if the retransmitted segments also go unacknowledged, then the acknowledgment timer will be doubled yet again, with the process repeating until a system-specific maximum has been reached. This could be a maximum number of retransmissions, or a maximum timer value, or a combination of the two.

Systems based on BSD typically limit the length of the retransmission timer to either five minutes or a maximum of twelve attempts, whichever comes first. Windows-based systems limit retransmissions to five attempts, with each retransmission doubling the acknowledgment timer. Other implementations do not double the retransmission timer, but instead use a percentage-based formula or a fixed table, hoping to recover faster than blind-doubling would allow. Regardless, remember that the value that is being incremented or doubled is based on the smoothed round-trip time for that connection, so the maximum acknowledgment timer value could be either quite large or quite small.

Some systems have shown problems in this area, failing to double the size of their retransmission timers whenever the timer expired. As such, these systems would send a retransmission, and then continue resending the data in short fixed intervals. Since these systems had low timers anyway (200 milliseconds was the default), a dial-up user connecting to this system would tend to get at least two or three retransmissions of the very first segment, until the round-trip smoothing started to kick in.

Also, some systems will cache the learned round-trip time for future use, allowing any subsequent connections to the same remote system (or network) to use the previously learned round-trip latency values. This feature allows the new connection to start with a default that should be appropriate for the specific endpoint system, instead of starting at the system default value (which is almost always wrong).

RFC 1122 mandates that both Van Jacobsen's algorithm and Karn's algorithm be used in all TCP implementations so that acknowledgment timers will get synchronized quickly. Subsequent experimentation has shown that these algorithms do in fact help to improve overall throughput and performance, regardless of the networks in use.

However, there are also times when these algorithms can actually cause problems to occur, such as when Karn's algorithm results in an overly slow reaction to a sudden change in the network's characteristics. For example, if the round-trip time suddenly goes through the roof due to a change in the end-to-end network path, the acknowledgment timer on the sender will most likely get triggered before the data ever reaches the destination. When that happens, the sender will resend the unacknowledged data, the size of the retransmission timer will get doubled, and the acknowledgments for the questionable data may also be ignored (since retransmissions aren't supposed to interfere with the smoothed round-trip time). It will take several attempts before the smoothed round-trip time will get updated to reflect the true round-trip latency of the new network path.

Delayed acknowledgments.

Figure 7-15 earlier in this chapter shows the receiver sending an acknowledgment every time it receives a segment from the sender. However, this is not necessarily an effective use of resources. For one thing, the receiver has to spend CPU cycles on calculating the acknowledgment, as does the sender when it gets the acknowledgment. Furthermore, the use of frequent acknowledgments also generates excessive amounts of network traffic, thereby consuming bandwidth that could otherwise be used by the sender to transmit data.

Rather than acknowledging every segment, it is better for the receiver to only send acknowledgments on a periodic basis. A mechanism called Delayed Acknowledgment is used for this purpose, allowing multiple segments to be acknowledged simultaneously. Remember that acknowledgments are implicit, stating that all data up to n has been received. It is therefore possible for a recipient to acknowledge multiple segments simultaneously by simply setting the Acknowledgment Identifier to a higher inclusive value, rather than sending multiple distinct acknowledgments. Not only does this consume less network resources, but it also requires less computational resources on the part of the two endpoints.

This concept is illustrated in Figure 7-16. In that example, the recipient only sends an acknowledgment after receiving two segments. This approach not only generates less traffic, but it also allows the sender to increment their sliding window by two segment sizes, thereby helping to keep traffic flowing smoothly.

RFC 1122 states that all TCP implementations should utilize the delayed acknowledgment algorithm. However, RFC 1122 also states that TCP implementations who do so must not delay an acknowledgment for more than 500 milliseconds (to prevent the sender's acknowledgment timers from reaching zero).

RFC 1122 also states that an acknowledgment should be sent for every two fullsized segments received. However, this depends upon the ability of the recipient to clear the buffer quickly, and also depends upon the latency of the network in

0326-01.gif
Figure 7-16.
An overview of the delayed acknowledgment algorithm

use. If it takes a long time for cumulative acknowledgments to reach the sender, this design could negatively impact the sender's ability to transmit more data. Instead of helping, this behavior would cause traffic to become bursty, with the sender transmitting lots of segments and then stopping to wait for an acknowledgment. Once the acknowledgment arrived, then the sender would send several more segments, and then stop again.

Furthermore, some applications (such as TELNET) are chatty by design, with the client and server both sending data to each other on a regular basis. With these applications, both systems would want to combine their acknowledgments with whatever data had to be sent, reducing the number of segments actively crossing the network at any given moment. Delayed acknowledgments are also helpful here, as the two systems can simply combine their acknowledgments with whatever data is being returned.

For example, assume that a TELNET client is sending keystrokes to the server, which must be echoed back to the client. This would generate lots of small segments by both systems, not only for the keystroke data, but also for the acknowledgments that would be generated for each segments containing the keystroke data, as well as for the segments containing the data being echoed back to the client.

By delaying the acknowledgment until the segment containing the echoed data is generated, the amount of network traffic can be reduced dramatically. Effectively, rather than sending an acknowledgment as soon as the client's keystroke data segment had been verified, the server would delay the acknowledgment for a little while. Then, if any data was being returned to the client (such as the echoed keystroke), the server would just set the Acknowledgment Identifier in that segment's TCP header, eliminating the need for a separate acknowledgment segment. When combined with the Nagle algorithm, delayed acknowledgments can really help to cut down the amount of network bandwidth being consumed by small segments.

Unfortunately, there are also some potential problems with this design, although they are typically only seen when used in conjunction with Path MTU Discovery. These problems occur whenever a system chooses to delay an acknowledgment until two full-sized segments have been received, but the system is receiving segments that are not fully sized. This can happen when two devices announce large MTUs using the MSS option, but then the sender determines that a smaller MTU is required (as detected with Path MTU Discovery). When this happens, the recipient will receive many segments, but will not return an acknowledgment until it has received enough data to fill two full-sized segments (as determined by the MSS option).

In this case, the sender will send as much data as it can (according to the current limitations defined by the congestion window and the local sliding window), and then stop transmitting until an acknowledgment for that data is received. However, the recipient will not return an acknowledgment until the 500-millisecond maximum for delayed acknowledgments has been reached, and then will send one acknowledgment for all of the segments that have been received. The sender will increment its sliding window and resume sending data, only to stop again a moment later, resulting in bursty traffic.

This scenario happens only when Path MTU Discovery detects a smaller MTU than the size announced by the MSS option, which should be a fairly rare occurrence, although it does happen often enough to be a problem. This is particularly problematic with sites that use Token Ring, FDDI, or some other technology that allows for large MTU sizes, with an intermediary network that allows for only 1500-byte MTU sizes. For a more detailed discussion of this problem, refer to Partially Filled Segments or Long Gaps Between Sends later in this chapter.

The TCP Header

TCP segments consist of header and body parts, just like IP datagrams. The body part contains whatever data was provided by the application that generated it, while the header contains the fields that tell the destination TCP software what to do with the data.

A TCP segment is made up of at least ten fields. Unlike the other core protocols, some TCP segments do not contain data. In addition, there are a variety of supplementary header fields that may show up as options in the header. The total size of the segment will vary according to the size of the data and any options that may be in use.

Table 7-2 lists all of the mandatory fields in a TCP header, along with their size (in bits) and some usage notes. For more detailed descriptions of these fields, refer to the individual sections throughout this chapter.

Table 7-2. The Fields in a TCP Segment
FieldBitsUsage Notes
Source Port16Identifies the 16-bit port number in use by the application that is sending the data.
Destination Port16Identifies the 16-bit target port number of the application that is to receive this data.
Sequence Identifier32Each byte of data sent across a virtual circuit is assigned a somewhat unique number. The Sequence Identifier field is used to identify the number associated with the first byte of data in this segment.
Acknowledgment Identifier32Each byte of data sent across a virtual circuit is assigned a somewhat unique number. The Acknowledgment Identifier field is used to identify the next byte of data that a recipient is expecting to receive.
Header Length4Specifies the size of the TCP header, including any options.
Reserved6Reserved. Must be zero.
Flags6Used to relay control information about the virtual circuit between the two endpoint systems.
Window16User to store a checksum of the entire TCP segment.
Checksum16Used to store a checksum of the entire TCP segment.
Urgent Pointer16Identifies the last byte of any urgent data that must be dealt with immediately.
Options (optional)variesAdditional special-handling options can also be defined using the options field. These options are the only thing that can cause a TCP header to exceed 20 bytes in length.
Padding (if required)variesA TCP segment's header length must be a multiple of 32 bits. If any options have been introduced to the header, the header must be padded so that it is divisible by 32 bits.
Data (optional)variesThe data portion of the TCP segment. Not all TCP segments have data, since some of them are used only to relay control information about the virtual circuit.

Notice that the TCP header does not provide any fields for source or destination IP address, or any other services that are not specifically related to TCP. This is because those services are provided by the IP header or by the application-specific protocols (and thus contained within the data portion of the segment).

As can be seen, the minimum size of a TCP header is 20 bytes. If any options are defined, then the header's size will increase (up to a maximum of 60 bytes). RFC 793 states that a header must be divisible by 32 bits, so if an option has been defined, but it only uses 16 bits, then another 16 bits must be added using the Padding field.

Figure 7-17 shows a TCP segment being sent from Arachnid (an HTTP 1.1 server) to Bacteria (an HTTP 1.1 client). This segment will be used for further discussion of the TCP header fields throughout the remainder of this chapter.

0329-01.gif
Figure 7-17.
A typical TCP segment

Source Port

Identifies the application that generated the segment, as referenced by the 16-bit TCP port number in use by the application.

Size
Sixteen bits.

Notes
This field identifies the port number used by the application that created the data.

Capture Sample
In the capture shown in Figure 7-18, the Source Port field is set to hexadecimal 00 50, which is decimal 80 (the well-known port number for HTTP). From this information, we can tell that this segment is a reply, since HTTP servers only send data in response to a request.

0330-01.gif
Figure 7-18.
The Source Port field

See Also
Destination Port

Application addressing with TCP ports

Destination Port

Identifies the application that this segment is for, as referenced by the 16-bit TCP port number in use by the application on the destination system.

Size
Sixteen bits.

Notes
This field identifies the port number used by the destination application.

Capture Sample
In the capture shown in Figure 7-19, the Destination Port field is set to hexadecimal 04 0d, which is decimal 1037. This is the port number that the HTTP client is using.

0331-01.gif
Figure 7-19.
The Destination Port Number field

See Also
Source Port

Application addressing with TCP ports

Sequence Identifier

Identifies the first byte of data that is stored in the data portion of the segment.

Size
Thirty-two bits.

Notes
Every byte of data sent across a virtual circuit is given a somewhat unique sequence number. As the data is stored into segments for delivery, the sequence number of the first byte of data in the segment is placed into the Sequence Identifier field.

The primary purpose of the Sequence Identifier field is to allow the recipient to sort data into its proper order. This step is required since the underlying IP network is unreliable and may destroy, delay, or duplicate some IP packets, causing the TCP segments within them to arrive out of order (if they arrive at all). By numbering the bytes inside of the segment, the recipient is able to put the data back into its proper order, eliminate duplicates, and recognize missing blocks of data.
Sequence numbers are supposed to be seeded by the system clock of the sending system. Each byte of data sent over a connection is then given a unique number, starting from the seed value.
Since the Sequence Identifier field is only 32 bits long, there are only enough values for four gigabytes of data. With very large data transfers, some sequence numbers may get reused. Typically this is not a problem, however, as the recipient will accept only data that has been sent recently, using the Window field's value as a boundary constraint. In addition, some systems utilize the Timestamp option to further separate old segments from new ones.
Since TCP virtual circuits are full-duplex, each endpoint is capable of sending and receiving data simultaneously. As commands and data are passed between the applications in use on the two systems, the sequence numbers and acknowledgment numbers for each endpoint will increment according to the amount of data that each of them has sent.

Capture Sample
In the capture shown in Figure 7-20, the Sequence Identifier field shows the sequence number of the first byte of data in this segment as 138452.

See Also
Acknowledgment Identifier

Sequence numbers

Acknowledgment Identifier

Identifies the next byte of data that a system is expecting to receive from the remote endpoint.

Size
Thirty-two bits.

0333-01.gif
Figure 7-20.
The Sequence Identifier field

Notes
Every byte of data sent across a virtual circuit is given a somewhat unique sequence number. As data is received, the recipient states that it is ready to receive the next segment, starting at the sequence number provided in the Acknowledgment Identifier field.

The Acknowledgment Identifier acts as an implicit, cumulative acknowledgment, saying that all data up to (but not including) this sequence number has been received. In truth, the Acknowledgment Identifier acts like a flow-control throttle rather than an acknowledgment, since it identifies the data it is ready to receive, rather than the data it has already received.
If a segment gets lost or corrupted, then the recipient will not receive it and will continue to publish the same acknowledgment number in each segment that it sends. In this way, a sender can recognize that data has been lost and recover from the error by resending the questionable data.
Since TCP virtual circuits are full-duplex, each endpoint is capable of sending and receiving data simultaneously. As commands and data are passed between the applications in use on the two systems, the sequence numbers and acknowledgment numbers for each endpoint will increment according to the amount of data that each of them has sent.

Capture Sample
In the capture shown in Figure 7-21, the Acknowledgment Identifier field is set to 119657, which is the sequence number of the next byte of data that Arachnid expects to receive from Bacteria.

0334-01.gif
Figure 7-21.
The Acknowledgment Identifier field

See Also
Sequence Identifier

Acknowledgment numbers

Header Length

Identifies the size of the TCP header, in 32-bit multiples.

Size
Four bits.

Notes
The primary purpose of this field is to inform the recipient where the data portion of the TCP segment starts. Due to space constraints, the value of this field uses 32-bit multiples. Thus, 20 bytes is the same as 160 bits, which would be shown here as 5 (5 × 32 bits = 160 bits = 20 bytes). Since each of the header's mandatory fields are fixed in size, the smallest this value can be is 5.

If all of the bits in this field were on, the maximum value would be 15. Thus, a TCP header can be no larger than 60 bytes (15 × 32 bits = 480 bits = 60 bytes).
Note that TCP does not define total length like UDP does, but rather only defines header length, like IP. In order to determine the amount of data contained in a segment, the destination system must calculate the entire length of the IP datagram (as described in Total Packet Length in Chapter 2, The Internet Protocol), and then subtract the size of the TCP header from that value. The resulting value provides both the number of bytes of data stored in this segment and the starting position for the data.

Capture Sample
In the capture shown in Figure 7-22, the Header Length field shows the size of the TCP header as hexadecimal 6 which indicates that the TCP header is 24 bytes long. Although the default size is only 20 bytes, this segment contains some TCP options, which are making it a little bit larger than normal.

0335-01.gif
Figure 7-22.
The Header Length field

See Also
TCP Options

Padding

Reserved

These bits are currently unused, and must be set to zero.

Size
Six bits.

Capture Sample
In the capture shown in Figure 7-23, the Reserved field contains zeroes.

0336-01.gif
Figure 7-23.
The Reserved field

Control Flags

Since TCP uses a point-to-point virtual circuit for all communications, it needs to be able to manage the flow of information across the virtual circuit at different times. The Control Flags are used to provide circuit-management services to the TCP endpoints.

Size
Six bits.

Notes
There are six different flags in the Control Flags field, with each flag being represented by an on or off condition.

Each of the Control Flags provide a variety of different service to TCP. Some of the flags provide circuit-management services, while others provide data-management services. The flags and their meanings are listed in Table 7-3.
Table 7-3. The Control Flags and Their Meanings
Control FlagUsage
UrgentIf the Urgent flag is set, then this segment contains urgent data, up through the sequence number referenced by the Urgent Pointer field. If this flag is not set, then the Urgent Pointer field should be ignored.
AcknowledgmentIf the Acknowledgment flag is set, then this segment contains an acknowledgment. Every segment (except for the very first segment that was used to initialize the circuit, and the Reset segments that are used to abort connections) should have this flag set.
PushIf the Push flag is set, then this segment contains data that is being pushed by the sending application. Typically, the Push flag is used to indicate that all of the data has been transferred, and is conceptually similar to an end-of-record marker.
ResetThe Reset flag should be seen only when a virtual circuit must be aborted because it cannot be torn down in an orderly fashion (due to errors), or when an incoming connection request is for an invalid socket and therefore must be rejected.
SynchronizeWhen a virtual circuit is first established, the two endpoint TCP providers must synchronize their sequence number before they can issue meaningful acknowledgments. During the handshake process, each endpoint sends a segment with the Synchronize flag set, and also provides the sequence numbers that will be used for the virtual circuit. Both systems will then have the necessary information to proceed orderly, and can begin exchanging data as needed.
FinishWhen all of the data has been exchanged between the two endpoint systems, the virtual circuit can be closed. This is achieved by one system issuing a segment with the Finish flag set. If the other end is ready to close the virtual circuit, it will return a segment with the Finish flag set as well. The connection is then terminated.

The Urgent and Push flags are both used by applications to identify certain characteristics of the data being exchanged between the two endpoints:

Urgent is used in conjunction with the Urgent Pointer field to identify that the data contained within the segment (up through the sequence number published in the Urgent Pointer field) needs to be processed immediately. All segments that arrive with the Urgent flag set on are to be treated as top-priority. Segments that have sequence numbers in the Urgent Pointer field but that do not have the Urgent flag set on should be treated as normal.

As written in RFC 793, the Push flag is supposed to be used by applications whenever they wish to have the send buffer flushed immediately.

This may be necessary if the application needs to send only a little bit of data, which would normally require the sending TCP software to wait for additional data before generating a segment. However, many TCP providers do not provide applications with access to the Push flag directly, and will instead set the Push flag whenever a write operation is completed. Some systems will even set the Push flag on every segment that they generate, regardless of any other factors.

For more information on how the Urgent and Push flags work with applications, refer back to Data considerations, earlier in this chapter.

The Acknowledgment, Reset, Synchronize, and Finish flags are used by the TCP endpoints directly, allowing them to control certain characteristics of the virtual circuit:

The Synchronize flag is used whenever a virtual circuit needs to be established. When an application issues an open request to the local TCP provider, TCP will issue a segment with the Synchronize flag on. The segment will contain all of the normal headers (including a Sequence Identifier field that identifies the starting sequence number that will be used by the local TCP provider for that virtual circuit). The recipient will then respond with a TCP segment of its own, which will also have the Synchronize bit on, with its local sequence number in the Sequence Identifier field.

The Acknowledgment flag is used with every segment that is sent across a virtual circuit, except for the very first segment and some Reset segments. Since the very first segment is a connection request, there cannot have been any data received that could be acknowledged. However, the response to the connection request—and every other segment sent thereafter—will be a response of some sort, and therefore must have the Acknowledgment flag set. The other exception to this rule is with Resets, which may be issued without any accompanying acknowledgment data.

Once all of the data has been exchanged between the two endpoint systems, one system may wish to close the connection. This is achieved through the use of the Finish flag, which is pretty much the opposite of the Synchronize flag. When an application issues a close request to the local TCP provider, the latter will issue a segment with the Finish flag set. If the recipient is willing to close the connection, then it will respond with a segment that will also have the Finish flag set. The virtual circuit will then be torn down.

If the virtual circuit must be terminated abruptly—possibly due to a loss of communication with the other endpoint system—or if an incoming connection request does not map to an active listener, then the terminating system will issue a segment with the Reset flag set. This acts as an informational notice that the circuit is being terminated, rather than acting as an agreement to terminate like the Finish flag.

For more information on the circuit-management flags, refer in this chapter to the sections entitled Opening a circuit, Closing a circuit, and TCP in Action.

Capture Sample
In the capture shown in Figure 7-24, the Acknowledgment and Synchronize flags are both enabled, indicating that this segment is a part of the circuit-setup exchange.

0339-01.gif
Figure 7-24.
The control flags

See Also
Data considerations

Opening a circuit
Closing a circuit
TCP in Action

Window.

Identifies the amount of receive buffer space available on the sending system, in bytes.

Size
Sixteen bits.

Notes
A key part of TCP's flow control service involves having each system keep track of the other's available receive buffer space. By advertising a small receive buffer, one end of the virtual circuit can effectively force the other end to stop sending data temporarily, while increasing the size of the buffer allows the other end to send larger amounts of data without having to wait for acknowledgments. The size of the receive buffer is advertised in the Window field of every TCP segment's header, allowing for the constant exchange of buffer status messages.

Since this field is limited to 16 bits, the maximum amount of buffer space that can be advertised is 64 kilobytes. This amount has proven to be too small for many networks, particularly satellites and Token Ring networks, both of which can easily send 64 kilobytes of data in a single frame. To get around this limitation, the TCP Window Scale option has been defined in RFC 1323, allowing 30 bits to be used in the advertisement,letting systems advertise buffers as large as one gigabyte.
Since TCP virtual circuits are full-duplex, each endpoint is capable of sending and receiving data simultaneously. As commands and data are passed between the applications in use on the two systems, each system will have to adjust the amount of space being advertised. Since the Window field is a mandatory part of the TCP header, this process is easily achieved.

Capture Sample
In the capture shown in Figure 7-25, the Window field is set to hexadecimal C1 E8, which is decimal 49,640. On an Ethernet network with a Maximum Segment Size of 1460 bytes, that value would allow 34 segments to be in-flight at any given time.

See Also
Buffer size considerations

Notes on Determining the Optimal Receive Window Size
Partially Filled Segments or Long Gaps Between Sends
Window Scale
0341-01.gif
Figure 7-25.
The Window field

Checksum

Used to store a checksum of the TCP segment, including both the header and body parts. The checksum allows the destination system to validate the contents of the TCP segment and to test for possible corruption.

Size
Sixteen bits.

Notes
Unlike UDP, the TCP checksum is mandatory. The sending system must calculate a checksum, and the receiving system must validate the contents of the TCP segment using the checksum.

If the checksum is deemed to be invalid, then the segment is discarded before it is processed. Discarding segments is a silent error, and no notification of the event is generated or sent. Therefore, the sender will not be made aware of the failure until the next acknowledgment is generated by the recipient, informing the sender that the recipient is still waiting for the next byte.
When the TCP checksum is calculated, a pseudo-header is included in the calculations. The pseudo-header includes the source and destination IP addresses, as well as the Protocol Identifier (6 for TCP) and the size of the TCP segment (including both the header and the data). The pseudo-header is combined with the TCP header and data, and a checksum is calculated using
ones-complement arithmetic. By including the pseudo-header in the calculations, the destination system is able to validate that the sending and receiving hosts are correct, in case the IP datagram that delivered the TCP segment got mixed up en route to the final destination.

Capture Sample
In the capture shown in Figure 7-26, the Checksum field is calculated as hexadecimal 42 08, which is correct.

0342-01.gif
Figure 7-26.
The Checksum field

See Also
Reliability

Urgent Pointer

Identifies the sequence number of the last byte of any urgent data that may be in this segment.

Size
Sixteen bits.

Notes
TCP offers the ability to flag certain bytes of data as urgent. This feature allows an application to process and forward any data that must be dealt with immediately, without the data having to sit in the send queue for processing. Instead, the data is packaged into a segment, the Urgent flag is set in the TCP

header, and a byte off-set marking the end of urgent data is specified in the Urgent Pointer field.
It is important to note that the Urgent Pointer does not use a Sequence Number to specify the end of urgent data, but instead uses an off-set in the current stream, indicating the location where urgent data ends. A recipient of a TCP segment with the Urgent flag enabled must add the value provided in the Urgent Pointer to the Sequence Number field of the current segment, and use the resulting value to determine the ending sequence number. What this means is that the Urgent Pointer off-set can refer to a byte location in another TCP segment, allowing urgent data to span across multiple segments if needed.
This mechanism reflects a specific design as mandated in RFC 1122. Unfortunately, this design was not always clear, and many TCP implementations take a different interpretation. Most notably, many BSD-based systems use the Urgent Pointer to refer to the byte following the off-set specified in the Urgent Pointer, rather than the specified off-set itself. Other systems have other problems, and many implementations simply do not support urgent data at all.
If urgent data is supported correctly on your system, then any segment that arrives with the Urgent flag enabled should be treated as containing urgent data. Conversely, any segment that arrives with a value in the Urgent Pointer field—but with the Urgent flag off—should be treated as normal data.

Capture Sample
In the capture shown in Figure 7-27, the Urgent Pointer field is set to 0, indicating that the first byte of data in this segment would be urgent, if the Urgent Flag were enabled (which it is not).

See Also
Control Flags

Data considerations

TCP Options

The TCP header provides everything needed for two endpoint systems to establish a connection, exchange data, and tear down the virtual circuit. However, some additional functionality beyond what is provided for in the TCP header is needed at times. When this is the case, TCP options must be used.

Size
Varies as needed. The default is zero bits, while the maximum is 40 bytes (a restriction imposed by the limited bits available in the Header Length field).

0344-01.gif
Figure 7-27.
The Urgent Pointer field

Notes
By default, no options are defined within a TCP header, meaning that this field does not exist. The TCP header can have as many options as will fit within the space available (up to 40 bytes), if any are required.

Each option has unique characteristics. For more information on the various options and their ramifications, refer to Notes on TCP Options later in this chapter.

Capture Sample

In the capture shown in Figure 7-28, the options field contains the Maximum Segment Size option, which is normal for circuit-setup segments (this option is explained in detail in Maximum Segment Size later in this chapter).

See Also
Header Length

Padding
Notes on TCP Options

Padding

Used to make a TCP segment end on a 32-bit boundary.

Size
Varies as needed.

0345-01.gif
Figure 7-28.
The options field

Notes
The length of a TCP header must be divisible by 32 bits in order for its length to fit within the Header Length field. Most TCP headers are 160 bits long, since that's the size of a normal header when all of the mandatory fields are used. However, if any options have been defined, then the TCP header may need to be padded in order to make it divisible by 32 again.

See Also
Header Length

TCP Options

Notes on TCP Options

TCP options are conceptually similar to IP options, although their usages are quite different. For one thing, IP options are mostly used to define special-handling services for IP datagrams being sent across the Internet, while TCP options are mostly used to extend TCP's native circuit-management services.

RFC 793 originally defined three option types, only one of which is required (the Maximum Segment Size option, used by the two endpoint systems during circuit setup to exchange information about their local MTU/MRU sizes). Over the years however, the list of options has grown dramatically.

RFC 1072 introduced the Window Scale, Selective Acknowledgment and Echo/Echo Reply options. However, the Window Scale option was redefined in RFC 1323, and the Echo/Echo Reply options were replaced with a single Timestamp option in RFC 1323 as well. Selective Acknowledgments were also redefined, although not until RFC 2018.

In addition, RFC 1106 defined an alternative to the Window Scale option called the Big Windows option, and also defined a Negative Acknowledgment option as a possible enhancement to TCP's native cumulatively implicit acknowledgment scheme. However, RFC 1106 was never ratified, nor was it implemented on many ststems.

Table 7-4 lists the current options that are commonly used and some notes on their usage. For a detailed listing of all of the TCP options that are currently registered, refer to the IANA's online registry (accessible at http://www.isi.edu/in-notes/iana/assignments/tcp-parameters).

Table 7-4. The Current Most-Used Options
TypeNameDescription
0End of Option ListUsed to mark the end of all the options.
1No operationUsed for internal padding, when multiple options are present, or when an option needs to start on a 32-bit boundary.
2Maximum Segment SizeUsed to exchange MRU sizes during the circuit setup handshake.
3Window ScaleAllows TCP to use and publish window sizes that are larger than the 64 kilobytes maximum allowed by the Window field of the TCP header.
4Selective Acknowledgment PermittedUsed to allow the use of selective acknowledgment on a virtual circuit.
5Selective Acknowledgment DataThe selective acknowledgment, if specified using the Selective Acknowledgment Permitted option (using option-type of 4).
8TimestampUsed to determine the round-trip delivery time for a segment, allowing the two endpoints to determine reasonable acknowledgment timers.

Each option is specified using three fields: an eight-bit option field for the option's type, an eight-bit field for the option's length, and a variable-length field for the option's data. The option-type field identifies the specific option in use, while the option-size field indicates the amount of data used for the entire option (including the option-type, option-size, and option-data fields). Since each option contains different types of information, the option-length field is required in order for the

TCP software to determine where the option-data field ends (and where the next option-type field begins).

Figure 7-29 shows a TCP circuit-setup message between Bacteria and Arachnid, with Bacteria providing several different TCP options. This capture will be used to further discuss some of the more common TCP options throughout the remainder of this section.

0347-01.gif
Figure 7-29.
A TCP segment with some common TCP options

End of Option List

Used to mark the end of all the options in a TCP header.

Type
0

Size
Eight bits.

Defined In
RFC 793.

Status
Standard.

Notes
This option comes at the end of all the options, and not after every option. The End of Option List option does not have option-length or option-data fields. It is simply a one-byte option used to mark the end of all the options in a TCP header.

If this option does not happen to end on a 32-byte boundary, then the TCP header will need to be padded.
As you may have noticed from Figure 7-29, this option is not required.

No Operation

Used to internally pad the options field if multiple options are provided.

Type
1

Size
Eight bits.

Defined In
RFC 793.

Status
Standard.

Notes
If a TCP header has multiple options defined, then sometimes it makes sense to have them start on an even 32-bit boundary. When this is the case, multiple No Operation options can be chained together, filling out the space in between the other (real) options.

The No Operation option does not have option-length or option-data fields. It is simply a one-byte option used to internally pad the TCP header.

Capture Sample
In the capture shown in Figure 7-30, the No Operation option is used to internally pad the options field.

Maximum Segment Size

Used by both endpoints to publish their local MTU or MRU values during the circuit-setup process.

Type
2

0349-01.gif
Figure 7-30.
The No Operation option

Size
Thirty-two bits total (sixteen bits are used for the option-data field).

Option Fields
Table 7-5 lists the format of the Maximum Segment Size option.

Table 7-5. The Format of the Maximum Segment Size Option
FieldSize (Bytes)Purpose
Option Type1Identifies this option as the Maximum Segment Size option
Option Size1Identifies the total length of the option (including all of the fields and data)
MRU2Used to publish the MRU (or more often, the MTU) of the local network

Defined In
RFC 793.

Status
Standard.

Notes
Before two endpoints can begin exchanging data, they must first understand the segment sizing restrictions that will be imposed upon them by the other end of the connection. Part of this process involves discovering the remote system's Maximum Transfer Unit or Maximum Receive Unit size. The Maximum Segment Size option provides this service, allowing the two endpoints to publish their local MTU/MRU sizes during the synchronization process.

When an application issues an open request to TCP, TCP will issue a startup segment with the Synchronize flag set, and with the local network's MRU or MTU information placed inside of the Maximum Segment Size option data field. The remote system will respond with a segment that also has the Synchronize flag set, and with its MRU or MTU size stored in the Maximum Segment Size option data field of the response segment.
The value advertised in the option-data field for the Maximum Segment Size option is typically the local system's MTU minus 40 bytes (20 bytes each for the TCP and IP headers). Although RFC 793 clearly states that the MSS option communicates the maximum receive segment size, RFC 1122 changed this to be the largest size that can be reassembled. Most implementations use the MTU with this option.
Also, note that some systems do not take into consideration any extra IP options or TCP options that would reduce the maximum segment size being advertised, and only subtract 40 bytes from the MTU when sending this option. This setup may cause problems down the road, depending on the options that are being used on that circuit. For example, if the first segment has only this option defined, then none of the remaining segments will have any extra TCP options (so TCP does not need to leave room for any additional TCP options). However, if the systems agree to use the TCP Timestamp option and the Selective Acknowledgments option, then it is very likely that some segments will have those options, so the advertised value should leave room for the extra space. Otherwise, later segments may end up being larger than the maximum size advertised, and thus will be fragmented.
The Maximum Segment Size option must not appear in any segment other than the first two (the ones that contain the Synchronization flag), and must be ignored if it shows up anywhere else.
If a system does not specify a value with the Maximum Segment Size option, then RFC 1122 states that a default value of 536 (IP's default of 576 bytes, minus 40 bytes for the IP and TCP headers) should be used

Capture Sample
In the capture shown in Figure 7-31, the Maximum Segment Size option shows 1460 bytes, which is the MTU of the local Ethernet network minus 40 bytes for the TCP and IP headers. Although the TCP header in this example is more than 20 bytes, the normal TCP header size (for the duration of the session) will only be 20 bytes long, so that is the value set aside by the Maximum Segment Size option.

0351-01.gif
Figure 7-31.
The Maximum Segment Size option

See Also
MTU and MRU size considerations

Window Scale.

Used to publish receive buffer sizes that are larger than the 64 kilobytes allowed by the Window field of the TCP header.

Type
3

Size
Twenty-four bits total (eight bits for the option-data field).

Option Fields
Table 7-6 lists the format of the Window Scale option.

Table 7-6. The Format of the Window Scale Option
FieldSize (Bytes)Purpose
Option Type1Identifies this option as the Window Scale option
Option Size1Identifies the total length of the option (including all of the fields and data)
Scale1Identifies the scale factor of the Window being advertised

Defined In
RFC 1323.

Status
Proposed standard.

Notes
Before two endpoints can begin exchanging data, they must first understand the buffer size restrictions that will be imposed upon them by the other end of the connection. Systems can only transmit as much data as the recipient can handle, and must wait for data to be acknowledged before attempting to send any more data. This is a key component of TCP's flow control and reliability services.

Although every TCP header contains a Window field that is used to advertise the current status of the sending system's receive buffers, this field is only 16 bits long. Thus, the maximum receive buffer size that can be advertised in the Window field is 64 kilobytes, which isn't large enough for many applications.
The Window Scale option provides a way around this hard limit, allowing the value advertised in the Window field to be scaled up to maximum of one gigabyte. The Window Scale option does this by publishing the number of bit positions that the binary value advertised in the Window field should be shifted to the left.
If the Window field were advertising 16 bits in the on position (which would normally be interpreted as 64 kilobytes of buffer space), and the Window Scale option's option-data field showed a value of 4, then the 16 original bits would be shifted to the left by four places, becoming the 16 most-significant bits of a new 20-bit window value (now equal to one megabyte).
Both endpoints on a virtual circuit must agree to use the Window Scale option in order for it to be valid. If both systems do not agree to use the Window
Scale option, then both systems should assume that the other endpoint does not understand the option, and the Window field must be interpreted literally.
In order for both systems to use the Window Scale option, they must pass the option during the circuit-setup negotiation. The Window Scale option should not appear in any segment other than the first two segments (those containing the Synchronization flag), and should be ignored if it shows up anywhere else.
Note that the Window Scale option was originally defined in RFC 1072, with clarifications being made in RFC 1323. An alternative Big Windows option was defined in RFC 1106, although this option was experimental, and RFC 1106 has since expired. All production TCP implementations should use the Window Scale option as defined in RFC 1323, and none of them should use the Big Windows option defined in RFC 1106.

Capture Sample
In the capture shown in Figure 7-32, the Window Scale option shows 0, which indicates that this system understands the Window Scale option, but that this system is not using it for any data.

0353-01.gif
Figure 7-32.
The Window Scale option

See Also
Buffer size considerations

Receive window size adjustments

Selective Acknowledgments Permitted

Indicates that selective acknowledgments are allowed on this virtual circuit.

Type
4

Size
Sixteen bits (the Selective Acknowledgment option does not use an option data field).

Defined In
RFC 2018.

Status
Proposed standard.

Notes
TCP uses cumulative, implicit acknowledgments by default, whereby a recipient indicates that it has received all data up to (but not including) the sequence number represented in the Acknowledgment Identifier field. Although the use of cumulative, implicit acknowledgments works well when the flow of data is uninterrupted, it is a clumsy mechanism to use when data has not been received. Rather than being able to request that a particular segment be resent, the receiver must instead continue to use the last byte of consecutive data received in its acknowledgments. This may force the sender to retransmit all of the data from from point on, even though some of the data may have arrived safely.

The Selective Acknowledgment Data option (described later in this chapter) allows recipients to issue acknowledgments for specific ranges of data. Any missing segments can then be resent as discrete entities. However, before the Selective Acknowledgment Data option can be used, both systems must agree to support and use it.
This is achieved through the use of the Selective Acknowledgments Permitted option. During the initial handshake period, both systems must place the Selective Acknowledgments Permitted option in the TCP headers of the segments that contain the Synchronize flags. The Selective Acknowledgments Permitted option should not appear in any segment other than the first two segments containing the Synchronization flag, and should be ignored if it shows up anywhere else.
The Selective Acknowledgments Permitted option does not contain an option-data field. It only contains the option-type field (storing the option-type of 8), and a length field ( two bytes ).

Capture Sample
In the capture shown in Figure 7-33, the Selective Acknowledgments Permitted option is being passed, indicating that this system understands how to deal with Selective Acknowledgments, should they be required.

0355-01.gif
Figure 7-33.
The Selective Acknowledgments Permitted option

See Also
Selective Acknowledgment Data

Acknowledgment Identifier
Acknowledgment numbers

Selective Acknowledgment Data

Used to report noncontiguous blocks of data received by a system.

Type
5

Size
Varies according to the number of blocks being acknowledged, but at least 10 byes (eight bytes for the two sequence number boundaries used to identify a single block of data). The size of this field can expand to fill all available space if multiple blocks of data are reported simultaneously.

Option Fields
Table 7-7. lists the format of the Selective Acknowledgment Data Option.

Table 7-7. The Format of the Selective Acknowledgment Data Option
FieldSize (Bytes)Purpose
Option Type1Identifies this option as the Selective Acknowledgment Data option.
Option Size1Identifies the total length of the option (including all of the fields and data).
Start of Block4Marks the beginning of a higher, noncontiguous block of data that the sender has received successfully.
End of Block4Marks the end of a higher, noncontiguous block of data that the sender has received successfully.

Defined In
RFC 2018.

Status
Proposed standard.

Notes
TCP uses cumulative, implicit acknowledgments by default, whereby a recipient indicates that it has received all data up to (but not including) the sequence number represented in the Acknowledgment Identifier field. Although the use of cumulative, implicit acknowledgments works well when the flow of data is uninterrupted, it is a clumsy mechanism to use when data has not been received. Rather than being able to request that a particular segment be resent, the receiver must instead continue to use the last byte of consecutive data received in its acknowledgments. This may force the sender to retransmit all of the data from that point on, even though some of the data may have arrived safely.

The Selective Acknowledgment Data option allows for the use of selective acknowledgments, thereby allowing recipients to issue acknowledgments for specific ranges of data. Any missing data can then be resent. Selective Acknowledgments work by allowing a recipient to specify any blocks of data that have been received which are higher than the sequence number referenced to in the Acknowledgment Identifier field, and which are stored in the
recipient's receive buffer. A sending system could then resend the missing blocks of data as discrete segments.
Each block of non-contiguous data that the recipient has received is specified with two fields in the Selective Acknowledgment Data option. The start-of-block field points to the sequence number of the first byte of non-contiguous data received, and the end-of-block field points to the sequence number of the last byte of non-contiguous data received. These fields can be repeated as many times as necessary, up to the maximum amount of free space available in the TCP header.
The Selective Acknowledgment Data option includes normal acknowledgment data that is combined with the Selective Acknowledgment Data option to identify the exact problem. If the recipient has received segments containing the byte ranges of one through 100 and 201 through 300, but is missing bytes 101 through 200, then a normal acknowledgment will be generated with the Acknowledgment Identifier pointing to sequence number 101. However, the TCP header of that acknowledgment segment would also contain a Selective Acknowledgment Data option that pointed to the byte range of 201 through 300, since those segments had been received. The original data sender would then recognize that sequence numbers 101 through 200 were missing, and retransmit just that data.
The Selective Acknowledgment Data option can appear in the header of any TCP segment that is acknowledging data, but may only be used on virtual circuits where both endpoints have agreed to use selective acknowledgments by way of the Selective Acknowledgments Permitted option.
Although the use of selective acknowledgments is still quite rare, it is becoming more common.

See Also
Selective Acknowledgments Permitted

Acknowledgment Identifier
Acknowledgment numbers

Timestamp

The Timestamp option allows both endpoints on a virtual circuit to constantly measure the latency between itself and the other end. Since the two endpoints may experience different levels of delay, both systems have to test the network independently, using a single option.

Type
8

Size
Ten bytes total (including two four-byte fields for timestamp data).

Option Fields
Table 7-8 lists the format of the Timestamp option

Table 7-8. The Format of the Timestamp Option
FieldSize (Bytes)Purpose
Option Type1Identifies this option as the Timestamp option.
Option Size1Identifies the total length of the option (including all of the fields and data).
Timestamp Value4Used by the sender of this particular segment to place a timestamp into, for comparison upon receipt.
Timestamp Echo Reply4Used as a holder for the original value of the Timestamp Value field, prior to the Timestamp Value field being overwritten.

Defined In
RFC 1323.

Status
Proposed standard.

Notes
Before two endpoints can begin exchanging data, they must first understand the characteristics of the underlying network, particularly in terms of how much latency and delay a particular virtual circuit is experiencing. This is required, since understanding the round-trip delivery time is necessary in order to establish an appropriate threshold for the acknowledgment timers. Setting a low acknowledgment timer threshold on a slow network would result in an excessive amount of retransmissions, while setting a high acknowledgment timer threshold on a fast network would result in errors going unnoticed.

The Timestamp option provides a mechanism by which the round-trip delivery times for a specific virtual circuit can be detected. Whenever data is being sent, one of the two TCP endpoints places a timestamp into the Timestamp Value field. The receiver then responds by placing its own timestamp into the Timestamp Value field, and moving the original data into the Timestamp Echo Reply field. This process is repeated continually, with each system acknowledging the timestamp they have most-recently received. By continually repeating the process, both endpoints can constantly monitor the network for
changing latency, thereby constantly updating their acknowledgment timer threholds appropriately.
Since this model allows both systems to monitor the network—and because it relies on multiple exchanges—the Timestamp option can (and should) appear in most of the segments sent by any system that supports it. However, both systems also have to agree to use the Timestamp option, and as such it is required to be in the circuit-startup segments (those that contain the Synchronize flag). Note that many systems do not support this option, and its usage is still quite rare.

Capture Sample
In the capture shown in Figure 7-34, the Timestamp option shows a Timestamp Value of 0 (which is wrong; the specifications clearly state that the seed value should be derived from the system clock, using the current time), and a Timestamp Reply of 0 (which is correct, since this system has not yet received a Timestamp Value from the remote endpoint).

0359-01.gif
Figure 7-34.
The Timestamp option

See Also
Acknowledgment timers

TCP in Action

Since TCP offers such a wide variety of services, it is easy to understand why so many applications choose to use it as a transport. Among the core services that TCP provides are application and network management, flow-control and reliability, all of which are implemented using TCP's virtual circuit architecture.

The thing to remember is that TCP's virtual circuit design is what makes all of the services offered by TCP possible. Without virtual circuits, TCP's buffering, flow control and reliability services would be much more difficult to implement. But by using this design, these services come for free, at least to the applications that use them.

In order to illustrate how these services work, a variety of diagrams and captures are shown throughout the remainder of this section. Figure 7-35 shows what happens during a TCP session, from the moment that an application is loaded until a circuit is terminated, while Figure 7-38 shows what happens on the wire when TCP opens and closes virtual circuits. Figure 7-39 shows what happens when many small segments are sent using an interactive application such as Echo, while Bulk Data Transfer and Error Recovery shows what happens when large blocks of data are sent using applications such as Chargen.

A Complete Session

Figure 7-35 shows a complete TCP session between an HTTP 1.0 client and server, taken from the examples given in Opening a circuit, Exchanging data, and Closing a circuit. Although this example shows an overview of the steps, it does not provide detailed insight into each of the segments being sent between the two endpoints. This information will be shown later in this section.

The order of events are as follows:

1. Before anything else can happen, applications have to register with their local TCP provider, allocating a port number for use. In this example, the HTTP server has requested that port 80 (the well-known port number for HTTP) be opened in passive mode, allowing incoming requests from HTTP clients to get satisfied. For more information on TCP port numbering mechanism, refer to Application addressing with TCP ports.

2. In this example, a user wishes to retrieve a document from the HTTP server, so he enters the URL of the document into the HTTP client. The HTTP client issues an active open request to the local TCP provider, which then begins the

0361-01.gif
Figure 7-35.
A complete TCP session

process of establishing a virtual circuit between the local and remote TCP systems. For more information on active and passive opens, refer to Opening a circuit.

3. During the circuit-setup process, the TCP stack in use at the HTTP client sends a command segment to the TCP stack in use at the destination system. Since the HTTP server is operational and willing to accept the connection, the remote TCP provider responds with a similar command segment, allowing the virtual circuit to be established. Since this is a new virtual circuit, both segments must have the Synchronize flag enabled. In addition, the segment shown in step three must have the Acknowledge flag enabled, since it is a response to the original segment that was sent by the HTTP client's TCP software. For more information on the Synchronize and Acknowledgment flags, refer to Control Flags.

4. In order for the virtual circuit to get established completely, the HTTP client's TCP stack must also acknowledge the Synchronize segment sent by the HTTP server. Once the server's TCP stack receives this acknowledgment, the circuit will be operational, and the two endpoints can begin exchanging data.

5. Now that the virtual circuit is up, the HTTP client issues a GET document command. This data is not sent directly to the HTTP server however, but instead is passed to the local TCP software. In addition, the HTTP client will likely have set the Push flag for this data, informing TCP that no additional data is coming. TCP will then create a TCP segment that contains this data, with the Push flag enabled in that segment's TCP header. For more information on the Push flag, refer to Control Flags.

6. Upon receiving the segment issued in step five, the HTTP server's TCP provider validates the contents of the segment using the TCP checksum routines. If the segment is valid, then TCP passes the data off to the HTTP server, and then issue an acknowledgment back to the HTTP client's TCP provider. For more information on TCP's checksum characteristics, refer to TCP checksums.

7. Once the HTTP server has located the document and read its contents, it writes the data to the local TCP stack, setting the Push flag once it is finished with the write. In this example, TCP has determined that all of the data will fit within a single segment (a process described in Network I/O Management ), so it creates a segment containing the data, and also enables the Push flag on that segment.

8. Upon receiving the segment issued in step seven, the HTTP client's TCP provider validates the contents of the segment using the TCP checksum routines, and then issues an acknowledgment after passing the validated data off to the HTTP client application.

9. Once the HTTP server receives the acknowledgment issued in step eight, it requests TCP tear down the virtual circuit. TCP then issues a command segment with the Finish flag set to the remote HTTP client's TCP provider. Note that if this connection were using HTTP 1.1 instead of HTTP 1.0, the client would request the virtual circuit be closed, rather than the server. For more information on the Finish flag, refer to Control Flags.

10. The HTTP client's TCP stack recognizes the request to tear-down the circuit, and if the HTTP client didn't object, TCP responds with a command segment that also had the Finish flag set. For more information on how virtual circuits are terminated, refer to Closing a circuit.

11. Upon receiving the confirmation issued in step 10, the HTTP server's TCP provider issues an acknowledgment, and the virtual circuit would be terminated.

All told, ten unique segments are required in order to satisfy two simple operations (the get document request, and the sending of the document back to the client). Although this seems like a lot of overhead, this all happens fairly quickly. Furthermore, it only seems like a lot of wasted effort because everything went smoothly. If there had been any problems—such as a segment getting lost or corrupted—then it would be easier to appreciate the value that this overhead provides.

Also note that the session as shown in Figure 7 35 is just an example, and will not necessarily mirror the exact behavior that you will see in all cases. For example, some HTTP 1.0 servers do not gracefully close their virtual circuits, but instead simply force the connection closed immediately after sending the requested document, even though this causes huge problems whenever loss occurs, since the clients are unable to request lost data (the server isn't listening, and therefore won't process the retransmission request).

In addition, some servers will return an HTTP header before the document itself, using a separate application write. This would result in a single small segment being written before the document contents were returned to the client, which may or may not result in another acknowledgment. The point is that implementations vary widely, and the example shown here is just an example of how one implementation may do it, which is not to say that all implementations will do it in the same way.

Notes on Virtual Circuit State Changes

Whenever a TCP-based application is loaded, the application has to register with TCP. Server applications do this whenever they issue a passive open, while clients do this whenever they issue an active open to a server.

Once registered, applications are linked with virtual circuits, with the virtual circuits going through a variety of different changes throughout the circuit's lifetime. For example, when a server is first loaded it enters the LISTEN state, where it sits and waits for a connection request. When a connection request arrives, the virtual circuit goes into the SYN-RECEIVED state (signifying that a connection request has arrived). Once the handshake is finished—but before any data is exchanged across the virtual circuit—the virtual circuit changes to the ESTABLISHED state, and when the applications are finished, the virtual circuit enters one of many different ENDING states, depending upon who terminated the circuit and how.

Most operating systems provide tools for monitoring the connectivity state of the virtual circuits that are currently in use on that particular system. This information is useful for debugging TCP, and for also monitoring the network activity on the system. On most Unix and derivative systems, this information can be gleaned by issuing the netstat command. The output of this command will be similar to what is shown in Figure 7 36.

0364-01.gif
Figure 7-36.
Using netstat to view the state of active virtual circuits

Each virtual circuit will be in different states at different times, depending upon their current situation. Table 7 9 lists the common states that TCP virtual circuits go through, and their meaning.

Table 7 9. The Various Circuit States and Their Meanings
Circuit StateDescription
LISTENA server application has been loaded, and has opened a port in passive mode. TCP is now waiting for incoming connection requests.
SYN-SENTA client application on the local system has issued an active open to a remote host. TCP has sent a startup segment with the Synchronize flag enabled, and is waiting for the remote system to respond with a startup segment that also has the Synchronize flag enabled.
SYN-RECEIVEDTCP on the server has received a startup segment with the Synchronize flag enabled from a remote client, has responded with its own startup segment, and is now waiting on an acknowledgment for that segment.
ESTABLISHEDThe virtual circuit is operational. This state occurs on both endpoints once the three-way handshake has completed.
FIN-WAIT-1The local application has issued an active close for the virtual circuit, and TCP has sent a shutdown segment with the Finish flag enabled. However, TCP is still waiting for the remote system to acknowledge the segment and respond with a shutdown segment of its own. No additional data will be sent from this system, although data will be accepted from the remote system unitil the circuit is completely terminated.

Table 7 9. The Various Circuit States and Their Meanings (continued)
Circuit StateDescription
CLOSE-WAITA shutdown segment with the Finish flag enabled has been received (as discussed in FIN-WAIT-1), and the local TCP has returned an acknowledgment for that segment back to the sender. However, the local TCP cannot respond with its own shutdown segment until the local application issues its own close operation, which has not yet occurred.
FIN-WAIT-2The local TCP has sent a shutdown segment with the Finish flag enabled (as described in FIN-WAIT-1), and an acknowledgment has arrived from the remote endpoint for that segment (as described in CLOSE-WAIT above). However, the remote application has not yet performed a close operation, preventing the remote TCP from issuing its own shutdown segment.
LAST-ACKA shutdown segment with the Finish flag enabled has been received (as discussed in FIN-WAIT-1), and the local application has approved the shutdown request by issuing its own close operation. This results in the local TCP sending its own shutdown segment with the Finish flag enabled, although the circuit will not be destroyed until an acknowledgment for this shutdown segment is received.
CLOSINGThis state is somewhat rare, and typically indicates that a segment has been lost on the network. In this case, the local TCP has sent a shutdown segment with the Finish flag enabled (as described in FIN-WAIT-1 above), and a shutdown segment has been received from the remote endpoint (as described in LAST-ACK), although the remote system has not yet sent an acknowledgment for the shutdown segment that was sent by the local system in FIN-WAIT-1. This normally indicates that the acknowledgment was lost in transit.
TIME-WAITThe circuit-shutdown operation has completed, but TCP is keeping the socket open for a while to allow for any laggard segments that might have gotten lost en route. This is to prevent any new connections to that port number from accidentally reusing any sequence numbers that may have been used in the previous connection. Note that this state occurs only on the host that issued the active close, since the remote system is not likely to receive any more data from the terminating host.
CLOSEDNothing is happening; the circuit is closed and TCP has released all of the resources it had been using with the virtual circuit. This state should never appear since there would not be a virtual circuit to show.

Notice that there is a pretty straightforward series of events that are followed when the virtual circuit is established, although there can be several different states that an endpoint can go through when a virtual circuit is terminated. That is because the model for establishing connections is almost always the same (as described in Opening a circuit ), while the shutdown sequence can take a variety of different paths, with either the client or server initiating the action, and doing so in a number of different ways at different junctures in the data-exchange process.

Figure 7 37 shows a simple operation, with the circuit being terminated in a clean, orderly sequence by an HTTP server.

0366-01.gif
Figure 7-37.
A simple circuit-shutdown sequence of state-changes

In the example shown in Figure 7 37, the following state-changes have occurred:

1. The HTTP server issued a close-for-send, resulting in the TCP stack sending a shutdown request to the remote endpoint (a segment with the Finish and Acknowledgment flags enabled, although the acknowledgment is for the last-accepted byte of data). This is the FIN-WAIT-1 state.

2. The client's TCP stack receives the request, and issues an acknowledgment for the segment. However, it has to inform the client of the shutdown request—and wait for the client's approval—before responding with its own Finish flag. At this point, the client's TCP stack is in the CLOSE-WAIT state.

3. The server's TCP stack goes into the FIN-WAIT-2 state, while it waits for the client to respond with a Finish flag.

4. The HTTP client approves the shutdown request, so the TCP stack on that system sends a shutdown segment with the Finish flag enabled (and repeats the acknowledgment returned in step 2). The client has entered the LAST-ACK

state, and as soon as the server's TCP returns an acknowledgment it will close the circuit completely and then enter the TIME-WAIT state.

5. Once the server's TCP stack receives the Finish flag sent in step 4, it will return an acknowledgment and completely close the circuit. The server's TCP stack will then go into the TIME-WAIT state, where it monitors the local socket for any spurious segments that may arrive late. After waiting for two times the maximum segment lifetime, the socket will enter the CLOSED state. (The server is the only one to enter TIME-WAIT since it issued the active close.)

6. Upon receiving the acknowledgment sent in step 5, the client's TCP stack will close the connection completely and enters the CLOSED state or disappears entirely.

7. After the server's TIME-WAIT timer expires, the virtual circuit will enter the CLOSED state or disappear entirely. However, there will still likely be at least one occurrence of the HTTP server still sitting in the LISTEN state, waiting for new connection requests.

This is a simple example of how things are supposed to work. However, that does not mean that things always go this way. Many times segments get lost at the end of the connection, and when that occurs the two endpoints can choose to close the circuit down anyway. This may result in a lot of Reset segments, since the other endpoint may not be fully aware of the circuit shutdown. For more information on how to tell when this has happened, refer to Lots of Reset Command Segments.

Opening and Closing Virtual Circuits

Figure 7 38 shows a dataless TCP session between a Discard client and server. The Discard client on Arachnid connects to the Discard server on Greywolf, and then disconnects. No data is sent between the two systems whatsoever.

Discard is a simple application that takes characters from a client and then throws them away. Since Discard does so little, it is really only useful for testing basic TCP connectivity between two systems.

The order of events are as follows:

1. The first segment sent in the exchange is the circuit-setup request, being sent from port 4854 on Arachnid to port 9 (the well-known port number for Discard servers) on Greywolf. This segment is being used to initialize the virtual circuit, and as such has the Synchronize flag set. Furthermore, this segment is advertising a beginning sequence number of 138,806. In addition, notice that the Acknowledgment flag is not set and that the Acknowledgment Identifier field is set to zero. Since this is the first segment to be sent, there is nothing to

0368-01.gif
Figure 7-38.
A Discard client exchanging data with a Discard server

acknowledge, so these fields are cleared. For more information on these flags, refer to Control Flags.

Also, since this is the first segment, the Maximum Segment Size option is being used to advertise Arachnid's MRU (minus 40 bytes for the IP and TCP headers). For more information on the MSS option, refer to Maximum Segment Size.
At this point, Arachnid is in the SYN-SENT state.

2. Segment two is a response from Grywolf back to Arachnid, agreeing to open the virtual circuit. This is identifiable by the Synchronize flag also being set. Also, notice that Greywolf is advertising its own sequence number (2,338,561,385), and is also using the Maximum Segment Size option to advertise the local MRU.

Furthermore, since this is a response to segment one, this segment has the Acknowledgment flag enabled, with the Acknowledgment Identifier pointing to the next byte of data that Greywolf expects to receive from Arachnid (138,807).
At this point, Greywolf is in the SYN-RECEIVED state.

3. In order to complete the handshake process, Arachnid's TCP stack returns an acknowledgment for segment two in segment three. Since the circuit-setup is now completed, the Synchronize flag is not set in this segment. However, the Acknowledgment flag is enabled, with the Acknowledgment Identifier pointing to the next byte of data that Arachnid expects to receive from Greywolf (2,338,561,386).

In addition, notice that the Sequence Identifier of this segment has been incremented (now being set at 138,807). In order for TCP to track circuit-startup and shutdown messages as distinct entities, it uses special sequence numbers for these segments that are outside the range of sequence numbers that may be used by any data sent across the virtual circuit. Although no data is being sent over this connection, TCP is still incrementing the local sequence number in case any data does get sent, thereby keeping the startup sequence number unique. For more information on sequence numbersand their usage with command segments, refer to Sequence numbers.
Once this acknowledgment is received by Greywolf, both systems will be in the ESTABLISHED state.

4. The Discard client on Arachnid requests that the virtual circuit be terminated, resulting in the local TCP stack issuing a segment with the Finish bit set. Notice that the Sequence Identifier of this segment is still set at 138,807. Since no data was sent, the sequence number remains unchanged. If any data had been sent however, then the sequence number would have been incremented.

At this point, Arachnid is in the FIN-WAIT-1 state.

5. Greywolf issues an acknowledgment for the segment it just received. However, notice that this response does not have the Finish flag enabled. Remember that in order for a close to complete, the applications in use on the virtual circuit have to agree to the request. While Greywolf is off getting permission from the Discard server to terminate the connection, it goes ahead and returns an acknowledgment for the request, so that Arachind will know that the request was received. Once the request has been approved by the Discard server, then Greywolf will respond to the circuit-termination request with a segment that also has the Finish flag set.

At this point, Greywolf is in the CLOSE-WAIT state. Once Arachnid receives this acknowledgment, it will be in the FIN-WAIT-2 state.

6. Greywolf issues another segment almost immediately, this time setting the Finish flag, and repeating the acknowledgment number issued in segment five.

At this point, Greywolf is in the LAST-ACK state.

7. In order to terminate the circuit, Arachnid must issue an acknowledgment to Greywolf, indicating that it received the circuit-termination confirmation sent in segment six. In addition, notice that the Sequence Identifier of this segment has been incremented (now being set at 2,351,319,118). This allows the acknowledgment to the shutdown request to be tracked independently of the request itself.

Arachnid is the only host in the TIME-WAIT state. Greywolf goes into the CLOSED state, or the circuit disappears from the state table.

What's most interesting about this example is that no end-user data was sent at all. Rather, the Discard client on Arachnid connected to the Discard server on Greywolf, and then disconnected almost immediately. Remember that TCP does not care about data whatsoever, and instead focuses strictly on managing virtual circuits. Whether or not an application chooses to send data has no effect on TCP's circuit-management duties.

Interactive Data Exchange

Figure 7 39 shows a simple TCP session between an Echo client and server. The Echo client on Arachnid connects to the Echo server on Greywolf, sends the characters t, e, s, and t, and then disconnects.

Echo is a simple application that takes characters from a client and then echoes the characters back. As such, Echo is a good example of how TCP handles interactive applications that send and receive many small segments.

The order of events are as follows:

1. [Segment 0 2] The Echo client on Arachnid allocates a TCP client port (4868), and establishes a TCP connection with the Discard server's well-known port (TCP port 7) on Greywolf.

2. [3] The Echo client sends the letter t (hex 74). In addition, notice that the Sequence Identifier of this segment has been incremented by one (now being set at 138,921) allowing data that is sent after the circuit has been established to be tracked separately from the segments used to establish the connection. Also, notice that the Push flag is set on this segment, indicating that the application requested that TCP clear the send buffer immediately, an well. For more information on the Push function, refer to Data considerations.

3. [4] Greywolf's Echo server responds to the segment by returning a segment that contains the same data that it received. Notice that this segment also has the Acknowledgment flag enabled, with the Acknowledgment Identifier field pointing to sequence number 138,922. Because one byte of data was received. Greywolf has added one byte to the Sequence Identifier of the segment that it received from Arachnid, and is using the resulting value to identify the next

0371-01.gif
Figure 7 39.
An Echo client exchanging data with an Echo server

byte of data it is expecting to receive from Arachnid. For more information on how acknowledgment numbers are chosen, refer to Acknowledgment numbers.

The important detail here is that Greywolf has consolidated the acknowledgment to segment numbers are chosen, refer to Acknowledgment numbers.
The important detail here is that Greywolf has consolidated the acknowledgment to segment numbers three with the outbound data (the echo reply), thereby eliminating the need to send two distinct segments. For more information on this subject, refer to Delayed acknowledgments.

4. [5] When Arachnid receives the echoed data back from Greywolf, it must acknowledge it. However, unlike segment four, there is no additional data being returned to Greywolf right now, so the acknowledgment is sent on its own. Also, notice that the acknowledgment number used in this segment points to the next byte of data that Arachnid expects to receive from Greywolf, which is one byte plus the sequence number used by Greywolf when the echoed data was sent to Arachnid. This process really illustrates the fullduplex nature of TCP virtual circuits.

5. [6 14] The process described in steps 2 through 4 are repeated for the letters e, s and t.

6. [15 18] The virtual circuit is torn down.

The most important aspect of this session is the way in which the sequence numbers and acknowledgments work hand-in-glove on a fully bidirectional basis. Since TCP is a true full-duplex transport, each endpoint is allowed to send as much data as required by the applications in use. As such, each endpoint must provide their own unique sequence numbers, which must be acknowledged by the recipient.

Another interesting aspect of this capture is the way in which the Echo server combines its acknowledgments with the echoed data being sent back to the Echo client. By delaying the acknowledgment until there is data to be sent, Greywolf substantially reduces the number of segments required.

0372-01.gif
Figure 7-40.
A Chargen client exchanging data with a Chargen server

Bulk Data Transfer and Error Recovery.

Figure 7 40 shows a more-complex TCP session between Greywolf (the Chargen client) and Arachnid (a Chargen server). Greywolf connects to Arachnid and receives several segments of data. Due to a variety of events, this session gets fairly complicated. As such, the discussion is continued in Figure 7 41, Figure 7 42, and Figure 7 43.

Chargen is a simple application that generates streams of character data as soon as a connection is established. Chargen sends the data as a continuous stream, with the stream being broken into segments according to the segment-sizing calculations performed during circuit setup (as described in Network I/O Management ). As such, Chargen is a good example of how TCP handles applications that send large quantities of data.

0373-01.gif
Figure 7-41.
Chargen data exchange, continued

The order of events are as follows:

1. [Segments 0 2] The Chargen client on Greywolf allocates a TCP client port (1048), and establishes a TCP connection with the Chargen server's wellknown port (TCP port 19) on Arachnid.

2. [3 4] Once the circuit is established, the Chargen server on Greywolf immediately starts sending data to the Chargen client on Arachnid. Notice that Arachnid sends two segments, and then waits for Greywolf to return an acknowledgment for that data. This will be shown to be an important detail in a moment.

3. [5] Greywolf acknowledges the data sent in segments three and four, and increments the byte number used in the Sequence Number field to show that these segments are different from the ones used in the circuit-setup procedure.

4. [6 8] Arachnid sends three more segments back-to-back, and then pauses to wait for another acknowledgment.

0374-01.gif
Figure 7-42.
Chargen data exchange, continued

5. [7 9] Greywolf acknowledges the data.

6. [10 13] Arachnid sends four more segments and then stops. This linear growth pattern would seem to indicate that Arachnid is using the congestion avoidance algorithm to enlarge the size of its congestion window, rather than using slow start. Congestion avoidance increments the size of the sender's congestion window by one segment whenever an acknowledgment is received for all of the segments already issued (linear growth), while slow start increments the sender's congestion whenever any segment is acknowledged (exponential growth). Since Arachnid is only sending one additional segment before stopping to wait for an acknowledgment, it appears that congestion avoidance is being used instead of slow start. For more information on these mechanisms, refer to Slow start and Congestion avoidance.

0375-01.gif
Figure 7-43.
Chargen data exchange, continued

7. [14] Greywolf acknowledges all of the data sent. However, notice that the size of Greywolf's receive window (as advertised by the Window field of the TCP

header) is starting to shrink a little bit, suggesting that it is not able to process the data as fast as it is coming in.

8. [15 19] Arachnid sends five more segments.

9. [20] Greywolf acknowledges all of the segments, but its Window is still shrinking.

This process continues through segment 44, with Arachnid sending one extra segment before stopping to wait for an acknowledgment, while Greywolf continues to acknowledge all of the data that it has received (although its Window continues to shrink).
One other pattern that is beginning to emerge is that Arachnid is setting the Push flag on every fifth segment. This pattern indicates that the Chargen server on Arachnid is handing data to TCP in chunks of 7,300 bytes each. Whenever this write occurs, then the data is pushed to TCP, who sets the Push flag on the outgoing segment. This does not affect the delivery of the data, since the application writes are a multiple of the maximum segment size. Otherwise, the segments containing the Push flag would probably be less than full-sized, since all of the data up to the Push would be sent immediately, resulting in smaller-than-normal segments. For more information on how the Push flag affects delivery, refer to Data considerations.

10. [Segment 44] Greywolf has been given eight segments, and it is acknowledging them all. However, by this point the size of its receive window has shrunk to the point where it will start to affect Arachnid's ability to send more data. In this case, the Window advertisement only shows 7,300 bytes of available storage in Greywolf's receive queue.

11. [45 49] Arachnid sees the small Window. Instead of sending nine segments as would be expected from the congestion avoidance algorithm, Arachnid sends only five segments, since that is all that would fit within the smaler receive queue (5 segments of 1460 bytes = 7300 bytes, the size of Greywol's receive queue).

12. [50] Greywolf acknowledges the data, although it is unable to process all of the data right away, and the Window field only shows 2,920 bytes.

13. [51 52] Arachnid sends two segments, which is all that would fit.

14. [53] Greywolf acknowledges the data, but only advertises a 1460-byte Window.

15. [54] Arachnid sends one segment.

16. [55 56] The process described in steps 14 through 15 are repeated.

17. [Segment 57] This time, Greywolf returns a zero-length Window advertisement along with the acknowledgment, suggesting that it cannot accept any more data at this time.

18. [58] Arachnid sends a window probe to Greywolf, checking to see if the receive queue has opened up any. Looking at the length of this segment, we can tell that this is a one-byte probe, and based on the Sequence number that is being provided, we can tell that Arachnid is trying to send the next byte of data (but just the next byte) as its probing mechanism. For more information on window probing, refer to Receive window size adjustments.

19. [59] Greywolf responds to the probe with another zero-length window. Notice that the acknowledgment number still points to 200580, which is the same as it was in the zero-length window advertisement sent in step 17.

20. [60] A few moments later, Greywolf sends an unsolicited acknowledgment to Arachnid with a Window size of 5,840 bytes. This would indicate that Greywolf has processed some of the data in its receive queue, and is ready to receive more data. This use of an solicited acknowledgment is called a window update, and is documented in Receive window size adjustments.

21. [61 64] Arachnid sends four segments, to fill the window.

22. [65 68] Greywolf responds with a 1460-byte Window advertisement, which Arachnid fills with another segment, and these steps are repeated.

23. [69] Greywolf sends another zero-length Window advertisement.

24. [70 75] Arachnid sends another window probe, which Greywolf responds to with another zero-length Window advertisement. After a while, Greywolf advises that it can handle 4380 bytes of data, which Arachnid sends as three segments.

25. [76 115] Greywolf advises a 2920-byte window, which Arachnid fills. Things never really get off the ground again, and Greywolf continues to alternate between small and zero-length Window advertisements, while Arachnid continues trying to send whatever data it can.

26. [Segment 116] The Chargen client on Greywolf informs its local TCP stack that it is going to close down. This results in Greywolf sending a circuit-shutdown request segment to Arachnid. Notice however that Greywolf is still advertising a zero-length window. Although Greywolf cannot currently accept any data, it will be able to send a shutdown message, which Greywolf must be able to accept. At this point, Greywolf is in the FIN-WAIT-1 state.

27. [117] Arachnid returns an acknowledgment for the shutdown request. However, it does not return its own shutdown-request segment (the Chargen server must approve the request first). At this point, Arachnid is in the CLOSE-WAIT

state, although it may also try to send any data that is waiting in the outbound queue.

28. [118] Arachnid sends another window probe to Greywolf. Apparently, there is still a substantial amount of data in Arachnid's send queue, which must be dealt with before the virtual circuit can be terminated.

29. [119] Greywolf returns a fairly large Window size advertisement in response to the window probe sent in step 28. Greywolf has also received the acknowledgment for the shutdown request it sent in step 27, so it is now in the FIN-WAIT-2 state.

30. [120 126] Arachnid sees the large window, and sends five full-sized segments plus one small segment to Greywolf in a burst (this is probably data that Arachnid had in the send queue). Once this data was acknowledged, Arachnid would likely send a circuit-shutdown segment of its own, in response to the one sent by Greywolf in step 26.

31. [127 133] However, Greywolf is not waiting around for Arachnid to issue a shutdown segment, and instead has closed the connection permanently. Since the TCP stack on Greywolf has nowhere to send the data that it just received, it rejects each of the segments (although this information is not visible from this capture, each of the segments sent during this step all have the Reset flag enabled). As a result of this action, Arachnid closes its end of the connection immediately as well. Greywolf enters the TIME-WAIT state. Greywolf had issued the active close in Step 26.

A variety of TCP's concepts are illustrated in this example. For one thing, notice that bulk data transfers act quite a bit differently from the interactive applications shown earlier in Interactive Data Exchange. Rather than each endpoint sending small data segments, only one end is transferring data in this example. As such, issues such as flow control and congestion management are much more important.

The congestion management is most easily seen in this example, whereby the amount of data that Arachnid could send was severely constrained by the size of its local congestion window, as well as the size of Greywolf's receive window. Another interesting aspect to this exchange is the fact that Arachnid did not use the slow start mechanisms at any time, although it should have done so at the very beginning. Instead, Arachnid used the congestion avoidance algorithm to ramp up the size of its congestion window.

Note that Arachnid was not sending acknowledgments every other segment, as recommend by RFC 1122. Instead, it was acknowledging data whenever it noticed a problem, or whenever it was convenient to do so. Would sending acknowledgments for every other segment made much of a difference? Probably not, although it would have made the transfer go smoother, instead of being jerky like it was at the very beginning (during the congestion avoidance ramp-up).

Finally, note that the size of the receive window in use by Greywolf is too large for this particular network session. By advertising a large Window size, Arachnid was able to queue up more data, than the network could handle. This resulted in problems when Greywolf tried to end the connection, since Arachnid had to clear its buffers before it could do anything else. If Greywolf had been advertising a small buffer consistently, then data still would have flowed across the network, but Arachnid would not have queued up a lot of data for sending, resulting in faster recovery times. For more information on sizing the receive window, refer to Notes on Determining the Optimal Receive Window Size.

Notes on Determining the Optimal Receive Window Size

Almost all of TCP's flow control services depend upon the size of the Window field that is advertised by a recipient, since this value dictates the maximum amount of data that can be sent during steady-state operations (i.e., after slow start has fully opened the initial congestion window) without the sender having to stop and wait for an acknowledgment. As such, this value determines the smooth flow of data more than just about any other element, so correctly determining the appropriate size of a system's receive window is central to achieving efficient throughput on a virtual circuit.

Although an application can set the size of the receive window in use with that application, most applications just use the system-wide default, which is only sometimes appropriate for the typical usage of that system. Therefore, one way to improve performance for any given system. Therefore, one way to improve performance for any given system is to optimize the system-wide default for the receive window in use on that system, so that it more accurately reflects the typical usage.

In fact, setting this value accurately is crucial to achieving optimal performance. Setting the receive window too small results in an artificial bottleneck, where the receiver's window is smaller than the amount of data that the network can handle. In this model, the sender has to wait for the recipient to acknowledge data before it can send any more data, even though the network may be idle and have plenty of excess capacity. This results in connections running slower than they could, which is something that nobody wants.

Conversely, setting the window too large results in a waste of system resources (some systems allocate memory according to the size of the receive buffers), and can also introduce problems where the sender and receiver attempt to put more data into the network than the network can handle. In that situation, unnecessary delays are added to the exchange, network throughput becomes a noticeable bottleneck, and recovery times get very high.

For example, if a remote system is advertising a large receive window, then the sender may actually try to fill that window, even though the network path is unable to forward that much data in a timely fashion. If any of that data gets lost, then the sender will not discover this until the next segment has made it through the network to the recipient (which could take a long time) and an acknowledgement has been returned (more time). Then when the sender resent the lost data, that segment would have to sit in the queue on a router somewhere until all of the other segments that had already been processed were delivered (even more time). If the receive window were better sized, then none of these delays would have occured, and recovery could have occurred much more quickly. Therefore, setting the window to be too large is not just ineffective, but it actually causes recovery problems whenever the network suffers from congestion.

Determining the most-efficient receive window size is a function of the specific virtual circuit in use with the particular applications on that system. For example, the optimal settings will be quite a bit different for system that is acting as a Web server for a site that gets many dial-up users over a relatively slow leased line, versus a system that is acting as a mail server for a highly distributed enterprise.

The factors that affect the optimal receive window size are the bandwidth of the typical virtual circuit (represented in bits-per-second), and the amount of roundtrip latency (expressed in seconds) found on that circuit. The product of the band-width multiplied by the round-trip latency represents the maximum amount of data that can be in-flight on the network at any given time, and therefore represents the optimal window size for that virtual circuit. Setting a value that is larger than this can cause recovery problems, while setting a value that is smaller than this works against effective network utilization.

Determining these basic values can be somewhat complicated. For one thing, you must remember that the maximum throughput is a function of the entire end-to-end link, which may not be nearly as fast as your local connection. Just because you have a 100 MB/s Ethernet card in your web server does not mean that you should perform your calculations based on 100 MB/s throughput. Instead, you must determine the throughput of the entire connection, which is not always easy to do. If you know that all of the connections to that server are originating on your LAN, then you can just use the defaults, but if the system is also serving resources for users on many different remote networks, then you have to determine the end-to-end bandwidth based on the destination network and all of the intermediary networks in between.

You also have to determine the latency times for the typical connections, using the round-trip latency for the typical connections. Although it may seem that latency should only be measured one-way (from the sender to the recipient), remember that the acknowledgments also have to come back across the network, and that the sender can continue to transmit data while those acknowledgments are in flight.

Fortunately, measuring latency is somewhat easier since tools such as ping or traceroute can be used. However, you have to be somewhat cautious with these tools, ensuring that they do not send large messages in their tests, as the time it takes to deliver all of the bits from a large packet will obviously be longer than the time it takes to deliver a small one. Obviously, you will not get a response until the remote end has received all of the message that you sent, and until your system has received all of the response data. When you are testing for latency with these tools, make sure to use a message size that is typical of the segment size for the application data that is predominately in use on that system. For applications like TELNET, this should be small messages, while applications like HTTP and SMTP will typically use a mix of small and large segments.

owl.gif Remember that kilobits and megabits are based on multiples of 1024 and not 1000. To figure out how many bits-per-second a 19.2 KB/s modem can send, you have to multiply 19.2 by 1024, and to figure out how much throughput a 1.5 MB/s circuit offers you have to multiply 1.5 times 1024 times 1024. Also remember that some topologies will have additional throughput constraints, such as the use of parity bits.

Taken together, bandwidth times latency tells us how much data a particular virtual circuit can handle. For example, multiplying the throughput of a 19.2 KB/s modem (19.2 ´ 1024) by an average round-trip time of 60 milliseconds (.06) tells us that the maximum amount of data in-flight over that virtual circuit at any given time is 1,180 bits. Actually, the calculations return 1179.648, although circuits cannot send fractions of bits, so we have rounded up.

Some examples of this formula at work are shown in Table 7 10.

Table 7 10. Maximum Bits-in-Flight from Calculating Bandwidth Times Delay
Bits-per-SecondRound-Trip LatencyMaximum Bits-in-Flight
19.2 KB/s (modem).060 seconds1,180
19.2 KB/s (modem).1202,360
53 KB/s (modem).0603,257
53 KB/s (modem).1206,513

Table 7 10. Maximum Bits-in-Flight from Calculating Bandwidth Times Delay (continued)
Bits-per-SecondRound-Trip LatencyMaximum Bits-in-Flight
64 KB/s (satellite).200 (fast)13,108
64 KB/s (satellite).400 (normal)26,215
384 KB/s (DSL).06023,593
384 KB/s (DSL).12047,186
1.5 MB/s (T-1).06094,372
1.5 MB/s (T-1).120188,744
10 MB/s (Ethernet).002 (local)20,972
10 MB/s (Ethernet).010 (multi-hop)104,858
100 MB/s (Ethernet).002209,716
100 MB/s (Ethernet).0101,048,577

As can be seen, the number of bits-in-flight change dramatically according to the bandwidth and latency characteristics of the virtual circuit. Circuits that have high levels of latency (referred to as long pipes ) take more time for data to be propagated to the destination than short pipes do, allowing more bits to get stuffed into the pipe at any given time, regardless of the bandwidth ( size ) of that pipe. For example, the first two entries from Table 7 10 show a 19.2 KB/s modem with 60 milliseconds of round-trip latency versus the same modem with a 120 millisecond latency. Since the second connection is twice as long, it can hold twice as much data from end-to-end.

Just as round-trip latency affects the amount of data in-flight, so does the capacity (or the bandwidth) of the connection. In this regard, fat pipes can carry more data than a skinny pipe can, regardless of the length of the pipe. For example, the 100 mb/s Ethernet link shown in Table 7 10 has 10 times the capacity than a 10 MB/s link, and thus can have 10 times as much data in-flight.

In a scenario where a network has lots of bandwidth and high levels of latency (the long, fat pipe ), the number of bits-in-flight can really add up. For example, the T-1 circuit with latencies of 120 milliseconds has a substantially larger requirement than long, thin pipe offered by the 19.2 KB/s modem with 120 millisecond latencies, or the short, fat pipe offered by the 10 MB/s Ethernet LAN with 2 milliseconds of latency.

Once the number of bits-in-flight have been determined, then that value must be converted to a byte value, since the Window field advertises buffer space in terms of the available number of bytes (not bits). Therefore, you have to divide the value by eight in order to determine the optimal size in bytes. However, a key consideration in this is to make sure that the byte value is an even multiple of the MTU for the virtual circuit. This is to ensure that the sender can always transmit two fully sized segments, thereby preventing the Silly Window Avoidance algorithm from holding up small segments, and also preventing the Delayed Acknowledgment algorithm from holding up acknowledgments (a problem discussed in Partially Filled Segments or Long Gaps Between Sends )

For example, if the MTU between two endpoints is 1460 bytes, then the Window field must be a multiple of 2920 bytes, but if the end-to-end MTU is only 536 bytes, then the Window field must be a multiple of 1072. Also recall that many BSD-based systems only use multiples of 512 for their MTU size (regardless of the MTU that is offered by the network), and this has to be taken into consideration if you're using those systems.

Typically speaking, the value used for the receive window should be at least a multiple of four, since a receive window that was smaller than that would not allow for a steady exchange of data. If the receive window were only two times the MTU, then the use of delayed acknowledgments on the receiver would result in a very staggered exchange. Upon seeing a window size of two segments, the sender would transmit two segments and then stop to wait for an acknowledgement, while the two segments worked their way through the network to the recipient. Once received, the recipient's acknowledgment would also have to work its way back to the sender, whereupon the sender could transmit two more segments. This would result in a very jerky exchange of data.

In addition, setting the receive window to just two segments will prevent fast retransmit from occurring whenever there is lost data. Remember that the fast retransmit algorithm depends upon the sender getting three or more acknowledgments for the same segment, which suggests that the receiver lost that segment, but has also gotten at least some segments sent later. If the receive window is only set to two segments then the sender will only be able to send two segments at a time, and the recipient will only get one of them, meaning that the recipient will only be able to issue one duplicate acknowledgment (and even that one duplicate acknowledgment will likely be delayed according to the Delayed Acknowledgment algorithm). Without the additional duplicate acknowledgments, the sender cannot utilize the fast retransmit algorithm, and will have to wait until the acknowledgment timer gets triggered, which will then cause other things to happen.

In the end, the state of the virtual circuit will become very messy, although all of this could have been avoided by just setting the recipient's receive window to four segments or greater. In fact, many systems set the default receive window size to 4 * MTU in an effort to avoid just these kinds of problems. It may be required that you go higher than this value, but you should not go lower than this without a very good reason.

With that in mind, some example Window sizes are shown in Table 7 11.

Table 7 11. Window Sizes, Based on the Number of Bits-in-Flight  
Bits per SecondLatencyBitsBytesMTUWindow
19.2 KB/s (modem).120 seconds2,3562951460 * 45,840
53 KB/s (modem).1206,5138151460 * 45,840
64 KB/s (satellite).400 (normal)26,2153,277536 * 84,288
384 KB/s (DSL).12047,1865,8991460 * 68,760
1.5 MB/s (T-1).120188,74423,5931460 * 1826,280
10 MB/s (Ethernet).002 (local)20,9722,6221024 * 44,192
100 MB/s (Ethernet).010 (multi-hop)1,048,577131,0731460 * 90131,400

This process should illustrate just how complicated the art of determining valid window sizes can be. However, by doing these calculations on a regular basis—particularly for the most-commonly accessed systems on your network—you can optimize the delivery times on your TCP network dramatically. For example, if your users are talking to a mail server that is many hops away on a 100 MB/s Ethernet LAN, it may be that large window sizes will improve throughput dramatically (although you may be better off with mail servers that were closer to the user, and thus had lower latencies).

Troubleshooting TCP

Most of the problems that users will experience will be related to application-specific issues, and should not be related to TCP. However, since TCP offers such a wide breadth of services, there are many things that can go wrong with a virtual circuit throughout a session.

Rejected Connections

The most common failing that TCP has is related to the inability of a client to connect to a remote system. This can be caused by either the client specifying a destination port number that does not have a listening application associated with it (such as trying to connect to a nonexistent web server), or by the listening server refusing to accept a connection due to a configuration issue. Figure 7 44 shows what a session looks like when the specified destination does not have a listening application associated with it, while Figure 7 45 shows what a session looks like when the destination server refuses to accept the connection.

As stated, the most likely cause for this error is that the destination port number specified in the active open is not active. In the example shown in Figure 7 44, the

0385-01.gif
Figure 7-44.
A connection being rejected because there's no listener

destination port number of 80 (the well-known port number for HTTP) was inactive, meaning that no web server was available on the destination host.

Notice that the destination system's TCP provider simply rejects the connection request. In the first segment, the client attempts to establish a connection, and the server responds with a Reset segment. This tells the client to just go away. The circuit never even gets established, with no Synchronize segment being returned.

Figure 7 45 shows what happens when a destination application has been misconfigured. In that example, the virtual circuit gets established (as indicated by the completed Synchronize process), but the listening application aborts the connection before any data is transferred (immediately starting the circuit-shutdown process).

The most likely cause for this occurring is that the remote system is unwilling to provide services to this client, although this does not get discovered until after the connection has been established. In the example shown in Figure 7 45, the destination TELNET server's security rules were configured to not allow a connection from the client system. Although the TCP virtual circuit was started successfully, the TELNET server closed the circuit immediately by marking the packed FIN, indicating that the application was loaded and running in LISTEN mode, but that it didn't want to talk to this particular client.

0386-01.gif
Figure 7-45.
A connection being accepted and then immediately terminated

Lost Circuits

Sometimes the connection is established just fine, but then the virtual circuit starts to collapse, with one side appearing to just disappear off the network. Most of the time, that's exactly what has happened: the remote endpoint has lost physical connectivity with the rest of the network for some reason, possibly due to power failure, link failure, or any number of other potential problems.

This scenario is illustrated in Figure 7 46, which shows an HTTP client on Greywolf requesting a document from the HTTP server on www.ora.com. After issuing the GET / request, Greywolf loses physical connectivity with the network.

Segments four through ten show the HTTP server on www.ora.com attempting to send the contents of the requested document back to Greywolf. However, since Greywolf is no longer live on the network, it is unable to acknowledge the data, nor is it able to issue a Finish or Reset segment back to the server. Since the server has not been told that the circuit has been torn down (since it hasn't been), it just assumes that the data has been lost, and continually tries to resend the questionable data.

0387-01.gif
Figure 7-46.
A connection getting dropped due to loss of carrier

What isn't shown in Figure 7 46 is that the time between the retry operations is increasing while www.ora.com continues trying to resend the data to Greywolf. Since Greywolf is not acknowledging the data, www.ora.com keeps trying to send the data, and keeps doubling the size of its acknowledgment timers. This process will continue for a while, until www.ora.com gives up and drops the connection. The number of times that www.ora.com will retry the operation—and the length of time between retransmissions—will be a function of the stack in use on that system. For more information on this subject, refer to Acknowledgment timers.

Partially Filled Segments or Long Gaps Between Sends

In order to maximize network utilization, TCP employs several algorithms that work towards filling the network with the most amount of data, using the least amount of segments. Among these mechanisms are the Nagle algorithm (as discussed in The Nagle algorithm ), the silly window avoidance algorithm (as discussed in The Silly Window Syndrome ), and delayed acknowledgments (as discussed in Delayed acknowledgments ). However, sometimes these mechanisms can actually trigger delays on a virtual circuit, rather than preventing them.

Interactions between Nagle and delayed acknowledgments

The Nagle algorithm is used to prevent senders from transmitting unnecessary small segments by mandating that a small segment cannot be transferred until all of the previously sent segments have been acknowledged, or until a full-sized segment can be sent. Thus, if an application (such as an HTTP client) needs to send one-and-a-half segments of data, then the first segment will be sent immediately but the second (small) segment will not be sent until the first segment has been acknowledged.

However, delayed acknowledgments are designed to prevent excessive acknowledgments by holding them until either two full-sized segments have been received, a timer expires, or data is being returned to the sender (which the acknowledgment can then piggyback onto). In this scenario, the first full-sized segment described above would not get acknowledged immediately, since two full-sized segments had not been received, nor would there be any data being returned since not all of the application data had been received by the HTTP server as of yet. Instead, the acknowledgment would not be sent until the acknowledgment timer had expired, which could take as long as 500 milliseconds, depending upon the implementations and the characteristics of the virtual circuit in use on that connection.

Sometimes this will show up as an application protocol taking a long time to get off the ground, particularly when a connection has already been established between the two endpoints (this is most often seen with application protocols such as HTTP that generate small amounts of data which are larger than a single segment, but smaller than two full-sized segments). At other times, this problem may manifest itself as a single trailing segment that appears to get held up for a long time (which can be seen with any TCP-based application).

The HTTP problem is illustrated in Figure 7 47. In that example, the HTTP client is sending 1500 bytes of data to the HTTP server, but because the client is using the Nagle algorithm, the sixty bytes of overflow data are delayed until the first segment has been acknowledged. But since the server is using the delayed acknowledgments mechanism, it is waiting for two full-sized segments before returning an acknowledgment, which is causing the overflow data to be held up on the HTTP client.

When these events occur, the result is long pauses between the first and second data segments. The only real cure for this situation is to disable the use of the Nagle algorithm on the system that is sending the one-and-a-half segments of application data, although in all likelihood you will need to have access to the application source code to make this change.

0389-01.gif
Figure 7-47.
A negative interaction between the Nagle algorithm and delayed acknowledgments

Once the Nagle algorithm is disabled on an application, the sender can transmit the small segment immediately, thereby causing all of the data to get to the remote endpoint immediately, where it can be processed. The remote system is likely to return some sort of data as a result of receiving all 1500 bytes of data (such as a requested web page), which the delayed acknowledgment could ride on.

This process is shown in Figure 7 48. In that example, the client has disabled the use of the Nagle algorithm and is now free to send small segments to the server without waiting for acknowledgments. As a result, the HTTP server gets the entire HTTP request quickly and is able to return the data (and the acknowledgment) quickly as well.

If you notice long delays with these kinds of applications, check to see if the client is sending more data than will fit within a single segment (but not so much data as to fill two fully sized segments). If so, then you may be able to disable the use of the Nagle algorithm on the client, which will allow it to send the small segment immediately. The server can then bundle the acknowledgments for those segments with whatever data is going to be returned by the application protocol, using the delayed acknowledgment mechanism to its advantage.

0390-01.gif
Figure 7-48.
Flow control with the Nagle algorithm disabled

These problems can occur in the other direction as well. In some cases, the web server may be sending more data than can fit within a single segment, but that does not fill two full-sized segments. In those cases, the server may also be suffering from the use of the Nagle algorithm. However, you should be much more cautious about disabling the use of the Nagle algorithm on servers, since they also tend to send large amounts of data (such as GIF images) that benefit from clustering small blocks of data into a single transmission. As a result, disabling the use of the Nagle algorithm on a server may cause other problems due to a large amount of small segments.

It is important to realize that this problem occurs only with applications that frequently generate writes that are smaller than two full-sized segments of data. Application developers should not disable the Nagle algorithm if they are writing many small blocks of data that are larger than two full-sized segments, since those writes should be clumped together, thereby preserving network utilization.

Sometimes you can tell the size of the block of data that an application is writing by looking for the Push flag, since this flag is often set on every application write. For example, if an application is writing data to TCP in four-kilobyte chunks, then there may be a Push flag at the end of every four kilobytes of data sent via TCP. Note that this method is not fool-proof, since some TCP stacks will set the Push flag on every segment that they send, but it can be used with many implementations.

Wrong MTU sizes

Another set of performance problems can occur whenever a system chooses to delay an acknowledgment until two full-sized segments have been received, but the system is receiving segments that are smaller than the system's local MTU. In this situation, the sender may send many segments, which the recipient does not acknowledge.

This particular scenario most-often happens when two devices announce large MTUs during the circuit-setup process (using the MSS option), but the sender then determines that a smaller end-to-end MTU is required in order to transfer data to the recipient (this is typically detected using Path MTU Discovery). The sender will therefore only transmit small segments to the recipient, who will not return an acknowledgment until enough data has been received to fill two full-sized segments (according to its understanding of the end-to-end MTU, which is still reflective of the local MTU).

This introduces all kinds of performance problems during slow start and congestion avoidance, since the sender's congestion window will only allow a few segments to be sent before waiting on an acknowledgment to arrive. Eventually the recipient's delayed acknowledgment timer would expire, and it would send one acknowledgment for all of the data that it had received, thereby allowing the sender to increment its congestion window and resume sending data, resulting in a very bursty ramp-up process.

However, it is important to realize that this scenario only happens when the sender's Path MTU Discovery routines detect a smaller end-to-end MTU than the size announced by the MSS option. If both endpoints are using the same MTU for the local and intermediary networks, then this problem will never occur. But if the intermediary path is smaller that the MTU of the two endpoints, then it will likely occur quite frequently.

For example, if the two endpoints both have MTUs of 1500 bytes (common for Ethernet), but an intermediary path is only providing an MTU of 536 bytes (some-what common on older leased lines and dial-up circuits), then the recipient may continually delay acknowledgments, since it is only receiving what it believes to be small, undersized segments. This problem also occurs frequently with systems who use Token Ring, FDDI and other large MTU sizes over a 1500-byte intermediary network (1500 bytes is the most common MTU found on the Internet).

Uneven MTU multiples

Another interesting opportunity for problems occurs when the receive window size being advertised by a recipient is not an even multiple of the MTU in use on the virtual circuit. This results in problems with Silly Window Avoidance causing transmission delays. As you may recall, the Silly Window Syndrome can occur when the recipient's TCP stack advertises a small window size, resulting in the sender transmitting a few bytes (to fill the advertised space), with the process repeating ad infinitum. In order to prevent this behavior, RFC 1122 stated that a system could only advertise a non-zero window if the receive buffers could store one fully sized segment.

However, this becomes a problem when the receive buffers on a system are not an even multiple of the MTU in use on the network, or when the application attempts to write data in blocks that are not an even multiple of the MTU. For example, if a system has a default receive window of four kilobytes but is receiving 1460-byte segments, then the sender can only transmit three fully sized segments, since the Nagle algorithm delays the fourth segment until an acknowledgment arrives for the third segment, or until a fully sized segment can be sent. However, as noted above, the recipient will not issue an acknowledgment until it has received two fully sized segments, or until a timer expires.

As such, the sender sends only three segments of data, and then stops to wait for an acknowledgment before sending the final small segment (the Nagle algorithm at work). Once the acknowledgment for the first two segments had been received, then the sender would transmit the fourth segment (which may or may not result in an acknowledgment from the remote system, according to whether or not the recipient is issuing acknowledgments for every two segments that it receives, or only for every two full-sized segments it receives). This would result in a fast send of three segments, followed by a long pause for the final (small) segment.

Most TCP implementations attempt to avoid this problem by forcing the receive window to be at least four times the size of the local MTU (as discussed in Notes on Determining the Optimal Receive Window Size ). However, this is still a problem with applications that write data in blocks that are not an even multiple of the end-to-end MTU (which is more important than the local MTU). In an effort to prevent this from occurring, some operating systems (such as BSD) will not delay sending the final segment from a write. For implementations that do not do this though, it can be problematic.

Small send windows and excessively delayed acknowledgments.

A variety of performance problems can also occur when the sending system is using a small send window, while the recipient is using a large receive window and is delaying acknowledgments for too long. Although this requires an odd mixture of conditions to occur, it is not so rare as to not be a problem.

For example, some server-class systems intentionally delay acknowledgments for more than every-other segment, in an attempt to reduce network utilization and overhead, with some systems delaying acknowledgments until 50% or 60% of the receive window has been filled. When this mechanism is used to exchange data with other large-scale systems on a half-duplex network, the long gaps between acknowledgments allow the network to send more data in less time. However, when this model is used on systems with small send windows (such as is common with PC-based implementations), it can then trigger serious performance problems.

Although the vast majority of send activity comes from server-class systems with large send windows (such as HTTP servers, or POP3 mail servers), a lot of network traffic also gets generated by PC clients that are sending mail messages with large binary attachments to an SMTP server, or who are uploading large data files to an FTP server, and so forth. Typically, these systems do not have very large send windows (most of them only have send windows that are four times the local MTU).

When these systems try to send data to a system that is delaying acknowledgments until half of the receive window has been filled—and when the receive window on that system is substantially larger than the sender's send window—the sender will have to stop transmiting once the send window has been filled with outstanding, unacknowledged data. If the recipient is not returning acknowledgments for that data (due to an excessively long delay timer), then this will result in very bursty traffic, with the sender transmitting four or so segments, and then stopping until the remote system's acknowledgment timer expires.

Excessive or Slow Retransmissions

One of the most important aspects of TCP's reliability service is the use of acknowledgment timers (also called retransmission timers), which ensure that lost segments get recognized as such in a timely manner, resulting in the quick retransmission of that data. However, as was pointed out earlier in Acknowledgment timers, accurately determining the correct timer values to use on any given virtual circuit is a complex process. If the value is set too low, then the timer will expire too frequently, resulting in unnecessary retransmissions. Conversely, if the value is set too high, then loss may not be detected quickly, resulting in unnecessary delays.

On some systems, you may notice that there are a high number of excessive retransmissions on new circuits (indicating that the default timer is set too low for those virtual circuits), or you may see very long gaps in between retransmissions (indicating that the default acknowledgment timer is set too high). Although these values are typically changed on a per-circuit basis as the smoothed round-trip time is calculated, the default settings can be problematic on new connections, where the smoothing has yet to begin in earnest.

For example, if the default acknowledgment timer is set to 3000 milliseconds, then the system may not detect a lost segment until three full seconds after the segment was sent. This would obviously cause problems with applications that only send one or two segments at a time (such as mail and web clients, that only send a little bit of data). The result would be a very long pause, followed by a retransmission, which may or may not succeed.

Conversely, if the default acknowledgment timer is set to 200 milliseconds, but you are connecting to a site that is on a very slow link, then you may see the same segments getting sent multiple times. Although this does not penalize applications that only send a couple of segments at once (mail and news clients, for example), it would be quite annoying to those applications that send many segments (such as an FTP upload, or a disk-sharing protocol), since those applications would be constantly resending data, at least until the smoothed round-trip time were sufficiently incremented.

Most operating systems allow you to define the default value for the acknowledgment timers in use on those systems. You should determine the most appropriate default timer according to the typical connection scenarios in use on your network (i.e., use ping or traceroute to determine typical latency times), and then refer to the system documentation to see how to set the default acknowledgment time to those values.

Slow Throughput on High-Speed Networks

The author has seen poorly written network drivers cause significant throughput problems on TCP/IP networks, particularly with regards to buffer management on the receive queue. In those instances, the sending and receiving systems are never able to get beyond the slow start ramp-up activity, as the recipient is unable to acknowledge more than a couple of segments at a time, due to poor buffer management within the device driver itself.

In this scenario, the sender transmits four segments as part of the slow start algorithm, but the recipient only returns an acknowledgment for the first two segments (or more accurately, fails to acknowledge the third and fourth segments). The sender will interpret this behavior as network congestion and reduce the size of its congestion window to one segment. This process will complete ad infinitum, with the sender never getting beyond a couple of segments. Unfortunately, the only way to resolve this problem is to either replace the recipient's network driver, or to replace the network card entirely.

Another set of problems for this are discussed earlier in Partially Filled Segments or Long Gaps Between Sends.

Lots of Reset Command Segments

A high number of TCP command segments with the Reset flag enabled can indicate a variety of things, although typically it boils down to the recipient getting a segment that apparently is not intended for the current connection, according to RFC 793. However, RFC 793 also goes on to state that A reset must not be sent if it is not clear that this is the case, leaving it up to the recipient to make the decision.

For example, Reset segments will be sent whenever a remote endpoint attempts to establish a connection to a non-existent socket on the local system. If a web browser tries to establish a connection to port 80 (the well-known port number for HTTP), but there is no server listening on that port, then the local system's TCP stack should return a Reset segment in response to the incoming Synchronize requests.

Reset segments can also be sent if the local socket is no longer valid for a previous connection. In that case, the local application has completely closed its end of the connection but the remote system is still sending data. When those segments arrive, the local TCP stack should just reply with an equal number of Reset segments. This can happen due to the remote endpoint refusing to close their end of the connection after the local system has sent the requisite circuit-termination segments (using the Finish flag), and can also occur if the virtual circuit had to be destroyed due to an excessive number of retransmissions. In both of those cases, the connection has fallen apart and was terminated abruptly by the server, so any additional segments received for that virtual circuit must be refused.

Generally speaking, applications are supposed to close their connections gracefully, so that particular scenario should only happen when fatal errors occur. However, some applications use abrupt closes on their ends of the virtual circuit in an attempt to boost performance, rather than closing the connection gracefully. For example, some HTTP servers don't try to close their virtual circuits gracefully, but instead just do a full close once they've sent all of the requested data. The theory behind this practice is that it is faster to terminate the connection than to go through the laborious task of exchanging shutdown segments. However, this also means that the client will not be able to request any data that was lost, since the server has closed the connection without waiting for acknowledgments to arrive for all of the data that was sent. Any subsequent requests for retransmission will just get rejected, resulting in a stalled client. Welcome to the World Wide Wait!

Another situation where resets can occur is if an application has crashed, leaving TCP thinking that the socket is still valid, although nothing is servicing the receive queues associated with that port number. In that situation, TCP may accept data on behalf of the application, until the buffer is full. If the queue never gets serviced, then eventually TCP should start issuing Reset segments for any new data that it receives, while continuing to advertise a zero-length window.

Weird Command Segments

As the Internet has gained in popularity, it has attracted people of every ilk, not all of whom have the best intentions. In particular, over the past few years there has been a large increase in the number of hackers who have taken a strong liking to TCP/IP (many of whom are probably reading this book), and who are using network probing tools to discover the layout of your network and the weak points on your servers. If you see weird looking command segments, then the chances are good that your network is being probed by one of the commonly available programs that are the tools of the trade for these users.

Figure 7 49 shows what one type of probe looks like, with Greywolf sending a lamp-test segment to Weasel, in an effort to discover the operating system in use on that host. This segment has the Urgent, Push, Synchronize, and Finish flags enabled. The way in which Weasel responds to this illegal segment can be used as

0396-01.gif
Figure 7-49.
A lamp-test segment

a clue towards discovering the operating system in use on that host (although this information has to be combined with many other such probes). As can be seen by the next segment, unfortunately Weasel responded to this command segment with an Acknowledgment and Synchronize segment of its own, allowing the connection to continue. It certainly should not have done this, upon seeing the Finish flag of the original command segment.

Other types of probe segments can consist of simple half-opens, whereby the hacker is testing to see whether or not a server is listening on a known port, or the false-reset segment where a command segment is sent with just the Reset flag enabled, even though no connection has yet been established.

When you see these types of segments, you should examine your security infrastructure for any possible holes that may be being exploited by these segments. A good firewall should block most of these probes.

Path MTU Discovery-Related Problems

Over the years, there have been many implementation-specific problems from the use of Path MTU Discover with TCP.

For a comprehensive discussion on this subject, refer to Notes on Path MTU Discovery in Chapter 5, The Internet Control Message Protocol.

Misconfigured or Missing Services File

You should verify that the services file in use on your system matches up with the well-known port numbers expected by the applications you are using. Some applications will ask the system for the port number associated with SMTP for example, and if your system's services file does not have an entry for that application, it will not return a port number to the client.

This will prevent the client from being able to send any data, since it cannot get the destination port number for the application.

To see the well-known ports used on your system, examine the /etc/services file on a Unix host, or the C:\WinNT\System32\Drivers\Etc\SERVICES file on a Windows NT host.

Miscellaneous Interoperability Problems

A variety of interoperability problems with different vendors' TCP implementations have been documented. Some of these problems have been seen in the sample captures shown in this chapter (such as Greywolf failing to use slow start, as shown in Bulk Data Transfer and Error Recovery ).

For more information on these problems, refer to RFC 2525, entitled Known TCP Implementation Problems.




Internet Core Protocols. The Definitive Guide with Cdrom
Internet Core Protocols: The Definitive Guide: Help for Network Administrators
ISBN: 1565925726
EAN: 2147483647
Year: 1999
Pages: 17
Authors: Eric Hall

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net