3.5 Networking | sendmail Performance Tuning

Handling file I/O efficiently is the most significant area for performance tuning of email servers. Second on the list is examining the system's networking. Anyone who has been an Internet email administrator for any length of time has experienced the effects that network slowdowns or outages can have on an email server, so the fact that networking is important in email will come as no surprise. What may be more surprising is that less obvious networking issues may significantly affect service performance.

At one site, I spent a great deal of time talking with a customer about ways to improve email system performance. We discussed many of the issues mentioned in this book concerning improving the system's I/O capability and such. Then the client mentioned something that really caught my attention: The company was experiencing about 10% packet loss rates on the LAN that connected its email server to its customer base. This rate shocked me, but in retrospect I'm at least partially to blame for not exploring the fundamentals of the system before spending time talking about NVRAM, advanced file systems, and efficient queue rotation.

When building a house, it makes no sense to expend effort on exquisite interior craftsmanship if the foundation won't hold up. The same principle is equally true in providing networked computing services, as it is in any endeavor. The first thing to provide is a solid physical foundation for the systems, which means quality facilities. The machines need clean power, appropriate environmental regulation, a clean location, and physical safety. One cannot build reliable services without providing these basic necessities, and one cannot build high-performance systems without first providing reliable services. For a large ISP, clearly this effort entails building a quality data center. For a smaller company, an office converted to a machine room with tile floor, extra cooling, and a UPS may suffice, but these conditions are all too often neglected even though they are the very foundation of providing dependable computing services.

The next step is to provide quality networking. This does not mean over-engineering. For any mail server experiencing performance problems, at the very least I'd recommend a switched, full-duplex solution. Of course, 10 Mbps might be sufficient, if that's enough to handle the load. It's also not adequate to just check the local LAN(s) on which the server resides. Instead, one must thoroughly understand the entire network between the server and the Internet, as well as between the server and whatever internal machines with which the server communicates, whether that would be other email servers or end users.

Network load on all relevant networks should be monitored regularly. Setting up automated processes to gather this information is invaluable, and writing tools to analyze the data and look for warning signs will notify the alert email administrator in many cases before problems become severe. Any full-duplex or other reserved-bandwidth network (such as FDDI or Token Ring) that regularly runs at greater than 60% utilization for one hour per day represents a candidate for upgrading. Any shared bus network (e.g., half-duplex Ethernet) that regularly sustains 30% load also represents a candidate for upgrading.

Any shared bus network or any machine on such a network that sees 10% packet retransmission rates (whether due to error or collision) should prompt immediate concern. On a full-duplex network, packet retransmission rates of even 1% usually indicate that something is wrong with the network. If a large amount of packet loss occurs on a full-duplex network, then it may be better to let it run well at half-duplex than to perform badly at full-duplex, if it is an option. Ultimately, however, if a network operates poorly at full-duplex, it needs to be fixed. When configuring Ethernet network interfaces that should operate at full-duplex, usually it's best to configure them as full-duplex interfaces rather than to let them automatically configure themselves. Some devices from different vendors still have problems locking full-duplex; in some cases, if a transient network problem occurs, they'll renegotiate down to half-duplex mode and stay there. This result can cause difficult-to-diagnose problems if one isn't alert to the possibility.

3.5.1 Network Interface Cards

Network interface cards (NICs) can be a surprisingly important area of consideration in the design of email servers. Almost any 10 Mbps card is good enough for use. In fact, at the time of this writing, it's difficult to find a new 10 Mbps-only card to purchase. Certainly, quality 10/100 cards are cheap enough these days that one can't rationalize the use of an old 10 Mbps card to save money, especially if email performance might be an issue.

The quality of 100 Mbps Ethernet cards has increased markedly over the last several years. In terms of Internet service, the primary difference in quality between cards relates to the amount of memory available on the card for buffering data. Lack of buffer space can mean more work for the processor and more packet retransmissions, which can affect performance. Typically, cards that carry a "server" designation have more buffer space than cards designed for workstation use, which in some cases can't even sustain their peak data transfer rate. It's difficult to get information about the real difference between a vendor's regular offering and its "server" card, but physical inspection of the NIC will often reveal more physical memory via the quantity and capacity of the chips on the card. If so, the marginal difference that the extra cost of the server card adds to the total cost of the server would likely be worthwhile.

The contrary view is that (1) these days it's easy to turn out a good 100 Mbps Ethernet card that can easily fill its available pipe, so there's no reason to overbuy, and (2) even if the card doesn't have adequate buffer space, it primarily costs the server some of its CPU capacity to compensate for this shortfall, and the average email server has more CPU capacity than it knows what to do with anyway. These are valid arguments. If an inexpensive card works well, I would be the last person to recommend a change. Some of the cost of a "server" card may be diagnostic capability, which is worthwhile if it can be used. If this extra information cannot be accessed, then clearly it isn't worth the extra cost.

One area in which I still regularly see subpar Ethernet NICs are the interfaces that come built into the motherboard of systems. Unless the system's networking will not be taxed or the on-board interface is of especially high quality, I recommend the use of supplemental networking cards for Internet service and relegate the on-board NICs to monitoring or command-and-control networks.

These days, good 100 Mbps cards are fairly routine, but NIC quality remains very important for Gigabit Ethernet interfaces. In cases in which a server handles enough email to at least occasionally saturate a full-duplex Fast Ethernet connection, gigabit networking is a necessity. Unfortunately, as recently as three years ago, few Gigabit NICs could be driven at their claimed speed by any computing platform in a test lab, much less under real-world conditions. Of course, few email servers need this sort of speed. Anyone pushing beyond the 100 Mbps barrier must be very concerned about how much CPU effort it takes to drive the card, no matter how large a SPECint rate2000 number the server might be able to generate. Quality of cards in this regime is critical, and careful tests should be performed to find one that will deliver the needed performance with minimal effects on the server's processing capability.

One feature of Gigabit Ethernet that reduces network overhead in a great many situations is the use of the optional jumbo frames feature. It allows large packets, around 9,000 bytes in size, rather than the traditional 1,500-byte Ethernet packets. Increasing the packet size can greatly reduce the amount of processing needed to transfer a fixed number of bytes between two servers. In an email context, if the client, the server, and all intermediate networks support jumbo frames, it can reduce the processing overhead of moving the data considerably. Unfortunately, this is rarely the case. In the real world, few POP clients can trace a Gigabit Ethernet path all the way to their POP server. Certainly, we don't expect an SMTP server connecting to the general Internet to handle frames of this size, either.

3.5.2 Different MTUs in the Network

The problem of different networks having maximum packet sizes on the Internet is handled in two ways. First, packets can be fragmented by routers as needed along the way. This is easy on the sending host, but requires a router to perform extra work. Second, one can determine the maximum packet size that all networks between the two hosts can handle by sending large packets and receiving messages back from intermediaries that can't handle those large packets. This process, called "Path MTU Discovery," is documented in RFC 1191 [MD90]. The benefits of this method versus the alternatives are eloquently stated in the RFC. However, every packet that is sent to a remote host that is larger than the maximal MTU of an intermediate network must be re-sent at least once. This extra overhead might be avoided by setting the network interface's MTU to a more modest size, such as 1,500 bytes.

A big downside associated with Path MTU Discovery bears mentioning. Servers using this protocol typically send IP packets with the "don't fragment" bit set. Then, when a router cannot deliver the packet to the next hop, it will return an ICMP "Destination Unreachable" message to indicate that the packet cannot be processed. RFC 1191 specifies that the MTU for the next network hop be encoded in this ICMP packet. That way, the originating server will know the size of the packet when it is re-sent, and it can iteratively determine the largest packet size that can be transferred without fragmentation.

A problem occurs when a router between the two communicating hosts blindly discards ICMP packets. In this case, the sending host doesn't know what to do. It never receives an acknowledgment for the packet it sent, yet no ICMP Destination Unreachable packet is returned to inform it that something went wrong. A common manifestation of this problem is the successful transmission of very small email messages, while larger messages (perhaps totaling more than some "magic" size, such as 536 or 1,500 bytes) can't get through.

This situation is not as uncommon as it ought to be. In the name of security concerns, a significant number of sites have configured packet filters on their routers or firewall boxes to reject all ICMP packets. While this cure might seem worse than the disease, and certainly I don't endorse this strategy, it's not surprising that some sites choose this course of action. Many fears concerning ICMP-based network attacks, especially distributed denial-of-service (DDoS) attacks, are justified. Path MTU Discovery is a relatively esoteric part of the IP protocol, and a large percentage of the people responsible for implementing network security policy at sites around the Internet have likely never heard of it. Consequently, they don't understand why they need to support it. This policy may be wrong, but in this day and age it's understandable.

If an organization experiences problems getting email through to other sites, and Path MTU Discovery is the culprit, it's probably best to disable this feature on the affected servers. Of course, if a site's own network causes the problem, then the obvious fix is to allow ICMP Destination Unreachable packets through to the email servers. While offloading the effort involved in packet fragmentation may decrease the load on email servers, any benefit will be very slight, if it's even measurable.

3.5.3 Kernel Networking

Beyond a certain tuning threshold, once bottlenecks in filesystems, storage, and networking have been removed, email service largely comes down to just moving ones and zeros as fast as possible. Essentially, providing email service (like all Internet services) is a matter of taking bits off the network and putting them on disk, then taking them off disk and putting them back on the network. The faster this transfer occurs without sacrificing integrity, the better. After tuning the I/O system, another significant factor in determining how rapidly data can move is the number of copies performed on each piece of data as it moves through the system. Clearly, much less overhead is associated with writing the data into memory once and then passing it around by reference than with making copies of the data as it moves between applications, different parts of the kernel, and between main memory and devices.

Zero- and one-copy IP implementations within operating systems have represented a rich area of research in recent years, a good example of which may be found in [RAC97]. Running on the same hardware, the operating system that performs the fewest number of data copies for each network transaction would be expected to move data faster. Even if the operating system works relatively efficiently in this manner, email applications run in user space. Thus data moving from user to kernel space are typically copied rather than mapped, as the kernel doesn't want to give applications a lot of access to its data buffers, and for good reasons. Each user/kernel space boundary transition generally requires a copy. For example, if we're sending data between two programs via a pipe, the data are not typically mapped between the two applications. Rather, data are copied from the user space of the originating process into the kernel, then copied again into the buffers of the target process, resulting in two data copies rather than the expected one copy (between processes) or zero copies (as would be the case if the applications used mmap() and shared memory to communicate the same data).

Note, however, that if the tall tent pole when it comes to email server performance tuning is kernel data copying, then tuning of the system has already gone above and beyond the call of duty. Nevertheless, even on the same piece of hardware, email applications running on different operating systems may perform differently. All operating systems are not created equal when it comes to moving data quickly. It's more than fair to consider success at raw data-moving tests a significant factor when evaluating which operating system should be used for email, or any other high-volume Internet service.

Data copying between applications and the kernel, and between the kernel and NICs, generally consumes very little of an email server's overall effort. Nonetheless, some research efforts and products aim to minimize this effort, and they're worth mentioning. Data copies occur not merely between an IP stack and a NIC, but also within the kernel and between applications and the kernel. Thus eliminating just one of these sources of copies is merely part of the battle. For this reason, doing work on network I/O systems that reside in userland and have direct access to both NICs and applications is increasing in popularity. With this method, data coming in to a network card are written directly into a reserved section of memory, then passed around by reference. No kernel/user space boundary issues arise, because this threshold is never crossed. The most widely known solution along these lines uses Myrinet [BCF⁺94]. It provides an ultra-high-speed, ultra-low-latency network plus software that interfaces with an application to eliminate many of the data copies in an application. Another promising area of research along similar goals is the Virtual Interface Architecture (VI) [VI97]. The principal downside of these solutions is that they must be available to servers on both sides of the network link, an unlikely event in a general-purpose internetwork. Nonetheless, the key ideas underlying these intriguing technologies are slowly working their way into traditional solutions.

3.5.4 Bandwidth and Latency

One argument that nearly always surfaces when discussing network performance tuning is the question of bandwidth versus latency. Briefly, bandwidth measures the total capacity of a network between two given points without regard to anything else. Latency measures the amount of time consumed in the transmission of a single message (which can be a packet, a three-way TCP handshake, or an email message) from the time the initial connection is made to the time the destination receives the last bit of data.

In the context of an email server, bandwidth is nearly always what we're interested in; latency is of lesser concern. This is not to say that delays in email are not a cause for alarm, but rather that the sorts of network-induced latencies that email server tuning can influence almost never result in a measureable delay.

For example, in any except the most geographically dispersed or heavily loaded corporate network, the IP packet round-trip time between an email gateway and the gateway router that links it to the network service provider's network will be less than 5 ms, and probably no more than 1 ms. The round-trip time between my email server and a "random" selection of three large ISP mail servers, one large university email server, and two Fortune 500 corporate email servers (all within the United States) varies between 20 ms and 90 ms, with a median of about 70 ms.

Clearly, the latencies that are under my control are dwarfed by those that are not. Moreover, even if the latency between my email server and my gateway router grew to 100 ms, this interval would still be too small to cause a person to notice the extra delay that the message took to reach its destination. On the time scales here, the consequences of latency/bandwidth trade-offs are consequential only in a performance context. They remain essentially negligible from a human interface standpoint, unless an email server becomes so badly overloaded that the delays are measured in seconds or longer time scales. Of course, the noticeably high latencies are then themselves a symptom, rather than the root problem.

In terms of Internet services, as long as delays remain short enough to avoid notice by people, increasing latency on email gateways primarily means that individual SMTP sessions take longer, which means that more processes run on that machine and more memory is consumed. Typically, these items are not tall tent poles when it comes to improving the performance of an email server, but they can be important. This topic will be revisited from another angle in Chapter 8.

This section has stated that bandwidth is far more important than latency. Other parts of the book say a great deal about methods to reduce latency. This might at first seem contradictory, but there is a consistent strategy here. Increasing the mail server's throughput is almost always more important than reducing latency. If bandwidth is not diminished as a consequence, then latency should be reduced as much as possible. An example of sacrificing latency for bandwidth is turning off aggressive filesystem pre-fetching. Eliminating unnecessary DNS lookups or lowering the overhead of each request reduces latency without affecting overall bandwidth.