8.3 Load Testing Pitfalls

Several tricky issues arise in setting up a test environment for an email system. Some of these issues may seem obvious, but others probably won't be. In any case, it's always best to be aware of these potential problems before getting started so the necessary choices are made explicitly.

Suppose we have an email test environment in which we have set up a single target machine with a dedicated disk for the queue and a reasonable RAID system for the message store. Further suppose that we plan to test how many messages per second the system can receive and store. Besides its overall capacity in ab-sorbing email messages, we'd like to know the location of the next bottleneck so that we can evaluate whether we might need multiple disks for the queue, a high-performance filesystem for the message store, or some other improvement to provide adequate headroom for future growth. To test this target server, we also have set up two computers to be the load generation (source) machines and connected them with the same type of network and switch we plan to use in production.

Generally, we expect that when the target machine reaches its saturation point, it will become I/O bound somewhere in the system. Because the actual messages used in testing don't matter, a source machine can generate meaningless messages rather than read data from its own disks. It could just send the same message over and over again to the target, or it could select randomly from some pool of messages that are representative of the distribution of message sizes that are typical of that organization's expected load. In this case, we'd expect that if the source and target machines had the same configuration, a single source should be able to overwhelm the target with load, because the source isn't moving its disks and the target is. Oddly, in many cases this result doesn't happen. In fact, it sometimes takes several times the target server's horsepower in source machines to saturate the target. The reasons for this behavior are often difficult to determine, but a partial list of some of the more likely culprits follows:

The tools used are not always efficient load generators. Software such as sendmail and popclient is not designed with load generation in mind. When these tools were written, the programming priorities were data integrity and correctness rather than blinding speed. Even if one modifies their behavior to improve performance, such as by having popclient write its output to /dev/null, the utility continues to perform at least one extra data copy that wouldn't need to be done in a tool designed purely for load testing. These little inefficiencies add up, especially when running large numbers of load generators. When hundreds or more of these processes run on a single machine, the shell scripts that drive them will fork a very large number of processes, and the original popclient author almost certainly didn't expect that 1,000 concurrent copies of his program would run on a single server at the same time. Conversely, the programmers working on sendmail and the Cyrus IMAP daemon, for example, spend a significant amount of time considering how their code will run in high-stress environments, like those on the target server.
A program that is active but not running still consumes resources. When a test program starts it consumes memory, at least one network socket, one or more file descriptors, and a slot in the process table. Due to the latencies involved in the synchronous request/response conversation of the email protocols covered in this book, most of these processes will spend most of their time waiting. This is even more true once the target server starts slowing down, as it will when it becomes more heavily loaded. This feedback causes the testing machine to slow down even more. Soon memory consumption and context switching become enough of an issue that the load generator starts slowing down itself.
Once a target configuration goes beyond the most basic setup that is, its I/O system becomes fairly well tuned handling large amounts of I/O becomes easier for the server. Even without other inefficiencies, the source and sink machines don't have sophisticated I/O systems on which to offload part of the work done by the server. As a consequence, the load generation and target machines can be closer to parity in the effort they actually expend than might seem likely at first.
Aside from its I/O capabilities, the target server is likely to be a more powerful system than each of the load generators. Certainly, it wouldn't be completely fair to criticize load generation software running on two Linux PCs for not being able to saturate a well-configured, four-processor Sun E420R. Of course, dollar-for-dollar and processor-for-processor comparisons aren't strictly fair either. Nevertheless, with a significant disparity in the class of machines used for load generation and the target server, we shouldn't be surprised if this difference manifests itself during testing.
Sometimes the source and sink machines actually perform more work than the target server. For example, if the target server acts as an email gateway, relaying email between an SMTP source and sink, then the nontarget machines might be expending more effort. While it's true that the target receives email and sends it, the sink receives email, and the source sends it, a well-configured target has the option of using the same data buffers to write out the incoming email, and then read the outgoing email. The source and sink can't employ the same optimization, nor could they even if both functions resided on the same machine. The two load generation machines have to manage some work items that the target doesn't have to perform in some cases.

Many other reasons exist as to why it might take several load generators to saturate a single target server; the ones mentioned here are just some of the most common. Nonetheless, if the load testing machine and the target machine consist of otherwise identical hardware, even if the fastest, most efficient I/O system is attached to the target, a well-tuned source should be able to saturate an equally well-tuned target. I haven't tested this theory in all cases, but one of the reasons for the lack of rigorous testing goes back to the issue of overall efficiency. There are a few environments in which it makes sense to tune anything except the target server in a test environment. Hardware is usually plentiful and almost always cheaper than expertise.

8.3.1 Difficulties in Approximating the Real World

At the beginning of this chapter, it was stated that no matter how much effort is put into creating a test environment, it can never be more than an approximation of what the server will face in a real-world environment. As specific tools were discussed, some of the ways in which it will be nearly impossible to simulate every detail of what an email server will actually experience probably became evident. It is worth explicitly exploring where some of these difficulties will occur so that the limitations of a given test environment will be well understood.

One of the more obvious differences between a production environment and a test environment relates to the number of hosts, domains, and IP addresses that the target machine encounters. When testing a gateway server, it might be possible to simulate the number of internal servers that will actually be encountered in a production environment if the number is small and virtual hosting is used. If the number of internal hosts is very small, then the number of internal IP addresses may even equal the number of servers set up as SMTP sinks. However, one cannot count on these cases. Even when they do apply, no practical way exists to simulate the variety of hosts, domains, and IP addresses that the production machine will find on the Internet side. In the test lab, DNS caches will be always be smaller, SMTP connection caches will always be more effective, information such as sendmail's persistent host status will always be stored more efficiently, and items such as route tables will always be smaller.

In a test environment, the number of different user accounts used, for example, on a target POP server, will often be smaller than in the production environment. If possible, a similar number of accounts should be tested as are expected in the production system. For all except the largest corporations, this type of testing is probably feasible. Tools can be written so that a test environment simulating the use of tens of thousands of accounts isn't out of the question. Simulating the entire user base of a large ISP or email portal is a different matter, however testing millions of accounts is a much more difficult problem than testing thousands. Just the extra effort required to select an item from a list with 1 million entries will place different demands on these two test systems. As the anecdote in Chapter 7 shows, testing with an improper number of accounts can generate misleading results.

Another set of circumstances that is challenging to account for in a test environment focuses on the collection of unusual events, misconfigured hosts, and antisocial denizens of the Internet with which an email server must cope on a daily basis. Most test suites won't include miscreants probing whether the target server is an open relay by trying to bounce 100 email messages off of it. Most test suites won't include POP clients that disconnect without informing the server through the POP protocol. Most test suites don't simulate rebooting backbone routers, misconfigured DNS servers, or email servers that don't strictly adhere to SMTP. Generally, this approach is appropriate. This sort of detail usually lies far beyond the threshold of what is worthwhile to explicitly account for in a test environment. However, while an email server may be subjected to more load in the lab, it will encounter far more unusual events in the field than any test writer can imagine.

When designing a test scenario, the primary interest is typically simulating the load that an email server will encounter during the period of its peak usage. To make things easier, we often use average loads and average message sizes. Sometimes, an email server's load is dominated by a single unusual event. A gateway might be called upon to relay or bounce an enormous message, a POP client might download an especially large message, or some client might issue an IMAP SEARCH command on an enormous volume of email. Averaging out some of these extreme cases might mean that the server is not tested under its most stressful circumstances.

A critical difference between production and test environments that is extra-ordinarily tough to simulate is the fact that in a test environment the network link between two hosts typically has much lower latency than in the real world. On the same subnet, the network latencies between client and server will be less than 1 ms. Across the Internet, they can easily reach 100 ms. Indeed, over a modem link, a latency of 300 ms is not uncommon. The addition of these latencies in interactive protocols fundamentally changes the nature of the connections. Let's look at this point in some detail.

Let us perform a fairly naive calculation on how long it will take to transfer a 10KB message from one host to another using SMTP. First, let us consider how long it will take with the two hosts on adjacent 10 Mbps Ethernet networks connected by a router that introduces a 1 ms latency. Let us also assume, for the sake of simplicity, that each of the two servers in question is an authoritative name server for its respective domain. We also assume that the CPU time for processing the message is arbitrarily fast, a reasonable assumption; even if it were not true, however, it would add only a few milliseconds to the transaction in each case. Table 8.1 presents a list of the durations of most of the events that occur during the message transfer, assuming sendmail will receive the message and runs in background mode, using a typical configuration. Obviously, the duration of the session is dominated by waiting for disk writes to occur.

Table 8.1. Latency During Various Phases of an SMTP Conversation
Duration	Event
3ms	Send initial SMTP connection request using TCP
1ms	Server looks up PTR record of connecting host
1ms	Server looks up A record of connecting host
1ms	Receive the initial SMTP banner
1ms	Send the ESMTP EHLO
1ms	Respond with 250 EHLO
1ms	Send the MAIL FROM:
1ms	Verify that the domain of the sender exists
1ms	Canonify sender host name
1ms	Respond to the sender information
1ms	Send the RCPT TO:
1ms	Canonify the recipient host name and perform MX lookups
20 ms	Create a `sendmail` queue entr
1ms	Respond to the recipient information
1ms	Send the DATA start
1ms	Respond with the 354 "go ahead" message
8ms	Send the message
40 ms	Queue the message data body
1ms	Respond to the message terminator
1ms	Send the SMTP session QUIT
1ms	Give the 221 "closing connection" response Total elapsed time: 88 ms

Now, let's replace each 1 ms of latency for each part of the conversation with 200 ms of modem latency, which is quite generous, and replace the 8 ms of message transmission with 1400 ms, assuming a 56 Kbps modem. In this case, the same set of transactions will now require more than 5 seconds to complete, almost a 60-fold increase in the duration of a single process. Of course, this calculation is crude and the receiving server would rarely need to make DNS requests across a modem line, but this example is not completely outlandish. Note that both sessions required the same amount of CPU calculation, I/O capacity, and total network consumption the first session simply required far less wall-clock time than the second session did.

As a result, system utilization usually looks quite different in the lab than it does in the field. Processes that interact with the outside world stick around a lot longer than the same tasks when run on a low-latency test network, even though they move the same amount of data. Under a test suite, a heavily loaded, saturated email server receiving and storing messages from the outside world, transferring some number of megabytes per second of data in and out of its network interfaces in the lab might, at any one time, have 30 sendmail and 12 mail.local processes running concurrently on that server. The same system deployed at a dial-up ISP handling the same throughput might easily run the same 12 mail.local processes but in excess of 300 sendmail daemons, due to the additional duration of each session caused by high-latency connections to the outside world.

This point leads to an important maxim in network profiling. Assume we have a set of synchronous network transactions, such as email protocols, that are running and consuming a constant amount of network bandwidth. Then, if the latency of the connections increases significantly, the primary effect on the servers involved in these transactions will be a substantial increase in memory consumption. A hint of this result can be seen in the CPU-bound test examples cited earlier in this book. The target server could quite capably handle the load thrown at it in a test environment without swapping, despite having a mere 32MB of RAM. When put in production, however, this system would begin thrashing under even a small fraction of its tested capacity. This problem would arise because the latencies between the target server and its peers in production would be much larger than they were in the lab.

Besides the obvious risks of swapping, this additional memory requirement for a production email server creates ripple effects that influence the performance of the entire system. The extra memory in use means less memory is available for things such as the filesystem buffer cache, DNS cache, and write-behind buffer space for asynchronous data (e.g., log files). These alterations in system behavior will probably reduce total I/O capability. Occasional bursts of higher traffic may be far more noticeable on the production server than they would be during testing. Larger process tables will make fork()ing, context switching, and memory reclamation more CPU intensive. Because more sessions operate concurrently in a production environment, the system will be characterized by more concurrent queue entries, more open files, and, therefore, longer waits for synchronous data operations. In general, one cannot expect to achieve the total throughput of an email server in a production environment that has been measured in a test lab. It's important to take this point into account.

Other than setting up a farm of load generators on the other end of a modem bank, there exists at least one good way to simulate high-latency networks. Work has been done on modifying network drivers to allow a computer to operate as if it resided on the other end of a low-bandwidth and/or high-latency network connection. One of these development efforts has been included in the FreeBSD operating system beginning with version 2.2.8 called dummynet [RIZ97]. The dummynet facility is built on top of the ipfw (IP firewall) facility in the FreeBSD kernel. One must configure ipfw to use dummynet. A very good tutorial on configuring ipfw and dummynet is available online [RIZ]. The availability of dummynet, along with the robustness and performance of FreeBSD, makes it my operating system of choice for load-testing servers, even though getting Mstone to work there requires some minor programming.

A similar facility is available for Linux called NIST Net [NIS]. Its implementors claim that it should be considered beta code, but many sites are already running it, and it has the advantage of being configurable from a GUI. Additional information on NIST Net can be found at [DAW00].

8.3.1 Difficulties in Approximating the Real World

Table 8.1. Latency During Various Phases of an SMTP Conversation