3.2 Synchronization | sendmail Performance Tuning

Any email system that cannot guarantee successful delivery of a message has a significant deficiency. Section 6.1 of RFC 2821 explicitly states:

When the receiver-SMTP accepts a piece of email (by sending a "250 OK" message in response to DATA), it is accepting responsibility for delivering or relaying the message. It must take this responsibility seriously. It MUST NOT lose the message for frivolous reasons, such as because the host later crashes or because of a predictable resource shortage.

This is a significant statement, and it is an absolute requirement. Any email system that does not follow this guideline cannot claim to be compliant with the SMTP protocol. Essentially, this paragraph requires that once the message is accepted, it must be completely committed to stable storage before the acknowledgment can be returned to the sending server. Not only must the message be received, but to survive a machine crash, the filesystem buffer that contains the message must be flushed to stable storage. On UNIX-like systems, this requirement means that the fsync() or equivalent call must be made. This technique ensures that the message has actually been written to disk such that the data will still be available if the system crashes.

Since the early days of UNIX, significant advancements have been made regarding the speed at which data can be written to disk. The most significant involves buffering the data to memory rather than synchronously writing it to disk. When an application performs a write() call to a file, the data are just written to memory, which is much faster than writing it to a disk, before the system call returns indicating success. The kernel marks these data buffers as "dirty," and it knows that they need to be flushed to disk eventually. However, it waits until it can do so conveniently or until a timer has expired, so as to keep programs performing data write()s moving along and to more efficiently use system resources. The downside is that if the system crashes before the buffer is flushed to disk, the contents of the write() will be lost. Usually, this loss would involve only a few seconds of data a small price to pay for vastly improved performance.

Conversely, this price is too large to pay for a machine that claims to support the SMTP standard. Therefore, an email server must perform much of its data updating synchronously, negating the most significant performance improvement in writing data to disk available to the system. Because few other applications must adhere to such strict requirements, the general I/O tuning literature pays little attention to synchronous disk operations. Few storage vendors test this scenario as it tends to make their products look as if they're not very fast, and few I/O benchmarks focus on this aspect of I/O performance. Even the PostMark [KAT00] filesystem benchmark, which is designed to mimic the behavior of an email system, falls short in this regard. While it does perform a large number of fairly random creates and deletes, it does not require that the target filesystem perform its updates synchronously.

These sorts of synchronous operations seem expensive from a performance standpoint, yet they are absolutely essential if people are to have any confidence that their messages will reach their intended destinations. Therefore, if it is necessary to support a large amount of email flowing through this system, one may have to be a bit clever in handling synchronous disk operations.

The necessity to synchronize writes to disk is true for both final email delivery and email queueing. Just because the machine queueing a message may not be its final destination does not mean that this server can have a cavalier attitude toward its integrity. Before a queued message is acknowledged that is, before the machine performing the acknowledgment accepts responsibility for the message we must be certain that the message has been committed to stable storage.

In sendmail configurations, a parameter called SuperSafe controls this behavior. By default, SuperSafe mode is turned on. In this case, after a series of disk write()s has occurred but before a close() is issued, an fsync() is performed on that file descriptor, ensuring the correct behavior as documented in RFC 2821. This feature can be turned off, although it should be done only under a very restricted set of circumstances. To turn off SuperSafe, add the following line to the .mc file:

 define('confSAFE_QUEUE','False')

For example, if a large presorted batch of messages is being sent to nonlocal email addresses from an email server, sufficient accounting of what does and doesn't actually get accepted by another mail server might render the additional safety of SuperSafe unnecessary. However, it is strongly recommended that this feature not be turned off unless some other accounting system, as in this example, is present to determine whether all email messages have reached their intended destinations. In practice, meeting all of these criteria is unlikely in most situations, as this sort of record keeping is difficult to do.

Beginning with sendmail version 8.12, changes have been made to the way sendmail operates when run in "interactive" mode such that email from remote machines to local mailboxes (or remote servers) may be delivered via the delivery agent and committed to stable storage without creating intermediate entries in the mail queue. This goal can be accomplished using the following lines in the .mc file:

 define('confDELIVERY_MODE','interactive')  define('confSAFE_QUEUE','interactive')

Setting confDELIVERY_MODE to "interactive"byitself can reduce the CPU overhead involved in handling mail transfer. With the addition of the confSAFE_QUEUE definition, a substantial reduction in I/O requirements in the queue will occur. With these changes, the sequence of events listed previously is modified as described next.

As is usual in any version of sendmail from 8.10 on, the qf file need not be created to reserve the queue identifier, and the qf file contents are held in memory. Also, the creation and buffering of the message body to the df file are deferred until the value defined by DataFileBufferSize is exceeded. By default, this value is 4,096 bytes. On most systems where the daemon will run predominantly in interactive mode, this value probably should be increased. A value of approximately 20KB seems a reasonable place to start for most machines, allowing most small messages to be held in RAM without exhausting the server of memory. If the system regularly needs to swap, then either add more RAM or lower this number. With interactive queueing, the xf file will also be buffered in memory rather than immediately written to disk as long as the value of the XScriptFileBufferSize parameter isn't exceeded. The default value for this parameter is 4,096 bytes and will rarely be exceeded except on those servers where each message commonly has a very large number of recipients. In such a case, increasing this parameter seems reasonable. Adding the following lines to the .mc file will increase the xf file buffer to 16KB and increase the df file buffer to 100KB:

 define('confXF_BUFFER_SIZE','16384')  define('confDF_BUFFER_SIZE','102400')

Unlike operations in background mode, in interactive mode, once the message has been received by a child sendmail process, another child process is not fork()ed to handle the subsequent delivery. Instead, the same process attempts delivery itself. When confSAFE_QUEUE is also set to interactive, it will attempt to do so without writing the message information out to disk unless it has to. This would be the case if none of the appropriate hosts to which the message could be sent is available or if the DataFileBufferSize buffer is exceeded. Then, and only then, are the qf and df files created in the queue and data written to them. After successful delivery of the message, the same sendmail process that performed this delivery will return an SMTP "250 OK" message to the originating machine signifying that the message has been transferred successfully to the next hop. The next hop may be a remote server, another delivery agent that takes responsibility for that message, or safe queuing of the message if initial delivery fails.

This algorithm will greatly reduce the amount of disk activity on the relaying machine, which would occur synchronously, at the cost of holding open the network connection from the server that originated the message for a small amount of additional time. In most cases, one would expect this method to considerably increase the number of messages per second a server that dealt predominantly in locally delivered email could handle. As a historical note, this change became available as an option in sendmail version 8.10 for systems, such as FreeBSD, that provided stdio function overrides.

Arguably, this network behavior is "less polite," as resources are consumed (the open connection) on the originating server when, strictly speaking, it isn't necessary. Nevertheless, I believe this behavior should be acceptable for a busy server because the resources consumed are very minor just one outstanding process, a little bit of memory, and a socket. It does not impose any additional I/O or CPU load on the originating machine. At the same time, it does require the gateway machine to be expedient in its delivery. If the originating machine must wait for several minutes under any but the most extreme circumstances, that would cross the threshold between "acceptable" and "rude" behavior. One could also rationalize this behavior by noting that if, instead of running the master sendmail daemon in interactive mode, it were run in the default background mode, the same server might have significant problems keeping up with the load, inconveniencing the originating server to an even greater extent.

One other potential downside to running a gateway server in interactive mode is the greater chance that a message will be retransmitted unnecessarily and end up as a duplicate message in someone's mailbox. This problem can occur if a message is sent from the originator to the gateway machine, and the connection to the originator is terminated after the message is received by the gateway, but before the message has been accepted for delivery by the next hop or destination from the gateway. The originator won't know that the message is being successfully transmitted to its next destination and the message will be re-sent. The gateway won't know that the second message is a retransmission of the first one, so it will be relayed as well. This scenario can also occur if sendmail runs in background mode on the gateway, although the window within which this event might happen is much shorter. A message will be unnecessarily retransmitted in this case only if the SMTP connection from the originator becomes severed between the acceptance of the end of the message by the gateway and the receipt of the "250 OK" message by the originator. The duration of this window of vulnerability in background mode typically will be on the order of a one-way network trip between the gateway and originator plus the time it takes to make a single disk write. In total, this window typically would be on the order of 10 to 100 ms. If the gateway runs in interactive mode and the message is large, the window of vulnerability may last a little longer than the duration of an entire SMTP session between the gateway and the destination machine, perhaps in the range of several seconds, or even longer if the network connection is slow or the message is very large.

We can examine the effects of changing the delivery mode on the CPU-bound test server introduced in Chapter 1. In this experiment, we set up our test machine to relay email sent from one server to a second machine where the messages are delivered (the messages are actually discarded before final delivery, but our test gateway remains oblivious to this fact). The test gateway runs sendmail 8.12.2. In the first test, the gateway runs in background mode, and at saturation it can relay about 279 messages/minute before running out of CPU resources. During this time, disk that contains the mail queue runs at about 45% of its throughput capacity. If the delivery mode changes to interactive, throughput jumps to 450 messages/ minute and the queue disk loading remains relatively constant, despite the increased I/O load, at about 47% of capacity. Finally, setting both the delivery mode and SuperSafe to be interactive, the test server can relay 512 messages/minute, while the queue disk loading drops to 0%.

From these results, we can see that considerably less CPU resources are consumed when running in interactive mode than when running in background mode. This difference primarily reflects the reduction in process forking and data copying. We obtain further CPU savings when we go to interactive queueing by eliminating the computational overhead involved in writing the data out to disk. If the gateway were disk bound rather than CPU bound, we would expect that not writing to the queue would result in an even more spectacular improvement in throughput. Even so, simply by changing the delivery mode and queue method in this test, we increased throughput by more than 80%, a tremendous improvement.

We can repeat these experiments on the I/O-bound test server also introduced in Chapter 1. We use the same configuration and testing methodology, using our target server as an email relay running sendmail 8.12.2 and using the Linux ext2fs filesystem in the queue. Queueing messages using background mode, we achieve a throughput of about 1,500 messages/minute. Changing the delivery mode to interactive reduces throughput slightly, to about 1,480 messages/minute. The switch to interactive mode changes the CPU workload of the server, but doesn't affect the I/O operations that must be performed, so it isn't surprising that the throughput essentially remains unchanged.

When we use interactive as the delivery mode and change SuperSafe to interactive as well, throughput jumps to 2,400 messages/minute. The restriction in this test, however, reflects CPU exhaustion on the machine sending the messages to our test server. At this point, the target machine we're testing handles the load easily. Memory consumption isn't a problem, no disks are used, and the CPU operates at about 21% of its maximum loading. While it is dangerous to extrapolate from these data, in this configuration our test server could potentially relay more than 10,000 messages per minute, an impressive amount of email.