3.3 Filesystems | sendmail Performance Tuning

For sendmail-based gateways, the single most common cause for a performance bottleneck is the rate at which files can be created and deleted in the mail queue directory, /var/spool/mqueue or its equivalent. The details of the implementation of the filesystem on which these messages are stored can contribute greatly to the overall performance of an email server. This section will discuss these issues.

3.3.1 FFS-Based Filesystems

Most of the more commonly used UNIX file systems are based on the venerable Fast File System (FFS) [MJLF84]. While this filesystem has been a workhorse for UNIX systems for more than 15 years, its characteristics can easily lead to performance problems under some types of loads. For instance, it does not perform terribly well when directories become too large that is, when a single directory contains many files or subdirectories. It also performs relatively poorly when it must carry out quickly a large number of directory-modifying operations, such as creates and deletes, which are common on sendmail servers.

An example will illustrate this point. When it comes time to look up an entry in an FFS directory, the filesystem does a linear scan of that directory looking for any entity that matches the file name in question. On very large directories on extremely busy disks, the server might be so busy that a simple "ls /var/spool/mqueue" can take minutes to write the first line of output to the terminal. As this same delay will occur whenever a sendmail process looks up a particular queued message identifier, and will be exacerbated when a forked sendmail process begins a queue run, this delay can create a huge bottleneck. While a system will rarely get so bogged down that ls takes a minute to run, even adding a significant fraction of a second to each directory lookup can increase the amount of time it takes to process an email message by a considerable percentage.

Many higher-performance filesystems store their directory information in a hash table or balanced tree (often called a B-tree) structure, making the cost to perform such a lookup much less expensive when the directories get very large. Some that have this feature include VxFS [VER], XFS [SDH⁺96], and ReiserFS [REI]. On these filesystems, the directory hashing information is stored on disk. On FreeBSD version 4.4 or later, if the kernel is compiled with the "UFS_DIRHASH" feature, a hashed image of frequently accessed directories is stored in memory on the system. Do not underestimate this extremely strong potential performance improvement.

On FFS-based filesystems, when a file is deleted from a directory, updating of several on-disk data structures becomes necessary. First, the directory entry for this file is deleted, then the inode containing the file is placed on the free list. These operations must be performed in this order or else a directory entry could potentially point to invalid data. When a new file is created, an inode must be allocated for that file and then a directory entry created pointing to that inode. These operations must be performed in this order as the directory entry needs to know to which inode it ought to point. Consequently, some directory plus inode operations need to be performed in one order, whereas others need to be performed in the opposite order.

FFS solves this dilemma by performing its directory operations synchronously. When a file is to be created or deleted, first the directory is locked, then the inode and directory operations are performed synchronously on disk in the correct order, and finally the directory is unlocked. As a consequence, not only are directory operations performed synchronously, but a system where many file creations and deletions are trying to occur at the same time in a single directory will also have all these operations serialized and bracketed by locks. This strategy prevents taking advantage of the buffering designed to speed up disk I/O on modern systems, thereby making these operations perform poorly. Exactly this sort of activity takes place continuously in a sendmail mail queue and often causes mail machines to appear to run slowly. A more thorough discussion of the issues involving FFS filesystem consistency appears in [MBKQ96].

For this reason, many have been enticed by the prospect of running their filesystems in asynchronous mode. Asynchronous file system performance is so far superior to FFS's synchronous performance that it has proved a serious temptation even to people who ought to know better. As noted earlier, the email standards require that email that has been acknowledged as received must be able to survive a machine crash. If a filesystem runs in async mode, then not only will the data written to disk be buffered in memory, but the filesystem metadata operations (that is, operations that change file information rather than the file contents) will also be buffered. Thus data could be lost during a crash, and entire files or directories may be in jeopardy as well. In fact, the filesystem could reach such a state that a large percentage, or even all in extreme cases, might not be recoverable if the server crashes. Clearly, this approach is an unacceptable way to run an email system.

Soft Updates [GP94] is an extension to FFS that improves the performance of filesystem metadata operations. With Soft Updates, each modification of a directory entry is written in a particular way such that if the update is aborted, enough information is available to roll back or commit each individual operation. This approach has been implemented in practice in BSD 4.4-based operating systems [MG99] with metadata operation performance numbers that come very close to asynchronous performance without the potentially disastrous side effects. Researchers can debate whether journaling, which will be discussed later, or Soft Updates provides better performance (and they do see [SGM⁺00], for example). For most practical purposes, however, which solution is adopted really doesn't matter. Both approaches have similar performance characteristics for most workloads, and both can offer dramatic improvements over slower synchronous metadata updates.

Even when an advanced filesystem is not available, queue performance in an FFS-based filesystem can be improved in several ways. One strategy is to turn off file access time updates for that filesystem.

On an FFS-based filesystem, each file is represented in a directory by an inode. The inode contains information such as the UID of the file's owner, the type of file, and the location on the disk of the data blocks that make up the file's contents. It also contains the times of three events: its creation time (when the inode was allocated), its modification time (when the file's metadata or data were last modified), and the access time (when the file was last read). Each time a file is read, the atime parameter in the inode is updated, which requires a disk write. Because sendmail never checks the access time of a file, we can safely turn this feature off and eliminate all of these inode updates.

This step typically is done by adding the noatime flag to the list of filesystem options given to the mount command during mounting of the filesystem. For example, on a gateway email server running FreeBSD, our /etc/fstab file might look like the following:

 # Device             Mountpoint  FStype  Options      Dump   Pass#  /dev/da0s1b          none        swap    sw            0       0  /dev/da0s1a          /           ufs     rw            1       1  /dev/da0s1f          /usr        ufs     rw            1       2  /dev/da1s1h          /var/spool  ufs     rw,noatime    2       2  server:/export/home  /home       nfs     rw            0       0  /dev/cd0c            /cdrom      cd9600  ro,noauto     0       0  proc                 /proc       procfs  rw            0       0

This file indicates that the /var/spool filesystem will be mounted with access time updates turned off. The syntax may vary slightly depending on the exact filesystem and operating system. It's important to understand the information available in the mount and related man pages to obtain appropriate results.

3.3.2 FFS Alternatives

The most popular Linux filesystem, ext2fs [CTT94], runs in asynchronous mode by default. However, it is an extremely bad idea to use this mode on a filesystem that will store email (or any other important data, for that matter). When a filesystem runs in asynchronous mode and the operating system supporting that filesystem crashes or otherwise suddenly halts, the system might not be able to reconstruct that filesystem when it comes back to life. Unfortunately, if ext2fs is run in synchronous mode, then every I/O request to that filesystem is performed synchronously, including every write() call. While this approach makes the filesystem very safe, it degrades performance enough to make this solution unacceptable for high-performance systems.

Equally dangerous is the default behavior of ReiserFS for Linux, which updates its metadata log asynchronously, thereby allowing file renames, creations, and deletions to become lost if the server crashes suddenly. Starting with version 8.12, sendmail works around both the ext2fs and ReiserFS problems by always being extra careful when it runs on Linux systems. On Linux, to ensure the integrity of a queued message when updating a qf file, sendmail will write its temporary data to the tf file, fsync() the file, rename the file to be the qf file, fsync() the qf file (using the same file descriptor), and then fsync() the directory that contains the file. The first fsync() commits the data to disk synchronously. The second fsync() is for those filesystems where fsync() must be run on a file to synchronize its metadata, such as FFS with Soft Updates. On filesystems that perform file metadata updates synchronously, such as FFS, this additional call won't add significantly to the latency of the I/O operations, as the system call should return immediately because no additional work needs to be done. The third fsync() commits the ext2fs and ReiserFS directory metadata (including the file's name) to disk synchronously. This approach makes running sendmail mail queues safe on Linux, but of course this safety is guaranteed only when running at least version 8.12. Not every other piece of email software is equally conscientious. A running sendmail process cannot easily determine what kind of filesystem is mounted in the queue, so sendmail always performs these steps by default on a Linux system, even if other filesystems such as XFS or IBM's JFS [BES00] are used. Finally, note that the second fsync() call was introduced in sendmail version 8.10.

This is all very complicated, but unfortunately necessary to ensure that email transport remains reliable in all environments. In all legitimate cases, we cannot afford to sacrifice email reliability for the sake of performance. This requirement makes the work more difficult, but it is a non-negotiable trade-off.

What we really want is to somehow match the safety of synchronous operations with the performance of an asynchronous filesystem. Much of the research in filesystems conducted over the last 15 years focuses on how to attain this goal.

One way to provide I/O safety without performing synchronous data operations is to use a synchronous journaling filesystem. A journaling filesystem keeps a log (called the journal) of filesystem metadata operations. This journal is updated synchronously with events such as file creations and deletions, but the actual inode operations are performed asynchronously. Thus, if the system crashes, the journal can be replayed to put it in a consistent state. Updating the journal synchronously is not nearly as expensive as modifying filesystem metadata synchronously, so a significant performance win can result. Examples of journaling filesystems include XFS [SDH⁺96], VxFS [VER], and ReiserFS [REI], although ReiserFS updates its journal asynchronously. General background information on journaling filesystems can be found in [RO92].

An updated version of the Linux filesystem, called ext3fs [TWE98], has recently become available. In essence, it consists of ext2fs with the addition of asynchronous journaling. As a result, ext3fs filesystems are backward-compatible with ext2fs. That is, an ext3fs filesystem can be mounted as ext2fs, albeit without journaling support. An ext2fs filesystem can be converted to ext3fs simply by running tune2fs -j /dev/diskdevice . The journal is stored in a reserved inode in the same filesystem as the regular files themselves.

The ext3fs filesystem supports three types of journaling modes, two of which are potentially useful for email delivery. The first, data=ordered, is the default configuration for ext3fs. Using this method, filesystem metadata are journaled, but updates to file contents go directly to disk. In the second configuration, data= journal, both filesystem updates and metadata are written to the journal before being written to disk. Under most workloads, filesystem performance using data=ordered will be superior to that with data=journal because file contents are written twice rather than once in the latter case. File contents are written to the journal and then the appropriate blocks are updated on the filesystem itself.

As with ReiserFS and ext2fs, ext3fs metadata updates are asynchronous by default. The primary stated advantage of ext3fs is that recovery is much faster in case of a system crash. Only the journal needs to be reconciled to put the filesystem into a consistent state. On nonjournaling filesystems, the filesystem checker, fsck, must examine the entire filesystem to assure that its contents remain consistent.

3.3.3 Linux Filesystem Comparison

With so many filesystems available for Linux, which one is the best for use on email servers? This question cannot be answered simply and unequivocally. It is entirely possible to come up with plausible email scenarios in which each of the Linux filesystems performs best. For this book, I've run a set of comparisons that illustrates that the best choice for a given scenario can often be surprising as well as highlights the complexities involved in making generalizations.

We test on the I/O-bound Linux server introduced in Chapter 1. As in the tests mentioned earlier in this chapter, this server acts as an email relay. It runs sendmail 8.12.2 in interactive delivery mode, with SuperSafe set to True.Onthe load generator, the number of concurrent SMTP sessions that relay 1KB email messages off of this server will be increased until the I/O-bound server's queue disk becomes saturated. At the saturation point, the number of messages processed per minute are counted. This test is repeated using each of the three Linux filesystems already discussed in this section.

The results of the first test using ext2fs were documented earlier in this chapter. Employing this filesystem in the queue, the test server succeeded in relaying 1,500 messages/minute.

Next, we rerun the test using version 3.6 of ReiserFS as our filesystem. In this test, throughput drops substantially, to only 530 messages/minute. While ReiserFS is advertised as working efficiently for large numbers of small files, this would seem to refer to storage efficiency and directory lookups rather than the performance of synchronous writes. While modifications to ReiserFS after version 3.6 improve on these results, they aren't yet enough to make it competitive with ext2fs in this particular benchmark test. ReiserFS can efficiently store a large number of small files due to its "tail-packing" algorithm, which allows the storage of small files within the same disk block as its directory tree nodes. This option can be turned off using the notail mount option. In the preceding test, remounting the queue using this flag did not significantly improve performance.

In ReiserFS, directory elements are stored in B* trees. This directory storage format allows for fast lookups of files in directories with large numbers of entries. In this test, however, the number of files in each directory never reached 50, so this benefit was not realized.

As ext3fs is just ext2fs with the addition of a journal, we might expect it to offer worse queueing performance than ext2fs. In reality, in a test run using the default data=ordered journal mode, a throughput of around 2,020 messages/ minute was achieved, representing about a 35% increase over ext2fs. Even more remarkable, if the journaling mode is changed to data=journal, throughput increases substantially. Under this arrangement, the machine sending email to the gateway test server became the CPU-bound bottleneck, while the iostat utility showed only about 40% disk utilization on the relay server. With extrapolation of these numbers (always very risky), this configuration could potentially relay 4,000 or perhaps even 5,000 messages/minute.

Why does ext3fs perform so well in this environment? Its performance almost certainly reflects the short lifetime of the queue files in this test combined with the way the journal works. Using data=ordered, file data are written to disk and the file metadata are written to the journal. Within a fraction of a second, the file is unlinked and a delete operation is recorded in the journal. Periodically, a checkpointer runs to migrate journaled data onto disk and thereby free up journal space. When the checkpointer encounters a "create/delete" pair, these entries are removed from the journal, the disk blocks to which they refer are freed, and no metadata updates need to be performed. Adding and deleting these entries from the journal, essentially one large file, works considerably more efficiently than removing them from the filesystem, making ext3fs perform better than ext2fs. When the filesystem is mounted with data=journal, the file data are also stored in the journal along with the metadata log. Thus, if the message is relayed quickly enough, every bit of information associated with a queue entry creation and deletion is written and deleted in the log without ever requiring data to be written elsewhere on the disk. In this test, the disk head on the queue disk might never perform a single seek outside of the journal region during the duration of the test, explaining why the throughput is so high.

On a production server, performance may not be quite so good. Not all messages will be delivered within a few milliseconds of their reception. Any file that exists in the queue for more than a few seconds will be rewritten to the disk from the journal. Moreover, most environments have larger average message sizes than those in this test run, and larger messages will consume journal space and bandwidth. Also, because real-world message transfers occur more slowly, more concurrent files will reside in the queue at the same time, which also consumes journal space and bandwidth. In fact, if email relaying occurs so slowly that all relayed messages are written out of the journal to disk before they're unlinked, ext3fs would probably perform worse than ext2fs, as it would have to write everything out twice. Similarly, if the queue directories are very large, it's entirely possible that the cost of queue metadata operations on ext3fs would be so high that ReiserFS would perform best. A production server using ext3fs as the filesystem in a mail queue may still be a good idea, especially if the journal is configured to be especially large and enough RAM is available to hold its entire image in the buffer cache. Nevertheless, the spectacular performance numbers demonstrated in this test are unlikely to be achieved in the real world.

Clearly, evaluating filesystems for a particular purpose can be quite complex. This section has provided some insights into which filesystems might be most appropriate for which workloads, but there is no substitute for direct testing. Excellent information on Linux filesystems, including feature, design, and performance comparisons, can be found in [VH01].

3.3.4 Hardware-Based Acceleration

As has already been mentioned, a high-performance filesystem can greatly improve I/O performance in a sendmail message queue by removing the need for slow directory locking during file creation and deletion. A solution to this problem using hardware also exists the use of nonvolatile RAM (NVRAM). With NVRAM, instead of file updates being immediately committed to disk, they are committed to battery-backed memory modules whose contents can survive a power outage or machine crash. For example, when a file is deleted, the deletion can be noted synchronously, but very quickly, in NVRAM (much as is done for a journaling filesystem), and the application can move on to its next operation. Meanwhile, the operating system can update the filesystem to reflect the change at its leisure. If the server crashes before these updates are committed to disk, the log in the NVRAM can be replayed when the server reboots to reconcile any discrepancies found on the disk itself.

Back in the old days, hardware cards provided this support. Legato made one called Prestoserve, originally for VME-based machines and then for Sun's SBus. Users of these cards on systems that did a large number of file creates and deletes typically described their performance improvements as nothing short of amazing. Along the same lines, owners of the Sun Sparc 20 server could add a product called NVSIMM, an NVRAM chip that was installed directly on the machine's motherboard and provided the same service.

Unfortunately, these products have vanished from the face of the computing industry. NVRAM remains in use, however, as part of most commercial RAID systems. Today's disk storage systems typically provide the option of adding RAM whose contents can survive a power outage or other catastrophe such that this RAM acts as a buffer to lower the latency of synchronous operations. It has the additional effect of assisting in the optimization of disk movement. That is, by aggregating disk operations the storage system could wait until several metadata update operations needed to be performed on the same directory and then commit them to disk at the same time.

On email servers, small messages are often received and written to the queue, sent on to their destinations, and deleted from the queue within a very short amount of time. The time it takes to deliver a message is logged by syslog, so one can see typical numbers for a given mail server. The following is a very inelegant but easy-to-follow script that will create a rough histogram-like list showing the fre quency of certain delay times. The name and location of the log file may need to be modified.

 #!/bin/sh  zcat /var/log/'maillog'.*.gz |\           tr '''\n' |\           grep "delay="|\           grep -v "xdelay"|\           sed 's/,//g' |\           awk -F'=' '{sum[$2]++}END{for (i in sum)            print i, sum[i]}           sort

Run on a sample machine, it produces the following output:

 00:00:00 27  00:00:01 29  00:00:02 4  00:00:03 3  00:00:04 3  00:00:05 3  00:00:06 3  00:00:07 3  00:00:11 1  00:00:13 2  00:00:21 2  00:00:24 1  00:00:31 1  00:01:27 1  00:10:01 1

Approximately one-third of the mail touched by this machine was transmitted in less than a second (of course, much of it is local). Another one-third was transmitted in about one second.

Based on this information, we could see how NVRAM might assist the performance of a sendmail mail queue in a remarkable way. Because most messages spend a very small amount of time in the queue, if the messages are small they can often be written to and deleted from NVRAM before the NVRAM contents are ever written to disk! The elimination of actual disk movement for many, if not most, email messages while still fulfilling RFC 2821's integrity requirements is a remarkable proposition.

NVRAM has been understood to be useful for accelerating disk operations for quite some time. Even though it is quite old in computing terms, a paper written by Mary Baker and others [BAD⁺92] is no less relevant in demonstrating how much of an advantage NVRAM can provide.

3.3.5 Other Filesystem Strategies

In combination, many operating systems and filesystems try to improve performance by aggressively "pre-fetching" data. The basic idea behind this approach is that if an application read()s several contiguous disk blocks out of a file, it is likely to eventually request the next several disk blocks, or even the rest of the file. Therefore, the operating system can provide faster access by pre-fetching the extra data that are likely to be requested next and storing that data in the buffer cache. In this way, when the request comes, the extra data have already been cached in memory and can be immediately passed to the application without moving a disk head.

In general, this strategy is a reasonable one. Unfortunately, it helps very little on an email server. To some small extent, performing pre-fetching of the next set of consecutive disk blocks will result in more efficient use of the disk if it will eliminate a disk head seek to read that section, but there's no guarantee that this event will happen. In general, any form of pre-fetching results in a loss of I/O bandwidth and increased memory consumption to gain improved latency. To be precise, the OS consumes disk bandwidth by performing operations in the hope that it can answer a forthcoming query much more quickly that it otherwise would. Of course, if that query never comes, precious bandwidth has been consumed with no return on investment. Whether or not the read ahead occurs, almost exactly the same disk motion will be required to fulfill the request. Thus, even if applications use the pre-fetched data, only a small savings in total I/O bandwidth will occur.

If a server stores extremely large scientific data files, movies, or other files that are typically accessed sequentially, pre-fetching can offer an enormous benefit. When files are small and accesses are less easy to predict, the pre-fetched files are actually used less often by applications, resulting in wasted I/O operations. While reducing the amount of time an email application must wait for data is a good thing, overall I/O bandwidth to the disks is almost always a much more precious commodity than latency.

On an email server, after reading the first data block from the file, it's almost certain that the application will scan through to the end of that file. Therefore, some read-ahead capability usually doesn't hurt. On systems where the amount is tunable, a minimal, but nonzero, read-ahead policy usually will provide the best performance characteristics. Details concerning which systems support tuning this capability and how to do so vary greatly from system to system, so it is impossible to give more specific advice that is both useful and reasonably brief.

If an email server handles many sessions over slow network connections, filesystem pre-fetching can result in an especially insidious problem. Suppose that an email server supports hundreds or even thousands of concurrent POP sessions at one time. Suppose also that these connections take place over slow modems that allow 2.5 KBps of data to be transferred. Each mailbox download, and hence each POP session, might take tens of minutes to complete if the mailboxes are large.

With a large number of sessions and slow transfers, pre-fetched data pages may consume a lot of memory on the system. If the system runs out of free memory at this point, memory will be reclaimed by freeing memory consumed by the buffer cache using a Least Recently Used (LRU) or similar algorithm. In the worst case, data that have not yet been sent to the POP client will be flushed from the buffer cache and must be reread from disk. This is a horrible eventuality, because now the same data must be read from disk twice.

At this point, the overconsumption of disk bandwidth could potentially cause the system to slow down noticeably, making each session last longer. This delay will cause the number of sessions to increase, which causes increased memory consumption, which causes faster depletion of the buffer cache, which leads to a catastrophe.

Of course, this scenario assumes that pre-fetching is aggressive, the mailbox files are large, and the network connections are very slow. This confluence of circumstances is not entirely implausible, especially for an email server that serves POP clients over dial-up links, such as an ISP. On a server that experiences this problem, the only solutions are (1) to reduce the aggressiveness of the pre-fetching algorithm or (2) to add more memory to increase the size of the buffer cache. Both are reasonable courses of action to take.

Detecting this symptom is straightforward if you monitor the buffer cache hit rate. On many systems, this information can be viewed by running something like sar -b 15. More information about system monitoring and the sar command appears in Chapter 7. On systems without sar, other methods must be used to obtain this information. These methods will usually be system dependent. If the data show the buffer cache hit rate declining precipitously and disk I/O bandwidth consumption measured in megabytes per second rising noticeably while the total amount of network bandwidth moving in and out of the system changes very little, buffer cache thrashing might be the culprit.