4.3 Storage Systems | sendmail Performance Tuning

Compared to most common file benchmarks, the I/O operation mix that typically takes place in email message stores is write-heavy, synchronous, and nonsequential, just as it is in the queue. Disks spend a lot of time writing data to random locations in the message store, and modifications to these files are followed by fsync()s. Unlike in the mail queue, however, file creates in the message store are less likely to be followed by immediate deletes. Files in the message store are significantly less temporary. Further, the amount of storage required in the message store is at least one and probably two or more orders of magnitude larger than space in the queue. For any performance-sensitive email system, use of a Redundant Array of Independent Disks (RAID) system for the message store will be mandatory.

RAID systems are quite complex, and an entire book easily could be written discussing even the user-visible features and administrative issues. Even though this information might seem like a long diversion from the book's primary topic, storage systems are an important enough factor in evaluating an email server's overall performance to justify this discussion. A RAID system provides two advantages over single disks, redundancy and striping. Redundancy aims to provide data integrity, although usually at a cost of some reduction in performance, in case a disk in the storage system fails. All RAID systems, except for RAID Level 0, which is pure disk striping, provide data protection in the event of a single disk failure. Striping aims to divide the I/O load over several disks. The RAID system provides the illusion of a single large virtual disk to the operating system. Behind the scenes, I/O requests for different virtual disk locations are parceled out to different physical disks.

4.3.1 Disk Concatenation

The most simple form of disk striping is concatenation. With concatenation, multiple disks are aggregated sequentially into one large virtual disk. In the worst case, disk blocks will be allocated such that once the first disk fills, the next write will occur on the following disk. As an example, assume that three 4GB disks are concatenated together. If we start writing out a huge file, the first 4GB will reside on the first disk in the aggregation. Once this disk fills, the next byte, at an offset of 4GB + 1B, will occur on the second disk. Once 8GB have been written, the next byte will live on the third disk. Once 12GB have been written, the disk system is full. To applications, this system looks like one 12GB disk. Some filesystems handle concatenation better than this example does. That is, they do not allocate storage in a strictly sequential fashion, but make use of all disks, even if the aggregate stores data representing only a fraction of its capacity.

The other way to aggregate physical disks into a larger unit is to stripe them. Instead of addressing the space on the disks sequentially, each disk is divided into "stripes" and the first stripe on each disk is addressed sequentially. Assume that we start with the same set of three 4GB disks mentioned earlier with no data on them and a stripe width of 1MB. If we write out a 4MB file to the disk system, the first 1MB would go on the first disk (filling the first stripe on the first disk), the second 1MB would go at the beginning of the second disk, the third 1MB would go at the beginning of the third disk, and the fourth 1MB would begin at the 1MB + 1B position on the first disk. (As an aside, another way of looking at disk concatenation is that it is a special case of disk striping, where the stripe size is equal to the disk size.)

It's important to consider the application when deciding on a RAID stripe size. Consider, for example, storage systems that store extremely large files that are manipulated by a single application. To obtain sufficient performance, smaller stripe sizes are generally beneficial to achieve as much I/O throughput as possible by having many disk heads in motion for a single data transfer. With email systems, the opposite situation arises: Several I/O streams will be occurring between the server and their storage systems at any one time, so each access to a single file should affect only a single disk if possible. This approach reduces the total number of disk heads that need to move to satisfy a particular I/O operation. As moving disk heads is the most expensive operation performed by a disk, obviously any way it can be minimized will improve performance.

An email server will almost always contain a very large assortment of file sizes. Files may range from hundreds of bytes up to very large file sizes, depending on several factors. In many environments, single files in message stores can easily run into the tens of megabytes. A stripe size of two times the size of the largest file would minimize the chance that more than one disk head movement will be necessary to access that file. Of course, the stripe need not be larger than all of the files on the server. For some email servers, the optimal stripe size will be larger than the maximum allowed by the storage system used. In this case, the largest available size should be selected. In fact, for email server use, my advice would generally be to set a RAID system's stripe size to be the maximum configurable.

An argument against disk concatenation, and against very large stripes in general, goes as follows: The typical filesystem stores new files by allocating the first group of sequentially numbered disk blocks that it expects can hold the given file. Some filesystems are not so naive, however. Even on those that are, this serves only as an approximation of what happens. If a filesystem does use a sequential allocation policy, then if only the first 50% of the disk system becomes filled with data, one cannot use the bandwidth provided by the last 50% of the disk spindles in the array.

Non-uniform file distribution can reduce the available disk bandwidth on striped disks as well. As an example, consider the three disk array considered earlier, and assume an extreme case in which each stripe width is half of a disk's total capacity. Then, using 33% of the available disk space may result in using, effectively, 67% of the disk system's total I/O capacity. Of course, this example assumes that the access to the data is uniform and random, but it is a good approximation when dealing with an email message store. Only 67% of the capacity is used because only the first two of the six available stripes would be filled with data if we used three disks. Of course, if we lower the stripe size to 10% of a disk's capacity, this problem effectively vanishes. Using nearly 50% total space utilization on the array, the load on one spindle will be no more than 20% greater than the load on any other spindle, even in the worst case, again assuming a random distribution. This variation isn't terribly large, and no RAID system can practically be built with a stripe size this large. The lesson here is to not be afraid of extremely large RAID stripe sizes for email applications. Conversely, going to the extreme of concatenation is probably not a good idea unless the filesystem will evenly distribute files across all spindles, even if the filesystem is nearly empty.

4.3.2 RAID Levels

The trade press abounds with descriptions of RAID Levels, and it's important to understand at least roughly what they mean. While the information presented here doesn't describe all the intricacies of RAID systems, it will suffice for the purposes of this book.

RAID Level 0 refers to data that are striped across multiple disks without any redundancy. RAID Level 1 consists of mirrored data. Mirroring means that two disks will contain exactly the same contents and be kept completely in sync. No RAID Level 2 systems are commercially available, so we won't consider them here. RAID Level 3 introduces the concept of parity. Parity is the use of extra disk space to store additional data that allows the information from a failed disk to be reconstructed just from data on the other disks in the array. By allocating a single disk to hold this parity information, we can protect against any single disk failure regardless of the number of disks in the array. In RAID Level 3, each disk in an array operates in lockstep; that is, the disk arms on all disks move in unison. In a RAID 3 system, one disk is allocated to act as the parity disk. RAID Level 4 is the same as RAID 3, except that data are stored independently on every disk, so that reads and writes do not involve accesses on every disk. These disks do not operate in lockstep, so data from different parts of different disks can be read simultaneously. A single disk holds parity information. RAID Level 5 is the same as RAID 4, except that parity data are striped across the entire array, not stored on a single disk. RAID 6isthe same as RAID 5, except that one extra disk per RAID group is added, and a second checksum algorithm ensures that the system can survive two disk failures. These RAID Levels are defined by the RAID Advisory Board. Much more information on RAID can be found in The RAID Book [MAS97], among other sources.

RAID 1, mirroring, provides the most fault protection and has excellent performance characteristics, as no parity calculations are performed. Even though email storage is relatively write-heavy, serious performance gains can be obtained from RAID 1 systems that balance their reads between both disks storing any particular datum. When purchasing a RAID 1 system, the ability to balance reads over the mirrored pairs should be a requirement of any product under consideration. While RAID 1 consumes a great deal of disk space, with fully 50% of all disk storage being devoted to redundancy, it is the recommended solution when a single disk or small number of disks need to be aggregated and protected, as is the case for an email queue, temporary file storage, or small message store.

Although not recognized as an official classification by the RAID Advisory Board, many storage vendors offer a RAID 0+1 option. With this system, the disks are divided into two groups. One becomes a RAID 0 system, and the second group mirrors the first group. Thus disks are striped together, and then the group is mirrored. Again, this very good high-performance strategy works efficiently for relatively small numbers of disks.

Because any particular data operation requires accessing all disks for RAID 3, it is far more suitable for single-application solutions such as scientific computing or Hollywood-style special effects work than for email applications. The single parity disk often becomes a bottleneck for write-intensive applications (like email) for RAID 3 (and RAID 4) systems. Some vendor implementations of very-high-performance RAID 3 and 4 configurations should not be immediately discounted as storage systems for email applications, however. Typically, these systems have large amounts of NVRAM to buffer writes and RAM to act as a read cache, thereby masking the potential limitations of these two solutions. I've used both RAID 3 and RAID 4 solutions from specific vendors as part of high-performance email solutions, and these products are near the top of my personal list in terms of what they provide for the money. Nevertheless, unless the storage system is specially designed to perform well in write-heavy, random access environments, it is best to avoid RAID 3 or RAID 4 configurations for email applications.

Once we consider disk systems that are too large to be built economically using RAID 1 or 0+1 systems, the most common and appropriate RAID Level for email applications will be RAID 5. As indicated earlier, it's almost universally true that the quality of the RAID solution is more important than the RAID Level. Assuming a quality implementation, RAID 5 is usually the most appropriate RAID Level for large-scale transactional storage. For RAID Levels 3, 4, and 5, one needs to allocate a certain number of parity disks or disk space for a given amount of data storage. Typically, I'd recommend allocating in the range of 1 parity disk (or equivalent space striped across all disks for a RAID 5 system) for every 5 to 13 data disks, depending on the system in question. The storage vendor should be able to help by providing some specific recommendations. In general, being conservative and sacrificing 20% of the total available disk space to avoid parity-induced slowdowns should be viewed as an investment, and a modest one at that.

What is the threshold for a large-scale storage system? For storage requirements of 10 or fewer disks' worth of data storage, RAID 0+1 solutions usually make the most sense, as the number of disks is so small that the extra expense of mirroring isn't especially high. If 36 disks will be aggregated into a single storage system, then dividing these into six 6-disk RAID 5 groups probably would be recommended. Between these two limits, I would be loathe to provide any specific recommendations. However, disks are generally cheap, so taking a conservative stance typically will not be as costly as being wrong in the other direction.

4.3.3 RAID Implementation

RAID systems can be classified as two distinct types: hardware RAID and software RAID. With software RAID, knowledge of which physical disk stores which data and how the data are protected is maintained by software that is part of the server's operating system itself. On a hardware RAID system, the operating system is merely aware of a large virtual storage device, and the device itself maintains the details of how the data are divided over the disks.

The primary argument in favor of a software RAID system is its inexpensive nature, requiring no additional hardware other than the disks and an enclosure to hold them. Further, on an email server the system's CPU is rarely heavily taxed, so it should have capacity available for performing RAID duties. These arguments often lead to an acceptable solution, but one should be very careful before blindly embarking along this path. First, using the main CPU to stripe and mirror data usually doesn't impose an enormous burden. On the other hand, requiring it to perform more complex parity calculations and handle the extra writes necessary for parity storage, such as those required for a RAID 5 system, is almost always a mistake. These calculations on I/O-intensive systems can be considerable and can result in noticeable performance degradations even on servers with very powerful processors. Offloading these calculations to a special-purpose processor on a controller card or RAID array is a good idea.

Second, while one might expect software RAID systems to be well-tuned applications that were implemented by their vendors with performance considerations of paramount importance, often this is not the case. Almost all vendors that sell RAID software also sell large, and expensive, storage systems. Carefully tuning RAID software solutions might reduce the amount of money they can make on these storage systems, so it might not be too surprising to find that the performance of their RAID software is not as strong as one might at first expect. This is not to say that these products are bad, just that the performance of systems running software RAID systems can prove disappointing. Improving the speed at which it runs might not be the vendor's top priority.

Hardware RAID systems come in two varieties: self-contained storage systems and controller cards to which a set of separately purchased disks are attached. Essentially, both are RAID controllers, but one sits inside the server and the other comes with a bunch of disks. As with most other computer systems, the quality of both types of products and their suitability for any particular purpose vary wildly. While the clichée "you get what you pay for" largely applies here, one occasionally stumbles across a system that defies its price tag, most commonly in the wrong direction. Nearly all "dirt cheap" RAID systems perform poorly, but not all expensive systems perform well.

As an example, not too long ago I hooked up a RAID system provided by a major UNIX server vendor with NVRAM and six disks into a 3 + 3 RAID 0+1 configuration to do some performance testing. Running mail queues on this arrangement delivered about 1.5 times the throughput of a single disk working alone, with the disk system and not the controller representing the bottleneck. This result is simply unacceptable: Anything less than 3 times the single disk performance would be embarrassing, and a very good RAID system should exceed that level. Under most circumstances I tend to purchase brand-name computer peripherals, usually from the server vendor, but RAID systems are often an exception to that rule. The "best bang for the buck" often comes from third-party equipment.

Of course, not all third-party RAID systems are good in fact, quite the contrary. As science fiction author Theodore Sturgeon once said, "90% of everything is crud." This rule of thumb certainly applies to RAID systems. In any event, before embarking on a project requiring high-performance storage, allocate plenty of time to try out several pieces of equipment from several vendors before the project needs to go live. Don't expect to find a top-notch solution in the bargain bin, and test, test, test in as close to a real-world environment as possible. Chapter 8 describes testing considerations in more detail.

4.3.4 Evaluating RAID Systems

While no substitute for direct testing exists, one can use some objective measures to begin to evaluate RAID systems. Each storage system capable of hardware RAID must have a processor to perform parity calculations and move the data from the disks to the computer. The same processor may carry out both tasks, or one or more specialized processors may be devoted to each task. As in other areas, an apples-to-oranges comparison of clock speeds means little, but learning that one system uses a processor that another vendor reserves for a higher-end model can be telling. A vendor might also use special-purpose processors for parity calculations and other more familiar or general-purpose processors for data transfer. A system with generally more horsepower available for data movement is a good thing, although of course it is not sufficient to indicate a quality product.

Every quality RAID system should contain its own RAM for data caching and NVRAM to accelerate writes. How much of each the system can hold may be an indicator of its expected capabilities in terms of total throughput. At the very least, expansion capability can provide a potential upgrade path if disk I/O becomes a performance constraint again in the future. NVRAM is indispensable if RAID 3, 4, or 5 solutions are considered, because writes for these RAID Levels are much more resource-intensive than those for RAID Levels 0 and 1.

Other considerations include the following: How much thought has gone into eliminating throughput bottlenecks in the RAID system? Are all internal pipes large enough to deliver the projected maximum sustained throughput? Finally, don't overlook the obvious need to make sure that the connection between the storage system and its host is fast enough to handle the load. No matter how good the components are, no one can get 50 Mbps of throughput from a RAID system connected by a SCSI-2 interface.

Evaluating RAID controller cards requires the same sort of careful testing as total RAID systems do. The quality of the processor, the amount of memory on the card, the types of caching performed by the card, and the protection of the data stored on the card if the machine crashes are all important factors that one should be able to evaluate from a specification sheet. The company's data sheet might list information such as how much CPU loading is required per megabyte transferred per second and actual sustained throughput numbers. However, real data need to be determined by testing, and ultimately these numbers are the ones that matter. Overall, RAID controller cards can prove useful for small disk systems, but if one plans to support more than six disks, then a complete RAID system will usually be a better solution.

Whether one uses RAID controller cards or regular SCSI cards to attach hardware RAID systems, there's always a danger of providing more disk I/O than a controller can handle. To prevent this problem, one can use any of several approaches. First, each RAID system, disk pack, or solid state disk (SSD) should have its own disk controller card on the host. It's almost always worthwhile to make sure this card is the best-performing piece of hardware that one can obtain. Second, do not mix high-speed, performance-sensitive storage and low-speed devices on the same SCSI bus. Even if the bus has plenty of I/O to handle both the disk that stores an email queue and a tape drive used to back up the system, this mixing is usually not a good idea. For some buses, the whole chain will sync down to the speed of the slowest device, even if it's never used, which can have disastrous consequences. Most controllers are smart enough to converse at different rates with different devices. Even in this case, however, the slow device will take up more than its fair share of the total available bandwidth when it runs, as it cannot take advantage of the bus's maximum speed. Therefore, data transfers involving this device will consume the bus's resources for a greater amount of time than desired.

Sometimes even a single RAID system or bus does not have enough I/O capacity to satisfy the demand. In these cases, multiple RAID systems must be used, and the application can be made aware that the data in question reside on multiple filesystems that are accessed through different directory paths. As a trivial example, this goal can be accomplished by modifying a POP daemon and LDA to look for mailbox names that begin with the letters a m under the /mbox1 directory, and locating the n z mailboxes under /mbox2. As an alternative, one can use software RAID striping to mask the fact that these multiple mount points exist. If each storage system uses RAID Level 5 to protect its data, then striping several of these systems together using software RAID is commonly called RAID 50, even though this RAID Level is not recognized by the RAID Advisory Board. Many variations on this theme are possible.

As already mentioned, the use of software RAID to perform operations such as striping and mirroring will often provide a reasonably well-performing solution, so this solution is recommended for those storage environments in which a single filesystem image is required, yet the I/O bandwidth to the devices exceeds that of a single high-end I/O controller, such as Fibre Channel or Ultra-SCSI 160. In the email application realm, this case will arise only with truly mammoth systems.