5.1 Mailing Lists | sendmail Performance Tuning

In their simplest form, mailing lists are single email addresses that represent more than one actual recipient, sometimes as many as hundreds of thousands of unique individuals. A mailing list server can never be certain when the next message for a given list might come in, but its goal will be to take a single incoming message for the list and send it out to all subscribers as quickly as possible. Messages might come in a flurry or a great deal of time may pass between them, but the server must be able to handle a large volume of messages in short order during times when the list heats up.

A mailing list server largely functions as a mail relay, as described earlier in this book, except that it sends much more outbound email than it receives. The same issues with queue performance arise here. We want cheap metadata operations, the safety of synchronous writes with asynchronous-like performance, and parallelism of as many operations as possible.

When a message is sent to multiple recipients, sendmail usually creates only one df file and one qf file for the whole lot of them. Under some circumstances the envelope will be split, but these instances are relatively rare and don't affect general performance. If we can expediently handle the qf files with large numbers of recipients, then the few single-recipient qf files won't pose much of a problem. Consequently, it takes far fewer directory operations to send a message to 10 people than it does to send 10 messages to one person each. This works in our favor.

With sendmail 8.12, one can force the envelope to split after a specified number of users by setting the "r" option for the default queue group and assigning an appropriate value to it. Here is an example in which we limit the maximum number of recipients per envelope to 10 for the default queue:

 QUEUE_GROUP('mqueue', 'P=/var/spool/mqueue, r=10')

See the Sendmail Installation and Operation Guide, section 2.3.1 [ASA], for more information about setting up queue groups.

It is somewhat ironic that for very large mailing lists, the creation of more qf files will speed up delivery, because the DNS lookup, connection, and delivery for several recipients at once can be performed in parallel. Of course, achieving parallelism requires multiple concurrent queue runners, but that is straightforward to implement. It does require more operations in the queue, but the increase is not directly proportional to the number of times the envelope is split. Multiple distinct qf files will exist, but there will be only one df file with multiple names (hard links). The same number of metadata operations are performed as if all the split envelopes concerned separate messages, but the contents of the (common) message body are written only once.

Finding the optimal number of recipients for splitting the qf files is tricky. With large numbers of queue runners and small numbers of messages, we want to set the number of recipients per message as low as possible. This choice creates a large number of qf files in the queue, which we can process in parallel to obtain the maximum message throughput. Obviously, as the message rate increases, the number of concurrent messages in the queue will go up. At some point, we will run out of I/O bandwidth in accessing the queue, and the number of qf files will pass the threshold at which we can speed up delivery through increased parallelism. Instead, the overhead caused by the sheer number of qf files will start to slow throughput down; at this point and beyond, fewer queue entries would be better. This is a delicate act to balance, and no simple formula will yield the optimal number of recipients per message. Instead, this metric must be determined by observation.

As a method to determine what the "sweet spot" might be in terms of the number of recipients per message, we might start with a relatively aggressive maximum number of recipients the number 10, used in the configuration example, is quite aggressive and see how it works. If the queue starts to back up, as represented by large numbers of qf files and slow access times in the queue, then raising this number is probably appropriate. Of course, this choice requires careful monitoring of the system, but scripts that count the number of qf files sitting in the queue are straightforward to write and are part of any vigilant server monitoring effort.

After making an adjustment, we want to make sure a sufficient amount of time has elapsed so that we have a representative feel for the loading of the server. If queue processing hasn't slowed down due to the number of concurrent messages in the queue, perhaps an even more aggressive envelope splitting strategy would be appropriate. I wouldn't recommend setting this value any lower than 10 unless it takes longer to process each message than is tolerable for that list.

Another modification to the sendmail.cf file that can help directly is raising the value of the CheckpointInterval variable. As a sendmail process tries to deliver messages to the recipients listed in the qf file, after a certain number of successful deliveries it will update the qf file to reflect the delivery status of the message to those users. By default, the value of CheckpointInterval is set to 10. This number can be raised for example, by adding the following line to the .mc file:

 define('confCHECKPOINT_INTERVAL','100')

Updating the qf file requires metadata operations. A tf file is created with the updated information and then rename()ed to replace the existing qf file. Therefore, raising this parameter will lower the number of operations needed to be performed during processing of a single message bound for many recipients. There is a downside to this strategy, of course. If a message is sent to a set of recipients but the machine crashes before the qf file is checkpointed, then when the server comes back up those subscribers who were sent the message after the last checkpoint will receive the message a second time. Of course, as inconvenient as receiving the same message twice might be, duplicates are not the same caliber of evil as lost messages are. Of course, if the checkpoint interval exceeds the envelope splitting parameter, altering it further in that range will change nothing.

As noted earlier, the multiple queues introduced in sendmail version 8.10 can help ease the congestion on busy email relays. On mailing list servers this ability isn't quite as useful. Instead of having many different queue entries over which one may parallelize the delivery load, we have a much smaller number of files with multiple recipients. However, some benefit can still be obtained by frequently rotating queues so that recently spawned queue runners don't have to work through a lot of temporarily undeliverable messages. The queue rotation interval might be quite frequent. In some cases, rotating the queue every few minutes might prove optimal. On servers that handle infrequent messages to very large numbers of recipients, one might even want to rotate the queue after each message. Some experimentation will help to strike the right balance.

A danger with frequent queue rotation for busy mailing lists is that if a clogged queue is rotated, it may take a very long time to drain. In the meantime, new messages coming in to a new, fast queue may be processed very quickly. As a consequence, mailing list messages will be delivered out of order. One cannot completely avoid this hazard, and anyone who has subscribed to an email mailing list for any length of time will have observed this behavior. This unavoidable problem is an intrinsic part of the email medium. Nevertheless, recipients may become confused if email messages that are part of a dialogue arrive out of order, so one should avoid this phenomenon to the extent possible. Usually, paying a little bit in performance to substantially reduce subscriber confusion is worthwhile. Does this mean that busy queues on mailing list servers shouldn't be rotated? No, it doesn't. However, every time a queue or a mailing list server becomes busy, a risk arises that the list's messages will be delivered out of order. Even though many smart MUAs will present the messages to the user in their proper order, this issue should be viewed as just one more reason to make every reasonable effort to avoid letting queues get badly clogged.

Many people have struggled with the prospect of improving mailing list performance. Descriptions of the trials and tribulations of managing a mailing list server using older versions of sendmail can be found in papers by Rob Kolstad [KOL97] and the crew at GNAC [CHK⁺98].