Filtering Junk Mail with Procmail and SpamAssassin

As a FreeBSD email server administrator, it behooves you to make some attempt to hold back the tide of that nemesis of the Internet: junk mail, or spam. Most modern email client applications provide their own internal spam filtering tools, using techniques such as Bayesian pattern matching and other semantics-based and contextual artificial intelligence tricks that lets users "train" their email programs to better identify spam and shunt it away into a quarantine area. Still, though, these client-side techniques are only part of a complete solution to the spam problem, and another big piece of the puzzle can be provided by you at the server level. If you can prevent spam messages from even getting to users' mailboxes, that's so much less useless data they have to download each time they get their mail.

Most email clients can optionally identify spam based on headers set by the mail server. This means that if an email message arrives with a set of headers like the following, the mail program can immediately trust that it's spam and file it away in the Junk folder:

From nobody@spammer.com Wed Jan 11 12:37:35 2006 X-Spam-Flag: YES X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on mail.example.com X-Spam-Level: ****** X-Spam-Status: Yes, score=6.3 required=5.0 tests=BAYES_99,DATE_IN_PAST_12_24,         URIBL_SBL autolearn=no version=3.1.0 X-Spam-Report:         * 1.2 DATE_IN_PAST_12_24 Date: is 12 to 24 hours before Received: date         * 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%         *     [score: 1.0000]         * 1.6 URIBL_SBL Contains an URL listed in the SBL blocklist         *     [URIs: premiumgoodsavailable.com]

These headers are set in the message by a package called SpamAssassin, whichin conjunction with a third-party local mail handler called Procmailprocesses all incoming messages, either on a per-user or global basis, and applies its own set of Bayesian filters and learned contextual rules, as well as a plethora of other tests for suspicious characteristics.

Each line below X-Spam-Report: in the example headers shown here represents a test that the message failed, and the "weight" assigned to that test; for example, here, the Bayesian spam probability checks reported with almost complete certainty that the message is junk, and that test is worth a score of 3.5. That in itself isn't enough to make SpamAssassin set the X-Spam-Flag header to YES, because the default threshold for that action is 5.0. However, a couple of further tests failed as well: one, worth 1.2 points, is that the sending timestamp of the message is more than half a day prior to the time the message was received, which is a suspicious indicator; another, worth 1.6 points, is that the body of the message contains a URL that's known to be a spammer site and is thus present in a list available to SpamAssassin. Taken together, these various tests add up to 6.3, which is more than the threshold value of 5.0 which is required to make the message "spam." No single test is enough to flag a message as spam; it must fail multiple tests for that to happen. This ensures that the probability of false positives is very low.

Note

Individual users can specify the threshold at which messages are flagged as spam; this is defined with the required_score keyword in the user_prefs file, which we will discuss shortly.

SpamAssassin alters the message's headers to reflect the results of its tests and then passes it back into the user's mailbox. Then, when the user's email program downloads the message and reads these headers (particularly X-Spam-Flag), it knows to trash the message right away so the user doesn't even see it.

Additionally, users can opt to have spam that scores particularly high (above a certain threshold, for example 7.0 or higher) be moved to a holding area or immediately deleted on the server (a task handled by Procmail, as you will see later). This clears out the messages that are obviously junk so the user doesn't even have to waste time downloading them.

You can configure FreeBSD to perform this service for your users with just a few relatively simple steps. SpamAssassin can be enabled globally for all users on your system, or for just one user at a time as requested; it's best to start out by enabling it for just one user (for instance, yourself) to make sure it's working properly, before choosing to enable it globally for all users.

Note

SpamAssassin's daemon processes can also be taxing on your system's resources, particularly CPU time. You should consider whether your hardware has the cycles to spare before enabling SpamAssassin on your MTA.

Installing SpamAssassin

Install the SpamAssassin package from the ports or packages in the mail category. The package name is p5-Mail-SpamAssassin (it's actually a set of Perl modules). See Chapter 16 for more details on installing software.

When SpamAssassin is installed, a daemon called spamd that's part of the package must be enabled; to do this, add the following line to /etc/rc.d:

spamd_enable="YES"

Next, ensure that the startup script called sa-spamd.sh is present in /usr/local/etc/rc.d and set executable. Now, the next time you reboot, SpamAssassin will be started automatically; alternatively, start it manually by typing /usr/local/etc/rc.d/sa-spamd start.

You must next prepare each user's directory for SpamAssassin by creating a .spamassassin subdirectory and setting its ownership to the user whose mail you're filtering:

# mkdir ~frank/.spamassassin # chown frank:frank ~frank/.spamassassin

The .spamassassin directory will contain the runtime files used by SpamAssassin for a given user, such as the Bayesian tokens "learned" over time that mark a message as "spam" or "not spam" (also known as "ham"), and the automatic whitelist file that helps protect legitimate messages from being incorrectly tagged. These files are created and maintained automatically by SpamAssassin and become more accurate the longer it's used.

Finally, create a file called user_prefs inside the .spamassassin directory, and give it the same ownership as the .spamassassin directory itself. This file defines the behavior of SpamAssassin for each user. Listing 25.2 shows the contents of a user_prefs file that will suit most users' needs; each user can modify it to his own taste as necessary. See man spamassassin for more details on what each of these keywords means.

Listing 25.2. A Sample SpamAssassin `user_prefs` Configuration File

rewrite_subject 1 report_header 1 use_terse_report 1 defang_mime 0 report_safe 0 use_bayes 1 auto_learn 1 ok_locales en

Installing Procmail

By itself, SpamAssassin does nothing to your mailbox. Email messages must be actively directed through the spamd server for it to process them. The tool that accomplishes this, and the post-processing that will be applied to each message after spamd is done with it, is a message processing program called Procmail, also available in the mail category of the ports or packages.

Procmail is a highly customizable utility that any user can use to filter messages into folders, redirect them into programs, or delete them outright, depending on the contents of each message. The functions of Procmail for each user are defined by the contents of the .procmailrc file in the user's home directory, and by the .forward file, which forwards all incoming messages into Procmail. Use the following line in .forward to send all a user's mail through Procmail before being delivered:

"|IFS=' '&&exec /usr/local/bin/procmail -f-||exit 75 #frank"

After this forwarding command is in place, create a .procmail directory in the user's home directory, using a similar technique to how you created the .spamassassin directory:

# mkdir ~frank/.procmail # chown frank:frank ~frank/.procmail

This directory will hold Procmail's activity log and other related files that are created through its natural operation.

Next, you must install an appropriate .procmailrc file for the user, defining the "recipe" for processing his incoming messages. Listing 25.3 shows a .procmailrc file that will accomplish our needs.

Listing 25.3. A Sample `.procmailrc` File for Harnessing SpamAssassin

PATH=/usr/bin:/usr/local/bin:/usr/sbin:/usr/local/sbin:/home/frank VERBOSE=off MAILDIR=$HOME/mail DEFAULT=/var/mail/frank PMDIR=$HOME/.procmail LOGFILE=$PMDIR/log SHELL=/bin/sh FGREP=/usr/bin/fgrep FORMAIL=/usr/local/bin/formail LOGABSTRACT=all NL=" " SPAM=$HOME/spam-folder #Spamassassin start :0fw: spamassassin.lock | /usr/local/bin/spamc #Spamassassin end :0: * ^X-Spam-Level: \*\*\*\*\*\*\* $SPAM #/dev/null # OK, guess we'll keep it :0: $DEFAULT

The first block of lines in this file represents environment variables that you're setting up for Procmail to have at its disposal; most aren't important for this task, but if you or your users want to customize the behavior of Procmail further to process incoming messages in still more creative ways, these variables can be very helpful. The most critical ones are the ones used later in the file, namely (in this case) the DEFAULT variable, which defines the default mail spool file. The SPAM line is also important; it defines the name of a folder (which is really just a plain text file) that will store all the messages that SpamAssassin determines is spam.

The two lines in the next block surrounded by comments are where Procmail feeds each incoming message to SpamAssassin. This is done using a command called spamc, which is a client that interfaces with the spamd daemon you enabled earlier. When Procmail encounters these lines, it forwards the message to spamc and waits for it to come back, tagged with headers that indicate whether or not it's spam.

Remote email client users might want to make this the end of their Procmail recipesthe X-Spam-Flag header is now set, and their email programs will now be able to dispose of the spam messages based on that alone. However, the next block of lines gives you still more options for how to treat spam messages.

All messages coming back from SpamAssassin will have an X-Spam-Level header which expresses the spam "score" as a string of asterisks (as you saw in the example headers earlier). You can use a line such as * ^X-Spam-Level: \*\*\*\*\*\*\* to indicate that Procmail should perform an additional step of processing on all messages that contain an asterisk string at least that long. (The backslashes are important; they indicate that the asterisks should be interpreted literally, rather than as wildcards, which would match any and all contents of all messages.) In this example, any message with a score of 7.0 or higher (expressed as a string of seven asterisks or more, which matches this regular expression no matter how many more asterisks there are) is redirected into the file called spam-folder in the user's home directory, rather than into his default mailbox. The user can then peruse the spam folder periodically and check for false positives if he's particularly worried. He can then delete the file when he's satisfied, or even set up a cron job to delete it on a regular basis, ensuring that your disk space doesn't get consumed.

Tip

If a user doesn't want to deal with checking the spam folder for misidentified messages or periodically deleting it to save space, he can instead choose to have the spam messages above a certain threshold deleted outright instead of sent into a folder. Just comment out the $SPAM line and uncomment the next line, #/dev/null. This tells Procmail to send those spam messages straight to the bit-bucket.

Finally, the last two lines (which aren't strictly necessary) tell Procmail to return the messages to the default mailbox if they didn't fall through any of the trap doors set by SpamAssassin and your Procmail rules. Legitimate mail will thus get delivered properly to the user, as will messages whose SpamAssassin scores were high but not as high as the threshold value you defined (7.0). This means that the user will still download messages whose spam scores are between 5.0 and 7.0; though their mail programs will still identify these as "spam" (because their X-Spam-Flag headers are still set to YES), the user can peruse them at leisure and check for false positivesthere will be a lot fewer of them than the obvious, high-scoring spam that's now being harmlessly filtered out.

Tip

You can, if you choose, set up every user's mail to be filtered through SpamAssassin whether they want it to be or not. To do this, use the file /usr/local/etc/procmailrc instead of the per-user .procmailrc; Procmail reads recipes in that global file and sends all users' messages to the indicated destinations. See man spamassassin for more information on running SpamAssassin globally.

Filtering Junk Mail with Procmail and SpamAssassin

Installing SpamAssassin

Listing 25.2. A Sample SpamAssassin user_prefs Configuration File

Installing Procmail

Listing 25.3. A Sample .procmailrc File for Harnessing SpamAssassin

Listing 25.2. A Sample SpamAssassin `user_prefs` Configuration File

Listing 25.3. A Sample `.procmailrc` File for Harnessing SpamAssassin