2.6 SpamAssassin and the End User

‚ < ‚ Day Day Up ‚ > ‚

2.6 SpamAssassin and the End User

The discussion so far in this chapter has focused on getting SpamAssassin to analyze incoming mail and mark spam by modifying the message before delivery. For end users who read their email on the server or download it with a POP or IMAP client, the final step is to take action on messages. Messages processed through SpamAssassin fall into one of the categories described in the next four sections.

2.6.1 True Negatives (ham)

True negatives are messages that both you and SpamAssassin agree are non-spam, or ham , messages. SpamAssassin does not modify these messages much. It adds an X-Spam-Status header beginning with the word "No," and an X-Spam-Checker-Version header giving the version of SpamAssassin in use. These messages look just as they should to a user's mail reader.

2.6.2 True Positives (spam)

True positives are messages that both you and SpamAssassin agree are spam. These messages are tagged by SpamAssassin. At minimum, SpamAssassin adds X-Spam-Leve l, X-Spam-Status , and X-Spam-Flag headers. If rewrite_subject is on, SpamAssassin also changes the subject of the message to begin with *****SPAM*****. Example 2-10 shows these headers.

Example 2-10. Headers added to spam by SpamAssassin

 Subject: *****SPAM***** Live your dream life!!                MPNWSTU X-Spam-Status: Yes, hits=12.9 required=5.0 tests=CLICK_BELOW,         FORGED_MUA_EUDORA,FROM_ENDS_IN_NUMS,MISSING_OUTLOOK_NAME,         MSGID_OUTLOOK_INVALID,MSGID_SPAM_ZEROES,NORMAL_HTTP_TO_IP,         SUBJ_HAS_SPACES,SUBJ_HAS_UNIQ_ID autolearn=no version=2.60 X-Spam-Flag: YES X-Spam-Checker-Version: SpamAssassin 2.60 (1.212-2003-09-23-exp) X-Spam-Level: ************

Most people will want either to complain about spam to the spammer's ISP or to discard it. In the former case, simply being able to quickly identify spam messages on sight is usually sufficient, and the modified Subject header makes that simple. If the user is reading his mail on a system with the spamassassin script and applications for distributed spam clearinghouses, he can pipe the message to spamassassin --report to report the message to the clearinghouses.

In the latter case, that of wanting to discard spam, users can set up their personal mail filters to delete spam or save it to a "spam" mailbox that they can check now and then. Users on shell accounts with procmail might use the following recipes in their ~/.procmailrc file:

 :0 * ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\* /dev/null :0 * ^X-Spam-Flag: YES spambox

The first recipe checks to see if the message has at least 15 asterisks in the X-Spam-Level header. These messages are very likely to be true positives and are discarded by delivering them to /dev/null . The second recipe catches all other messages that SpamAssassin considers spam (e.g., with scores between 5.0 and 14.99) and saves them to a separate mailbox file called spambox .

Users of POP mail clients can use their client's filtering capabilities. Nearly all modern POP mail clients provide the ability to filter messages based on strings contained in the Subject header, so spam can be redirected by checking the Subject for *****SPAM*****. Some POP clients provide greater control over filtering and allow checking arbitrary headers; these clients can do the equivalent of the preceding procmail recipes.

2.6.3 False Positives

False positives are the bane of all spam-checkers. A false positive occurs when SpamAssassin incorrectly marks a message as spam that you actually wanted to receive. Because of the potential for false positives, it's a good idea to encourage users to think of SpamAssassin's tags as advisory and to avoid discarding messages unseen on the basis of a spam classification by SpamAssassin. Instead, as illustrated in the earlier section on true positives, spam can be filtered to a special spam mailbox that the user can check periodically to ensure that it does not contain any false positives.

If you're reading email on a system that has the spamassassin script and you find a false positive, you can pipe the message through spamassassin --remove-markup to remove the SpamAssassin report and restore the message to its untagged state.

Identifying false positives and reporting them to SpamAssassin is key to improving SpamAssassin's Bayesian classifier. The Bayesian classifier is discussed in detail in Chapter 4.

2.6.4 False Negatives

A false negative is a missed spam. It occurs when SpamAssassin fails to tag a message as spam that you actually consider spam. The more false negatives you get, the less effective the spam-checking is in saving you time. You can reduce false negatives by lowering SpamAssassin's threshold score, but you will increase false positives at the same time. Keeping track of false negatives can help you find patterns that may let you tweak SpamAssassin's rules to match your environment more closely.

As with true positives, if the user is reading her mail on a system with the spamassassin script and applications for distributed spam clearinghouses, she can pipe the message to spamassassin --report to report the message to the clearinghouses."

Identifying false negatives and reporting them to SpamAssassin is key to improving SpamAssassin's Bayesian classifier. The Bayesian classifier is discussed in detail in Chapter 4.

Measuring SpamAssassin's Performance

One of the ways that SpamAssassin's developers measure SpamAssassin's performance is by running SpamAssassin on large corpora of messages that are known to be spam or non-spam and measuring the rate of true and false positives and negatives at different thresholds (from -4 to 20) and with different features enabled. The results of these tests are distributed in the rules directory in files STATISTICS.txt (statistics without network or Bayesian tests), STATISTICS-set1.txt (statistics with network tests but no Bayesian tests), STATISTICS-set2.txt (statistics with Bayesian tests but no network tests), and STATISTICS-set3.txt (statistics with both network and Bayesian tests).

Here's an example of the contents of STATISTICS-set3.txt showing performance with a spam threshold of 5.0:

 # SUMMARY for threshold 5.0: # Correctly non-spam:  15550  46.59%  (99.90% of non-spam corpus) # Correctly spam:      17648  52.87%  (99.08% of spam corpus) # False positives:        15  0.04%  (0.10% of nonspam,   1133 weighted) # False negatives:       164  0.49%  (0.92% of spam,    437 weighted) # TCR: 74.527197  SpamRecall: 99.079%  SpamPrec: 99.915%  FP: 0.04%  FN: 0.49%

With those features and that threshold, SpamAssassin had a true positive rate of 99.08%, a true negative rate of 99.9%, a false positive rate of 0.1%, and a false negative rate of 0.92%.

‚ < ‚ Day Day Up ‚ > ‚