‚ < ‚ Day Day Up ‚ > ‚ |
SpamAssassin's Bayesian classifier learns to distinguish the features that characterize spam from those that characterize non-spam in the messages that you receive. Properly trained, the Bayesian classifier can reduce both false positives and false negatives . 4.2.1 PrinciplesBayesian filtering is based on Bayes' Theorem, a statement of probability theory propounded by the Reverend Thomas Bayes in 1763. Bayes' Theorem is important in many fields where classifying data is essential, including computer vision, psychophysics, and diagnostic decision-making in health care. SpamAssassin's implementation is mostly based on the work of Paul Graham (archived at http://www.paulgraham.com) and Gary Robinson (http://www.garyrobinson.net). Conceptually, Bayes' Theorem states that the probability of some event (such as a message being spam) given a test result (such as matching a spam-checking rule) depends on the baseline probability of the event before the test result is known and on the discriminating power of the test. A corollary is that the discriminating power of a test can be measured by comparing the probability of the event given a known test result to the baseline probability before the result is known. The more the test result can increase (or decrease) the probability from baseline, the stronger the test.
In the context of spam-checking, a Bayesian approach amounts to developing potential rules and asking how much each rule, if matched, should change the system's perception of the likelihood that a message is spam. Very strong rules come in two forms. Some are patterns that only occur in spam (and never in non-spam), thus yielding a high probability that a message that matches one of the patterns is spam. Others are patterns that only occur in non-spam (and never in spam), thus yielding a low probability that a message that matches the pattern is spam. Weaker rules ‚ patterns found in both spam and non-spam messages but with different frequencies ‚ result in less extreme probabilities. To use Bayesian filtering successfully, you must have a corpus of messages that you have decided are definitely spam, a corpus of messages that you have decided are definitely non-spam, and an algorithm for analyzing the two sets of messages to develop rules and test their strength. SpamAssassin provides the algorithm and a script that you can use to identify messages as spam or non-spam in order to train the filter. It also provides a mechanism for training itself with messages that are very likely to be spam or non-spam. The results of the SpamAssassin learning process are a set of databases. One database contains tokens (strings of 3-15 characters ) that have been seen, how often each has been seen in spam and non-spam messages, and the date and time that each token last proved useful in classifying a message. During learning, tokens are derived from both the message headers (with several commonly misleading headers ignored) and message body. Tokens that haven't been useful in a long time may be removed from the database to increase efficiency. Another database keeps track of which messages have been learned, so SpamAssassin doesn't waste time relearning old messages. During spam-checking, a message to be checked is split into tokens. SpamAssassin then looks up each token in the token database. Up to 150 of the most diagnostic tokens in the message are identified, and their associated predictive values are combined using one of two mathematical functions to yield a final prediction of the probability that the message is spam. This predicted probability is matched by special SpamAssassin rules that associate probability ranges with spam score modifiers. 4.2.2 ConfigurationSpamAssassin's Bayesian classifier is controlled by more than a dozen configuration directives, though only a few are regularly modified by system administrators. These are the most useful:
The following directives influence the internal workings of the Bayesian classifier. For the most part, they can be left to the default settings.
4.2.3 TrainingThere are two main strategies for training a Bayesian classifier: train everything and train-on-error. In the train everything strategy, you train the classifier with every message that you receive. This strategy is highly responsive to changes in spam patterns but may change too quickly in response to unrelated variability in messages. In addition, it is resource intensive to scan every message. In the train-on-error strategy, you train the classifier only with messages that it has previously classified incorrectly (i.e., false positives and false negatives). This strategy is resource efficient but may not train the classifier as quickly when spam patterns change. Based on experiments conducted by Greg Louis (and described at http://www.bgl.nu/bogofilter/), the train everything strategy appears to be more efficient for initial training. Once a suitable number of messages have been learned, however, switching to a train-on-error approach saves resources, because many fewer messages must be trained. Louis suggests that switching to train-on-error after 10,000 spam and 10,000 non-spam message have been learned may be reasonable. You can train SpamAssassin's Bayesian classifier with either strategy. The sa-learn script is your primary interface for training the Bayesian classifier. The first step in using Bayesian filtering is collecting a corpus of messages you've received that you have verified are spam and a corpus that you've verified are non-spam. The easiest and best way to do so is to simply start saving spam you receive to one folder and any non-spam messages that you would ordinarily delete to another. The two collections of messages can either be in maildir format (in which each file contains a single message) or mbox format (in which a single file contains multiple messages). It's important that the messages be from the same time period; if you train SpamAssassin with a set of spam messages from 2003 and a set of non-spam messages from 2004, it will quickly learn that an effective way to detect spam is to look for messages in 2003! Similarly, forwarded spam, or messages discussing spam in your corpus ("Hey, look at this spam I just got; it's really strange . Here it is . . . ") can result in the classifier learning artificial rules that will degrade its accuracy with normal messages. Next, run sa-learn on each corpus, using either the --spam or --ham command-line options to specify what each corpus represents. Example 4-1 shows the process for a set of mbox files ‚ a file of saved spam, a file of saved (non-spam) messages related to a project, and the user's mail spool. The project files and mail spool files together form a corpus of known good messages. This example assumes that each user maintains her own Bayesian databases, so sa-learn is run by each user on her own messages. Example 4-1. Learning from a set of mbox files$ ls -F Mail spam myproject $ sa-learn --mbox --spam Mail/spam $ sa-learn --mbox --ham mail/myproject $ sa-learn --mbox --ham /var/spool/mail/$LOGNAME Example 4-2 shows the process for a set of maildirs , again assuming that each user has his own Bayesian databases. The commands in the example are those that would be executed by each individual user. Providing a directory as an argument to sa-learn causes it to learn from every file in that directory. The example also illustrates the use of the --no-rebuild option to defer rebuilding of the databases until the --rebuild option is used. When performing learning on a large set of small files (the very essence of a maildir ), deferring the expensive database-rebuilding step is more efficient than rebuilding after each file. Example 4-2. Learning from a set of maildirs$ ls -F mail INBOX/ spam/ myproject/ $ sa-learn --no-rebuild --spam mail/spam $ sa-learn --no-rebuild --ham mail/INBOX $ sa-learn --no-rebuild --ham mail/myproject $ sa-learn --rebuild If you're the sort who likes to see the progress of the training (or who worries when you run a command that takes longer than a few seconds to finish), you can add the --showdots option to cause sa-learn to print a period for each message it processes. You can also call sa-learn on an individual file containing a mail message, or you can pipe a mail message to sa-learn 's standard input. Finally, you can put the names of mailboxes, files, or directories into a file and run sa-learn with the --folders= filename option, and it will read the file and directory names from the filename file and learn from each.
If you mistakenly train the Bayesian classifier that a message is spam, simply direct sa-learn to relearn it as ham; if you mistakenly learn a message as ham, you can direct sa-learn to relearn it as spam. This process is also how you later train the classifier on errors. You can also cause SpamAssassin to forget a message entirely by running sa-learn --forget on the message. sa-learn also accepts the same --configpath /path/to/ruleset/directory , --prefspath /path/to/user_prefs , and --siteconfigpath /path/to/sitewide/directory directives that the spamassassin script does. They are described in Chapter 2.
4.2.4 Daily UseWhen you first enable the Bayesian classifier in SpamAssassin, you will initially notice little change in the way messages are checked for spam. Once you've trained the classifier with enough messages, however, your spam scores for messages will begin to change substantially in two ways:
4.2.4.1 Ongoing trainingOngoing training is essential to maintaining the performance of a Bayesian filter. As in initial training, you must continue to provide examples of both spam and non-spam messages. As you receive messages, check each message classified as spam to be sure that it is really spam and not a false positive. If the message's spam score is higher than the threshold for automatic learning, the message should have already been fed back into the classifier to train it. You can determine if this has happened by looking at the autolearn= section of the X-Spam-Status header added by SpamAssassin. If the message's spam score wasn't high enough for automatic learning, submit it to sa-learn --spam yourself. If you come across a false positive, submit it to sa-learn --ham instead. Similarly, you can submit your non-spam messages to sa-learn --ham if their spam scores are too high for the automatic learning threshold for ham. Any spam SpamAssassin misses should definitely be submitted to sa-learn --spam . You can make the ongoing training process more convenient using one of two common ways. If you read your email with an email client that allows you to bind commands to keys, you could define keystrokes to invoke sa-learn --ham or sa-learn --spam on the current message. Another approach is to save all spam messages into a single mail folder and all non-spam messages that you plan to delete into a second folder, and then run sa-learn on each folder (and possibly on your inbox if you keep many undeleted messages there) at the end of your mail-reading session. Users or system administrators can set up cron jobs to automate this process. 4.2.4.2 Expiration and importingExpiration and importing are two other functions of sa-learn that you will use infrequently. Expiration removes old tokens from the database, and importing updates the database if a new SpamAssassin release changes database formats. As discussed earlier in this chapter, when bayes_auto_expire is enabled (the default), SpamAssassin's Bayesian classifier regularly reviews its database of tokens to determine if any should be expired . Expiration is always skipped when fewer than 100,000 tokens are in the database. The automatic expiration process runs no more than once every 12 hours and only when the number of tokens exceeds bayes_expiry_max_db_size . If you do not use bayes_auto_expire , or if you want to expire tokens manually, you can force an expiration attempt by running sa-learn --force-expire . Doing so may not actually expire any tokens; for example, when fewer than 100,000 tokens or all tokens have been recently used, no tokens will be expired. The sa-learn --import command is used to update the Bayesian databases from their format in an older version of SpamAssassin to the current format. The release notes for new versions of SpamAssassin should tell you when running sa-learn --import is necessary. In many cases, SpamAssassin will perform importation when it automatically learns a new message, so this command may not be necessary.
4.2.5 Storing Bayesian Data in SQLSpamAssassin 3.0 can optionally store per-user Bayesian data in an SQL database, which is useful when users don't have accounts on the mail server. To store Bayesian data in SQL, you must install the DBI Perl module and an appropriate driver module for your SQL server. Common choices are DBD-mysql (for the MySQL server), DBD-Pg (for the PostgreSQL server), and DBD-ODBC (for connection to an ODBC-compliant server). You should create a database and a user with privileges to access it. You must then create a set of tables in the database to store the Bayesian data. The SpamAssassin source code includes schemas for MySQL, PostgreSQL, and SQLite tables in the sql subdirectory. Here is the MySQL schema: CREATE TABLE bayes_expire ( username varchar(200) NOT NULL default '', runtime int(11) NOT NULL default '0', KEY bayes_expire_idx1 (username) ) TYPE=MyISAM; CREATE TABLE bayes_global_vars ( variable varchar(30) NOT NULL default '', value varchar(200) NOT NULL default '', PRIMARY KEY (variable) ) TYPE=MyISAM; INSERT INTO bayes_global_vars VALUES ('VERSION','2'); CREATE TABLE bayes_seen ( username varchar(200) NOT NULL default '', msgid varchar(200) binary NOT NULL default '', flag char(1) NOT NULL default '', PRIMARY KEY (username,msgid), KEY bayes_seen_idx1 (username,flag) ) TYPE=MyISAM; CREATE TABLE bayes_token ( username varchar(200) NOT NULL default '', token varchar(200) binary NOT NULL default '', spam_count int(11) NOT NULL default '0', ham_count int(11) NOT NULL default '0', atime int(11) NOT NULL default '0', PRIMARY KEY (username,token) ) TYPE=MyISAM; CREATE TABLE bayes_vars ( username varchar(200) NOT NULL default '', spam_count int(11) NOT NULL default '0', ham_count int(11) NOT NULL default '0', last_expire int(11) NOT NULL default '0', last_atime_delta int(11) NOT NULL default '0', last_expire_reduce int(11) NOT NULL default '0', PRIMARY KEY (username) ) TYPE=MyISAM; For each user, these tables maintain information about token expiration ( bayes_expire ), messages seen ( bayes_seen ), tokens seen ( bayes_token ), and per-user configuration variables ( bayes_vars ). A table for global configuration variables ( bayes_global_vars ) is also available. The names of rows in these tables are similar to the corresponding SpamAssassin configuration variables and indicate the data they store. To configure SQL support for Bayesian data, set the following configuration parameters in your systemwide configuration file ( local.cf ):
SpamAssassin will now store Bayesian data learned from messages (either automatically or via sa-learn ) in the SQL database and will look up tokens in this database when checking messages for a user. SpamAssassin provides one additional configuration variable for SQL storage of Bayesian data:
4.2.6 A Sitewide Bayesian ClassifierBayesian filtering is most effective when each user maintains his own set of token databases trained from his own email. By learning about the peculiar characteristics of spam and non-spam messages received by an individual user, the Bayesian classifier becomes an effective test for future messages to that user. A pharmacist might receive a lot of legitimate email about sildenafil citrate, and having all of these messages tagged as spam (or worse ) could be a serious problem. Many sites, however, prefer to have a single set of databases for all users at the site, either to save disk space or because users do not have home directories and setting up SpamAssassin 3.0's SQL storage is infeasible. Setting up a sitewide Bayesian classifier is possible with SpamAssassin. Perform the following steps:
One solution for enabling users to submit spam messages for training is to ask users to bounce any spam they receive to a central mailbox that can be processed by a privileged script. For example, set up an email alias of spamtrap on the SpamAssassin system that pipes incoming messages to a script like that shown in Example 4-3. As an extra benefit, you can publicize the spamtrap address on public web pages or in Usenet postings and actually use it as a spam trap ‚ spammers who harvest the address and send spam to it will find their spam fed into your learning and reporting systems. Example 4-3. A sitewide script for learning spam#!/bin/sh # # This script accepts an email message on its standard input # and feeds it to SpamAssassin's learning and/or reporting systems # It is meant to be run as root or as the user who owns the # SpamAssassin Bayesian databases PATH=/bin:/usr/bin:/sbin:/usr/sbin # Three choices: # 1. Uncomment the following line to use --report if # you have bayes_learn_during_report enabled. spamassassin --report # 2. Uncomment the following line to use sa-learn and # spamassassin --report when you don't have # bayes_learn_during_report enabled # sa-learn --spam spamassassin --report # 3. Uncomment the following line to use sa-learn # alone. #sa-learn --spam
A similar solution for non-spam messages is much more difficult ‚ for social, rather than technical, reasons. Users may well be reluctant to forward their legitimate email to any central address. Unfortunately, without a good corpus of non-spam messages, the Bayesian filter will not perform well. One possible approach is to raise the bayes_auto_learn_threshold_nonspam slightly (e.g., to 0.5 or 1.0) so that much legitimate email will be auto-learned. |
‚ < ‚ Day Day Up ‚ > ‚ |