# 4.2 Bayesian Filtering

 ‚  < ‚  Day Day Up ‚  > ‚

SpamAssassin's Bayesian classifier learns to distinguish the features that characterize spam from those that characterize non-spam in the messages that you receive. Properly trained, the Bayesian classifier can reduce both false positives and false negatives .

#### 4.2.1 Principles

Bayesian filtering is based on Bayes' Theorem, a statement of probability theory propounded by the Reverend Thomas Bayes in 1763. Bayes' Theorem is important in many fields where classifying data is essential, including computer vision, psychophysics, and diagnostic decision-making in health care. SpamAssassin's implementation is mostly based on the work of Paul Graham (archived at http://www.paulgraham.com) and Gary Robinson (http://www.garyrobinson.net).

Conceptually, Bayes' Theorem states that the probability of some event (such as a message being spam) given a test result (such as matching a spam-checking rule) depends on the baseline probability of the event before the test result is known and on the discriminating power of the test. A corollary is that the discriminating power of a test can be measured by comparing the probability of the event given a known test result to the baseline probability before the result is known. The more the test result can increase (or decrease) the probability from baseline, the stronger the test.

 Actually, SpamAssassin's "Bayesian" system doesn't really compute the baseline probability or frequency of spam versus non-spam messages ‚ which some have argued means it's not strictly Bayesian at all. Instead it assumes values that seem reasonable and useful.

In the context of spam-checking, a Bayesian approach amounts to developing potential rules and asking how much each rule, if matched, should change the system's perception of the likelihood that a message is spam. Very strong rules come in two forms. Some are patterns that only occur in spam (and never in non-spam), thus yielding a high probability that a message that matches one of the patterns is spam. Others are patterns that only occur in non-spam (and never in spam), thus yielding a low probability that a message that matches the pattern is spam. Weaker rules ‚ patterns found in both spam and non-spam messages but with different frequencies ‚ result in less extreme probabilities.

To use Bayesian filtering successfully, you must have a corpus of messages that you have decided are definitely spam, a corpus of messages that you have decided are definitely non-spam, and an algorithm for analyzing the two sets of messages to develop rules and test their strength. SpamAssassin provides the algorithm and a script that you can use to identify messages as spam or non-spam in order to train the filter. It also provides a mechanism for training itself with messages that are very likely to be spam or non-spam.

The results of the SpamAssassin learning process are a set of databases. One database contains tokens (strings of 3-15 characters ) that have been seen, how often each has been seen in spam and non-spam messages, and the date and time that each token last proved useful in classifying a message. During learning, tokens are derived from both the message headers (with several commonly misleading headers ignored) and message body. Tokens that haven't been useful in a long time may be removed from the database to increase efficiency. Another database keeps track of which messages have been learned, so SpamAssassin doesn't waste time relearning old messages.

During spam-checking, a message to be checked is split into tokens. SpamAssassin then looks up each token in the token database. Up to 150 of the most diagnostic tokens in the message are identified, and their associated predictive values are combined using one of two mathematical functions to yield a final prediction of the probability that the message is spam. This predicted probability is matched by special SpamAssassin rules that associate probability ranges with spam score modifiers.

#### 4.2.2 Configuration

SpamAssassin's Bayesian classifier is controlled by more than a dozen configuration directives, though only a few are regularly modified by system administrators. These are the most useful:

use_bayes

This directive controls whether the Bayesian classifier is used at all. It defaults to 1 (use Bayesian filtering). By setting it to 0, Bayesian filtering is disabled completely.

bayes_auto_learn, bayes_auto_learn_threshold_nonspam, bayes_auto_learn_threshold_spam

These directives configure the automatic learning system, which automatically feeds messages with very high or very low spam scores to the Bayesian classifier. The bayes_auto_learn directive enables (1) or disables (0) this feature; it is enabled by default. The threshold directives determine which messages will be automatically learned as spam or non-spam. Messages with spam scores lower than bayes_auto_learn_threshold_nonspam are learned as non-spam; this value defaults to 0.1. Messages with spam scores higher than bayes_auto_learn_threshold_spam are learned as spam; this value defaults to 12 and cannot be set lower than 6. The spam score used for making this determination does not include modifiers for the Bayesian system itself, for the autowhitelist, or for user -configured whitelists or blacklists .

This directive tells the Bayesian classifier to ignore the given header when learning or classifying messages. It is most often used when another spam-tagging system adds headers before SpamAssassin receives the message, in order to prevent the classifier from learning the other spam tag instead of the features of the actual message.

This directive prevents Bayesian classification and learning from being performed on messages sent from address and is a form of whitelisting. It's most useful when you want to receive messages from a few senders and the messages may include tokens that would otherwise suggest spam.

You can use multiple bayes_ignore_from directives or multiple addresses in a single directive to whitelist several addresses. You can also use as asterisk (*) as a wildcard for zero or more characters and a question mark (?) as a wildcard for zero or one character, much as you would to specify filename patterns in a shell.

This directive prevents Bayesian classification and learning from being performed on messages sent to address , and is a form of whitelisting recipients. It's useful in sitewide Bayesian filtering to prevent any learning from being performed from messages sent to postmaster , for example, who is likely to receive forwarded spam, non-spam messages discussing spam, etc. Specify addresses as you would to the bayes_ignore_from directive discussed previously.

bayes_learn_during_report

When this directive is enabled (1), messages that are reported to clearinghouses as spam with the spamassassin --report command are also learned as spam by the Bayesian classifier. This saves you an extra learning step. Set the directive to 0 to disable this feature. It is enabled by default.

bayes_path and bayes_file_mode

By default, SpamAssassin maintains separate Bayesian databases for each user on the system. The databases for a user are stored in the .spamassassin subdirectory of the user's home directory and their names begin bayes_ , such as bayes_seen and bayes_toks . These files are kept in one of several possible database formats (Berkeley DB format is generally preferred when it's available to SpamAssassin).

Separate databases for each user are ideal for Bayesian learning because different users may receive different kinds of spam and non-spam messages. However, it is often necessary to maintain a single Bayesian database for all users of a system, either to save on disk space or because users don't have home directories. You can configure a systemwide Bayesian database set by setting the bayes_path directive to the full path of the Bayesian database file prefix. For example, to set up systemwide Bayesian databases in the files /etc/mail/spamassassin/bayes_* , use the following directive:

` bayes_path /etc/mail/spamassassin/bayes `

By default, the Bayesian databases are created with mode 0700. The bayes_file_mode directive can be used to set a different file mode (e.g., 0770) if you need to share the databases among a group . This might be necessary if SpamAssassin can be invoked with the privileges of different users. Care should be taken with this directive, as a malicious user with access to the Bayesian databases can cause legitimate email to be mistagged as spam.

The following directives influence the internal workings of the Bayesian classifier. For the most part, they can be left to the default settings.

bayes_min_ham_num and bayes_min_spam_num

These directives set the minimum number of ham (non-spam) and spam messages that must be learned by SpamAssassin before it will use the predictions of the Bayesian classifier to score new messages. They default to 200 each; until 200 ham and 200 spam messages have been learned, the SpamAssassin rules that rely on the Bayesian classifier will not be applied to email.

bayes_use_hapaxes

Hapaxes are tokens that have been seen only once during learning so far. Accordingly, SpamAssassin's concept of whether a hapax is associated with spam or ham is based on limited data and may not be reliable. On the other hand, SpamAssassin can learn hundreds or thousands of hapaxes, and using hapaxes seems to provide better accuracy, so this setting defaults to 1 (enabled).

bayes_use_chi2_combining

This directive controls which of the two mathematical functions are used to combine token probabilities into an overall message probability. When enabled (1), the approach is based on the distribution of the chi-squared statistic; when disabled (0), a so-called "na ƒ ve Bayesian" function combines the probabilities using the assumption that errors in classification from each token are independent of one another. SpamAssassin's maintainers have found the chi-squared method more useful, and it is the default.

bayes_auto_expire and bayes_expiry_max_db_size

When bayes_auto_expire is enabled (1), SpamAssassin will automatically attempt to remove old tokens during learning when the token database exceeds bayes_expiry_max_db_size tokens. This is the default. When disabled (0), token expiration must be performed manually. Automatic expiration occurs no more than once every 12 hours.

bayes_learn_to_journal and bayes_journal_max_size

When bayes_learn_to_journal is enabled (1), SpamAssassin will store newly learned data in a journal file, rather than directly into the Bayesian databases. The journal file will be synchronized into the databases at least daily, or when the journal exceeds bayes_journal_max_size bytes (102,400 by default). Using journaling reduces disk contention for the databases, which must be exclusively locked while being updated, but results in a delay between the time a message is learned and the time the learned tokens can be used to classify further messages. Journaling might be particularly useful if the journal could be kept in a different location than the databases (e.g., on a RAM disk), but this directive is not supported as of SpamAssassin 3.0. bayes_learn_to_journal is disabled by default.

#### 4.2.3 Training

There are two main strategies for training a Bayesian classifier: train everything and train-on-error. In the train everything strategy, you train the classifier with every message that you receive. This strategy is highly responsive to changes in spam patterns but may change too quickly in response to unrelated variability in messages. In addition, it is resource intensive to scan every message. In the train-on-error strategy, you train the classifier only with messages that it has previously classified incorrectly (i.e., false positives and false negatives). This strategy is resource efficient but may not train the classifier as quickly when spam patterns change.

Based on experiments conducted by Greg Louis (and described at http://www.bgl.nu/bogofilter/), the train everything strategy appears to be more efficient for initial training. Once a suitable number of messages have been learned, however, switching to a train-on-error approach saves resources, because many fewer messages must be trained. Louis suggests that switching to train-on-error after 10,000 spam and 10,000 non-spam message have been learned may be reasonable. You can train SpamAssassin's Bayesian classifier with either strategy.

The sa-learn script is your primary interface for training the Bayesian classifier. The first step in using Bayesian filtering is collecting a corpus of messages you've received that you have verified are spam and a corpus that you've verified are non-spam. The easiest and best way to do so is to simply start saving spam you receive to one folder and any non-spam messages that you would ordinarily delete to another. The two collections of messages can either be in maildir format (in which each file contains a single message) or mbox format (in which a single file contains multiple messages).

It's important that the messages be from the same time period; if you train SpamAssassin with a set of spam messages from 2003 and a set of non-spam messages from 2004, it will quickly learn that an effective way to detect spam is to look for messages in 2003! Similarly, forwarded spam, or messages discussing spam in your corpus ("Hey, look at this spam I just got; it's really strange . Here it is . . . ") can result in the classifier learning artificial rules that will degrade its accuracy with normal messages.

Next, run sa-learn on each corpus, using either the --spam or --ham command-line options to specify what each corpus represents. Example 4-1 shows the process for a set of mbox files ‚ a file of saved spam, a file of saved (non-spam) messages related to a project, and the user's mail spool. The project files and mail spool files together form a corpus of known good messages. This example assumes that each user maintains her own Bayesian databases, so sa-learn is run by each user on her own messages.

##### Example 4-1. Learning from a set of mbox files
` \$  ls -F Mail  spam    myproject \$  sa-learn --mbox --spam Mail/spam  \$  sa-learn --mbox --ham mail/myproject  \$  sa-learn --mbox --ham /var/spool/mail/\$LOGNAME  `

Example 4-2 shows the process for a set of maildirs , again assuming that each user has his own Bayesian databases. The commands in the example are those that would be executed by each individual user. Providing a directory as an argument to sa-learn causes it to learn from every file in that directory. The example also illustrates the use of the --no-rebuild option to defer rebuilding of the databases until the --rebuild option is used. When performing learning on a large set of small files (the very essence of a maildir ), deferring the expensive database-rebuilding step is more efficient than rebuilding after each file.

##### Example 4-2. Learning from a set of maildirs
` \$  ls -F mail  INBOX/    spam/    myproject/ \$  sa-learn --no-rebuild --spam mail/spam  \$  sa-learn --no-rebuild --ham mail/INBOX  \$  sa-learn --no-rebuild --ham mail/myproject  \$  sa-learn --rebuild  `

If you're the sort who likes to see the progress of the training (or who worries when you run a command that takes longer than a few seconds to finish), you can add the --showdots option to cause sa-learn to print a period for each message it processes.

You can also call sa-learn on an individual file containing a mail message, or you can pipe a mail message to sa-learn 's standard input. Finally, you can put the names of mailboxes, files, or directories into a file and run sa-learn with the --folders= filename option, and it will read the file and directory names from the filename file and learn from each.

 The Bayesian classifier is most effective when trained on large collections of both spam and non-spam messages. In particular, training using many spam messages and fewer non-spam messages is likely to produce an ineffective filter. Aim for a couple thousand messages of each type, collected prospectively from your personally received mail.

If you mistakenly train the Bayesian classifier that a message is spam, simply direct sa-learn to relearn it as ham; if you mistakenly learn a message as ham, you can direct sa-learn to relearn it as spam. This process is also how you later train the classifier on errors. You can also cause SpamAssassin to forget a message entirely by running sa-learn --forget on the message.

sa-learn also accepts the same --configpath /path/to/ruleset/directory , --prefspath /path/to/user_prefs , and --siteconfigpath /path/to/sitewide/directory directives that the spamassassin script does. They are described in Chapter 2.

## What's Being Learned?

Once your Bayesian classifier has been trained and is contributing to spam-checking, you might be curious to find out which tokens are actually being used. The sa-learn --dump type command displays that information. type can be one of these choices:

• data will cause sa-learn to display all of the tokens it has learned, with their associated spam probabilities, number of occurrences in spam and ham messages, and last time used.

• magic will cause sa-learn to display "magic" tokens. Although they're stored in the database, these tokens don't represent parts of email messages. They include such information as the number of spam and ham messages in the databases, the last time a token was used, etc.

• all will cause sa-learn to display tokens of both types.

Here are the first and last five lines of sa-learn --dump data sort -n as executed on one system:

` 0.000    0    110 1072880922  discussion 0.000    0    112 1071162080  HMBOX-Line:2002 0.000    0    112 1072907632  modify 0.000    0    113 1072915324  H*u:Windows 0.000    0    115 1072900545  Sender ... 1.000  310      0 1071162080  N:HEADER_NBITS 1.000  316      0 1072026198  8-bit 1.000  323      0 1071162080  HEADER_8BITS 1.000  328      0 1072026198  N:N-bit 1.000  394      0 1072910571  Forged `

The first five lines show tokens that have only exclusively appeared in non-spam messages. The last five show tokens that have exclusively appeared in spam messages. Tokens starting with H were found in headers; some headers are abbreviated with special codes starting with an asterisk (*) ‚ so H*u : means the User-Agent header. Tokens starting with N: indicate that Ns that appear in the token should match any sequence of digits.

You can restrict which tokens are shown by sa-learn --dump by adding the --regexp regexp command-line option and providing a regular expression pattern regexp . Only tokens that match regexp will be displayed. This option is useful when you want to see the spam probability associated with specific tokens.

#### 4.2.4 Daily Use

When you first enable the Bayesian classifier in SpamAssassin, you will initially notice little change in the way messages are checked for spam. Once you've trained the classifier with enough messages, however, your spam scores for messages will begin to change substantially in two ways:

• Messages will show that they are hitting SpamAssassin rules with names like BAYES_44 or BAYES_80. These rules, which can be found in the 23_bayes.cf file, are triggered when the Bayesian classifier assigns a given probability of spam to a message. For example, the BAYES_44 rule is matched when a message has a probability of spam between 0.44 and 0.4999; the BAYES_80 rule is triggered when a message has a probability of spam between 0.80 and 0.90. Rules that match on probabilities less than 0.5 lower spam scores, and those that match on probabilities greater than 0.5 raise spam scores.

• Most of the non-Bayesian rules assign different scores when the classifier is trained and in use than when it is not. In many cases, non-Bayesian rules produce less extreme scores, which reflects the supposition that the Bayesian classifier should be better than static rules at distinguishing spam from non-spam.

##### 4.2.4.1 Ongoing training

Ongoing training is essential to maintaining the performance of a Bayesian filter. As in initial training, you must continue to provide examples of both spam and non-spam messages.

As you receive messages, check each message classified as spam to be sure that it is really spam and not a false positive. If the message's spam score is higher than the threshold for automatic learning, the message should have already been fed back into the classifier to train it. You can determine if this has happened by looking at the autolearn= section of the X-Spam-Status header added by SpamAssassin. If the message's spam score wasn't high enough for automatic learning, submit it to sa-learn --spam yourself. If you come across a false positive, submit it to sa-learn --ham instead.

Similarly, you can submit your non-spam messages to sa-learn --ham if their spam scores are too high for the automatic learning threshold for ham. Any spam SpamAssassin misses should definitely be submitted to sa-learn --spam .

You can make the ongoing training process more convenient using one of two common ways. If you read your email with an email client that allows you to bind commands to keys, you could define keystrokes to invoke sa-learn --ham or sa-learn --spam on the current message. Another approach is to save all spam messages into a single mail folder and all non-spam messages that you plan to delete into a second folder, and then run sa-learn on each folder (and possibly on your inbox if you keep many undeleted messages there) at the end of your mail-reading session. Users or system administrators can set up cron jobs to automate this process.

##### 4.2.4.2 Expiration and importing

Expiration and importing are two other functions of sa-learn that you will use infrequently. Expiration removes old tokens from the database, and importing updates the database if a new SpamAssassin release changes database formats.

As discussed earlier in this chapter, when bayes_auto_expire is enabled (the default), SpamAssassin's Bayesian classifier regularly reviews its database of tokens to determine if any should be expired . Expiration is always skipped when fewer than 100,000 tokens are in the database. The automatic expiration process runs no more than once every 12 hours and only when the number of tokens exceeds bayes_expiry_max_db_size .

If you do not use bayes_auto_expire , or if you want to expire tokens manually, you can force an expiration attempt by running sa-learn --force-expire . Doing so may not actually expire any tokens; for example, when fewer than 100,000 tokens or all tokens have been recently used, no tokens will be expired.

The sa-learn --import command is used to update the Bayesian databases from their format in an older version of SpamAssassin to the current format. The release notes for new versions of SpamAssassin should tell you when running sa-learn --import is necessary. In many cases, SpamAssassin will perform importation when it automatically learns a new message, so this command may not be necessary.

 The import process can be both CPU and disk intensive, especially with a large database of tokens. It is best run during off-hours or times of low system load.

#### 4.2.5 Storing Bayesian Data in SQL

SpamAssassin 3.0 can optionally store per-user Bayesian data in an SQL database, which is useful when users don't have accounts on the mail server. To store Bayesian data in SQL, you must install the DBI Perl module and an appropriate driver module for your SQL server. Common choices are DBD-mysql (for the MySQL server), DBD-Pg (for the PostgreSQL server), and DBD-ODBC (for connection to an ODBC-compliant server).

You should create a database and a user with privileges to access it. You must then create a set of tables in the database to store the Bayesian data. The SpamAssassin source code includes schemas for MySQL, PostgreSQL, and SQLite tables in the sql subdirectory. Here is the MySQL schema:

` CREATE TABLE bayes_expire (   username varchar(200) NOT NULL default '',   runtime int(11) NOT NULL default '0',   KEY bayes_expire_idx1 (username) ) TYPE=MyISAM; CREATE TABLE bayes_global_vars (   variable varchar(30) NOT NULL default '',   value varchar(200) NOT NULL default '',   PRIMARY KEY  (variable) ) TYPE=MyISAM; INSERT INTO bayes_global_vars VALUES ('VERSION','2'); CREATE TABLE bayes_seen (   username varchar(200) NOT NULL default '',   msgid varchar(200) binary NOT NULL default '',   flag char(1) NOT NULL default '',   PRIMARY KEY  (username,msgid),   KEY bayes_seen_idx1 (username,flag) ) TYPE=MyISAM; CREATE TABLE bayes_token (   username varchar(200) NOT NULL default '',   token varchar(200) binary NOT NULL default '',   spam_count int(11) NOT NULL default '0',   ham_count int(11) NOT NULL default '0',   atime int(11) NOT NULL default '0',   PRIMARY KEY  (username,token) ) TYPE=MyISAM; CREATE TABLE bayes_vars (   username varchar(200) NOT NULL default '',   spam_count int(11) NOT NULL default '0',   ham_count int(11) NOT NULL default '0',   last_expire int(11) NOT NULL default '0',   last_atime_delta int(11) NOT NULL default '0',   last_expire_reduce int(11) NOT NULL default '0',   PRIMARY KEY  (username) ) TYPE=MyISAM; `

For each user, these tables maintain information about token expiration ( bayes_expire ), messages seen ( bayes_seen ), tokens seen ( bayes_token ), and per-user configuration variables ( bayes_vars ). A table for global configuration variables ( bayes_global_vars ) is also available. The names of rows in these tables are similar to the corresponding SpamAssassin configuration variables and indicate the data they store.

To configure SQL support for Bayesian data, set the following configuration parameters in your systemwide configuration file ( local.cf ):

bayes_store_module Mail::SpamAssassin::BayesStore::SQL

Configures SpamAssassin to use SQL-based storage for Bayesian data instead of file-based (DBM) storage.

bayes_sql_dsn DSN

Defines the data source name for the SQL database. See the earlier definition of bayes_awl_dsn for examples of how to define a DSN.

Defines the username that will be used to connect to the database server. This user must have permission to modify the data in the table (including inserting and deleting rows).

Defines the password associated with the username that will be used to connect to the server.

SpamAssassin will now store Bayesian data learned from messages (either automatically or via sa-learn ) in the SQL database and will look up tokens in this database when checking messages for a user.

SpamAssassin provides one additional configuration variable for SQL storage of Bayesian data:

When this directive is set, the SQL query for Bayesian data will use someusername in place of the current user's name when adding new message data or retrieving data for message-checking. Generally, this directive should only be used in per-user configuration files so that most users have their own personal Bayesian data. In principle, you could also use it in the site-wide configuration file to create a sitewide Bayesian database, and then use it in per-user configuration files to exclude certain users from the sitewide data.

#### 4.2.6 A Sitewide Bayesian Classifier

Bayesian filtering is most effective when each user maintains his own set of token databases trained from his own email. By learning about the peculiar characteristics of spam and non-spam messages received by an individual user, the Bayesian classifier becomes an effective test for future messages to that user. A pharmacist might receive a lot of legitimate email about sildenafil citrate, and having all of these messages tagged as spam (or worse ) could be a serious problem.

Many sites, however, prefer to have a single set of databases for all users at the site, either to save disk space or because users do not have home directories and setting up SpamAssassin 3.0's SQL storage is infeasible. Setting up a sitewide Bayesian classifier is possible with SpamAssassin. Perform the following steps:

1. Set bayes_path and bayes_file_mode in the systemwide configuration file. Be sure the directory specified in bayes_path is readable, writable, and searchable by the user that SpamAssassin will be running as, so that it can create the proper files. The bayes_file_mode should be as strict as possible, typically 0700, which is the default setting. It's a good idea to set it explicitly, rather than rely on the default.

2. Provide a mechanism for users or administrators to submit messages for training. This step is the most difficult part of a sitewide Bayesian classifier. Because the database files will be owned by the user that SpamAssassin runs as, even local users typically will not be able to run sa-learn with the proper permissions to update the databases.

One solution for enabling users to submit spam messages for training is to ask users to bounce any spam they receive to a central mailbox that can be processed by a privileged script. For example, set up an email alias of spamtrap on the SpamAssassin system that pipes incoming messages to a script like that shown in Example 4-3. As an extra benefit, you can publicize the spamtrap address on public web pages or in Usenet postings and actually use it as a spam trap ‚ spammers who harvest the address and send spam to it will find their spam fed into your learning and reporting systems.

##### Example 4-3. A sitewide script for learning spam
` #!/bin/sh # # This script accepts an email message on its standard input # and feeds it to SpamAssassin's learning and/or reporting systems # It is meant to be run as root or as the user who owns the  # SpamAssassin Bayesian databases PATH=/bin:/usr/bin:/sbin:/usr/sbin # Three choices: # 1. Uncomment the following line to use --report if # you have bayes_learn_during_report enabled. spamassassin --report # 2. Uncomment the following line to use sa-learn and # spamassassin --report when you don't have # bayes_learn_during_report enabled # sa-learn --spam  spamassassin --report # 3. Uncomment the following line to use sa-learn # alone. #sa-learn --spam `

A similar solution for non-spam messages is much more difficult ‚ for social, rather than technical, reasons. Users may well be reluctant to forward their legitimate email to any central address. Unfortunately, without a good corpus of non-spam messages, the Bayesian filter will not perform well. One possible approach is to raise the bayes_auto_learn_threshold_nonspam slightly (e.g., to 0.5 or 1.0) so that much legitimate email will be auto-learned.

 ‚  < ‚  Day Day Up ‚  > ‚

SpamAssassin
ISBN: 0596007078
EAN: 2147483647
Year: 2004
Pages: 88

Similar book on Amazon