3.4 The Built-in Tests | SpamAssassin

‚ < ‚ Day Day Up ‚ > ‚

SpamAssassin is distributed with over 700 test rules defined for English-language spam. SpamAssassin 2.63 includes another 2,900 rules for spam in other languages. (Language support in SpamAssassin 3.0 is currently available only for French and German, but language support is likely to increase as SpamAssassin gets into wider release.) Reading the rules distributed with SpamAssassin is an excellent way to learn to write your own rules.

SpamAssassin's rules are defined in a set of files typically installed in /usr/share/spamassassin :

10_misc.cf: The 10_misc.cf file defines templates for the spam report that SpamAssassin attaches to spam messages, definitions of headers that SpamAssassin adds to messages, and default settings for the most common configuration options. This file is described in more detail later in this chapter.
10_plugins.cf (SpamAssassin 3.0): This file provides a convenient place to load SpamAssassin plug-in modules with the loadplugin directive. Plug-ins extend SpamAssassin's features.
20_fake_helo_tests.cf: This file defines a set of rules used to test for forged HELO hostnames. This file is also described in more detail later in this chapter.
20_body_tests.cf: This file defines most tests against message bodies, spam clearinghouses, message languages, and message locales. It's described in more detail later.
20_dnsbl_tests.cf: This file defines tests against many different DNS blacklists , using the check_rbl( ) , check_rbl_sub( ) , and check_rbl_txt( ) eval tests described earlier in this chapter. These blacklists include NJABL (http://www.dnsbl.njabl.org/), SORBS (http://www. dnsbl .sorbs.net/), OPM (http://opm.blitzed.org/), Spamhaus (http://www.spamhaus.org/sbl/), DSBL (http://dsbl.org), Spamcop (http://www.spamcop.net/bl.shtml), MAPS (http://www.mail-abuse.org), and several others.
20_ratware.cf and 20_anti_ratware.cf: The 20_ratware.cf file contains tests that look for tell-tale signs of specialized mail programs known to be used by spammers ( ratware or spamware ). Most of them are tests of message headers. The 20_anti_ratware.cf file is designed to contain tests that look for signs of non-spam mail programs that might be mistaken for spamware, but it doesn't contain any active tests as of SpamAssassin 3.0.
20_head_tests.cf: This file contains most of the tests that SpamAssassin performs against message headers. This includes tests for blacklisted and whitelisted addresses in the From and To headers (discussed in greater detail in Chapter 4).
20_porn.cf (all SpamAssassin versions) and 20_drugs.cf (SpamAssassin 3.0): These files contain body tests that look for common indicators of pornographic spam and online pharmacy spam, respectively.
20_phrases.cf: This file contains body tests that look for common phrases that appear in spam. Most of them are either instructions for how you can be removed from the mailing list or claims that the message conforms to a bill that putatively regulates unsolicited email.
20_uri_tests.cf: This file contains most of the tests that SpamAssassin performs against URIs that appear in messages.
20_compensate.cf: Tests in this file are intended to compensate for common false positives in header tests and are "nice" tests (with negative spam scores).
20_html_tests.cf: This file contains body tests that target messages that contain HTML markup. Certain types of markup are very commonly seen in spam, and several of these tests make for interesting reading.
20_meta_tests.cf: This file contains meta tests. Meta tests are tests that combine other tests, and are described earlier in this chapter.
23_bayes.cf: This file contains tests that act on the results of the Bayesian classifier. The Bayesian system and these tests are described in greater detail in Chapter 5.
25_head_tests_es.cf, 25_body_tests_es.cf, 25_head_tests_pl.cf, 25_body_tests_pl.cf (SpamAssassin 2.6x): These files contain header and body tests for Spanish (es) and Polish (pl) messages.
25_uribl.cf (SpamAssassin 3.0): This file loads the URIDNSBL plug-in and defines URI tests against DNS blacklists.
30_text_*.cf (de,es,fr,it,pl,sk): These files don't define any new tests but provide translations of test descriptions and report templates into different languages, such as German (de), Spanish (es), French (fr), Italian (it), Polish (pl), and Slovak (sk). SpamAssassin 3.0 includes only German and French tests at the time of this writing.
50_scores.cf: This file defines the scores associated with all of the tests defined in the other files. The scores are separated into a single file because they are generated by an algorithm that applies each test to a large corpus of spam and non-spam messages and adjusts the scores to minimize false positives and false negatives .
60_whitelist.cf: The rules in this file set up default whitelists for several large well-known addresses and companies, such as Amazon.com.

Because these files are overwritten whenever SpamAssassin is upgraded, they should not be changed. All local rules or changes to the scoring of distributed rules should be performed in the systemwide configuration file (or in per- user preference files) rather than in these files. Reading these files, however, provides the most information about how SpamAssassin rules are designed.

The following sections describe some of the more important rule files in greater detail.

3.4.1 10_misc.cf

The 10_misc.cf file defines special rules that are not spam tests. These include templates for the spam report that SpamAssassin attaches to spam messages, definitions of headers that SpamAssassin adds to messages, and default settings for the most common configuration options (such as those described in Chapter 2).

Templates are defined with the repo rt, unsafe_report , an d spamtrap directives, and the corresponding utility directives clear_report_templa te , clear_unsafe_report_template , and clear_spamtrap_template . Use the report template to design the report that SpamAssassin attaches to spam messages. Use the unsafe_report template to design the report that SpamAssassin attaches to messages that contain potentially executable code. Use the spamtrap template to design the message that SpamAssassin sends back to senders who email a spam trap address that calls the spamassassin script with the --report and --warning-from options (spam-reporting is discussed in Chapter 2).

Each time it encounters a template directive, SpamAssassin appends new text to the template. Accordingly, to ensure that you're starting with a clean slate when you define a new template, you must first clear the template and then add your desired text. Here's how the spam report might be defined in SpamAssassin:

 clear_report_template report Spam detection software, running on the system "_HOSTNAME_", has report identified this email as possible spam. The original message report is attached to this so you can view it (if it isn't spam) or block report similar future email.  If you have any questions, see report _CONTACTADDRESS_ for details. report  report Content preview:  _PREVIEW_ report  report Content analysis details:   (_HITS_ points, _REQD_ required) report report " pts rule name              description" report  ---- ---------------------- ------------------------------------ report _SUMMARY_

_HOSTNAME_ , _ CONTACTADDRESS_ , _PREVIEW_ , _HITS_ , _REQD_ , and _SUMMARY_ are variables that are replaced by their values when the template is generated for each message. The complete list of variables , which appears in the Mail::SpamAssassin::Conf manpage , is given in Table 3-3.

Table 3-3. Variables for use in report and header templates

Variable	Value
Variables that depend on the message
_YESNOCAPS_	"YES" if message is spam; "NO" if message is not spam.
_YESNO_	"YES" if message is spam; "NO" if message is not spam.
_HITS_	Spam score for message.
_BAYES_	Bayesian classifier score.
_AUTOLEARN_	"spam" if message was auto-learned as spam by the Bayesian classifier; "ham" if auto-learned as non-spam; "NO" if the message was not auto-learned.
_AWL_	Autowhitelist score modifier.
_DATE_	Date and time of SpamAssassin scan in RFC 2822 format.
_STARS_	A string containing one asterisk for each point of spam score (up to 50).
_STARS( `character` )_	A string containing one of `character` for each point of spam score (up to 50).
_RELAYSTRUSTED_	List of relays found in the message and deemed to be trusted. The list includes the IP address, reverse DNS lookup, and HELO address for each relay.
_RELAYSUNTRUSTED_	List of relay IP addresses found in the message and deemed to be untrusted.
_TESTS_, _TESTSSCORES_	Comma-separated list of tests matched, or tests matched and their associated scores.
_TESTS( `character` )_, _TESTS-SCORES( `character` )_	As in _TESTS_, _TESTSSCORES_ but separated by `character` instead of comma.
_LANGUAGES_	List of languages that SpamAssassin thinks a message is written in.
_PREVIEW_	Preview of message content.
_SUMMARY_	Multiline list of tests matched and their scores and descriptions.
_REPORT_	One line list of tests matched.
_RBL_	Results of positive DNSBL queries.
_DCCB_, _DCCR_	Checking host and results of DCC check of message.
_PYZOR_	Results of Pyzor check of message.
Variables that don't depend on the message
_REQD_	SpamAssassin's threshold score for calling a message spam.
_VERSION_, _SUBVERSION_	Version and subversion of SpamAssassin.
_HOSTNAME_	Hostname of SpamAssassin host.
_CONTACTADDRESS_	The value of the `report_contact` directive (typically, the email address of the postmaster ).

The variables in Table 3-3 can also be added to customized message headers for messages processed by SpamAssassin by using the add_header directive, which takes the following form:

 add_header   messagetype headername string

The messagetype can be spam , ham (non-spam), or all and determines which kind of messages will have the header added. The new header will be named X-Spam- headername , and string , which should be enclosed in double quotes, will be the value of the header. For example, the following directive, which appears in the distributed 10_misc.cf file, adds an X-Spam-Status header to all messages ‚ spam or not ‚ that shows whether or not each message is spam, the spam score, the spam threshold score, the tests that were matched, whether the message is being automatically learned (see Chapter 5), and the version of SpamAssassin:

 add_header all Status "_YESNO_, hits=_HITS_ required=_REQD_ tests=_TESTS_ autolearn=_ AUTOLEARN_ version=_VERSION_"

If you want to change or remove a default header, you can use the remove_header directive:

 remove_header   messagetype headername

You can remove all headers with the clear_headers directive.

3.4.2 20_fake_helo_tests.cf

This file defines a set of rules that use the eval test check_for_rdns_helo_mismatch( ) . This test takes two arguments: a regular expression pattern to match against the reverse DNS lookup of the connecting client's IP address, and a regular expression pattern to match against the hostname provided by the client during in the SMTP HELO command. Spammers often use mail programs that forge the HELO hostname, and these tests look for such forgeries when the clients have hostnames that match those of major commercial ISPs. Here's an example of a test from this file:

 header FAKE_HELO_AOL  eval:check_for_rdns_helo_mismatch("aol\.com","aol\.com") describe FAKE_HELO_AOL  Host HELO did not match rDNS: aol.com

This test matches if the client connects from an IP address that reverse-resolves to an aol.com hostname but claims in the HELO to have a hostname that does not match "aol.com". These tests are applied to all of the Received headers from untrusted relays.

You can use this eval test to reject messages that claim, in their HELO, to be from your own host. If your hostname is myhost.example.com , and you know that your IP address reverse-resolves to the same hostname, you could add a rule like this (to the systemwide configuration file):

 header FAKE_MY_HELO eval:check_for_rdns_helo_mismatch("(?!myhost\.example\.com). {18}$","myhost\.example\.com") describe FAKE_MY_HELO Host HELO faked my hostname score FAKE_MY_HELO 5.0

The regular expression (?!myhost\.example\.com).{18}$ matches any hostname containing at least 18 characters that does not end in myhost.example.com , which should match the reverse DNS lookup of any untrusted relay host other than your own. If any such host claims in their HELO to be myhost.example.com , it is forging your hostname.

3.4.3 20_body_tests.cf

This file contains most of the tests that SpamAssassin performs against message bodies. In addition to tests for regular expressions in the body, this file defines tests against spam clearinghouses and tests of message language and locale.

A spam clearinghouse is a server that maintains a database of checksums of messages reported as spam and allows clients to test a message against the checksum database. SpamAssassin supports three spam clearinghouses: Vipul's Razor (http:// razor .sf.net/), Pyzor (http://pyzor.sf.net), and the Distributed Checksum Clearinghouse, or DCC (http://rhyolite.com/anti-spam/dcc/). Special client software must be installed on the system in order for SpamAssassin to use these tests. The spamassassin ‚ report command can be used to report confirmed spam to these clearinghouses as well.

In SpamAssassin 3.0, the pyzor_options configuration directive can be set to a string of additional options to be passed to the Pyzor client on the command line when SpamAssassin invokes it. Similarly, the dcc_options directive can be set to provide additional options to the DCC client.

‚ < ‚ Day Day Up ‚ > ‚