3.3 Writing Your Own Tests

‚ < ‚ Day Day Up ‚ > ‚

When none of the existing tests does what you'd like, you can write a custom test of your own. Custom tests are just like the distributed tests, except that you install them in the systemwide configuration file or in a per- user preference file.

Users can write their own tests in their per-user preference files, but for security reasons these tests will not be used when spamd is performing spam-checking, unless the allow_user_rules option is set to 1 in the systemwide configuration. However, setting this option is dangerous because spamd runs as root and a malicious or inexperienced user can construct a custom test that causes the system to hang or to invoke an arbitrary command as nobody or as spamd 's uid. Users who want their own tests on a system that uses spamd should reinvoke the spamassassin script on their incoming mail (probably in their .procmailrc ). Chapter 2 illustrates this approach.

The first step in writing a custom test is to choose a symbolic test name and write a meaningful test description with the describe directive. For now, do not begin any of your names with a double underscore ( _ _ ). Test names that begin with two underscores are not listed in test hit reports , nor are they added to the spam score on their own; such names are used for creating sets of subtests that should be applied in combination. SpamAssassin calls these combinations meta tests , and they are discussed later in this section.

Second, determine what part of the message you wish to test. Table 3-1 summarizes the directives used to test different portions of a message. Each is covered in greater detail in the following sections.

Table 3-1. Message portions and associated test directives

Message part	Directive	Possible tests
Headers	header TESTNAME	Match a regexp Don't match a regexp Exists Evaluate Perl code Check Received headers against DNSBL
Message subject and text of message body, decoding all textual MIME parts , with HTML tags and line breaks removed	body TESTNAME	Match a regexp Evaluate Perl code
Text of message body, decoding all textual MIME parts, with HTML tags and line breaks retained	rawbody TESTNAME	Match a regexp Evaluate Perl code
Undecoded message body including all MIME parts	full TESTNAME	Match a regexp Evaluate Perl code
URIs in the message body	uri TESTNAME	Match a regexp
URIs in the message body	uridnsbl TESTNAME	(SpamAssassin 3.0) Check for address in a DNS-based blacklist

Third, decide if your test requires any special test flags. Test flags are used to inform SpamAssassin that your test may apply only under certain conditions or may do something unusual. Use the tflags TESTNAME flaglist directive to indicate test flags. The flaglist is a space-separated list of flags. Table 3-2 lists the available flags in SpamAssassin and their effects.

Table 3-2. Test flags

Flag	Meaning
net	A network-based test that will not be run when SpamAssassin is directed to run local tests only
learn	A test that requires training before use (e.g., the Bayesian tests)
userconf	A test that requires user configuration before use (e.g., a test that expects the user to provide a list of addresses)
nice	A test that will be given a negative score
noautolearn	(Spamassassin 3.0) A test that will not be applied in the spam score when determining whether the message should be automatically learned as spam or non-spam

For example, the RCVD_IN_BL_SPAMCOP_NET test, which checks the message's Received headers against the DNS-based blacklist at bl.spamcop.net is defined in 20_dnsbl_tests.cf like this: ^[2]

^[2] Section 3.3.1 explains the details of how DNS-based blacklist-checking is performed.

 header   RCVD_IN_BL_SPAMCOP_NET eval:check_rbl_txt('spamcop', 'bl.spamcop.net.') describe RCVD_IN_BL_SPAMCOP_NET Received via a relay in bl.spamcop.net tflags   RCVD_IN_BL_SPAMCOP_NET net

Finally, after adding or modifying a test, you should run spamassassin --lint to check your new rules for correct syntax. This command will attempt to parse all of the rules and configuration files in the ruleset directory and systemwide configuration directory. It exits quietly if no errors are found.

Versioning Your Rules

If you plan to create an extensive set of new rules, and especially if you plan to distribute them to other SpamAssassin users, you should use the version_tag configuration option to set a string that will denote your version of the rules. This string will appear in the X-Spam-Status header, after SpamAssassin's version number.

For example, set version_tag like this:

 version_tag example.com

to produce the following in the header:

 X-Spam-Status: No, hits=0.9 required=5.0 tests= FROM_NO_LOWER autolearn=no  version=3.0.0-example.com

If your rules rely on a particular version of SpamAssassin, include the require_version directive, followed by the required version number. When SpamAssassin sees this directive when parsing a file, it skips the rest of the file unless the version number is an exact match for the running version. For example, to ensure that custom rules you wrote for SpamAssassin 2.63 won't be used in SpamAssassin 3.0, add this line to the top of the file containing your rules:

 require_version 2.63

3.3.1 Header Tests

Use the header directive to define a header test. Header tests can test for the existence of a header or check to see if a header matches (or fails to match) a regular expression.

To check for the existence of a header, use the following syntax:

 header TESTNAME exists:   headername

Regular expression tests can be applied to any single header in a message, both the To and Cc headers, all Message-Id headers, or all headers. Use the following form to match a header to a Perl regular expression:

 header TESTNAME   headername   =~ /   regexp   /   modifiers

Use this next syntax to test whether a header does not match a regular expression:

 header TESTNAME   headername   !~ /   regexp   /   modifiers

In these tests, the headername can be the name of a single header, or can be ToCc (to match in the To or Cc header), MESSAGEID (to match in any Message-Id header), or ALL (to match in any header). SpamAssassin 3.0 also supports headername EnvelopeFrom to match against the address supplied in the SMTP MAIL FROM command if the MTA provides this information to SpamAssassin.

A header that does not exist will not match any regular expression. To handle the possibility of a nonexistent header, you can add an optional [if-unset : STRING ] after the regular expression and modifiers, and STRING will be tested against the regular expression if the header does not exist. For example, to look for a Reply-To header that either contains @localhost or is missing, you could use this rule:

 header LOCAL_OR_NO_REPLY reply-to =~ /@localhost/ [if-unset: @localhost]

Many of the methods available in the Mail::SpamAssassin::EvalTests module test headers. This module is not documented, but you can learn about its methods by reading the rules distributed with SpamAssassin. For example, the subject_is_all_caps( ) method matches when the Subject header contains all capital letters . This test is the basis of the SUBJ_ALL_CAPS rule distributed with SpamAssassin:

 header SUBJ_ALL_CAPS      eval:subject_is_all_caps( )

3.3.1.1 Configurable header tests (SpamAssassin 3.0)

Some of the header tests in SpamAssassin 3.0 that use Mail::SpamAssassin::EvalTests methods have configurable parameters that control their operation. These parameters should be defined in sitewide or user configuration files.

The check_for_from_dns( ) method performs a DNS lookup on the address in the message's Reply-To or From header to ensure that an MX record listing a host willing to receive mail for the message sender's host exists. Because DNS lookups can be slow, two configuration file options, check_mx_attempts and check_mx_delay are provided so you can adjust these lookups. Set check_mx_attempts to the number of lookup attempts you are willing to have SpamAssassin make (the default is 2). Set check_mx_delay to the number of seconds to wait between attempts in case the domain name server is temporarily down (the default is 5).

The check_hashcash_value( ) and check_hashcash_double_spend( ) methods implement Hashcash verification (http://www.hashcash.org). If a message includes an X-Hashcash header, SpamAssassin can quickly verify that the sender spent the required processing time to produce a valid header and reduces the message's spam score in proportion to how difficult it was for the sender to produce the header. To control SpamAssassin's use of Hashcash, define the following configuration variables :

use_hashcash

If this variable is set to 1 (the default), Hashcash headers in messages will be checked. To disable Hashcash-checking, set this variable to .

hashcash_accept address(es)

In order for SpamAssassin to perform a Hashcash check, it must know all of the valid addresses that could receive mail with Hashcash headers. Set this variable to provide those addresses.

You can use multiple hashcash_accept directives or multiple addresses in a single directive to list several addresses. You can also use an asterisk (*) as a wildcard for zero or more characters and the question mark (?) as a wildcard for zero or one character, much as you would to specify filename patterns in a shell. Finally, you can use %u to represent the current user's username in a sitewide configuration file. For example, a sitewide configuration file for users at example.com might include:

 hashcash_accept %u@example.com %u@*.example.com

hashcash_doublespend_path / path /to/file

Set this variable to the path at which SpamAssassin will create and maintain a (Berkeley DB format) database of previously seen Hashcash headers to prevent a sender from reusing a header. The default file is ~/.spamassassin/hashcash_seen . For a shared sitewide database, the user SpamAssassin runs as must have permission to write to this file and its directory.

hashcash_doublespend_file_mode mode

The file mode, in octal, for the Hashcash double- spend database. The default file mode is 0700. The file mode should include execute bits so that SpamAssassin can create directories, if necessary; i.e., use 0700 rather than 0600.

3.3.1.2 check_rbl( )

A set of methods that can be the basis for new tests are the check_rbl( ) , check_rbl_txt( ) , and check_rbl_sub( ) methods. These methods extract IP addresses from a message's Received headers, discard those that are known to be reserved addresses or on trusted networks, and query a DNS-based blacklist for each address. If any of the addresses are listed in the blacklist, the test matches. Rules using these methods are written like other eval rules:

 header A_NEW_BLACKLIST    eval:check_rbl('nasties','new.blacklist.zone')

Call check_rbl( ) with two arguments. The first argument is the zone ID , a string that's used to identify the blacklist. It's primarily useful when you're querying a blacklist that's composed of many different lists, and you later want to evaluate the query result by which sublists the addresses were on (this topic is discussed later in this chapter).

If you append -notfirsthop to the name of the zone ID, the originating IP address will be excluded from RBL lookups unless it is the only IP address. This is useful when querying blacklists of dialup or DSL (Digital Subscriber Line) hosts that are expected to relay all their email through an ISP's mail server. If new.blacklist.zone was this kind of blacklist, you might have written the test like this:

 header A_NEW_BLACKLIST    eval:check_rbl('nasties-not-firsthop','new.blacklist.zone')

Similarly, you can append -firsttrusted to check the IP address that appears in the Received header that was added by the most remote trusted server (IP addresses in Received headers added by more remote relays cannot be trusted). This is useful for querying a DNS-based whitelist to determine whether the server that first relayed the email to a trusted server appears on the whitelist. By appending -untrusted , you will check only the untrusted IP addresses (those more remote than the most remote trusted server). Here's a definition for a test of a DNS-based whitelist:

 header A_NEW_WHITELIST    eval:check_rbl('friends-firsttrusted','new.whitelist.zone') tflags A_NEW_WHITELIST    nice

(Remember, as Table 3-2 points out, when defining a test that will lower the spam score, you must set the nice test flag.)

Trusted and Untrusted Servers

Some mail servers are more trustworthy than others. In many organizations, email is received at an SMTP (Simple Mail Transfer Protocol) gateway on the Internet, checked for viruses, and then relayed through a firewall to an internal SMTP gateway that is responsible for delivering mail to individual machines on the internal network. In such a configuration, messages received by internal machines will have Received headers added by the internal SMTP gateway and the external SMTP gateway. The organization may also maintain (or contract with) off-site machines that serve as backup mail exchangers if the main SMTP gateway is unreachable. All of these machines are under the organization's control (or the control of a trusted provider), and the information in their headers can be trusted. Received headers added by other machines may be forged.

SpamAssassin doesn't check the IP addresses of trusted relays against DNS-based blacklists. By default, SpamAssassin works backward through the Received headers, beginning with the one added by the MTA on its own system (which is always trusted), and decides whether or not the addresses in each header are trusted. SpamAssassin treats Received lines that show messages being received from the local host, from a host on the same /16 subnet, from a host with a private IP address, or by a host with a private IP address as accurate and uses them to infer trusted relays.

When these simple inferences are not sufficient, you can manually define a set of trusted relays or networks using the trusted_networks configuration option, like this:

 trusted_networks 10/8 127/8 209.58.173.10

This specifies that all hosts in the 10.*.*.* range, all hosts in the 127.*.*.* range, and the single host 209.58.173.10 are to be trusted. Multiple trusted_networks directives can be used.

SpamAssassin 3.0 adds the internal_networks configuration option. Set internal_networks to the list of relays or networks that you trust because you manage them (or they are within your organization or are mail exchangers for your organization). trusted_networks may include other hosts that you trust but that are not part of your mail organization. Separating these concepts allows SpamAssassin 3.0 to do a better job of detecting spam from dialup hosts being routed around their ISP's designated outgoing mail server, while still allowing messages from trusted sites to skip blacklist-testing.

The second argument is the DNS zone for the blacklist. SpamAssassin checks the blacklist by performing a DNS query for a hostname in this zone. SpamAssassin determines the hostname by reversing the IP address that it's trying to check (e.g., 128.0.10.0 becomes 0.10.0.128) and prepending it to the zone name (e.g., creating 0.10.0.128.new.blacklist.zone). It then issues a query for a DNS A record associated with that hostname. Typically, if an address is blacklisted, the DNS query will be successful ‚ it will return an IP address (usually 127.0.0.1). If the address is not on the blacklist, the DNS query will fail (returning an NXDOMAIN response).

3.3.1.3 check_rbl_txt( )

Some blacklists are based on DNS TXT records instead of DNS A records. (Blacklist operators should indicate which kind of lookup is appropriate for their blacklist.) Use the check_rbl_txt( ) method to perform lookups using a blacklist based on TXT records. check_rbl_txt( ) accepts the same arguments as check_rbl( ) and works analogously. SpamAssassin reverses the IP address that it's trying to check (e.g., 128.0.10.0 becomes 0.10.0.128) and prepends it to the zone name (e.g., creating 0.10.0.128.new.blacklist.zone). It then issues a query for a DNS TXT record associated with that hostname. If the address is blacklisted, the TXT query will return a string explaining why the address is blacklisted. If the address is not on the blacklist, the DNS query will fail (returning an NXDOMAIN response).

3.3.1.4 check_rbl_sub( )

Some DNSBLs are aggregations of many different blacklists. These DNSBLs typically return different IP addresses in response to a successful A lookup to indicate on which sublist(s) the blacklisted address appears (e.g., the query returns 127.0.0.1 for addresses on sublist 1, 127.0.0.2 for addresses on sublist 2, etc.).

Use the check_rbl_sub( ) method to query a combined DNSBL and determine if the IP address is on a specific sublist. This method also takes two arguments: the first is a zone ID, and the second indicates which response is associated with the desired sublist. For example, if the new.blacklist.zone blacklist is composed of sublists that return 127.0.0.1 and 127.0.0.2, you could check IP addresses against only the second sublist:

 header A_NEW_BLACKLIST    eval:check_rbl('nasties','new.blacklist.zone') header NEW_BLACKLIST_2    eval:check_rbl_sub('nasties','127.0.0.2')

Less commonly, composite lists may return a single A record whose IP address is to be interpreted as a bitmask of matching sublists. To check a sublist in this case, provide a bitmask (as a positive decimal number) as the second argument to check_rbl_sub( ) .

Note that you must have a rule that uses check_rbl( ) or check_rbl_txt( ) to associate a zone ID string with the blacklist in order to check the result against a sublist.

3.3.2 Body Tests

The body , rawbody , and full directives define tests on the body of an email message. Two basic kinds of tests are provided. Message bodies can be tested against a regular expression pattern, and message bodies can be submitted to an eval test defined in Mail::SpamAssassin::Evaltests .

The body directive defines a test to be applied to the text of a message, as it would be likely to appear to a person reading the message in a text-based mail client. The Subject header is considered to be the first paragraph of the message body. All textual MIME components of the message are decoded, and HTML tags are removed. The message is reformatted into paragraphs (text separated by multiple newlines), and newlines within paragraphs are removed. The test is then applied to each message paragraph. Here's an example of a body test distributed with SpamAssassin that matches if the word "remove" appears in quotes in the body:

 body REMOVE_IN_QUOTES           /\"remove\"/i

The rawbody directive defines a test to be applied to the text of a message, as it would be likely to appear to a person reading the message in an HTML-based mail client. The Subject header is not included. All textual MIME components of the message are decoded, and the message is split into lines based on the line breaks in the message. The test is then applied to each message line. Here's an example of a rawbody test distributed with SpamAssassin that's designed to find a JavaScript statement that's common in spam:

 rawbody HIDE_WIN_STATUS          /<[^>]+onMouseOver=[^>]+window\.status=/I

Note that this test could not be written as a body test because this JavaScript appears inside an HTML tag.

The full directive defines a test to be applied to the full text of a message. All headers are included, along with all textual MIME components of the message body, but no decoding is performed. The message is split into lines based on the line breaks in the message, and the test is then applied to each header and message line. SpamAssassin does not distribute any full tests that match regular expressions; it reserves full for eval tests that must submit the raw message to external spam clearinghouses (which are discussed later in this chapter).

Body tests are powerful but slow. Be especially careful when defining regular expressions to test message bodies, as these expressions will be applied to large amounts of text. Consult Jeffrey Friedl's book Mastering Regular Expressions (O'Reilly) for important tips on optimizing regular expression processing.

3.3.3 URI Tests

The uri directive defines a test on all URIs that appear in an email message. SpamAssassin creates a list of http , https , ftp , mailto , javascript , and file URIs and transforms bare hostnames starting with www or ftp into appropriate URIs. The test is applied to each URI in the message.

URIs can be matched against a regular expression pattern. Here's an example of a distributed URI test that checks for a mailto URI with the string "remove" in the address portion:

 uri MAILTO_TO_REMOVE            /^mailto:.*?remove/is

SpamAssassin 3.0 includes a plug-in called Mail::SpamAssassin::Plugin::URIDNSBL . When loaded, this plug-in enables the uridnsbl directive, which takes each URI in the message, extracts the name of the host in the URI, looks up its IP address in DNS, and then checks the IP address against a specified DNSBL. These tests catch spam that is relayed through innocent (or temporary) mail servers but that advertise web sites on spammer servers. Here's a portion of SpamAssassin 3.0's 25_rules.cf file that defines a uridnsbl test called URIBL_SBLXBL :

 loadplugin Mail::SpamAssassin::Plugin::URIDNSBL ... uridnsbl  URIBL_SBLXBL    sbl-xbl.spamhaus.org.   TXT header    URIBL_SBLXBL    eval:check_uridnsbl('URIBL_SBLXBL') describe  URIBL_SBLXBL    Contains a URL listed in the SBL/XBL blocklist

3.3.4 Meta Tests

A meta test is a test that combines the results of several other tests using Boolean logic. For example, a meta test might be positive if either of two subtests are positive, or might specify that both subtests must be positive. A meta test can combine several tests using Boolean operators for and ( && ), or ( ), and not ( ! ), along with parentheses to modify the precedence in the expression.

When using meta tests, you will often want some or all of the subtests to contribute only to the meta test and not to be separately scored. To achieve this effect, give the subtests names that begin with two underscores. This prevents SpamAssassin from scoring them separately. You can then assign a single score to the meta test. Because non-scoring subtests will never be listed in a SpamAssassin report, you need not include a describe directive for these tests.

Example 3-3 shows the CLICK_BELOW meta test in SpamAssassin.

Example 3-3. A meta test and its subtests

 body CLICK_BELOW_CAPS      /CLICK\s.{0,30}(?:HEREBELOW)/s describe CLICK_BELOW_CAPS  Asks you to click below (in capital letters) body _  _CLICK_BELOW         /click\s.{0,30}(?:herebelow)/is meta CLICK_BELOW           (_  _CLICK_BELOW && !CLICK_BELOW_CAPS) describe CLICK_BELOW       Asks you to click below

The CLICK_BELOW_CAPS test is standard body test that is positive if the words "CLICK BELOW" or "CLICK HERE" appear in the message in uppercase. Although it is a standard test that is used and scored on its own, SpamAssassin also uses it as a subtest in a meta test. The _ _CLICK_BELOW test is a nonscoring subtest that is positive if the same phrases appear in any combination of upper- and lowercase letters. The CLICK_BELOW meta test is positive when _ _CLICK_BELOW is positive and CLICK_BELOW_CAPS is not positive ‚ that is, when the phrase appears in anything except all uppercase. Typically, a mixed or lowercase occurrence is assigned a lower score than the uppercase version.

In addition to using Boolean logic operators, it's also possible to use arithmetic operators ( + , - , * , / ) and comparisons ( > , >= , < , <= , ! = , = ). When you combine tests with arithmetic operators, the values of subtests are 1 if they are positive and 0 if they are negative. One such meta test in SpamAssassin is MULTI_FORGED, which counts the number of positive tests for different kinds of Received header forgery and is positive when two or more forgeries appear in the same message. This test is shown in Example 3-4.

Example 3-4. The MULTI_FORGED meta test

 meta MULTI_FORGED     ((FORGED_AOL_RCVD + FORGED_HOTMAIL_RCVD + FORGED_EUDORAMAIL_RCVD +  FORGED_YAHOO_RCVD + FORGED_JUNO_RCVD + FORGED_GW05_RCVD) > 1)

‚ < ‚ Day Day Up ‚ > ‚