Detection: Identifying Spam | Inside the SPAM Cartel: By Spammer-X

< Day Day Up >

There are four main methods of spam detection used today:

Host-based filtering
Rule-based filters
Bayesian statistical analysis
White lists

However, many variations exist within each process, and every application implements each slightly differently.

Spam filters ideally identify a vector or otherwise analytical approach for verifying message validity. Some methods are easily bypassed, some require much more work in order to evade, and still other methods of spam prevention offer highly effective results.

Host and Network-based Filtering with Real-time Black Holes

If adsl-987.company.com sends you ten million e-mail messages, it’s safe to say that it is spam. It doesn’t make sense for a home-based Digital Subscriber Line (DSL) connection to act as its own e-mail gateway (not use its Internet Service Provider’s [ISP’s] mail gateway). No home user would ever send ten million e-mails, and sending them all directly makes the host highly questionable.

Simple, commonsense rules such as this make up the basis for network- and host-based spam filtering. An e-mail host’s validity can be proven by its network address and by how it delivers e-mail. E-mail clients that send suspicious information when delivering e-mail, such as trying to spoof a different address or identify themselves as an obviously fake host, are easily spotted by host-based filters. Look at the following example of client dialup-102.68.121.20.nationalnet.com.kr who is attempting to send spam by using false headers and a spoofed HELO:

From - Thu Jun 12 23:34:41 2004   Return-path: 928jd2e2@mail.freemail.com Delivery-date: Wed, 11 Jun 2003 01:53:39 +0100 Received: from [102.68.xxx.xx (helo=195.8.xx.xxx)     by mail.spammerx.com with smtp (Exim 4.12)     id 19Earz-0001Ae-00; Wed, 11 Jun 2003 01:53:38 +0100

As can be seen by the DNS resolution just after the square brackets in these headers, client 102.68.xxx.xx sent this e-mail; however, you can only slightly trust this information. Directly after this, the host sent the command, HELO 195.8.xx.xxx. This message is trying to fool the server into thinking its identity is 195.8.xx.xxx. This is a very old trick, and only very old filtering programs fall for it. Furthermore, the return path is directed to mail.freemail.com using a reply e-mail address that looks very much like a random string. The server mail.freemail.com resolves to 195.8.xx.xxx, the same address passed when the spammer sent their HELO command. Saying, “This mail came from freemail.com. Here’s my HELO string. I am mail.freemail.com. Even my reply address is at mail.freemail.com. I am not spam!” is the spammer’s attempt to prove the server’s validity. However, the e-mail had nothing to do with freemail.com, and was actually sent from 102.68.xxx.xx (dialup 102.68.xxx.xx.nationalnet.com.kr), a Korean-based dialup.

Network- and host-based filtering was one of the first methods used to detect spam, and although these simple rules can quickly identify and drop large amounts of spam, they can only catch the easy spammers, the ones who are trying to be sneaky. The more professional spammers are not so easy to spot. Take the following example:

You are 16 years old and a friend offers you marijuana. It is the first time you have used any drugs and you feel very anxious. Suddenly, a police car drives by, and there you are on the side of the street holding a large joint. Do you:

Run as fast as you can and hope you can get away?
Quickly hide the joint in your pocket and then turn and start walking away?
Relax and continue smoking since you know it looks just like tobacco?

Who would the police be the most suspicious of? The person who did something wrong and tried to run and hide, or the person who did not do anything wrong and is relaxed? The same mindset is used when filtering spam; if you try to hide information, falsify your identity, and generally lie, you only draw more attention to yourself.

A host-based filter’s primary focus is on the host that is sending the e-mail. Whether this host is previously known for sending spam is determined by several facts about that host. The domain name has a lot of strength in determining if a host is likely to send spam. For example:

Return-Path: <jack@69-162-xxx-xxx.ironoh.adelphia.net > Received: from [69.162.xxx.xxx] (HELO 69.162.xxx.xxx)   by aakadatc.net (CommuniGate Pro SMTP 4.0b5) Thu, 18 Jul 2002 04:59:06 -0400 From: jack@69-162-xxx-xxx.ironoh.adelphia.net To: <you@yourplace.com> Subject: Hey there. X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4522.xxxx Date: Thu, 18 Jul 2002 11:23:39 +-0800 Mime-Version: 1.0 Content-Type: text/plain; charset="Windows-1251"

The headers for this e-mail message are valid; the host has not tried to falsify or hide any information. Because the IP address is in the name, the client who sent this e-mail, 69-162-xxx-xxx.ironoh.adelphia.net, looks like a DSL or dial-up connection; a common trick when an ISP has a pool of clients connecting to them.

Any mail server that sends e-mail should be able to receive e-mail. If a mail server is unable to receive e-mail, chances are that mail server is not legitimate. One way to check this is by seeing if the mail server’s Domain Name System (DNS) record contains a mail exchange (MX) entry. This tells any client sending e-mail to this host or network that the e-mail should be directed toward a certain host, as seen in the following example from hotmail.com:

[spammerx@spambox spammerx]$ dig hotmail.com MX ;; QUESTION SECTION: ;hotmail.com.                   IN      MX ;; ANSWER SECTION: hotmail.com.            2473    IN      MX      5 mx4.hotmail.com. hotmail.com.            2473    IN      MX      5 mx1.hotmail.com. hotmail.com.            2473    IN      MX      5 mx2.hotmail.com. hotmail.com.            2473    IN      MX      5 mx3.hotmail.com. ;; ADDITIONAL SECTION: mx1.hotmail.com.        2473    IN      A       65.54.xxx.xx mx2.hotmail.com.        2066    IN      A       65.54.xxx.xxx mx3.hotmail.com.        2473    IN      A       65.54.xxx.xx mx4.hotmail.com.        2473    IN      A       65.54.xxx.xxx

As you can see, any e-mail sent to user@hotmail.com is directed to mx1.hotmail.com, mx2.hotmail.com, mx3.hotmail.com, and mx4.hotmail.com. The e-mail client will attempt to deliver to the other MX records in case mx1.hotmail.com is down. This host is valid, and if hotmail.com sent you e-mail, this particular e-mail verification check would pass.

Let’s do the same test on 69-162-xxx-xxx.ironoh.adelphia.net and see what DNS records it holds:

[spammerx@spambox spammerx]$ dig 69-162-xxx xxx.ironoh.adelphia.net MX ;; QUESTION SECTION: ;69-162-xxx-xxx.ironoh.adelphia.net. IN MX ;; AUTHORITY SECTION: ironoh.adelphia.net.    3600    IN      SOA     ns1.adelphia.net.  hostmaster.adelphia.net. 2004081300 10800 3600 604800 86400

As seen from this example, if we sent e-mail to user@69-162-xxx-xxx.ironoh.adelphia.net we would have to rely on the mail server running locally on that host. Mail servers without MX records are not unusual, but they do raise flags with spam filters. This server is even more suspicious because the hostname looks like a high-speed home Internet user, not a company. This host looks like it would send spam and would undoubtedly be flagged by a critical spam filter. Even though the host has no MX record and the e-mail is highly questionable, it still may be legitimate. Who’s to say that user@69-162-xxx-xxx.ironoh.adelphia.net is not a legitimate e-mail address? What happens if a company forgets to set up their MX record?

Accidents do happen, and blocking e-mail purely on DNS information forces many false positives to occur. Your valid e-mail will be suspected as coming from a spam host and dropped, usually without any notification. This is a serious downside to network- and host-based filtering. There are too many exceptions to the rule to have one “blanketed” rule.

Tricks Of The Trade…RFC 822

Developers are so determined to stop spam from being delivered, that they have even broken Request for Comments (RFC) 822; the core layout for the e-mail delivery process. The method of filtering e-mail simply on a host having an MX entry, contradicts the RFC. Because of spam, e-mail standards are evolving and changing into an entirely new protocol.

Another popular method of host filtering is detecting an insecure proxy server. As discussed in Chapter 3, insecure proxy servers can be used to relay e-mail anonymously, obfuscating the original sender’s Internet Protocol (IP) address. Any e-mail coming from a known open proxy server is seen as spam—no exceptions.

Different methods can be used to detect if a host is acting as an open proxy. First, servers can query a central spam database such as MAPS (www.mail-abuse.com) or Spamhaus (sbl.spamhaus.org). These servers can determine if a host is indeed an open proxy, by testing the host to see if it is running a proxy server or by looking at past statistics of messages the host sent. Its validity can be proven easily and the knowledge shared with any clients who ask.

Notes from the Underground…

ISP Shut Down

In an attempt to ban hosts before a spammer can find them, some real-time black hole lists (RBLs) actively test random hosts to see if they are acting as an open proxy or open relay.

A few years ago, I set up a friend’s mail server for his small ISP. It didn’t take long before the system was up and working. However, I forgot to restart qmail after I added the relay access control list (ACL), therefore denying anyone from using the server as their e-mail relay.

The service was not going to be used for a while, so I added the relay ACL the following day and continued. This meant that for a single day, qmail was acting as an open relay. Funnily enough, an RBL found my IP at random, tested to see if I had insecure relay rules in place, found that I did, and banned my address, all in a single day.

This was not good when the mail server was first launched, because users were reporting that 40 percent of their e-mail was being returned—rejected for coming from a known open relay. I had to find the RBL and again submit my host for verification. Forty-eight hours later my mail server’s IP address was removed from the list, allowing more of my e-mail to be delivered.

Just my luck.

Using an RBL is one of the most effective methods for stopping spam at the network level. It can take only a matter of minutes for a spam-sending server to be detected and blacklisted.

Notes from the Underground…

MAPS

Even though a network-based RBL such as MAPS is effective at catching spam-sending hosts, this is only because nothing can do it better.

In-fact, in a recent study by Giga Information Group, it was found that MAPS was only able to block 24 percent of incoming spam, with 34 percent false positives. Network-based spam filtering does not work. Even though a spammer may be using an insecure server to send spam, ten other people may also be using that server legitimately.

What would happen if a spammer began using maila.microsoft.com and MAPS banned this host, even though only one spammer was abusing it and the remaining thousands were using it legitimately?

RBLs are ingeniously designed. Each client using the RBL indirectly tells the server about every client who sends them e-mail. This allows the RBL to quickly identify hosts that send large volumes of e-mail and flag them as possible spam hosts. If an open proxy is not present but the host is still sending large volumes of e-mail, RBLs often judge the server based on other criteria such as valid DNS entries for MX, the host name itself, and past e-mail sending statistics. At a high level, an RBL can graph a host’s statistics and detect from past e-mail usage if that host has a gradual e-mail gradient (sending a few thousand messages more per day) or if the host has just appeared and has sent one million e-mails in the past hour. Hosts are banned quickly when sending spam (often in under an hour), especially when only one single host is sending the spam.

Notes from the Underground…

RBLs and Privacy

Although great for stopping spam, RBLs are not good for online privacy. If you have an RBL that 50 percent of the Internet uses, that RBL will have statistics on every e-mail sent and received from 50 percent of the Internet.

For example: user1.com receives e-mail from user2.com. user1.com submits user2.com to an RBL to test if that host has previously been sending spam. The RBL replies, “No,” and the e-mail is delivered. However, this means that the RBL knows that user2.com sends user1.com e-mails, how often, and when.

This means any RBL can correlate and graph the data you give them, allowing them to see 50 percent of the people you e-mailed and who e-mailed you—any private relationships you hold with these people, and also who those people commonly talk to. In terms of privacy, in my opinion, an RBL is a bad idea. It’s funny how most people just seem to trust an RBL without seeing any threat.

Think about it this way: What if RBLs are actually National Security Association (NSA)-inspired projects used to spy on people. Think about the possibilities if you knew 50 percent of the people who e-mailed knownterrorist.com, or alternatively, any e-mails from known weapon suppliers being sent to North Korea or Iraq-based mail servers. You could pry into every aspect of modern society with the information held in an RBL. The worst thing is, the Internet willingly gives this information to RBLs worldwide.

There are many forms of RBLs. Some focus on the legitimacy of the host sending the e-mail, and others focus on the message content itself.

One interesting method of catching spam is the use of a distributed hash database (discussed briefly in Chapter 3). A hash is a checksum, a unique mathematical representation of each message, and this database contains the hash of every message sent to everyone using the RBL (see Figure 7.1). This allows the RBL to quickly identify that the same message is being sent to many servers, and enables it to warn future clients that the message is probably spam.

click to expand
Figure 7.1: Sample Architecture of How a Message Checksum Database Works

As each mail server accepts the e-mail, they in turn ask the hash database if the message was previously known as spam. A spammer would send the four mail servers spam and by the third mail server the hash database begins telling any new mail servers that this particular message is spam. Since the first two mail servers received the same message, that message is suspected of being spam and is filtered for any new mail servers a spammer would send it to.

Razor (http://razor.sourceforge.net) is a spam-hashing application that is unique in the way it calculates the hash values for each message. Since the hash value of the message can easily be changed with a single character difference, Razor has gotten much smarter at coping with random mutations spammers may use within the message, and can analyze the message fully, removing trivial permutations such as a random number in every subject. Using a fuzzy signature-based algorithm called Nilsimsa, Razor can create a statistical model of each message. This model is based on the message body minus any slight text mutations it finds. Having a message with a different random number in it no longer fools a hash-based spam filter.

Razor also supports segmented checksums, allowing the spam filter to only pay attention to the last ten lines of the e-mail, or the first five lines. This means that each spam message has to be entirely random throughout the e-mail. Razor offers a highly creative method of stopping spam and I take my hat off to the authors, but it does not stop spam entirely. Since the element Razor is fighting against is the spammer’s ability to be purely random, Razor will fail if the spam message is entirely different every time.

Rule-based Spam Filtering

Rule-based filtering is a method of static analysis undertaken on each e-mail to judge the likelihood that it is spam. This is achieved by matching the probabilities of known spam tactics with frequencies within each e-mail. If an e-mail has ten known spam elements, the filter will assume it is spam. If it has only one, it is considered legitimate traffic and is deliverable. A good implementation of rule-based filtering can be seen in Spam Assassin, which attempts to match thousands of rules to each message; each rule increases or decreases an individual score the message has.

If the score is above a certain threshold the message is declared spam, and if the score is below a certain threshold it is considered legitimate. This allows you to quickly make your spam filter more critical, if required, and to increase the threshold number. Spam Assassin has a very impressive rule list. A small fragment is shown in Table 7.1, demonstrating the amount of detail that is used and how much each rule adds to a message’s score when triggered.

Table 7.1: A Snippet of Spam Assassin’s Rule Base
Area Tested Scores	Description of Test	Test Name	Default
Header	Sender is in Bonded Sender Program (trusted relay)	RCVD_IN_BSP_TRUSTED	.3
Header	Sender is in Bonded Sender Program (other relay)	RCVD_IN_BSP_OTHER	-0.1
Header	Sender domain is new and very high volume	SB_NEW_BULK	1
Header	Sender IP hosted at NSP volume spike has a	SB_NSP_VOLUME_SPIKE	1
Header	Received via a relay in bl.spamcop.net	RCVD_IN_BL_SPAMCOP_NET	1.832
Header	Received via a relay in RSL	RCVD_IN_RSL	0.677
Header	Relay in RBL, www.mail-abuse. org/rbl/	RCVD_IN_MAPS_RBL	1
Header	Relay in DUL, www.mail-abuse. org/dul/	RCVD_IN_MAPS_DUL	1
Header	Relay in RSS, www.mail-abuse. org/rss/	RCVD_IN_MAPS_RSS	1
Header	Relay in NML, www.mail-abuse. org/nml/	RCVD_IN_MAPS_NML	1
Header	Envelope sender has no MX or A DNS records	NO_DNS_FOR_FROM	1
Header	Subject contains a gappy version of ‘cialis’	SUBJECT_DRUG_GAP_C	1.917
Header	Subject contains a gappy version of ‘valium’	SUBJECT_DRUG_GAP_VA	1.922
Body	Mentions an E.D. drug	DRUG_ED_CAPS	1.535
Body	Viagra and other drugs	DRUG_ED_COMBO	0.183
Body	Talks about an E.D. drug using its chemical name	DRUG_ED_SILD	0.421
URI	URL uses words/phrases which indicate porn	PORN_URL_SEX	1.427
Body	Talks about Oprah with an exclamation!	BANG_OPRAH	0.212

You can see how comprehensive Spam Assassin’s rule set is. I wonder how long it took the creators to come up with the full list (seen at http://spamassassin.apache.org/tests.html).

The rules can come in many forms: words or language used inside the body, the host being listed as a known RBL, or a string of random numbers in the subject. Spam Assassin’s rules attempt to predict common spam elements, and work well for the most part.

Tricks Of The Trade…“Click Here”

If a rule-based filter only had one rule and looked for the phrase, “Click here,” it would be capable of catching up to 75 percent of spam. How many legitimate e-mails have you received with “Click Here” in them?

These Markovian-based (referring to something random) rules catch the majority of spam, but it is still a very ineffective method of filtering e-mail because every new variation of spam requires a new rule. As shown in Chapter 3, variation in spam is huge. With many mailing programs, it is easy to add random characters, random spaces, and random words to each message making the body of the message seem entirely different. Each Spam Assassin rule is trying to cover a small piece of the entire entropy pool that the spam program uses, which is highly inefficient. Spam Assassin also tries to use other methods in combination with rule-based filtering to attempt to determine the host’s validity (covered in more detail later in this chapter).

You can see the rule method in action in the e-mail headers in the following section. This was obviously a spam e-mail, and was easily detected because it contained the words Viagra and Online Pharmacy (and a disclaimer at the foot of the body). These are common items found inside spam; it’s likely that a seasoned spammer did not send this.

SPAM: Content analysis details:   (40.80 hits, 5 required) SPAM: USER_AGENT_OE (-0.3 points) X-Mailer header indicates a non-spam   MUA (Outlook Express) SPAM: X_PRECEDENCE_REF (4.6 points) Found a X-Precedence-Ref header SPAM: GAPPY_SUBJECT (2.9 points) 'Subject' contains G.a.p.p.y-T.e.x.t SPAM: FROM_ENDS_IN_NUMS (1.6 points) From: ends in numbers SPAM: LOW_PRICE (-1.2 points) BODY: Lowest Price SPAM: EXCUSE_14 (-0.2 points) BODY: Tells you how to stop further SPAM SPAM: EXCUSE_13 (4.2 points) BODY: Gives an excuse for why message was sent SPAM: VIAGRA (4.2 points) BODY: Plugs Viagra SPAM: VIAGRA_COMBO (3.8 points) BODY: Viagra and other drugs SPAM: BILL_1618 (3.8 points) BODY: Claims compliance with Senate Bill 1618 SPAM: ONLINE_PHARMACY (3.2 points) BODY: Online Pharmacy SPAM: HR_3113 (3.1 points) BODY: Mentions Spam law "H.R. 3113" SPAM: NO_COST (2.7 points) BODY: No such thing as a free lunch (3) SPAM: CLICK_BELOW_CAPS (2.4 points) BODY: Asks you to click below (in caps) SPAM: DIET (2.3 points) BODY: Lose Weight Spam SPAM: UCE_MAIL_ACT (2.2 points) BODY: Mentions Spam Law "UCE-Mail Act" SPAM: OPT_IN (1.6 points) BODY: Talks about opting in SPAM: EXCUSE_10 (1.3 points) BODY: "if you do not wish to receive any more" SPAM: CLICK_BELOW (0.3 points) BODY: Asks you to click below SPAM: GAPPY_TEXT (0.1 points) BODY: Contains 'G.a.p.p.y-T.e.x.t' SPAM: DISCLAIMER (0.1 points) BODY: Message contains disclaimer SPAM: HTML_FONT_COLOR_RED (-1.2 points) BODY: HTML font color is red SPAM: LINES_OF_YELLING_2 (-0.7 points) BODY: 2 WHOLE LINES OF YELLING  DETECTED

As can be seen in the header section of this message, various rules were triggered and the message’s score was totaled. Some elements of the e-mail triggered a higher score, while some lowered the score. The USER_AGENT_OE rule detected from my Mail User Agent (MUA) that the message was sent from Outlook; however, it wasn’t. A fake MUA header was sent (the one Outlook uses) but the score was lowered.

However, no amount of score lowering is going to get this e-mail into the network. Because there are so many known spam keywords and spam traits, this e-mail is obviously spam. Final calculations put the total score for this e-mail at 40.80; however, the message only needed a score of five or higher to be declared spam.

Commercial Whitelists

A blacklist is a list of known un-trusted parties who are excluded from any service offered. Alternatively, a whitelist is a list of hosts that should never be distrusted and have a guaranteed trust relationship from a previous communication. What a whitelist means in the context of e-mail is simple: if you send userjoe@companyx.com an e-mail, you will get back another e-mail instantly, telling you to click on a link or reply to that e-mail. Your response back to the server verifies that you are not a spammer, since you are contactable and you clicked on something.

Whitelists consider all human-sending clients legitimate and fully trusts any communication from them in the future. One such company offering a whitelist service is spamarrest.com. When any user sends an e-mail to a spamarrested.com user, the recipient quickly gets an e-mail back (see Figure 7.2) informing them that their identity needs verification. This requires the sender to click on a link within the e-mail.

click to expand
Figure 7.2: SpamArrest in Action

Once you have clicked the link, verifying that you are a person with an arm and at least one finger, you receive another e-mail, this time informing you that your e-mail is approved and has been passed to the recipient. You are now fully trusted to send this user e-mail, and any further e-mails from this address will not require verification.

This is probably the most effective method available of stopping spam; however, it is also the most intrusive and requires the most human effort of all methods available.

By using a whitelist, you may lose up to 1 percent of all incoming e-mail simply from people unwilling to click on a link or not receiving the verification e-mail because it was caught by someone else’s spam filter or whitelist. It is yet another link in the e-mail chain and it can possibly stop an e-mail’s delivery, but the risk is worth it. The shelter from spam that a whitelist provides is considerable and, as seen later in this chapter, it can be very hard to evade.

Whitelists like this come at a price, though. spamarrest.com’s free or “lite” version of the service contains large banners of advertising on any e-mails and is obviously not ideal for all companies. The professional version costs $34.95 per year, and includes free advertisement and e-mail support for any problems you may have.

Bayesian Filters and Other Statistical Algorithms

If you became your own spam filter and analyzed every e-mail message you received every day, you would quickly pick out common phrases such as names, companies, or Web sites that identify them as spam.

The more you did this, the more you would distrust the use of those phrases within e-mail. If you saw the same spam message five times with the same subject each time, the sixth time you received this e-mail you would not open it up because you know it contains spam. The other five messages have reduced the amount of trust you hold towards that particular message subject. When you began receiving legitimate e-mails again with the same subject, you would gain more trust for that subject. This is basic human nature at work and is in essence the basis for how a Bayesian filter operates.

Bayesian filters calculate the statistical probability of e-mail being spam, based on previous analyses of spam messages you have deleted. These probabilities and frequencies are then collaborated into rules that are applied to all incoming e-mail you receive. Elements from many different spam messages are used to identify new spam messages. In turn, the keywords found in the next message the filter catches can be used to help identify new spam e-mails that are similar in nature.

Using this data as a comparison technique is a highly effective method of filtering. Based on my own experience with a filter that learned from 4,000 deleted spam messages, you can expect a 99.8 percent filtering rate on spam e-mails.

Because you tell the Bayesian filter what spam is, you can highly personalize it to the spam you receive. This is far superior compared to a rule-based system. If you never get spam for Viagra products, why check to see if the e-mail you receive contains the word Viagra? Spam is personal; everyone receives different types of spam. Bayesian filters will grow to match the spam you receive, so that you are able to detect and delete new specific types of spam that are unique to you. On the downside, a Bayesian filter needs to learn, and you need to be proactive in teaching it about the spam you receive.

Tricks Of The Trade…Thomas Bayes

In 1761, the word “Bayesian” was first used by Thomas Bayes, who used it to describe a new method of calculating the probability of an event occurring based on past mathematical statistics.

Paul Graham coined the term for spam filters when he released a paper on using the statistical algorithm to catch spam. Within weeks, the algorithm had been implemented into mostly large open-source spam filters such as Spam Assassin. The full specification for the term and the mathematics behind it can be read at Paul’s Web site, www.paulgraham.com/spam.html.

There has been recent progress catching spam with the use of statistical filters like Bayesian. The spam-catching program DSPAM, as in de-spam, (www.nuclearelephant.com/projects/dspam) can use the filter more efficiently and can receive an increase of up to 99.9981 percent of spam caught.

DSPAM’s trick is using a data sanitization technique on the e-mail before the second content-based Bayesian filter is used. Cleaning the e-mail of all mutations, random data, and noise-based words allows the message to be parsed more efficiently the second time around. This cleaning method is called Bayesian Noise Reduction (BNR) and is designed to learn from the typing styles, word spacing, random letters, and useless phrases found in known spam e-mails. BNR is able to use a Bayesian method to remove un-needed characters before it is handed to the second content-based Bayesian filter. A mixture of Bayesian and other language-related algorithms are used to determine if each word on the page is needed in the sentence, or is vital to the body of the e-mail or the context of the language used.

Tricks Of The Trade…DSPAM White Paper

If you are interested in the logic involved in random and junk word detection, read the published white paper by the DSPAM authors found at www.nuclearelephant.com/projects/dspam/BNR%20LNCS.pdf.

Additional filters are passed over the message to determine if each word in the e-mail is in the dictionary and should be in the sentence. These rules are also designed to catch rouge numbers and extra Hypertext Markup Language (HTML) tags that might be used to obfuscate the true nature of the e-mail.

Using two Bayesian filters in this method is a highly effective way to filter spam. DSPAM is one of the smartest implementations of Bayesian filters I have ever seen. Although this can mean you have a very involved mail server setup, two Bayesian filters really has no other downside. The first filter doesn’t remove the characters from the e-mail permanently, and an untouched version is delivered when the message is declared legitimate.

Combination Filters, Mixing, and Matching

If one filter provides a 95 percent spam catch rate, two filters should be able to provide 98 percent, three filters should be able to provide 99 percent, and so on. This methodology has led to the design of some very significant chains of spam filtering solutions, often with three or four filters running against each e-mail as it makes its way to the user’s in-box. These filter sets can tie into a network or hosted base check with an RBL, and then run one or two Bayesian filters through the e-mail contents. It’s amazing that spam has reached such a level of annoyance that people are required to use four spam filters. Mail servers have to be substantial in both size and power, often using separate spam filtering servers in the process, all for the sake of filtering spam.

Spam Assassin is a good example of software that implements a combination of filters. It contains the following different spam assessments, each adding to the previously used filter’s success.

Header Analysis Spam Assassin attempts to detect tricks used by spammers to hide their identity. It also tries to convince you that you subscribed to their newsletter or agreed to accept e-mail from them.
Text Analysis Spam Assassin uses a comprehensive rule-based pattern match to analyze the message body to look for known text that indicates spam.
Blacklists and RBLs Spam Assassin actively uses many existing blacklists such as MAPS, ORBS, and SpamHause as a method of detecting spam-sending hosts.
Bayesian Filters Spam Assassin uses a Bayesian-based probability analysis algorithm, allowing users to train filters to recognize new spam e-mails they receive.
Hash Databases Razor, Pyzor, and DCC are all supported by Spam Assassin, and allow for quick hash generation of known spam messages. Acting as a primary filter, a hash database is one of the first filters to detect the validity of a message.

Even though these spam detection processes are in place, a large amount of spam still gets through. It’s overwhelming when you realize that spam filters are catching 95 to 98 percent of all obvious spam. If you receive ten spam e-mails a day, your filter may have blocked up to 980 other spam messages!

< Day Day Up >