How to Train Thunderbird s Junk Mail Filter | Firefox and Thunderbird Garage

How to Train Thunderbird's Junk Mail Filter

Junk, junk, and more junkit seems that some days I get more spam email than I do legitimate email. At least I don't get as much as Bill Gates, who reportedly receives four million emails a day, most of which are spam.^[1] If you put your email address out in the Internet space, it is likely at some point that your address will be harvested by spammers and you will become a victim of spam email. Ready to enter a contest that has a prize that looks too good to be true? It just might be that the contest you are entering will lead you down the primrose path to an inbox full of spam (not surprisingly, the entry form probably only asked for your email address). Luckily, Thunderbird has an excellent way to keep spam in check.

^[1] Steve Ballmer, the CEO of Microsoft, was quoted in the same story as saying that an entire department at Microsoft is devoted to doing nothing more than ensuring that nothing unwanted gets into Gates' inbox.

Thunderbird uses Bayesian filtering to classify junk mail, which is a system that requires some degree of user intervention and training (see the FAQ on the next page for an explanation of how Bayesian filtering works). In order to train Thunderbird to weed out spam, you have to manually mark messages as Junk by either clicking the Junk icon or going to File | Message | Mark | As Junk. But the important factor to remember here is that you also need to mark your "good" messages by going to Message | Mark | As Not Junk (note that no icon is available for this in the toolbar). That way, you train the filter on both ends and ensure that a better percentage of spam will be captured.

Tip: In the Early Phase of Training, Check Your Junk Mail Folder

In the early days of training your filter, you will probably want to check your "Junk" mail folder just to make sure that mail has not been classified incorrectly. If it has, you will have the chance to mark it correctly so the next message that comes through will not be marked as spam.

Easy Way to Mark All Your "Good" Mail

In case you want to mark all your "Good" mail in one fell swoop, the best way to do this is to go to the View dropdown list and select Not Junk, and then go to the File menu and mark the messages as not junk. Going to the File menu and selecting View | Sort by | Junk Status is another way you can accomplish this.

Thunderbird marks junk mail with a junk icon (see Figure 11-1). Note that if you change Thunderbird's theme (see Chapter 13), the Junk icon will likely not look the same as it does in Thunderbird's default theme.

Figure 11-1. The Junk icon.

FAQ: What Is Bayesian Filtering?

Bayesian filtering first came into vogue when Paul Graham covered it in his seminal paper "A Plan for Spam" (http://www.paulgraham.com/spam.html), even though Graham himself admits that Bayesian text classification methods have been used for years. Although Bayesian filtering is a technique that can be used to classify many types of data (it has been applied in a number of other disciplines, including the scientific realm, and has been applied in the machine learning environment in AI), programs such as Mozilla Thunderbird use it to distinguish spam email (junk) from ham email (non-junk).

The essence of Bayesian filtering boils down to examining probabilities and focuses on the probabilities of certain words appearing in ham or spam email. For example, a word such as "Rolex" might appear more frequently in your spam email, but not in your ham email (unless, of course, you are a watch dealer). Even though the filter isn't savvy enough to figure this out at first, it can be trained by the user over time. When it is trained, a computation is made (using Bayes' theorem) regarding the probability of an email belonging to either the ham or spam category. This assessment is done by looking over all the words (or combinations of words) contained in the email. After the assessment is complete, if the total exceeds a particular threshold, the filter then identifies the email as spam. Mozilla Thunderbird has a handy feature that can automatically move these messages to a "Junk" folder.

The user-centric nature of Bayesian filtering does have some distinct advantages over systems that use other rule filter methodology or point value systems, such as Mailshield. This is largely due to the fact that we all get different types of spam and ham, and the Bayesian system allows the user the flexibility to make corrections over time in the event that email is classified incorrectly (one person's ham may look like spam to another). However, the downside of the Bayesian system is that it will not perform well if it is not trained (you must mark both the spam and ham email in the training phase), and it does need some degree of training data (a past collection of email messages is helpful in this regard).

Despite the fact that Bayesian filtering does a good job of nipping spam in the bud after it is trained, spammers are constantly developing new techniques to get mail into your inbox. Recently, I have started to see emails that have my coworkers' names inserted in the subject line. In this instance, they are attempting to defeat the Bayesian system by using familiar name patterns. While Bayesian filtering isn't perfect, it is just one method that is being used to fight the seemingly never-ending battle against spam.