< Day Day Up > |
Shortly after his death in 1761, Reverend Thomas Bayes's (1702 1761) work in statistical distributions was published by his friend Richard Price. This work is now known as the Bayes Theorem. Almost 240 years later, Paul Graham published his paper "A Plan for Spam," which proposed using Bayes Theorem (in a slightly modified form, perhaps) to detect spam. To use Bayesian filtering, you must have a collection of spam emails (several thousand would be best) and a collection of emails that are not spam (again, a few thousand). These messages are then fed to the filter software, which creates lists of words in each category, spam and nonspam. Each word includes a count of how often it occurs (so a word such as Viagra might have a relatively high count in the spam messages but not be found in the nonspam messages). After these counts are created, they are plugged into an algorithm to compute the probability that they are an indicator of spam. Here's how it works:
Finally, we take this resulting probability and make a decision. Most systems flag an email as spam when the probability is 90% or greater. Empirical testing shows that few emails fall within the middle probabilities; most fall either at the higher end (and are spam) or at the lower end (and are not spam). |
< Day Day Up > |