Playing the Language Game: Tips on How to Beat Bayesian Filters

< Day Day Up >

Playing the Language Game: Tips on How to Beat Bayesian Filters

Bayesian filtering is the way of the future. As each day passes, a Bayesian filter learns more about what spam messages looks like, how they sound, and what content tokens they usually contain. Although this pseudo-intelligence is increasingly effective, it suffers from several major logic flaws. The definition of Bayesian pertains to the statistical methods based on Thomas Bayes’ probability theorem involving prior knowledge and accumulated experience. Thus, the only way to beat a Bayesian filter is to create spam that leans toward the statistically unknown or the statistically legitimate. If spam e-mail is deemed so far-fetched that the filter has never seen anything like it, chances are it will be marked with a lower score than if it has had previous dealings with a spam message that looked very similar to the current spam. The same is true for e-mails containing legitimate content. If you can create spam to look like a legitimate e-mail, the filter will be beguiled and will believe the message content is legitimate. In essence, this is the basis for Bayesian filter evasion. However, there is a fine line between data evasion and data corruption when trying to evade a Bayesian based filter.

Chapter 7, “Spam Filters: Detection and Evasion,” shows basic techniques that, when used correctly, keep the message highly readable by the recipient while obfuscating them slightly to a filter, usually just enough so that the score is lowered and the mail has a higher chance of being delivered (primarily aimed at untrained or slightly older filters). The following technique takes the basic evasion method one-step further by attempting to evade filters of higher intelligence. Whether it is a well-trained filter or a filter yet to be released, these methods further obfuscate a message and hinder a spam filter’s ability to detect the content. However, you do risk making the spam harder for the recipient to read and understand.

Corrupting the body of the message to the point where it will hopefully evade all filters is the tradeoff in filter evasion. Although the message will be hard for any filter to understand, there’s a strong chance that the recipient will also be unable to understand what it really means. This can highly decrease the reader impact value of your spam. Who buys a product from an e-mail they cannot understand? The idea of carefully constructing language to use inside spam e-mails is highly epic and vital to spam filter evasion.

The English language is a highly rich and meaningful language. It is complex from the amount of exceptions it has to its own rules—words such as homophones (words that sound the same, but are spelled differently) are a good example of this. Gary walked down to the beach and saw a large beech tree. If the word beach was a known spam word and e-mails containing it were instantly filtered, you could get the same message across using the alternative word, “beech.” It makes sense when read, although the context of the words is slightly out of place. This is where the English language itself can be used as a spam evasion technique, where you can obfuscate words within different spellings, meanings, and suggestions. The example used earlier in this chapter was “Guess what I am pointing at you, thanks to my wondrous tablets.” This used no known spam content, but was suggestive as to the focal point of the message—English-speaking readers would have few problems understanding what the message was referring to.

The trick is to know what language is banned. Obvious words such as “Viagra,” “Hot Teen,” and “Penis” are a dead give away, but just how much language parsing do spam filters undertake? The next example details a short list of common words that spam filters looks for in a message that is selling medication or online pharmacy services. The asterisks are wildcards of any length, while the question marks can be any single letter or number. Knowing what is being looked for will let you create a message that is within the boundaries of a filter and not so obvious about its true nature. Remember, language is your friend.

Ch??p Med* Generic Via* online pharmacy discount med* Viagra Cialis Levitra Vicodin Tramadol Vioxx Fioricet V?i?a?g?r?a C?i?a?l?i?s valium ?iagra V?agra Vi?gra Via?ra Viagr? Penis P?n?s Pen?s Pen??? Penil? Penis enl* Erection Pe???? dysfunction *Ile enlagement

As suspected, words such as Viagra and Valium are classed as obvious spam tokens, but Ciagra and Xiagra would also be flagged as obvious. This is an attempt to stop spammers from altering single letters in drug names. It is very hard to fool a filter with product and brand names; for example: Viagra is a brand name with only one correct spelling.

Notes from the Underground…

Product Names

In my opinion, product names should not be used in spam evasion. They are too easy to detect.

Brand names are filtered out very quickly against a hash or intelligent spam filter. It isn’t worth the trouble because everyone knows that online pharmacies sell Viagra and Cialis. For example:

Attention shoppers, there is a huge mark-down on all epidiymis  developing products

Epidiymis is the medical term for penis. And although the amount of people who would know what this term means is probably small, it will not be flagged by the filter. The two rules of Spam Assassin that attempt to find “Body Enhancement” spam are seen in Example 7:

body BODY_ENHANCEMENT /\b(?:enlarge|increase|grow|lengthen|larger\b|bigger\b|longer\b|thicker \b|\binches\b).{0,50}\b(?:penis|male organ|pee[ - ]?pee|dick|sc?hlong|wh?anger|breast)/i body BODY_ENHANCEMENT2          /\b(?:penis|male organ|pee[ - ]?pee|dick|sc?hlong|wh?anger|breast).{0,50}\b(?:enlarge|increase|grow| lengthen|larger\b|bigger\b|longer\b|thicker\b|\binches\b)/i

These words are conveyed as regular expressions (a method of matching patterns within words and sentences). If the body contains any of the following words: “enlarge,” “increase,” “grow,” “lengthen,” “large,” “bigger,” “longer,” or “inches” and also contains “pee,” “penis,” “male organ,” “schlong,” “whanger,” or “breast,” the message activates the spam rule, thereby increasing the message’s score. However, the phrase “epidiymis developing products” evades these two rules while keeping the context and meaning intact. Another possible evasion phrase is “Web-based medicine stores offering economical solutions to emasculation problems.” This is a complicated way of saying “Online pharmacy offering cheap erection dysfunction medications.” Sending this in an e-mail almost guarantees that it will be filtered instantly.

By using the English language as a method of shrouding text, you can keep a high level of legitimacy in your message body and evade spam filters without using highly obvious tactics such as random numbers or words. This method is only limited by your imagination. If you want to have randomly inserted synonyms, keeping your phrase unique for each spam, you can easily integrate synonym swapping into your message body. Instead of inserting random data, swap words out for other synonyms of the same word, as seen in Table 8.1.

Table 8.1: Synonym Swapping Technique
Order	All	Medication	for	Yourself	Here
Purchase	Complete	Tablet bottles	for	Your-person	At this point
Buy	Entire	Medicine ranges	for	You	At this location
Requisition	Whole	Remedy	for		Now
Request	Comprehensive	Cures	for		At this address
Seize	Wide-ranging	Therapy products	for		At this cursor

If you pick a different row for each column you can quickly build a unique phrase for each e-mail. The phrase will loosely have the same meaning, but use an entirely different language each time. For example:

Purchase wide-ranging cures for yourself at this location

or
Requisition complete medicine ranges for yourself now

Producing many different variations of the same phrase helps a message’s chances against a hash-based filter, since the data is more varied and unique than one single phrase. In theory, if you use a large enough string you can create an endless paragraph that will always represent the same meaning, but also always be unique to the spam filters while not containing any noise or obvious uniquely placed data.

< Day Day Up >