Noise Filters: Detecting Your Random Data

< Day Day Up >

Throughout this book, you have seen how essential random data can be and how its presence can help a spam message’s chance of being delivered. In Chapter 7 we looked at the concept of random data and how it statistically reduces a message’s spam score. However, what do you do when a spam filter is detecting your random data, locating it in the spam, and removing all random junk before analyzing the content with the primary spam filter? How about when a filter reads and understands your messages, detecting no flow or structure between your random words from the dictionary? If you’re a spammer, you’re helpless as the filter quickly parses your random data, thus showing the true character of the message body. Without the spacer words you added in the message, it becomes much easier to judge spam content. A real life example of this is your own ability to focus on a single object, such as this book you are reading right now. The items in your peripheral vision are not clearly visible; they have blended into the world around you. You are ignoring them because they have become unimportant to what you’re focusing on. The same attitude is taken when parsing e-mail; you attempt to focus only on the relevant data (for example, the key points in a sentence). The following examples demonstrate e-mails that have been processed by a filter.

Example 1 shows a legitimate e-mail containing the word “Viagra”:

Hey spammerx, hows life? I met up with Andy on the weekend, great guy, the stupid idiot is  buying generic Viagra now from the store, trying to pass it off to  people as ecstasy, what an idiot Oh well, I catch yah round some time. Hey Spammerx hows life met with Andy weekend great guy stupid idiot  buying generic Viagra now from store trying pass people ecstasy what  idiot well catch yah round some time

Example 2 shows a spam e-mail that contains the word “Viagra”:

improving the quality of people's lives is what Prescription  Medications are designed to do, we can offer you Viagra at a very  cheap rate!  http://www.drugsaregood.com With this he began walking in the air toward the high openings, and  Dorothy and Zeb followed him It was the same sort of climb one experiences when walking up a hill,  and they were nearly out of breath when they came to the row of  openings, which they perceived to be doorways leading into halls in  the upper part of the house

After being filtered, this e-mail becomes:

improving quality peoples lives what Prescription Medications are  designed can offer you Viagra very cheap rate  http://www.drugsaregood.com With this began walking air toward high  openings Dorothy Zeb followed him was same sort climb one experiences when walking hill they were nearly  breath when came row openings which perceived doorways leading into  halls upper part house

After each word is sequentially assessed to see if it commonly appears in known spam e-mails, the message is given a total score for the body, based on the individual score each word received. Although the words “Viagra” and “Generic” featured in the first e-mail are next to each other, the message content is legitimate. This message scores a relatively low spam score and the chance of successful delivery is high since it was missing key evidence that would suggest it was spam. (A link or Web site address in the message body would be needed to define this message as typical Viagra spam.)

Removing the spacing and filler words provides the basis for the most basic form of random noise filtering; a method of cleaning an e-mail of surplus or junk data thus allowing a content-based Bayesian filter to perform more efficiently. This is because there is less content to process and a higher chance of detecting a message’s true nature. Filtering takes place on many levels; from removing two- and three-letter words and all duplicates in its simplest form to removing neutral words—words that do not suggest any spam connotations and are passive in nature.

Next we will show Example 3 which is Example 2 with all neutral and passive words removed, further shortening the body and leaving only key words behind, therefore making the message easier to analyze.

improving quality peoples lives what Prescription Medications are  designed can offer you Viagra very cheap rate    http://www.drugsaregood.com With this began walking air toward high    openings Dorothy Zeb followed him was same sort climb one experiences when walking hill they were nearly  breath when came row openings which perceived doorways leading into  halls upper part house

becomes:

improving quality peoples lives Prescription Medications Viagra very    cheap rate http://www.drugsaregood.com began walking air toward high    openings Dorothy Zeb followed climb experiences walking breath    openings perceived doorways part

When content-based filters look at this, the message is significantly shorter. You can quickly see the main theme by reading the key words:

"Prescription Medication Viagra, very cheap rate http://www. drugsaregood.com"

The attempt is to scale down the message size, remove common elements from the entire e-mail, and reduce the total text, therefore making the content filter’s job easier.

The majority of e-mails received share common structures. If you analyzed the header, mid-section, and footer of each legitimate e-mail you receive, you would be able to quickly compile common language rules to help identify random or junk data held within the message. Data that is out of place or uncommon in day-to-day e-mails can be quickly filtered out of view, leaving only the real “spam” data for the content filter to see. For example, you might ask your filter to consider the following list of questions when analyzing a message for spam:

Are strings of random numbers often located in legitimate e-mails, and should these be paid attention to?
Do e-mails often contain a common ending phrase such as, “Thanks,” or “Catch you later?.” What do the majority of your e-mail contacts say when finishing an e-mail?
Are Chinese or Korean words common in the body of legitimate e-mails, or is this something we should notice?
How often do you receive legitimate e-mails with long message bodies?

By powering these rules with a Bayesian algorithm, a random noise detection filter can learn from your past e-mail statistics and easily detect sections within new e-mails that seem out of place. As you continue to use the filter and identify more spam messages, the more the filter will learn from these messages, detecting new kinds of random data and sections that are of no use in the body of an e-mail. Consequently, this makes any content filter more efficient and is why Bayesian Noise Reduction has become the next great tool in spam filtering.

If a clear classification of random data is made, evasion becomes more work. Simply adding a few sentences of random words at the base of an e-mail will no longer increase the chances of a successful delivery. The goal of beating a noise filter is to make the data seem legitimate and to increase the size of the message as much as possible (so that even after noise filtering the message will still have a decent amount of body).

Noise filters have one major flaw, however. Like any spam filter, they have to be careful to only filter out spam and to allow legitimate messages to pass through. If a friend writes me a legitimate e-mail telling me he is having fun gambling at a new online casino and suggests I should check it out by providing me with a link in the message body, I hope that my spam filter will not block this message. Sure, the e-mail mentioned an online casino and provided a Hypertext Transfer Protocol (HTTP) link, but the e-mail is legitimate. With this in mind, the authors of noise filters acknowledge that Bayesian noise reduction works best against obviously random noise. The filters catch strings of random numbers or letters and any content that is obviously not English or clearly breaks language rules such as the string:

kazivali skogul zz02 bekka

This is obviously a string of random data. “Kazivali” might be considered legitimate, but since it is used in the same sentence as “zz02” and “bekka,” it is declared random based on the validity of the neighboring words.

Random numbers are also easily detected. If using a number with more than five digits or a word with more than five numbers, chances are the word is junk and carries no focal point in the e-mail. It becomes harder, however, when large amounts of legitimate text are included. Noise reduction can be highly efficient, but only against common spam—spam that uses visibly random data. Spam filters are not designed to catch creative spammers; they are designed to catch the millions of people that send highly predictable spam. If your spam is “weird” or “unique,” you are likely among the few that can successfully bypass a Bayesian noise filter; however, your task may be significantly harder than before.

Example 4 shows a message containing tricky random data that has a good chance of getting a large percentage of data through the noise filter to the content filter:

Jack, I am really hooked on this new casino, its really fun, dad thinks I    should stop but I made like $200 last nite! The address is www.onlinecasino.com, you should try it. Obviously, this was overlooked in whatever installation you were    looking at. In fact, it looks like your administrator removed the    default horde password and replaced it with nothing...even worse than    using the default password. At 10:17 AM 9/3/2004, you wrote:   >The thread says they only tried to   > >/cfg/slb/real #/dis   >not   >/oper/slb/dis #.. > >Two completely different ways to disable a real server. Only *trying* to   >offer some help. > >-----Original Message-----   >From: Brent Van Dussen [mailto:vandusb@attens.com] >Sent: September 3, 2004 12:11 AM   >To: lb-l@vegan.net   >Subject: Re: FW: [load balancing] Alteon Backup/Overflow configuration   >questio n. > > >Hmmmm, nope, that's exactly what *didn't* work. Do you have any other    suggestions?   >My boss is going to kill me if I cant get this damm thing working! > >-Brent

This message may look extreme, but this is a great way to evade a noise filter.

Although sections of random text from stories look like real text and use legitimate English language techniques, they often do not look like real e-mail. Stories and printed text do not follow the same informal language laws that e-mails usually follow. The personal and often slang-filled text that only appears in personal e-mails is hard to replicate, which is a factor that noise filters use to detect spam with. The most successful methods utilize sections of existing e-mails to build a longer, more legitimate looking spam e-mail that can be passed off as a valid “reply” message. Even when filters parse out the noise data, there is a large amount of body left—enough to carry through to the underlying content filter and increase the message’s chance of successful delivery. This can be simple to implement. Instead of using a random line from a book or story, use lines from existing e-mails that were sent to you. Don’t include any personal information; the aforementioned example uses e-mails sent to mailing lists, so the content is very generic and varied while still lacking any personal information about the sender. You can further obfuscate this e-mail by changing the subject line to that of a real reply subject. Find a subject that beings with “RE:” in your inbox and use a different subject for each spam. A legitimate looking subject can go a long way to helping defeat a content-based filter. Remember, a noise filter is only looking for “known” noise; it also deals with large amounts of legitimate e-mails and will have learned the language that is used in these “acceptable” e-mails. All you have to do is send e-mails that contain enough legitimate body that the bulk of the message is increased.

Suppose you compose a message that emulates a reply to another message, as seen in the previous example. From the eyes of a spam filter, it looks like Brent Van Dussen sent an e-mail to someone else, who then replied, telling him about an online casino Web site that they found addictive. For this message to be successful, spam needs to contain differently worded text but not draw obvious attention to it. Using a phrase such as “BUY MY VIAGRA WWW.DRUGSAREGOOD.COM,” will get you nowhere. No matter how much of the message body you paste around the sentence, it is simply too obvious for any filter to allow to pass through.

Tricks Of The Trade…Selling Your Product

Avoid phrases such as “Click here” and “Unsubscribe here.” They will not be seen as noise and will be left in the e-mail for any underlying filters to detect.

A handy trick is to use possible noise words to sell your product. Hide your meaning inside words that noise filters will strip out before the real content filter gets the message, such as:

This message is indirectly selling Viagra. A large part of the message meaning has been lost, but it still carries some direction. Although you will suffer from a reduced impact on the client, you will have a higher delivery rate through filters than if you were to use a phrase like, “Get hard, use Viagra.”

This message also uses two neutral words, “pointing” and “Guess,” while the only spam-related word used was “tablets.” “Tablets” is not a commonly used word in spam as the majority of spammers use the word “pills” or “medication.” Being creative and original can easily beat any Bayesian-based spam filter, because the basis for Bayesian technology is filtering based on past statistics. If no past spam contained the word “tablets,” the filter will be confused, especially if legitimate e-mails also contain the word.

Noise filters are changing the shape of spam. If you want to pursue a successful noise filter evasion, your message bodies will need to be highly creative. Furthermore, this creativity can never stop; the second you begin reusing the same language and the same structure, your random data will be filtered and become pointless.

The days of using random words or numbers is long gone. In a few years, Bayesian noise filters will be commonplace. If spam content fails to equally evolve, it will stand no hope against a well-trained filter. Sadly, this is the future of spam. It will become more obscure and cryptic in its language, as spam filters attempt to understand more of the true nature of the spam. Spam’s only hope is to obfuscate the message within legitimate language, hiding the true nature from the filters.

< Day Day Up >