Section 3.7. Message Content | Internet Forensics

3.7. Message Content

From a forensics perspective, the content of a message is actually the least interesting part. If the message carries a virus or spyware, then the payload will be contained in the attachment. If it is a phishing attempt, then the web site it links to is where your interest will lie.

The experts in spam analysis and filtering can do a far better job than I at describing the techniques they use to classify messages and decide if they represent spam or not. This is a fascinating area that combines advanced computer science, with its statistical and pattern recognition algorithms, and practical software engineering that builds and deploys tools in an ongoing battle with the spammers.

There are three main approaches to dealing with spam. Here are resources to each of these that you might find useful. Rule-based filtering looks for specific strings and signatures within a message and assigns a score based on the matches it finds. SpamAssassin is a leading open source tool that uses this approach (http://spamassassin.apache.org/). Statistical filtering, using Bayesian analysis, looks at things like word frequencies in sets of messages that have been manually classified as spam or not, typically by the end user. As such it reflects their personal interests and can adapt to changes in the types of email that an individual receives. This is the approach taken in the Thunderbird email client, among others. A good introduction to Bayesian filtering is this paper by Paul Graham: http://www.paulgraham.com/spam.html. If spam can be traced back to a specific network address, then that address can be added to a Block List, or blacklist, of known spammers. A mail server can look up the address of each MTA that wants to transfer a message and automatically reject those that are on the list. This approach will become less effective in the face of proxy servers that were created by the Sobig worms. The Spamhaus Block List is a leading example of this approach, and their web site is an excellent resource: http://www.spamhaus.org. The problem facing block lists is that they can only react to addresses that have been used repeatedly to send spam. As I show in Chapter 11, spammers are able to use large networks of hijacked computers such that no one address is used enough to be included in the block lists.

Believe it or not, not everyone receives spam. Should you be in that enviable position and want to see what you are missing out on, you can find an archive of the stuff at http://www.spamarchive.org/. This can also be a great resource for anyone wanting to test spam-filtering software.

I return to the subject of message content in Chapter 4, specifically discussing the many ways in which phishing attempts try to disguise the real URLs of their fake web sites. I will end this chapter with the speculation that some spam may not be what we think it is.