Section 11.2. Case Study 2: Spam Networks


11.2. Case Study 2: Spam Networks

The aim of the second study was to see where some of the spam that I receive comes from. The consensus view is that most spam is being sent via computers infected with viruses that set up email relays without the owners' knowledge.

I wanted to collect the IP addresses of the machines that relayed the messages to my server and look for any correlations between those and the specific types of spam that they handled. I had no shortage of data. At the time of this analysis, I had 29,041 messages in my Junk folder, which originated from 22,429 different IP addresses. The vast majority of these (92% of the total) were the source of only a single message. Figure 11-2 shows how few addresses were involved in sending multiple emails. Note that the Y-axis is logarithmic.

Several alternative conclusions can be drawn from this distribution. The spam domain blacklist from Spamhaus (http://www.spamhaus.org/sbl/) that I use to reject known spam sources could be so efficient that most source machines can only get one message through to me before being blocked. I doubt that this is the case.

Figure 11-2. Number of messages originating from each IP address


It could be that the owners of these machines, or their ISPs, realize that they are acting as mail relays as soon as they send out the first batch of spam and either remove the relay or shut down that machine. This is possible given that my sample contains more than 20,000 computers, but I think this is unlikely.

Perhaps the most likely scenario is that so many computers have been set up with hidden email relay software that the spammers can afford to treat them as disposable. Each is used to relay a single batch of spam and then never used again. This renders the spam blacklists useless, as they never know where the next batch will originate. That these 20,000 distinct computers could be the tip of the iceberg is a chilling prospect, indeed.

This observation immediately throws up a host of questions. Is there a single pool of open email relays that any spammer can access? Or do different groups "own" distinct pools of machines? Do compromised machines tend to occur in specific countries? Viruses tend to have most success infecting poorly protected home computers on broadband connections. Is this reflected in the pool of email relays?

11.2.1. Subsets of Spam

Rather than address these questions using my entire collection of spam, I decided to look at three distinct subsets. Each of these had a distinct signature that allowed me to identify and extract all instances from each spam campaign. All had a large but very manageable number of instances, allowing me to study each group in a reasonable amount of detail.

Subset A consisted of messages advertising pornographic web sites that contained certain signature patterns. Most notable of these was that the name of the sender always took the form of "First Name, Middle Initial, Last Name,"for example, John Q. Public. This common pattern was surprisingly diagnostic for this subset. When combined with a second pattern common to all MessageID headers for this subset, the signatures were completely specific.

Subset B contained messages advertising discount copies of Microsoft Windows XP and Office. These appeared to contain HTML-formatted text, but actually contained a GIF image of such text. This was always identical, and a string of characters taken from the Base64-encoded image served as a totally specific signature.

Subset C contained messages advertising Viagra. These also contained GIF images and appeared very similar in structure to Subset B, suggesting they came from the same source. Examples from both subsets would often arrive together.

I wrote simple Perl scripts that could extract all examples of these sets from my Junk mail folder. They returned 889 examples of subset A, 180 of subset B, and 119 of subset C.

The next step was to extract the IP address of the final mail relay from each message. Remember that it is trivial to forge Received header lines with the exception of the final step that transfers the message to your server. This is a little more complicated in practice as some of my email arrives via a relay at my ISP. In those cases, I am interested in the address of the relay that transferred the message to them. The Perl script shown in Example 11-2 will extract these addresses from a mail file in the standard MBOX format used to archive messages. You will need to modify the pattern used to detect a relayed message from your ISP's mail server.

Example 11-2. extract_ipaddr.pl
 #!/usr/bin/perl -w # Message separator: From - Tue Apr 06 10:20:25 2004 if(@ARGV == 0) {     $ARGV[0] = '-'; } elsif(@ARGV > 1) {    die "Usage: $0 <mail file>\n"; } my $flag = 0; my $separator = 0; open INPUT, "< $ARGV[0]" or die "$0: Unable to open file $ARGV[0]\n"; while(<INPUT>) {     # The following regular expression defines the message separator     if(/^From\s+.*200\d$/ and $separator == 1) {         $separator = 0;         $flag = 0;     } elsif(/^\s*$/) {         $separator = 1;     } else {         if(/^Received\:.*seanet/) {            # skip any headers from seanet (my ISP)         } elsif($flag == 0 and /^Received\:\s*.*?\[([\d\.]+)\]/) {             print "$1\n";             $flag++;         }         $separator = 0;     } } close INPUT; 

Throughout this case study, I created Perl scripts based on this simple template, to select specific types of message from a mail file and to extract specific pieces of information from each of these. Having a collection of these on hand is extremely useful in any study of this kind.

The script in Example 11-2 returns one address for each message in the input file. Piping that output into the Unix sort command with the -u option produces a list of unique addresses:

     % extract_ipaddr.pl subsetA_msgs.dat | sort -u > subsetA_ipaddr.dat 

Running this on the three subsets of messages produced 873 addresses that sent out the subset A messages, 173 that sent out subset B, and 114 that sent out subset C. This means that only a few machines sent out more than one example from each subset.

I was then able to compare those lists of addresses to see if any relays handled messages from more than one subset. This was easy enough to do using the Unix sort and wc commands. The specific command wc -l returns the number of lines in a file. The following set of commands shows how these can be used to determine the overlap between two sets of addresses:

     % wc -l subsetB_ipaddr.dat     173     % wc -l subsetC_ipaddr.dat     114     % cat subsetB_ipaddr.dat > tmp     % cat subsetC_ipaddr.dat >> tmp     % sort -u tmp | wc -l     212 

None of the IP addresses used in subset A were used in either B or C. Subsets B and C appeared to have a similar origin, so it was not surprising to see that some of these messages came from the same relays.

A more interesting comparison for subset A was to look for overlap with the set of more than 22,000 unique IP addresses from my entire spam collection. Remarkably, the addresses used to transfer subset A messages were not used anywhere else.

If the pool of email relays used by spammers were accessible to anyone, you would expect to see the same addresses being used for multiple spam campaigns. That this was not the case could be due to the sample size, relative to the size of the overall pool, or to various other issues. To my mind, a reasonable conclusion is that the source for this spam has sole access to a set of at least 873 relays. The fact that, within this spam campaign, the reuse of these addresses is minimal suggests that the size of the pool is considerably larger than this.

11.2.2. Digging Deeper

Looking more closely at the overlap between subset B and C addresses sheds more light on this. 65 mail relays transferred messages in both campaigns, 108 sent only those from subset B, and 49 sent only subset C. Clearly these two spam campaigns are closely linked. Not only do the messages appear very similar in structure, but a large fraction are being sent from a common pool of relays.

That begs the question of whether these relays are being used to send other types of spam. To answer this, I wrote another Perl script, shown as Example 11-3, that extracts all messages from a mail file that are relayed from any IP address contained in a file.

Example 11-3. extract_match_ipaddr.pl
 #!/usr/bin/perl -w if(@ARGV == 0 or @ARGV > 2) {    die "Usage: $0 <ipaddr file> <mail file>\n"; } elsif(@ARGV == 1) {     $ARGV[1] = '-'; } my %ipaddrs = (  ); loadAddresses($ARGV[0], \%ipaddrs); my $flag = 0; my $separator = 0; my $text = ''; open INPUT, "< $ARGV[1]" or die "$0: Unable to open file $ARGV[1]\n"; while(<INPUT>) {     if(/^From\s+.*200\d$/ and $separator == 1) {         if($flag > 0) {            print $text;            $flag = 0;         }         $separator = 0;         $text = '';     } elsif(/^\s*$/) {         $separator = 1;     } else {         $separator = 0;         if(/^Received\:.*seanet/) {            # skip Received: headers from my ISP         } elsif(/^Received\:\s*.*?\[([\d\.]+)\]/ and $flag==0) {             if(exists $ipaddrs{$1}) {                $flag++;            }         }     }     $text .= $_; } if($flag == 1) {    print $text; } close INPUT; sub loadAddresses {    my $filename = shift;    my $ipaddrs = shift;    open INPUT, "< $filename" or die "$0: Unable to open file\n";    while(<INPUT>) {       if(/^(\d+\.\d+\.\d+\.\d+)/) {           $ipaddrs->{$1} = 1;       }    }    close INPUT; } 

Running this on my Junk mail file with the list of unique IP addresses from the combined B and C subsets produced a total of 536 messages, including the 299 in the original subsets:

     % extract_match_ipaddr.pl subsetBC_ipaddr.dat junkmail.dat >     subsetBC_allmsgs.dat     % grep 'From -' subsetBC_allmsgs.dat | wc -l     536 

I then looked for the presence of a GIF file in each of these messages using grep and the common string R0lGODl that occurs at the beginning of the encoded form of these GIF files. That showed that 517 of the original 536 contained encoded images and that these fell into 8 groups, based on the first line of that content:

     % grep R0lGODl sameip_as_subsetBC.dat | wc -l          519     % grep R0lGODl sameip_as_subsetBC.dat | sort -u     R0lGODlh4wBRAJEAAMwAAAAAzAAAAP///yH5BAAAAAAALAAAAADjAFEAAAL/1D6     R0lGODlhBwFLAJEAAP///wAAAMwAAAAAzCH5BAAAAAAALAAAAAAHAUsAAAL/BIJ     R0lGODlhMQE9AJEAAP8AAAAAAP///wAAACH5BAAAAAAALAAAAAAxAT0AAAL/lI+     R0lGODlhTAEdAJEAAAAAzMwAAAAAAP///yH5BAAAAAAALAAAAABMAR0AAAL/nI+     R0lGODlhjABLAJEAAAAI/wAAAP8AAP///yH5BAAAAAAALAAAAACMAEsAAAL/nI+     R0lGODlhjgBNAJEAAAAAAP8AAP///wAAACH5BAAAAAAALAAAAACOAE0AAAL/hI+     R0lGODlhjgBNAJEAAP8AAAAAABEA/////yH5BAAAAAAALAAAAACOAE0AAAL/jI+     R0lGODlhygAbAJEAABEFlgAAAP///wAAACH5BAAAAAAALAAAAADKABsAAAL/jI4 

Further slicing and dicing of the messages showed that these groups were all advertising prescription drugs or software, using the same form of message as the original two subsets.

I repeated the cycle of using these signature strings from the encoded images to try and pull out other examples, with the goal of finding additional IP addresses that I could add to the pool. Surprisingly there were no other examples in my Junk mail folder. So just like subset A, this more diverse set of messages was relayed from a defined set of IP addresses. These observations show that the global set of email relays are partitioned into defined sets that appear to be under the control of distinct groups.

Even within the subset B and C pool, there is a clear partitioning of addresses into smaller subsets. To uncover that, I used the eight GIF file signatures to select out those sets of messages and then extracted the unique IP addresses within each of these. The numbers of messages and addresses in each set is shown in Table 11-1.

Table 11-1. Subsets of messages within the B+C set

Subset of subsets B+C

Number of messages

Number of addresses

1 (Subset B)

180

173

2 (Subset C)

119

114

3

75

75

4

46

46

5

38

38

6

38

38

7

12

12

8

9

9


I then compared those lists between all pairs of the eight sets using a Perl script that automated the cat, sort, and wc steps that I used earlier. The results are shown as a Venn diagram in Figure 11-3.

The relative sizes of the circles are only approximate, and the true structure is not well represented by the traditional form of Venn diagram. In fact, subsets 4 and 6 overlap subset 3 exactly, with no overlap between the two. But the figure does illustrate how the entire pool of mail relays has been partitioned in a very clear manner. In no way does this represent the random selection of relays from a pool where all are considered equal. That scenario would yield a diagram with many more intersections between circles.

Perhaps this partitioning reflects something special about the machines that make up subset 3. Running dig -x on these lists turned up hostnames for more than half the IP addresses in the various subsets. They are dispersed around the world, based on their

Figure 11-3. Venn diagram showing overlap between subsets


IP address range and the national affiliation of their hostnames. Many of the hostnames contain strings like dsl, ppp, and cable, which suggest they are residential machines that have broadband connections to their ISP. The following examples are typical. Because these systems have been hijacked, I have replaced some of the identifying characters with hash marks.

     customer-209-99-###-###.millicom.com.ar     dl-lns1-tic-C8B####.dynamic.dialterra.com.br     modemcable077.56-###-###.mc.videotron.ca     200-85-###-###.bk4-dsl.surnet.cl     pop8-###.catv.wtnet.de     dyn-83-154-###-###.ppp.tiscali.fr     nilus-####.adsl.datanet.hu     DSL217-132-###-###.bb.netvision.net.il     host82-###.pool####.interbusiness.it     200-77-###-###.ctetij.cablered.com.mx     ppp07-90######-###.pt.lu     host-62-141-###-###.tomaszow.mm.pl     ev-217-129-###-###.netvisao.pt 

From a cursory examination of the names, I can see no obvious correlations between the names in a single subset compared to those in other subsets. But some non-random force has been responsible for this very distinct partitioning.

Perhaps the spam operation uses different addresses over time, expecting that some will be added to spam blacklists. You could test this by splitting the datasets into several time intervals and comparing the partitioning between them. If that were the case you might be able to calculate the distribution in lifespan for each relay and from that draw some conclusions about the effectiveness of anti-spam measures.

I cross-checked some of the IP addresses against the Spamhaus SBL blacklist, but none of them were found. Spamhaus will only add an address to their system if they have evidence of multiple offenses. It seems likely that the careful and limited reuse of addresses within these subsets is aimed at keeping them out of blacklists so they can be used again at some point in the future.

This is fertile ground for anyone interested in the statistics of large datasets that are subject to a variety of forces. These include the impact of computer viruses that we believe have created most of these mail relays. The dynamics of the spam marketplace and the impact of legislation are influencing the volume, type, and origin of spam. Spam domain blacklists are blocking some spam from reaching your Inbox, and spam filters applied at ISPs, company mail servers, and within your email client, will identify and partition messages. Spam is evolving rapidly in order to get past these filters.

One of the biggest questions in the field of virus infections and spam relays is the size of the total pool of infected machines. You will see plenty of estimates quoted by anti-virus and anti-spam software makers but no one really has the answer. A similar question applies to subsets of messages. The pornographic spam that makes up subset A continues to flow into my Inbox with new IP addresses every time. At some point, you would expect to see some of the old addresses reappearing.

There is an elegant statistical method that could be used in situations like this. It has been used to estimate the number of biological species in a complex ecosystem such as a rainforest, or a bacterial community such as the human gut. When you sample the population, you will quickly detect the major species but you have to take many more samples before you see any of the rare ones. Even if you have not seen all the species, you can use the distribution in the frequency with which you observe species to make an estimate of how many other species are out there. One of the first approaches to this problem came from a paper by Efron and Thisted (Biometrika, 1976, 63:435-447), which asked the question "How many words did Shakespeare know?" They looked at the frequency with which different words appeared in the plays and sonnets. From the tail of that distribution, they were able to estimate how many other words the playwright knew but just never actually used. They estimated that the playwright had a total vocabulary of 66,534 words. A similar approach might be used to determine how large the email relay pools are.

Statisticians have been very active in the field of spam but most of that work has focused on using statistics to define patterns in messages that can be used to identify new spam. I think advanced statistics, combined with forensics, can play an important role in helping understand how the world of spam operates. Even this case study, small as it is, illustrates how you can use Internet forensics to identify and characterize distinct spamming operations. Having a good knowledge of statistics helps, but knowing some basic forensics and being creative in the way you apply those skills are actually more important.

To truly understand broad Internet phenomena, such as spam campaigns or computer virus infections, you would need a way to monitor traffic over the entire network. You might assume that this is only possible within the companies that manage the main Internet backbone links, or perhaps within a government organization such as the U.S. National Security Agency. But thanks to an ingenious approach, called a Network Telescope, some form of monitoring is available to computer security researchers. The idea behind the approach is to monitor network traffic that is sent to unused parts of the Internet. Someone trying to exploit an operating system vulnerability, for example, will often systematically scan across the entire range of IP addresses. Many of these are in blocks of numbers that are not currently in use. Those packets would otherwise be discarded due to the invalid addresses. Network telescopes capture this traffic and look for interesting packets. Because the target addresses are invalid, no legitimate traffic should be intercepted, so no personal email messages or other sensitive material should be captured.

Telescopes have been used to study various malicious phenomena such as worms or denial-of-service attacks. One detailed analysis that shows the power of this approach, was performed by Abhishek Kumar, Vern Paxson, and Nicholas Weaver. It concerns the Witty worm that spread rapidly across the Internet in March 2004, and their reports are available at http://www.cc.gatech.edu/~akumar/witty.html. Projects, such as this and the work of the Honeynet Project that I describe in Chapter 5, show the direction in which advanced Internet forensics is headed.



Internet Forensics
Internet Forensics
ISBN: 059610006X
EAN: 2147483647
Year: 2003
Pages: 121

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net