Understanding the Beauty of the Beast

You may not realize it, but you've just created a monster. You have just instrumented your web architecture for performance assessment, debugging, monitoring, and auditingand it was really easy.

Now, you can take my word for it that this stuff is cool, but we will present a few examples of things we can now see that were painful, if not impossible, in a previous logging architecture (whatever it may have been).

The most obvious is that you now have one or more log hosts running spreadlogd journaling your logs to disk. This isn't very glamorous, so instead of staring at those for a long time (but by all means, do so if it makes you happy), we'll spend the next few sections looking at live streams.

Real-time Analysis

First, let's look at a stream going by. In previous examples we used the group logdomain2, so we should be able to connect to Spread using its cantankerous command-line tool spuser and see some logs. The following output presents a spuser real-time observation session:

# /opt/spread/bin/spuser -s 4803 Spread library version is 3.17.3 User: connected to 4803 with private group #user#admin-va-1 ========== User Menu: ----------         j <group> -- join a group         l <group> -- leave a group         s <group> -- send a message         b <group> -- send a burst of messages         r -- receive a message (stuck)         p -- poll for a message         e -- enable asynchonous read (default)         d -- disable asynchronous read         q -- quit User> j logdomain2 ============================ Received REGULAR membership for group lethargy with 2 members, where I am member 1:         #sld-05988#gollum         #user#admin-va-1 grp id is 182571332 1118158447 4 Due to the JOIN of #user#admin-va-1 User> ============================ received RELIABLE message from #ap26194#www-va-2, of type 1, (endian 0) to 1 groups (216 bytes): 85.192.40.13 - - [12/Aug/2005:15:21:30 -0400] "GET /~jesus/archives/29-isaexec-and-magical-builds.html HTTP/1.0" 200 12502 "http://order-carisoprodol.9k.com" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" User> ============================ received RELIABLE message from #ap26590#www-va-1, of type 1, (endian 0) to 1 groups (210 bytes): 85.192.40.13 - - [12/Aug/2005:15:21:36 -0400] "GET /~jesus/archives/29-isaexec-and-magical-builds.html HTTP/1.0" 200 12502 "http://order-cialis.9k.com" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" User> ============================ received RELIABLE message from #ap26196#www-va-3, of type 1, (endian 0) to 1 groups (202 bytes): 216.86.156.205 - - [12/Aug/2005:15:21:40 -0400] "GET /~jesus/archives/1970/01.html HTTP/1.1" 200 12493 "http://buy---adipex.at.tut.by" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.0" User> ============================ received RELIABLE message from #ap26192#www-va-2, of type 1, (endian 0) to 1 groups (209 bytes): 85.192.40.13 - - [12/Aug/2005:15:21:47 -0400] "GET /~jesus/archives/29-isaexec-and-magical-builds.html HTTP/1.0" 200 12502 "http://1st-levitra.9k.com" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" User> Bye.

The first message is one indicating a group membership change. The new users are #sld-05988#gollum and #user#admin-va-1. Well, we're logged in to admin-va-1, so we are the #user#admin-va-1, and on the machine named gollum we are running a spreadlogd daemon (gollum is running Linux):

# ssh gollum "ps auxwww | awk '/spreadlogd/{print "'$2" "$11'";}'" 5988 /usr/local/sbin/spreadlogd

Lo and behold! A spreadlogd process is running at process ID 5988, hence the private member name #sld-05988#gollum. I no longer feel alone.

Now, what are all those other lines? Well, aside from the Apache log line that we see, we have some other useful information:

RELIABLE is a word describing the semantics with which the message was published; mod_log_spread uses RELIABLE messaging by default.
from #ap26192#www-va-2 indicates the sender of the message, which is informative. It means that the Apache child at process ID 26192 on the machine www-va-2 just serviced the request and sent us that log line.
type 1 is an advanced Spread internal that can be ignored.
endian 0 is used to indicate whether the sending machine has a different byte ordering than the local Spread daemon to which we connected.
to 1 groups indicates the cardinality of groups to which this message was sent. mod_log_spread only sends individual messages to a single group.

The rest of the message is the Apache Common Log Format line and shows us more interesting things. These hits happen to be against my personal blog, and we can see that they are all coming from the same IP address from different referrers all within a few seconds of one another. This is suspect. Why would someone be coming to my site four times (this is just a small clip of time, there are actually thousands of these) all from different places?

Well, blogs (at least mine) track the referrers and list the top referrers in a side bar on the page. So if you refer to my blog many times from a particular URL, you may get some exposure under the top referrers section. Looking at these URLs, this person clearly isn't interested in the information on my blog. He just wants his prescription drug website link on my page. I don't like this guy already.

I could write a script that listens to this Spread group to check for IP addresses that contain three non-blank, differing URL referrers within 60 seconds and add them to a suppression list that my blog would use to eliminate "the bad guys" from my top referrers. Just a thought.

A demonstration on my blog doesn't expose the challenges involved with doing real-time assessments on large sites. In those environments, the logs stream by so fast that automated tools are absolutely required to perform any useful analysisunless something is catastrophically broken.

Real-time Monitoring

From the administrative point of view (systems, network, database), real-time monitoring is the capability to understand what is happening over some period of time. Specifically, it is the capability to correlate clearly realized effects with clearly reported occurrences. On the systems administration side, there are some age-old tools that allow administrators to understand system metrics in real-time. Two such tools are iostat and vmstat. These tools allow you to understand disk I/O activity, service times, virtual memory, and scheduling-related activity in near real-time. It is far out of the scope of this book to talk about how you would go about interpreting the output of such tools, but we can use them to draw a parallel into clustered systems.

Where are the tools that tell us what is going on in real-time in our web cluster? iostat and vmstat are fundamentally important to monitoring and troubleshooting problems on production systems. They allow administrators to quickly pinpoint bottlenecks or points of resource contention. As such, they are available on every modern system. When you install your Apache web cluster, where are the analogous tools? Absent.

If you install a load-balancing device in front of your cluster, often that system will provide real-time metricsat least those that it uses to make its load-balancing decisions. As we have learned, these load-balancing devices are not always the right tools for the job, so there must be another way to ascertain these vital metrics in real-time. Additionally, there very well may be metrics related to your web application that should be monitored, and because these load-balancing devices operate above the web application, there is no feasible way they could report application specific metrics. Listing 9.2 shows an example of a real-time web traffic monitor.

Listing 9.2. `wwwstat`A Real-time Web Traffic Monitor

 1: #!/usr/bin/perl  2:  3: use strict;  4: use Spread;  5: use Getopt::Long;  6: use Time::HiRes qw/tv_interval gettimeofday/;  7:  8: # return the maximum of two inputs  9: sub max { my($a,$b) = @_; ($a>$b)?$a:$b; } 10: 11: use vars qw /$daemon @group $interval $last $quit $banner $rows_since_banner 12:              %servers %hits %statuses/; 13: 14: GetOptions("d=s" = > \$daemon, 15:            "g=s" = > \@group, 16:            "i=i" = > \$interval); 17: 18: $interval ||= 1; 19: my ($m, $g) = Spread::connect( { spread_name = > "$daemon", 20:                                  private_name = > "tt_$$" } ); 21: die "Could not connect to Spread at $daemon" unless $m; 22: die "Could not join group" unless(grep {Spread::join($m, $_)} @group); 23: 24: sub pretty_hits { 25:   my ($servers, $statuses, $hitcnt, $banner) = @_; 26:   my @slist = sort keys %$servers; 27:   my @stlist = sort { $a < = > $b } keys %$statuses; 28:   my $minw = scalar(@stlist) * 4 - 1; 29: 30:   # If we were asked to print a banner or it has been over 19 rows 31:   if($banner || $rows_since_banner > 19) { 32:     # print out server names 33:     foreach (@slist) { 34:       printf "%".max($minw, length($_))."s ", $_; 35:     } 36:     print "\n"; 37:     foreach (@slist) { 38:       # print out status codes 39:       my $str; 40:       foreach my $st (@stlist) { $str .= sprintf("%3d ", $st); } 41:       $str =~ s/\s*$//; 42:       printf "%".max($minw, length($_))."s ", $str; 43:     } 44:     print "\n" . ('-' x 72) . "\n"; 45:     $rows_since_banner = 0; 46:   } 47: 48:   # Print out counts for each server/statuscode 49:   foreach (@slist) { 50:     my $str; 51:     foreach my $st (@stlist) { $str.=sprintf("%3d ",$hitcnt->{"$_/$st"}); } 52:     $str =~ s/\s*$//; 53:     printf("%".max($minw, length($_))."s ", $str); 54:   } 55:   $rows_since_banner++; 56:   print "\n"; 57: } 58: 59: $SIG{'INT'} = sub { $quit = 1; }; 60: $last = [gettimeofday]; 61: while(!$quit and my @p = Spread::receive($m, 0.2)) { 62:   if(@p[0] & Spread::REGULAR_MESS()){ 63:     # For each regular message, parse the common log 64:     if(@p[5] =~ /^(\S+)          # remote host address 65:                  \s(\S+)         # remote user 66:                  \s(\S+)         # local user 67:                  \s\[([^\]]+)\]  # date 68:                  \s"([^"]+)"     # request 69:                  \s(\d+)         # status 70:                  \s((?:\d+|-))   # size 71:                  \s"([^"]+)"     # referrer 72:                 /x) {; 73:       my ($raddr, $ruser, $luser, $date, $req, $status, $size, $ref) = 74:          ($1,     $2,     $3,     $4,    $5,   $6,      $7,    $8); 75:       (my $lhost = @p[1]) =~ s/^#[^#]+#//; 76: 77:       # account for the server and status code, if either is "new" we 78:       # must issue a new header/banner 79:       $banner = 1 unless exists $servers{$lhost}; 80:       $banner = 1 unless exists $statuses{$status}; 81:       $servers{$lhost} = 1; 82:       $statuses{$status} = 1; 83:       # tally the server/status hit 84:       $hits{"$lhost/$status"}++; 85:     } 86:   } 87:   if(tv_interval($last) > $interval) { 88:     # if $interval has elapsed, print out our tallies and zero them 89:     pretty_hits(\%servers, \%statuses, \%hits, $banner); 90:     $banner = 0; 91:     undef %hits; 92:     $last = [gettimeofday]; 93:   } 94: } 95: 96: Spread::disconnect($m);

The wwwstat program provides the insight we're missing. Let's take it for spin on a small web cluster of three nodes serving only dynamic page traffic. Output from wwwstat is as follows:

 1:         www-va-1        www-va-2        www-va-3  2:  200 302 404 500 200 302 404 500 200 302 404 500  3:  ------------------------------------------------------------------------  4:  51   3   0   0  69   2   0   0  54   2   0   0  5:  40   0   0   0  57   1   0   0  48   2   0   0  6:  68   2   0   0  70   4   0   0  55   2   0   0  7:  68   2   0   0  94   2   0   0  71   1   0   0  8:  54   2   0   0  60   0   0   0  53   1   1   0  9:  61   1   0   0  69   2   0   0  64   0   0   0 10:  99   4   0   0  92   5   0   0 101   1   0   0 11:  70   4   0   0  89   4   0   0  83   3   0   0 12:  65   2   0   0  57   1   0   0  49   1   0   0 13:  59   4   0   0  60   2   0   0  56   2   0   0 14:  48   3   0   0  50   2   0   0  46   4   0   0 15:  44   2   0   1  58   1   0   0  44   1   0   0 16:  49   0   0   0  64   3   0   0  54   3   0   0 17:  54   6   0   0  50   2   0   0  54   2   0   0 18:  94   1   0   0  84   0   0   0  76   4   0   0

Hey! There's an internal server error on www-va-1 (line 15)! Someone should take a look at that. Most importantly, however, we can see that the number of requests per time interval to each node in the cluster is distributed reasonably well (uniformly).

Now, the wwwstat tool certainly can be cleaned up a bit. The rendering could be prettier, and the information could be more thorough. However, it demonstrates that we can whip up a tool to suit our needs in short order as long as the information we need to analyze is at our fingertips. Getting the information here was cake (lines 19-22, 61, and 96).

This example was fun, but you may say, "I can look on my load balancer and see this information." You are absolutely right, if you (as a developer or sys admin) have access to that load balancer.

Demonstrating this passive analysis technique with web logs in common log format was simple and useful. That is certainly not the be-all, end-all of what can be accomplished. Spread is a simple communications medium, and publishers can publish anything they want. You could change the common log format to a custom one that includes the time it took to service each web request. With this modification, you could augment wwwstat to display average, minimum, and maximum service times (much like iostat does).

Note

Sadly, to get resolution finer than 1 second in Apache 1.3 (you'd want millisecond or microsecond granularity), you have to modify the mod_log_config source code. If you don't code C, or have never touched Apache source or Apache module source, this is a fabulous first project. It is short, simple, and amazingly useful. Apache 2.0 already provides sub-second timing granularity.

Aside from web logs, you could have your database abstraction layer log to a spread group on every statement execution and report the query being run and the time it took to execute it.

Outside the monitoring space (in a manner of speaking) is the capability to track user click-streams in real-time, regardless of the servers they access.

Passive Log Aggregation for Metrics

When you spend a lot of time in a large architecture you begin to realize that looking up at the system from the ground floor is a great way to troubleshoot problems. It also happens to be one of the worst ways to find problems. Your vision is shortsighted, and your view on the system as a whole is poor. So, the best architects and engineers have a healthy respect for the down-and-dirty details, as well as a macroscopic view of the business.

When I say "macroscopic view of the business" I don't mean accounting, HR, cleanliness of the office spacewe haven't left the technical landscape. I'm referring to technical business metrics. The marketing initiatives, the new registrations, the number of orders placed, and so on. These elements make the business money and, in the end, pay your salary, so you better be monitoring them. Somebody has to see the big picture, and if you are the CTO or the director of operations, that is your job. If you aren't and you report to someone, there is no excuse for you to not see the big picturethat's the path to the D and C levels.

The key here is not to lose your analytical engineering-focused thinking. So, you might ask (though I hope not) what I hear from developers and administrators when I speak on this subject. "What good does it do me to see how many orders were placed yesterday or last hour?" "Who cares how many new customers we have as long as the number is positive?" "No one tells me when a new marketing initiative starts, that's handled by another department; besides, what does it have to do with my job?"

There are many answers to these questions. Usually they are delivered after a smack in the head of the person who asked them in the first place. The most obvious reason is because you should care what is going on. The other reason is because you are in the unique position to see an even bigger picture.

On the nontechnical side of the business, you better believe they have reports on every single business related metric. Chances are good that the data for those reports mostly comes from the logs and databases in your architecture. So...you can see those reports, too. However, the business side doesn't see all the technical reports such as bandwidth utilization, database query volume, and cache hit rates on your static content servers.

By seeing all of this data at a high level, you can learn what to expect. You see normal trends and correlations between metrics that fall more on the business side to metrics that fall more on the technical side. Understanding the relationship between these things can lead to profound insights.

So, where are we going with this? Am I saying that you should have daily reports of all the important business metrics and technical metrics in one place so that you can see the picture? Well, obviously, but that's the tip of the iceberg. With centralized real-time logging you can get reports by the minute and see trends as they form. Although understanding a trend that is just forming is pretty much impossible, if you have familiarized yourself with all the big picture data and understand what the relationships between all the metrics look like in a healthy system, you can spot an unhealthy situation as it is gestating.

Real-time Analysis

Real-time Monitoring

Listing 9.2. wwwstatA Real-time Web Traffic Monitor

Passive Log Aggregation for Metrics

Listing 9.2. `wwwstat`A Real-time Web Traffic Monitor