Visualizing Data with RRDtool

If you have ever worked with large number sets, the first thing you learned was that visualization tools are priceless. I can look at the number of octets sent every minute of the day from each of our 10 web servers (1440 times 10) or I can look at a simple graph. Both are useful, but I certainly want to start with the graphif everything looks okay, I have no need to look deeper. Graphs, specifically those that share a common axis (such as time), make it easy to correlate cause and effect, which can be difficult when there are multiple causes and effects spanning several systems.

Let's stay with our news site as an example. Most operations groups use graphing as an invaluable data correlation tool. Typical graph sources are bandwidth and errors on each switch port, as well as CPU, memory, load, and disk utilization on each server. These graphs are an invaluable resource when attempting to diagnose performance issues and the effects that new strains (new pushed services or code) have caused.

Suppose that we launch a new service that offers users a listing of the most popular "next page visited" list based on the site-local page loaded by all people who were viewing the page in question directly prior. In other words, if 1,000 users viewed page A and then performed some action to deliver them to another page on the site, all those pages would be ranked by popularity and displayed on page A. Implementing this in a scalable manner will be left as an exercise to the reader (read the rest of the book first).

The service has been launched and metrics are down by 5%. Why and what metrics? These are questions posed to senior technologists, and having a firm grasp of how to arrive at a solution quickly is a valuable skill. New user registrations are following an expected trend, the bandwidth is up by 2%, as well as the number of hits on the site. However, the advertising click-through is down by 5%. This is the landscape for our puzzle. How can all these system metrics be up and the business be down?

To answer this question we need more dataor an aware operations team. However, we also can have data coming out our ears, and it all will be useless unless we find a good means of visualizing it.... Enter Tobias Oetiker and his useful RRDtool (http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/.)

A Bit About RRDtool

The more I work with RRDtool, the more I realize that it is utterly obtuse, and yet irreplaceable. RRDtool stands for Round Robin Database tool, and it allows streamlined data metric retention and visualization. Perhaps the most common use of this type of tool is for monitoring the activity on network interface cards on routers, switches, and even hosts. RRDtool will allow you to create a database of tracked metrics with a specific expected frequency and well-defined database retention policies, feed new metrics into the database, and then generate both ugly and stunning visualizations of the data therein.

Like many engineering tools, RRDtool provides an engineer with just enough options to require constant access to the manual page and defaults/samples that are designed to convince people that engineers don't know what really looks good. However, it is powerful and flexible and a tremendous asset to any monitoring suite.

The ins and outs of RRDtool could fill the pages of at least one book, and it simply isn't important to understand how to install, configure, and use RRDtool for me to illustrate its usefulness. Luckily, the documentation for RRDtool and the various open source products that use it are excellent starting points.

We will walk through a simple usage of RRDtool and show how the additional collected information can help us solve the "metrics are up and business is down" problem posed earlier.

Setting Up Our Databases

The first thing we'll do is set up some RRDs to track the metrics we're interested in. Note that you must be interested in these things before the problems occur; otherwise, you will see only the data representing the problematic situation and not historical data for comparison. In other words, we should set up the metric collection as part of launching the service, not as a part of troubleshooting.

We want to track the number of bytes in and out of our web environment and the number of page hits by server, error code, and user registration status. To make this example shorter, we'll set up traffic metrics and page hits by user registration status here and leave the rest as an exercise to the reader.

For all our information we want second to second accurate data. However, updating all these databases once a second for each metric can be expensive, and it really doesn't add much value. Basically, we want to be able to see spikes and anomalies better. For example, if something bad happens for one minute out of the hour, we don't want to see it averaged into 60 minutes worth of data. There is a trade-off here, and I'm giving you some insight into choosing a lower bound on the data collection period. If a graph or set of graphs is displayed, what is the amount of time that it will take to fully understand what we're seeing? This includes the big picture, trends, identifying anomalies, and correlating them across data sources. On a good day, it takes me about two minutes, so I collect data twice that often.

The next questions are "how long will this data be retained?" and "how far back must be it available on 60-second intervals?" These are much tougher questions. Typically, I retain information for two years when I set up RRD files. If I want data back farther than that, it is never needed on a granularity finer than a week, so archiving could actually be done by taking RRD-generated graphs and saving one a week forever.

To create the RRDs for the inbound byte counters and outbound byte counters to be used to measure overall traffic, we create one RRD for each (though multiple metrics can be stored in a single RRD, I find juggling them easier when they are one-to-one):

rrdtool create /var/rrds/web-inbytes.rrd --step 60 DS:bytes:COUNTER:60:0:U \     RRA:AVERAGE:0.5:1:5760 RRA:MAX:0.5:1:5760 RRA:MIN:0.5:1:5760 \     RRA:AVERAGE:0.5:10:4032 RRA:AVERAGE:0.5:60:5376 \     RRA:AVERAGE:0.5:120:8064 RRA:MAX:0.5:120:8064 RRA:MIN:0.5:120:8064 rrdtool create /var/rrds/web-outbytes.rrd --step 60 DS:bytes:COUNTER:60:0:U \     RRA:AVERAGE:0.5:1:5760 RRA:MAX:0.5:1:5760 RRA:MIN:0.5:1:5760 \     RRA:AVERAGE:0.5:10:4032 RRA:AVERAGE:0.5:60:5376 \     RRA:AVERAGE:0.5:120:8064 RRA:MAX:0.5:120:8064 RRA:MIN:0.5:120:8064

We will create RRD files for tracking hits by registered versus unregistered users similarly:

rrdtool create /var/rrds/web-visitorhits.rrd --step 60 DS:hits:COUNTER:60:0:U \     RRA:AVERAGE:0.5:1:5760 RRA:MAX:0.5:1:5760 RRA:MIN:0.5:1:5760 \     RRA:AVERAGE:0.5:10:4032 RRA:AVERAGE:0.5:60:5376 \     RRA:AVERAGE:0.5:120:8064 RRA:MAX:0.5:120:8064 RRA:MIN:0.5:120:8064 rrdtool create /var/rrds/web-userhits.rrd --step 60 DS:hits:COUNTER:60:0:U \     RRA:AVERAGE:0.5:1:5760 RRA:MAX:0.5:1:5760 RRA:MIN:0.5:1:5760 \     RRA:AVERAGE:0.5:10:4032 RRA:AVERAGE:0.5:60:5376 \     RRA:AVERAGE:0.5:120:8064 RRA:MAX:0.5:120:8064 RRA:MIN:0.5:120:8064

The preceding statements maintain single step (60 seconds) averages, minimums, and maximums for 5760 intervals (4 days), 10 step (10 minute) averages for 4032 intervals (28 days or 4 weeks), 60 step (60 minute) averages for 5376 intervals (224 days or 32 weeks), and lastly 120 step (2 hour) averages, minimums and maximums for 8064 intervals (2 years).

Collecting Metrics

Perhaps the most challenging aspect of managing stores of metrics is collecting the data in the first place. There are two categories of metrics in the world: those that can be queried via SNMP (Simple Network Management Protocol) and those that cannot. Many a tool sits on top of RRDtool that can collect SNMP-based metrics for you. Cacti (http://www.cacti.net/) is one such tool that manages to make its way into production a lot around here.

Querying data from SNMP is simplethere are supporting libraries or extensions for almost every common programming language. I highly recommend using one of the many prebuilt packages to do automated metric collection from all your SNMP-capable devices. Many of these packages even automate the creation and population of the RRD files. For now, we'll set up a simple cron job that updates the two byte counters from the router that sits in front of our web architecture (see Listing 9.3). (It's IP address is 10.10.10.1.)

Listing 9.3. `simple_out_rrdupdate.sh`Simplistic RRDtool Update Script

#!/bin/sh FILE=$1 AGENT=$2 OID=$3 if test ! -f "/var/rrds/$FILE.rrd"; then   echo No such RRD file   exit fi COMMUNITY=public BYTES='snmpget -c $COMMUNITY -v 2c -Oqv $AGENT $OID' if test "$BYTES" -le "0"; then   echo Bad SNMP fetch;   exit fi rrdtool update /var/rrds/$FILE.rrd -t bytes N:$BYTES crontab entry: * * * * * /usr/local/bin/simple_oid_rrdupdate.sh web-inbytes 10.10.10.1 \        .1.3.6.1.2.1.2.2.1.10.1 * * * * * /usr/local/bin/simple_oid_rrdupdate.sh web-outbytes 10.10.10.1 \  .1.3.6.1.2.1.2.2.1.16.1

It should be clear that you would not want to run a cron job every minute for every single metric you monitor. This specific example is for brevity. Additionally, the SNMP OID .1.3.6.1.2.1.2.2.1.10.1 is in no way intuitive. It represents the inbound octets (old networking lingo for bytes) on interface 1 of the specified device. Any SNMP collection tool worth its weight in electrons will make OIDs human readable and easy to find.

Now that we have statistics being collected (and we'll assume we have been collecting them all along), we can move on to something that is a bit more obtuse (if that's possible). Collecting SNMP metrics is easy because all the tools do all the complicated actions described previously for you, and you just point it at a device, or set of devices, and click Go. This can be done because SNMP is an established protocol, and the OIDs (the long dot-delimited numeric) are well established as either an industrywide standard or are well documented by the vendor of the device. Graphing other stuff is not so simple.

To graph the rate of page loads (hits) for visitors to the site, as well as for registered users, we need to know more than any vendor could generally know. Specifically, we need to know how we classify a loaded page as being loaded by a registered user (in our case a user with an account that is currently "signed in" while viewing the page) or a visitor (anyone else).

To do this, we can alter the wwwstat program we wrote before to track the hits and update the RRD files. In our web application we will use the second field of the common log format (remote user) to represent the current viewing user. It is in the form V-{VISITORID} for visitors and U-{USERID} for users. We place it in the remote user field of the log format because it will not change the format and thus will not break existing processes that read those logs. The remote user field is relatively useless these days because many clients don't support RFC1413, and most sites don't attempt to perform RFC1413 (ident) lookups. This means that we simply need to monitor the live log stream and tally logs with a remote user starting with 'V' in the web-visitorhits file and everything else in the web-userhits file. We do this in userhits2rrd.pl seen in Listing 9.4.

Listing 9.4. `userhits2rrd.pl`Updating RRD Files with External Data

01: #!/usr/bin/perl 02: 03: use strict; 04: use Spread; 05: use Getopt::Long; 06: use Time::HiRes qw/tv_interval gettimeofday/; 07: use RRDs; 08: 09: use vars qw /$daemon @group $interval $last $quit %ahits/; 10: 11: GetOptions("d=s" = > \$daemon, 12:            "g=s" = > \@group, 13:            "i=i" = > \$interval); 14: 15: $interval ||= 60; 16: my ($m, $g) = Spread::connect( { spread_name = > "$daemon", 17:                                  private_name = > "tt_$$" } ); 18: die "Could not connect to Spread at $daemon" unless $m; 19: die "Could not join group" unless(grep {Spread::join($m, $_)} @group); 20: 21: $ahits{visitors} = $ahits{users} = 0; 22: sub tally { 23:   # This should be called every $interval seconds 24:   RRDs::update("/var/rrds/web-visitorhits.rrd", 25:                "--template", "hits", "N:$ahits{visitors}"); 26:   RRDs::update("/var/rrds/web-userhits.rrd", 27:                "--template", "hits", "N:$ahits{users}"); 28: } 29: 30: $SIG{'INT'} = sub { $quit = 1; }; 31: $last = [gettimeofday]; 32: while(!$quit and my @p = Spread::receive($m, 0.2)) { 33:   if(@p[0] & Spread::REGULAR_MESS()){ 34:     # For each regular message, parse the common log 35:     if(@p[5] =~ /^(\S+)          # remote host address 36:                  \s(\S+)         # remote user 37:                  \s(\S+)         # local user 38:                  \s\[([^\]]+)\]  # date 39:                  \s"([^"]+)"     # request 40:                  \s(\d+)         # status 41:                  \s((?:\d+|-))   # size 42:                  \s"([^"]+)"     # referrer 43:                 /x) {; 44:       my ($raddr, $ruser, $luser, $date, $req, $status, $size, $ref) = 45:          ($1,     $2,     $3,     $4,    $5,   $6,      $7,    $8); 46: 47:       if($ruser =~ /^V/) { $ahits{"visitors"}++; } 48:       } else             { $ahits{"users"}++;    } 49:     } 50:   } 51:   if(tv_interval($last) > $interval) { 52:     tally(); 53:     $last = [gettimeofday]; 54:   } 55: } 56: 57: Spread::disconnect($m);

Now we run the userhits2rrd.pl script, and data collection and storage into RRD is done. The reason this is substantially more complicated than SNMP is that there are no good generic tools to collect your custom business metrics. Each time you want to track a new data source that isn't SNMP capable, you either have to glue it into an SNMP agent or query via custom logic as we have done here. These metrics are absolutely useful and warrant maintaining one-off collection scripts for each of the business metrics you want to track. In the end, you'll find that your scripts aren't very complicated, aren't all that different from one another, and will yield substantial intra-organizational reuse.

Visualizing Data Through RRDtool

So, we're collecting metrics, but this doesn't help us solve our problem. We must be able to visualize it for it to prove useful. Like everything else with RRDtool, the visualization can do exactly what you want, but the user interface to do it is 100% engineer and 0% designer. We have a web wrapper around most of the RRD graphs we can generate, which makes them quite nice; but I'll show you the "under the hood" graph generation here so that you can have a healthy respect for why someone should wrap it with a simple web interface.

To generate a graph that is ugly, we can call rrdtool in its simple form:

rrdtool graph \     ugly.png -title "web traffic" -a PNG \     --vertical-label "bits/sec" -width 800 -height 500 \     "DEF:inbytes=/var/rrds/web-inbytes.rrd:inbytes:AVERAGE" \     "DEF:outbytes=/var/rrds/web-outbytes.rrd:inbytes:AVERAGE" \     "CDEF:realinbits=inbytes,8,*" \     "CDEF:realoutbits=outbytes,8,*" \     "AREA:realinbits#0000ff:Inbound Traffic" \     "LINE1:realoutbits#ff0000:Outbound Traffic" \     --start -172800

Yes, that's the simple form. It pulls the in and out byte metrics over the last 172,800 seconds (two days) from their respective RRD files, multiplies them by 8 (to get bits), and graphs the inbound in a blue area curve and the outbound in a red line curve.

It produces something usable, but so ugly I refuse to grace this book with it. So, we will consult a visualization expert and show the traffic sent to the Internet on the positive y-axis and the traffic received from the Internet on the negative y-axis (the total hits being the area in between). Additionally, we will add some gradient pizzazz and end up with the following (utterly obtuse) command that produces Figure 9.5:

rrdtool graph webtraffic.png -title "web traffic" -a PNG     --vertical-label "bits / sec"     --width 450 --height 180     DEF:outbytes=/var/rrds/web-outbytes.rrd:outbytes:AVERAGE     DEF:inbytes=/var/rrds/web-inbytes.rrd:inbytes:AVERAGE     "CDEF:realout=outbytes,8,*,-1,*" \     "CDEF:realin=inbytes,8,*" \     "CDEF:realout1=outbytes,8,*,-1,*,3,*,4,/" \     "CDEF:realout_h=outbytes,8,*" \     "CDEF:realout2=outbytes,8,*,-1,*,2,*,4,/" \     "CDEF:realin3=inbytes,8,*,1,*,4,/" \     "CDEF:realoutb=outbytes,8,*,-1,*" \     "CDEF:realinb=inbytes,8,*" \     "CDEF:realin_h=inbytes,8,*" \     "CDEF:realin1=inbytes,8,*,3,*,4,/" \     "CDEF:realin2=inbytes,8,*,2,*,4,/" \     "CDEF:realout3=outbytes,8,*,-1,*,1,*,4,/" \     "AREA:realout#ffaa44:" \     "AREA:realin#ffaa44:" \     "AREA:realout1#ffcc55:" \     "AREA:realin1#ffcc55:" \     "AREA:realout2#ffee77:" \     "AREA:realin2#ffee77:" \     "AREA:realout3#ffff88:" \     "AREA:realin3#ffff88:" \     "LINE1:realoutb#888833:web outbound traffic" \     "LINE1:realinb#888833:web inbound traffic" \     --start -172800

Figure 9.5. A graph representing web network traffic.

Now let's graph our user hit metrics using the same technique (shown in Figure 9.6):

rrdtool graph webhits.png --title "online users" -a PNG \     --vertical-label "users" -width 800 -height 500 \     DEF:uhits=/var/rrds/web-userhits.rrd:hits:AVERAGE \     DEF:vhits=/var/rrds/web-visitorhits.rrd:hits:AVERAGE \     "CDEF:realv=vhits" \     "CDEF:realv3=vhits,1,*,4,/" \     "CDEF:realvb=vhits" \     "CDEF:realv_h=vhits" \     "CDEF:realv1=vhits,3,*,4,/" \     "CDEF:realv2=vhits,2,*,4,/" \     "CDEF:realu=uhits,-1,*" \     "CDEF:realu1=uhits,-1,*,3,*,4,/" \     "CDEF:realu_h=uhits" \     "CDEF:realu2=uhits,-1,*,2,*,4,/" \     "CDEF:realub=uhits,-1,*" \     "CDEF:realu3=uhits,-1,*,1,*,4,/" \     "AREA:realu#ffaa44:" \     "AREA:realv#ffaa44:" \     "AREA:realu1#ffcc55:" \     "AREA:realv1#ffcc55:" \     "AREA:realu2#ffee77:" \     "AREA:realv2#ffee77:" \     "AREA:realu3#ffff88:" \     "AREA:realv3#ffff88:" \     "LINE1:realub#888833:Registered Users" \     "LINE1:realvb#888833:Visitors" \     --start -172800

Figure 9.6. A graph representing web page-load traffic.

Being Hit in the Face with Data

Back to our problem, so that we don't wander aimlessly without purpose any longer: New user registrations are following an expected trend, the bandwidth is up by 2%, as is the number of hits on the site. However, the advertising click-through is down by 5%.

As our problem describes, we should see an increase in bandwidth by about 2% from the previous day. Figure 9.5 shows something that looks reasonably consistent with the problem description.

In Figure 9.6, however, we see something entirely unexpected. We were told that the hits on the site have increased by 2% as well, but the data for today (the right hump) is substantially more than 2% larger than yesterday (the left hump). So, either someone doesn't know what he is talking about, or that person's definition of "hit" doesn't match our definition of "hit."

Indeed this is the case. If we look at our script, it tallies all log lines it sees as a hit for either a registered user or a visitor. It does not limit the count to those requests serviced by a 200 HTTP response code (as is common in some reporting tools). Although this doesn't identify the problem, we clearly see that a problem of some type is occurring today that did not occur yesterdayprogress.

If we believe that this graph shows dramatically different trends from what the problem report indicates because of the response codes, we should be looking at graphs of response codes.

Figure 9.7 shows a graph of pages served over time broken down by HTTP response code. The problem couldn't be more obvious now. We are throwing 404 (page not found errors) now, and we were not doing so yesterday. Looking at these graphs to come to this conclusion took only a few seconds. The data behind most problems, when visualized correctly, will hit you like a ton of bricks.

Figure 9.7. A graph representing web page-load traffic by HTTP response code.