Hack 62 Graphing Data with RRDTOOL

figs/moderate.gif figs/hack62.gif

Graphing data over time, either by itself or in comparison with another dataset, is the Holy Grail of analytical research. With the use of RRDTOOL, you'll be able to store and display time-series data .

In this hack, we're going to get some example data from Amazon.com and use the Round Robin Database Tool (RRDTOOL, http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/) to graph changes in Amazon.com Sales Rank over time.

Round robin is a way of storing a fixed amount of data and a pointer to the current element. This is much like a cyclic buffer with a fixed number of slots for data, where adding a new element pushes out the oldest to make space. This is a nice feature, because you never have to worry about using all your disk space or clearing out old data. The downside is that you have to decide the time period up front. This hack assumes you have RRDTOOL installed as per the online instructions.

First, let's create a database to log an Amazon.com Sales Rank for a month:

 %  rrdtool create salesrank.rrd --start 1057241523  --step 86400   DS:rank:GAUGE:86400:1:U  RRA:AVERAGE:0.5:1:31  RRA:AVERAGE:0.5:7:10  

We have now created a database called salesrank.rrd , starting when this was written, adding new data every 24 hours, and keeping two round robin datasets. There are numerous settings when creating a database, many more than we can hope to explain here. To give you a feel for it, we'll just briefly explain the settings we used in this hack:


--start 1057241523 --step 86400

Defines when the time series starts, using Unix timestamps. Executing date +%s gives you the current time in the necessary format (number of seconds since the Epoch). Setting the number to 86400 for step defines the time in seconds between our data points. We arrive at that number with the following equation: 24 x 60 x 60 = 86400or, 24 hours of 60 minutes each and each minute containing 60 seconds. In this case, we're graphing one bit of data per day, every day, starting now.


DS:rank:GAUGE:86400:1:U

DS defines a dataset, rank is the name , and GAUGE is used when we're more interested in the absolute number than a percentage change. We set the scale to begin with 1 , because we know that the highest Sales Rank is 1. We set the upper limit of the scale to unlimited ( U ), because we don't know how many products Amazon.com has; therefore, we can't know how badly ranked our book will be, and thus the need for unlimited.


RRA:AVERAGE:0.5:1:31

RRA:AVERAGE:0.5:7:10

Here, we define our two round robin databases, the first keeping daily numbers and running for a total of 31 days, the second running weekly numbers (7 days) for a total of 10 weeks.

Now that we have the database created, it is time to start filling in some numbers by using the rrdtool update command:

 %  rrdtool update salesrank.rrd 1057241524:3689  %  rrdtool update salesrank.rrd 1057327924:3629  ...etc... %  rrdtool update salesrank.rrd 1059833523:2900  

The numbers are in the format of timestamp:value , which, in this case, indicates a Sales Rank of 3689 for the first entry and 3629 for the next entry 24 hours later. The rule is that every update should be at least one second after the previous entry. With a total of 31 data points (not all are shown in the example), we now have something to display. To get textual results, we can use the fetch feature of rrdtool :

 %  rrdtool fetch salesrank.rrd AVERAGE --start 1057241524 --end 1059833524  1057190400: nan 1057276800: 3.6290017008e+03 1057363200: 3.6094016667e+03 ...etc... 

It's not very pretty to look at, but it's essentially the same as when we entered the data with timestamp:value . These are calculated numbers, so they are not exactly the same as those we entered. But (finally!) on to where this whole hack started: drawing graphs based on time-series data:

 %  rrdtool graph osxhacks.png --start 1057241524 --end 1059833524   --imgformat PNG --units-exponent 0 DEF:myrank=salesrank.rrd:rank:AVERAGE   LINE1:myrank#FF0000:"Mac OS X Hacks"  

This code produces the graph shown in Figure 4-5.

Figure 4-5. Graph of the Amazon.com Sales Rank for Mac OS X Hacks
figs/sphk_0405.gif

There's an almost never-ending list of settings when displaying the graphs, which would be impossible to cover here. Most notable in our previous command is that we get the rank parameter out of our database and graph it in red with the legend "Mac OS X Hacks." Other than that, we ask for files in PNG format and tell the graph not to do any scaling on the y-axis.

Doing this by hand on a regular basis would be incredibly tedious at best. cron and Perl to the rescue! First, we'll create a Perl script that sucks down the Amazon.com product we're interested in, and then we'll capture the Sales Rank with a simple regular expression. This captured data, as well as the current timestamp, will be used to update our RRDTOOL database, and a new graph will be created.

The Code

Save the following code in a file called grabrank.pl :

 #!/usr/bin/perl -w # # grabrank.pl # # This code is free software; you can redistribute it and/or # modify it under the same terms as perl # use strict; use LWP::Simple; my $time=time(  ); # path to our local RRDTOOL. my $rrd = '/usr/local/bin/rrdtool'; # Get the Amazon.com page for Mac OS X Hacks my $data = get("http://www.amazon.com/exec/obidos/ASIN/0596004605/"); $data =~ /Amazon.com Sales Rank: <\/b> (.*) <\/span><br>/; my $salesrank=; # and now the sales rank is ours! Muahh! # Get rid of commas. $salesrank =~ s/,//g; # Update our rrdtool database. `$rrd update salesrank.rrd $time:$salesrank`; # Update our graph. my $cmd= "$rrd graph osxhacks.png --imgformat PNG --units-exponent ".          "0 DEF:myrank=salesrank.rrd:rank:AVERAGE LINE1:myrank#FF0000:".          "'Mac OS X Hacks' --start ".($time-31*86400)." --end $time"; `$cmd`; # bazam! we're done. 

Running the Hack

First, we need a cron job [Hack #90] to run this script once every day. On some systems, you can simply place the script in /etc/cron.daily . If you don't have that option, then add something like this to your crontab file, which will tell cron to run our script every night at five minutes after midnight:

 5 0 * * *       /path/to/your/grabrank.pl 

Hacking the Hack

The graphs are not exactly pretty, so there are many possible improvements to be made, playing with intervals, colors, and so forth. If you look at the graph, you'll see that the way it is displayed is somewhat counterintuitive, because a low figure is a sign of a higher ranking. If we knew the exact Sales Rank of the worst-selling item at Amazon.com in advance, then we could simply subtract the rank of the day from that and create a graph that rose with a higher ranking. Not having the right numbers, it's going to take a few more calculations.

If you want to graph more than one Sales Rank, there's not much to change, other than defining an extra data source when creating the database:

 DS:otherrank:GAUGE:86400:1:U 

And remember to add an extra DEF and LINE1 to the rrdtool graph command:

 DEF:myotherrank=salesrank.rrd:rank:AVERAGE LINE1:myotherrank#11EE11:"My other book" 

Grabbing the extra data from Amazon.com is left as an exercise for the reader.

Mads Toftum



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net