90 Monitoring Network Status

#90 Monitoring Network Status

One of the most puzzling administrative utilities in Unix is netstat , which is too bad, because it offers quite a bit of useful information about network throughput and performance. With the - s flag, netstat outputs volumes of information about each of the protocols supported on your computer, including TCP, UDP, IPv6, ICMP, IPsec, and more. Most of those protocols are irrelevant for a typical configuration; the protocol to examine is TCP. This script analyzes TCP protocol traffic, determining the percentage of failure and including a warning if any values are out of bounds.

Analyzing network performance as a snapshot of long- term performance is useful, but a much better way to analyze data is with trends. If your system regularly has 1.5 percent packet loss in transmission, and in the last three days the rate has jumped up to 7.8 percent, a problem is brewing and needs to be analyzed in more detail.

As a result, Script #90 is in two parts . The first part is a short script that is intended to run every 10 to 30 minutes, recording key statistics in a log file. The second script parses the log file and reports typical performance and any anomalies or other values that are increasing over time.


Some flavors of Unix can't run this code as is! It turns out that there is quite a variation in the output format of the netstat command between Linux and Unix versions. This code works for Mac OS X and FreeBSD; the changes for other Unixes should be straightforward (check the log file to see if you're getting meaningful results to ascertain whether you need to tweak it).

The Code

 #!/bin/sh # getstats - Every 'n' minutes, grabs netstats values (via crontab). logfile="/var/log/netstat.log" temp="/tmp/getstats.tmp" trap "/bin/rm -f $temp" 0 ( echo -n "time=$(date +%s);" netstat -s -p tcp > $temp sent="$(grep 'packets sent' $temp  cut -d\ -f1  sed 's/[^[:digit:]]//g')" resent="$(grep 'retransmitted' $temp  cut -d\ -f1  sed 's/[^[:digit:]]//g')" received="$(grep 'packets received$' $temp  cut -d\ -f1  \   sed 's/[^[:digit:]]//g')" dupacks="$(grep 'duplicate acks' $temp  cut -d\ -f1  \   sed 's/[^[:digit:]]//g')" outoforder="$(grep 'out-of-order packets' $temp  cut -d\ -f1  \   sed 's/[^[:digit:]]//g')" connectreq="$(grep 'connection requests' $temp  cut -d\ -f1  \   sed 's/[^[:digit:]]//g')" connectacc="$(grep 'connection accepts' $temp  cut -d\ -f1  \   sed 's/[^[:digit:]]//g')" retmout="$(grep 'retransmit timeouts' $temp  cut -d\ -f1  \   sed 's/[^[:digit:]]//g')" echo -n "snt=$sent;re=$resent;rec=$received;dup=$dupacks;" echo -n "oo=$outoforder;creq=$connectreq;cacc=$connectacc;" echo "reto=$retmout" ) >> $logfile exit 0 

The second script analyzes the netstat historical log file:

 #!/bin/sh # netperf - Analyzea the netstat running performance log, identifying #    important results and trends. log="/var/log/netstat.log" scriptbc="$HOME/bin/scriptbc"   # Script #9 stats="/tmp/netperf.stats.$$" awktmp="/tmp/netperf.awk.$$" trap "/bin/rm -f $awktmp $stats" 0 if [ ! -r $log ] ; then   echo "Error: can't read netstat log file $log" >&2   exit 1 fi # First, report the basic statistics of the latest entry in the log file... eval $(tail -1 $log)    # all values turn into shell variables rep="$($scriptbc -p 3 $re/$snt\*100)" repn="$($scriptbc -p 4 $re/$snt\*10000  cut -d. -f1)" repn="$(( $repn / 100 ))" retop="$($scriptbc -p 3 $reto/$snt\*100)"; retopn="$($scriptbc -p 4 $reto/$snt\*10000  cut -d. -f1)" retopn="$(( $retopn / 100 ))" dupp="$($scriptbc -p 3 $dup/$rec\*100)"; duppn="$($scriptbc -p 4 $dup/$rec\*10000  cut -d. -f1)" duppn="$(( $duppn / 100 ))" oop="$($scriptbc -p 3 $oo/$rec\*100)"; oopn="$($scriptbc -p 4 $oo/$rec\*10000  cut -d. -f1)" oopn="$(( $oopn / 100 ))" echo "Netstat is currently reporting the following:" echo -n "  $snt packets sent, with $re retransmits ($rep%) " echo "and $reto retransmit timeouts ($retop%)" echo -n "  $rec packets received, with $dup dupes ($dupp%)" echo " and $oo out of order ($oop%)" echo "   $creq total connection requests, of which $cacc were accepted" echo "" ## Now let's see if there are any important problems to flag if [ $repn -ge 5 ] ; then   echo "*** Warning: Retransmits of >= 5% indicates a problem "   echo "(gateway or router flooded?)" fi if [ $retopn -ge 5 ] ; then   echo "*** Warning: Transmit timeouts of >= 5% indicates a problem "   echo "(gateway or router flooded?)" fi if [ $duppn -ge 5 ] ; then   echo "*** Warning: Duplicate receives of >= 5% indicates a problem "   echo "(probably on the other end)" fi if [ $oopn -ge 5 ] ; then   echo "*** Warning: Out of orders of >= 5% indicates a problem "   echo "(busy network or router/gateway flood)" fi # Now let's look at some historical trends... echo "analyzing trends...." while read logline ; do     eval "$logline"     rep2="$($scriptbc -p 4 $re / $snt \* 10000  cut -d. -f1)"     retop2="$($scriptbc -p 4 $reto / $snt \* 10000  cut -d. -f1)"     dupp2="$($scriptbc -p 4 $dup / $rec \* 10000  cut -d. -f1)"     oop2="$($scriptbc -p 4 $oo / $rec \* 10000  cut -d. -f1)"     echo "$rep2 $retop2 $dupp2 $oop2" >> $stats   done < $log echo "" # Now calculate some statistics, and compare them to the current values cat << "EOF" > $awktmp     { rep += ; retop += ; dupp += ; oop +=  } END { rep /= 100; retop /= 100; dupp /= 100; oop /= 100;       print "reps="int(rep/NR) ";retops=" int(retop/NR) \          ";dupps=" int(dupp/NR) ";oops="int(oop/NR) } EOF eval $(awk -f $awktmp < $stats) if [ $repn -gt $reps ] ; then   echo "*** Warning: Retransmit rate is currently higher than average."   echo "    (average is $reps% and current is $repn%)" fi if [ $retopn -gt $retops ] ; then   echo "*** Warning: Transmit timeouts are currently higher than average."   echo "    (average is $retops% and current is $retopn%)" fi if [ $duppn -gt $dupps ] ; then   echo "*** Warning: Duplicate receives are currently higher than average."   echo "    (average is $dupps% and current is $duppn%)" fi if [ $oopn -gt $oops ] ; then   echo "*** Warning: Out of orders are currently higher than average."   echo "    (average is $oops% and current is $oopn%)" fi echo \(analyzed $(wc -l < $stats) netstat log entries for calculations\) exit 0 

How It Works

The netstat program is tremendously useful, but its output can be quite intimidating. Here are just the first ten lines:

 $  netstat -s -p tcp  head  tcp:         36083 packets sent                 9134 data packets (1095816 bytes)                 24 data packets (5640 bytes) retransmitted                 0 resends initiated by MTU discovery                 19290 ack-only packets (13856 delayed)                 0 URG only packets                 0 window probe packets                 6295 window update packets                 1340 control packets 

So the first step is to extract just those entries that contain interesting and important network performance statistics. That's the main job of getstats , and it does this by saving the output of the netstat command into the temp file $temp and going through $temp ascertaining key values, such as total packets sent and received. To ascertain the number of packets sent, for example, the script uses

 sent="$(grep 'packets sent' $temp  cut -d\ -f1  sed 's/[^[:digit:]]//g')" 

The sed invocation removes any nondigit values to ensure that no spaces or tabs end up as part of the resultant value. Then all of the extracted values are written to the netstat.log log file in the format var1Name=var1Value; var2Name=var2Value; and so forth. This format will let us later use eval on each line in netstat.log and have all the variables instantiated in the shell:


The netperf script does the heavy lifting , parsing netstat.log and reporting both the most recent performance numbers and any anomalies or other values that are increasing over time.

Although the netperf script seems complex, once you understand the math, it's quite straightforward. For example, it calculates the current percentage of retransmits by dividing retransmits by packets sent and then multiplying this result by 100. An integer-only version of the retransmission percentage is calculated by taking the result of dividing retransmissions by total packets sent, multiplying it by 10,000, and then dividing by 100:

 rep="$($scriptbc -p 3 $re/$snt\*100)" repn="$($scriptbc -p 4 $re/$snt\*10000  cut -d. -f1)" repn="$(( $repn / 100 ))" 

As you can see, the naming scheme for variables within the script begins with the abbreviations assigned to the various netstat values, which are stored in netstat.log at the end of the getstats script:

 echo -n "snt=$sent;re=$resent;rec=$received;dup=$dupacks;" echo -n "oo=$outoforder;creq=$connectreq;cacc=$connectacc;" echo "reto=$retmout" 

The abbreviations are snt , re , rec , dup , oo , creq , cacc , and reto . In the netperf script, the p suffix is added to any of these abbreviations for variables that represent decimal percentages of total packets sent or received. The pn suffix is added to any of the abbreviations for variables that represent integer-only percentages of total packets sent or received. Later in the netperf script, the ps suffix denotes a variable that represents the percentage summaries (averages) used in the final calculations.

The while loop steps through each entry of netstat.log , calculating the four key percentile variables ( re , retr , dup , and oo , which are retransmits, transmit timeouts, duplicates, and out of order, respectively). All are written to the $stats temp file, and then the awk script sums each column in $stats and calculates average column values by dividing the sums by the number of records in the file ( NR ).

The following line in the script ties things together:

 eval $(awk -f $awktmp < $stats) 

The awk invocation is fed the set of summary statistics ( $stats ) produced by the while loop and utilizes the calculations saved in the $awktmp file to output variable=value sequences. These variable=value sequences are then incorporated into the shell with the eval statement, instantiating the variables reps , retops , dupps , and oops , which are average retransmit, average retransmit timeouts, average duplicate packets, and average out-of-order packets, respectively. The current percentile values can then be compared to these average values to spot problematic trends.

Running the Script

For the netperf script to work, it needs information in the netstats log file. That information is generated by having a crontab entry that invokes getstats with some level of frequency. On a modern Mac OS X, Unix, or Linux system, the following crontab entry will work fine:

 */15 * * * */home/taylor/bin/getstats 

It will produce a log file entry every 15 minutes. To ensure the necessary file permissions, it's best to actually create an empty log file by hand before running getstats for the first time:

 $  sudo touch /var/log/netstat.log  $  sudo chmod a+rw /var/log/netstat.log  

Now the getstats program should chug along happily, building a historical picture of the network performance of your system. To actually analyze the contents of the log file, run netperf without any arguments.

The Results

First off, let's check on the netstat.log file:

 $  tail -3 /var/log/netstat.log  time=1063981801;snt=14386;re=24;rec=15700;dup=444;oo=555;creq=563;cacc=17;reto=158 time=1063982400;snt=17236;re=24;rec=20008;dup=454;oo=848;creq=570;cacc=17;reto=158 time=1063983000;snt=20364;re=24;rec=25022;dup=589;oo=1181;creq=582;cacc=17;reto=158 

It looks good, so let's run netperf and see what it has to report:

 $  netperf  Netstat is currently reporting the following:   25108 packets sent, with 24 retransmits (0%) and 158 retransmit timeouts (.600%)   34423 packets received, with 1529 dupes (4.400%) and 1181 out of order (3.400%)    583 total connection requests, of which 17 were accepted analyzing trends.... *** Warning: Duplicate receives are currently higher than average.     (average is 3% and current is 4%) *** Warning: Out of orders are currently higher than average.     (average is 0% and current is 3%) (analyzed 48 netstat log entries for calculations) 

Hacking the Script

You've likely already noticed that rather than using a human-readable date format, the getstats script saves entries in the netstat.log file using epoch time, which represents the number of seconds that have elapsed since January 1, 1970. For example, 1,063,983,000 seconds represents a day in late September 2003.

The use of epoch time will make it easier to enhance this script by enabling it to calculate the time lapse between readings . If, for some odd reason, your system's date command doesn't have the %s option for reporting epoch time, there's a short C program you can install to report the epoch time on just about any system: http://www.intuitive.com/ wicked /examples/epoch.c

Wicked Cool Shell Scripts. 101 Scripts for Linux, Mac OS X, and Unix Systems
Wicked Cool Shell Scripts
ISBN: 1593270127
EAN: 2147483647
Year: 2004
Pages: 150
Authors: Dave Taylor

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net