8.6 Monitoring the Network


For most of us, networking-related tasks make up a large fraction of our system administration duties. Installing and configuring a network can be a daunting task, especially if you're starting from scratch. However, monitoring and managing the network on an ongoing basis can be no less daunting, especially for very large networks. Fortunately, there are a variety of tools to help with this job, ranging from simple single-host network status utilities to complex network monitoring and management packages. In this section, we'll take a look at representative examples of each type, thereby enabling you to select the approach and software that is appropriate for your site.

8.6.1 Standard Networking Utilities

We'll begin with the standard Unix commands designed for various network monitoring and troubleshooting tasks on the local system. Each command provides a specific type of network information and allows you to probe and monitor various aspects of network functionality. (We've already considered three such tools: ping and arp in Section 5.3 and nslookup in Section 8.1.5.2 earlier in this chapter).

The netstat command is the most general of these tools. It is used to monitor a system's TCP/IP network activity. It can provide some basic data about how much and what kind of network activity is currently going on, and also summary information for the recent past.

The specific output of the netstat command varies somewhat from system to system, although the basic information that it provides is the same. Moving from these generic examples to the format on your systems will be easy.

Without arguments, netstat lists all active network connections with the local host.[19] In this output, it is often useful to filter out lines containing "localhost" to limit the display to interesting data:

[19] Some versions of netstat also include data about Unix domain sockets in this report (omitted from the upcoming example).

# netstat | grep -v localhost  Active Internet connections  Proto Recv-Q Send-Q  Local Address  Foreign Address (state)  tcp        0    737  hamlet.1018    duncan.shell    ESTABLISHED  tcp        0      0  hamlet.1019    portia.shell    ESTABLISHED  tcp      348      0  hamlet.1020    portia.login    ESTABLISHED  tcp      120      0  hamlet.1021    laertes.login   ESTABLISHED  tcp      484      0  hamlet.1022    lear.login      ESTABLISHED  tcp     1018      0  hamlet.1023    duncan.login    ESTABLISHED  tcp        0      0  hamlet.login   lear.1023       ESTABLISHED

On this host, hamlet, there are currently two connections each to portia, lear, and duncan, and one connection to laertes. All but one of the connections a connection to lear are outgoing: the address form of a hostname with a port number appended indicates the originating system for the connection.[20] The .login suffix indicates a connection made with rlogin or with rsh without arguments; the .shell appendix indicates a connection servicing a single command.

[20] Why is this? Connections on the receiving system use the defined port number for that service, and netstat is able to translate them into a service name like login or shell. The port on the transmitting end is just some arbitrary port without intrinsic meaning and so remains untranslated.

The Recv-Q and Send-Q columns indicate how much data is currently queued between the two systems via each connection. These numbers indicate current, pending data (in bytes), not the total amount transferred since the connection began. (Some versions of netstat do not provide this information and thus always display zeros in these columns.)

If you include netstat's -a option, the display will also include passive connections: network ports where a service is listening for requests. Here is an example from the output:

Proto Recv-Q Send-Q  Local Address  Foreign Address (state)  tcp        0      0  *:imap         *:*             LISTEN

Passive connections are characterized by the LISTEN keyword in the state column.

The -i option is used to display a summary of the network interfaces on the system:

# netstat -i Name    Mtu Network      Address      Ipkts      Opkts lan0   1500 192.168.9.0  greta      2399283     932981 lo0    4136 127.0.0.0    loopback     15856      15856

This HP-UX system has one Ethernet interface named lan0. The output also gives the maximum transmission unit (MTU) size for each interface's local network and a count of the number of incoming and outgoing packets since the last boot. Some versions of this command also give counts of the number of errors as well.

On most systems, you can follow the -i option with a time interval argument (in seconds) to obtain an entirely different display comparingnetworktraffic and error and collision rates (in fact, -i is often optional). On Linux systems, substitute the -w option for -i.

Here is an example of this netstat report:

# netstat -i 5 | awk 'NR!=3 {print $0}'      input   (en0)     output          input   (Total)   output packets errs  packets errs colls packets errs packets errs colls  47      0     66      0    0      47      0     66      0     0  114     0     180     0    0      114     0     180     0     0  146     0     227     0    0      146     0     227     0     0  28      0     52      0    0      28      0     52      0     0  ^C

This command displays network statistics every five seconds.[21] This sample output is in two parts: it includes two sets of input and output statistics. The left half of the table (the first five columns) shows the data for the primary network interface; the second half shows total values for all network interfaces on the system. On this system, like many others, there is only one interface, so the two sides of the table are identical.

[21] The awk command throws away the first line after the headers, which displays cumulative totals since the last reboot.

The input columns show data for incoming network traffic, and the output columns show data for outgoing traffic. The errs columns show the number of errors that occurred while transferring the indicated number of network packets. These numbers should be low, less than one percent of the number of packets. Larger values indicate serious network problems.

The colls column lists the number of collisions. A collision occurs when two hosts on the network try to send a packet within a few milliseconds of one another.[22] When this happens, each host waits a random amount of time before retrying the transmission; this method virtually eliminates repeated collisions by the same hosts. The number of collisions is a measure of how much network traffic there is, because the likelihood of a collision happening is directly proportional to the amount of network activity. Collisions are recorded only by transmitting hosts. On some systems, collision data isn't tracked separately but rather is merged in with the output errors figure.

[22] Remember that collisions occur only on CSMA/CD Ethernet networks; token ring networks, for example, don't have collisions.

The collision rate is low on an average, well-behaved network using hubs or coax cable, just a few percent of the totaltraffic. You should start to become concerned when it rises above about five percent. Network segments using full-duplexswitches should not see any collisions at all, and any amount of them indicates that the switch is overloaded.

The -s option displays useful statistics for each network protocol (cumulative since the last boot). Here is an example output section for the TCP protocol:

# netstat -s  .. . Tcp:     50 active connections openings     0 passive connection openings     0 failed connection attempts     0 connection resets received     3 connections established     45172 segments received     48365 segments send out     1 segments retransmitted     0 bad segments received     3 resets sent

Some versions of netstat provide even more detailed per-protocol information.

netstat can also display therouting tables using its -r option. See Section 5.2 for a discussion of this mode.

Graphical utilities to display similar data are also becoming common. For example, Figure 8-6 illustrates some of the output generated by the ntop command, written by Luca Deri (http://www.ntop.org). When it is running, the command generates web pages containing the collected information.

Figure 8-6. Network traffic data produced by ntop
figs/esa3.0806.gif

The window on the left in the illustration depicts one of ntop's most useful displays. It shows incoming network traffic for the local system, broken down by origin. The various columns list average and peak data transmission rates for each one. A similar display for outgoing network traffic is also available. This information can be very useful in narrowing down network performance problems to the specific systems that are involved.

ntop provides many other tables and graphs of useful network data. For example, the pie chart on the right side of the figure illustrates the breakdown of network traffic by packet length.

As we've seen, the ping command is useful for basic network connectivity testing. It can also be useful for monitoring network traffic by observing the round trip time between two locations over time. The best way to do this is to tell ping to send a specific number of queries. The command format to do this varies by system:

AIX and HP-UX
ping host packet-size count
AIX, FreeBSD, Linux, and Tru-64
ping -c count [-s packet-size] host
Solaris
ping -s host packet-size count

Here is an example from an AIX system:

# ping beulah 64 5 PING beulah: (192.168.9.84): 56 data bytes 64 bytes from 192.168.9.84: icmp_seq=0 ttl=255 time=1 ms 64 bytes from 192.168.9.84: icmp_seq=1 ttl=255 time=0 ms 64 bytes from 192.168.9.84: icmp_seq=2 ttl=255 time=0 ms 64 bytes from 192.168.9.84: icmp_seq=3 ttl=255 time=0 ms 64 bytes from 192.168.9.84: icmp_seq=4 ttl=255 time=0 ms ----beulah PING Statistics---- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max = 5/5/6 ms

This command pings beulah 5 times, using the default packet size of 64 bytes. The summary at the bottom of the output displays the packet-loss statistics (here, none) and round-trip time statistics. Used in this way, ping can provide a quick measure of network performance, provided that you know what normal is for the connection in question.

You can increase the packet size to a value greater than the MTU to force packet fragmentation (a value above 1500 is usually sufficient for Ethernet networks) and thereby use ping to monitor performance under those conditions.[23]

[23] The "ping of death" attacks (1998) consisted of fragmented ping packets that were too large for their memory buffer. When the packet was reassembled and the buffer overflowed, the system crashed.

The traceroute command (devised by Van Jacobson) is used to determine the route taken by network packets to arrive at their destination. It obtains this route information by a clever scheme that takes advantage of the packet's time-to-live (TTL) field, which specifies the maximum hops the packet can travel before being discarded. This field is automatically decremented by each gateway that the packet passes through. If its value reaches 0, the gateway discards the packet and returns a message back to the originating host (specifically, an ICMP time-exceeded message).

traceroute uses this behavior to identify each location in the route to the destination. It begins with a TTL of 1, so packets are discarded by the first gateway. traceroute then obtains the gateway address from the resulting ICMP message. After a fixed number of packets with TTL 1 (usually 3), the TTL is increased to 2. In the same way, this packet is discarded by the second gateway, whose identity can be determined by the resulting error message. The TTL is gradually increased in this way until a packet reaches the destination.

Here is an example of traceroute in action:

# traceroute www.fawc.org  traceroute to fawc.org (64.226.114.72),30 hops max,40 byte packets  1  route129a.ycp.edu (208.192.129.2) 1.870 ms  1.041 ms  0.976 ms  2  209.222.29.105 (209.222.29.105)  3.345 ms  3.929 ms  3.524 ms  3  Serial2-2.GW4.BWI1.ALTER.NET (157.130.25.173) 9.155 ms .. .  4  500.at-0-1-0.XL2.DCA8.ALTER.NET (152.63.42.94)  8.316 ms .. .  5  0.so-0-0-0.TL2.DCA6.ALTER.NET (152.63.38.73)  9.931 ms .. .  6  0.so-7-0-0.TL2.ATL5.ALTER.NET (152.63.146.41)  24.248 ms .. .   7  0.so-4-1-0.XL2.ATL5.ALTER.NET (152.63.146.1)  25.320 ms .. .   8  0.so-7-0-0.XR2.ATL5.ALTER.NET (152.63.85.194)  24.330 ms .. .   9  192.ATM7-0.GW5.ATL5.ALTER.NET (152.63.82.13)  26.824 ms .. . 10  interland1-gw.customer.alter.net (157.130.255.134) 24.498 ms .. . 11  * * *      No messages received from these hosts. 12  * * * 13  64.224.0.67 (64.224.0.67)  24.937 ms  25.155 ms  24.738 ms 14  64.226.114.72 (64.226.114.72)  26.058 ms  24.587 ms  26.677 ms

Each numbered line corresponds to a successive gateway in the route, and each line displays the hostname (when available), IP address, and the round-trip times for each of the three packets (I've truncated long lines to fit). This particular route spent quite a bit of time traveling inside alter.net.

Sometimes, routers or firewalls drop ICMP packets or fail to send error messages. These situations result in lines like 11 and 12, where three asterisks indicate that the gateway could not be identified. Other lines may also contain asterisks for similar reasons. Occasionally, the successive outgoing packets take different routes to the destination, and different intermediate gateway data is returned. In such cases, all of the gateways are listed.

NOTE

figs/armadillo_tip.gif

Both traceroute and netstat provide a -n option which specifies that output contain IP addresses only (and hostname resolution should not be attempted). These options are useful for determining network information when DNS name resolution is not working or is unavailable.

8.6.2 Packet Sniffers

Packet sniffers provide a means for examining network traffic on an individual packet basis. They can be invaluable for troubleshooting problems related to a specific network operation, such as a client-server application, rather than general network connectivity issues. They can also be abused, of course, and used for eavesdropping purposes. For this reason, they must be run as root.

The freel tcpdump utility is the best-known tool of this type (it was originally written by Van Jacobson, Craig Leres, and Steven McCanne and is available from http://www.tcpdump.org). It is provided with the operating system by many vendors all but HP-UX and Solaris in our case but can be built for these systems as well. (Solaris provides the snoop utility instead, which we'll discuss later in this subsection.)

tcpdump allows you to examine the headers of TCP/IP packets. For example, the following command displays the headers for all traffic involving host romeo (some initial and trailing output columns have been stripped off to save space):

# tcpdump -e -t host romeo  arp 42: arp who-has spain tell romeo  arp 60: arp reply spain is-at 03:05:f3:a1:74:e3  ip 58: romeo.1014 > spain.login: S 27643873:27643873(0) win 16384  ip 60: spain.login > romeo.1014: S 19898809:19898809(0) ack                                                 27643874 win 14335  ip 54: romeo.1014 > spain.login: . ack 1 win 15796  ip 55: romeo.1014 > spain.login: P 1:2(1) ack 1 win 15796  ip 60: spain.login > romeo.1014: . ack 2 win 14334  ip 85: romeo.1014 > spain.login: P 2:33(31) ack 1 win 15796  ip 60: spain.login > romeo.1014: . ack 33 win 14303  ip 60: spain.login > romeo.1014: P 1:2(1) ack 33 win 14335  ... ip 60: spain.login > romeo.1014: F 177:177(0) ack 54 win 14335 ip 54: romeo.1014 > spain.login: . ack 178 win 15788  ip 54: romeo.1014 > spain.login: F 54:54(0) ack 178 win 15796  ip 60: spain.login > romeo.1014: . ack 55 win 14334

This output displays the protocol and packet length, followed by the source and destination hosts and ports. For TCP packets, this information is followed by the TCP flags (a period or one or more uppercase letters), ack plus the acknowledgement sequence number, and win plus the contents of the TCP window size field. Note that the literal sequence numbers are displayed only in the first packet in each direction; after that, relative numbers are used to improve readability.

So what good is this output? You can monitor the progress of a TCP/IP operation (the packets that are displayed can be specified in a number of ways); here we see the initial connection and final termination of an rlogin connection from romeo to spain. You can also monitor how network traffic is affecting connections of interest by observing the values in the window field. This field specifies the data window that the sending host will accept in future packets, specifying the maximum number of bytes. The window field also serves as the TCP flow-control mechanism, and a host will reduce the value it places there when the host is congested or overloaded (it can even use a value of 0 to temporarily halt incoming transmissions). In our example, there are no congestion problems on either host.

tcpdump can also be used to display the contents of TCP/IP packets, using its -X option, which displays packet data in hex and ASCII. For example, this command displays the packet data from packets sent from mozart to salieri:

# tcpdump -X -s 0 src mozart and dst salieri ... 0x0000   4510 0053 dd9e 4000 3c06 cbe8 c100 0935  E..S..@.<......5 0x0010   c100 09d8 0201 03fd 1ead 846c c70d c3d6  ...........l.... 0x0020   5018 f000 6e99 0000 4672 6920 4d61 7220  P...n...Fri.Mar. 0x0030   2031 2030 393a 3438 3a32 3120 4553 5420  .1.09:48:21.EST. 0x0040   3230 3032 0d0a 6d61 686c 6572 2d32 3032  2002..mozart-202 0x0050   3e3e                                     >>

The output shows only one packet. It contains the current date and time and the initial prompt after a successful rlogin command from salieri to mozart.

The -s 0 option tells tcpdump to increase the number of bytes of data that are dumped from each packet to whatever limit is required to display the entire packet (the default is usually 60 to 80).

We've now seen two examples of the arguments to tcpdump, which consists of an expression specifying the packets to be displayed. In other words, it functions as a filter on incoming packets. A variety of keywords are defined for this purpose, and logical connectors are provided for creating complex conditions, as in this example:

# tcpdump src \( mozart or salieri \) and tcp port 21 and not dst vienna

The expression in this command selects packets from mozart or salieri using TCP port 21 (the FTP control port) that are not destined for vienna.

NOTE

figs/armadillo_tip.gif

You can save packets to a file rather than displaying them immediately using the -w option. You then use the -r option to read from a file rather than displaying current network traffic.

A few vendor-provided versions of tcpdump have some eccentricities:

  • The AIX version does not provide the -X option (although you can dump packets in hex with -x). I recommend replacing it with the latest version from http://www.tcpdump.org if you need to examine packet contents.

  • Tru64 requires that the kernel be compiled with packet filtering enabled (via the options PACKETFILTER directive). You must also create the pfilt device (interface):

        # cd /dev; MAKEDEV pfilt

    Finally, you must configure the interface to allow tcpdump to set it to promiscuous mode and to access the frame headers:

        # pfconfig +p +c network-interface

It is often useful to pipe the output of tcpdump to grep to further refine the displayed output. Alternatively, you can use the ngrep command (written by Jordan Ritter, http://www.packetfactory.net/projects/ngrep/) which builds grep functionality into a packet filter utility. For an example of using ngrep, see Section 6.6.

8.6.2.1 The Solaris snoop command

The Solaris snoop command is essentially equivalent to tcpdump, although I find its output is more convenient and intuitive. Here is an example of its use:

# snoop src bagel and dst acrasia and port 23 Using device /dev/eri (promiscuous mode)        bagel -> acrasia      TELNET C port=32574 a        bagel -> acrasia      TELNET C port=32574        bagel -> acrasia      TELNET C port=32574 e        bagel -> acrasia      TELNET C port=32574        bagel -> acrasia      TELNET C port=32574 f        bagel -> acrasia      TELNET C port=32574        bagel -> acrasia      TELNET C port=32574 r        bagel -> acrasia      TELNET C port=32574        bagel -> acrasia      TELNET C port=32574 i        bagel -> acrasia      TELNET C port=32574        bagel -> acrasia      TELNET C port=32574 s        bagel -> acrasia      TELNET C port=32574        bagel -> acrasia      TELNET C port=32574 c        bagel -> acrasia      TELNET C port=32574        bagel -> acrasia      TELNET C port=32574 h

As this example illustrates, the snoop command accepts the same expressions as tcpdump for use in filtering the packets to display. This output displays a portion of the login sequence from a telnet session. The data from the packet is displayed to the right of the header information; here we see the login name that was entered.

snoop has several useful options, as illustrated in these examples:

# snoop -o file -q                        Save packets to a file. # snoop -i file                           Read packets from a  file. # snoop -v [-p n]                         Display packet details (for packet n).
8.6.2.2 Packet collecting under AIX and HP-UX

HP-UX's nettl facility and AIX's iptrace and ipreport utilities are general-purposepacket collection packages. They both collect network packet data into a binary file and can display specified information from such files in an easy-to-read format. They have the advantage that data collection is fundamentally decoupled from its display.

The specific data to save is highly configurable, and data collection occurs automatically via a network daemon or cron job. This allows the facilities to gather and accumulate a body of network information which can be used for troubleshooting and performance analysis. In addition, ad hoc filtering can take place afterwards, allowing for much more complex reporting.

8.6.3 The Simple Network Management Protocol

The tools discussed in the previous subsection can be very useful for examining network operations and/or traffic for one or two systems. However, you'll eventually want to examine network traffic and other data in the context of the network as a whole, moving beyond the point of view of any single system. Much more elaborate tools are needed for this task. We will consider several examples of such packages in the next section. To understand how they work, however, we'll need to consider the Simple Network Management Protocol (SNMP), the network service that underlies a large part of the functionality of most network management programs. We'll begin with a brief look atSNMP's fundamental concepts and data structures and then go on to the practicalities of using it on Unix systems. Finally, we'll discuss some security issues that must be resolved when using SNMP.

For a more extended treatment of SNMP, I recommend Essential SNMP by Douglas Mauro and Kevin Schmidt (O'Reilly & Associates).

8.6.3.1 SNMP concepts and constructs

SNMP was designed to be a consistent interface for both gathering data from and seting parameters of various network devices. The managed devices can range from switches and routers to network hosts (computers) running almost any operating system. SNMP succeeds in doing this reasonably well, once you have it configured and running everywhere you need it. The hardest part is getting used to its somewhat counterintuitive terminology, which I'll attempt to decode in this section.

SNMP has been around for a while, and there are many versions of it (including several flavors of Version 2). The ones that are implemented currently are Version 1 and Version 2c. There is also a Version 3 in development as of this writing. We will address version-specific issues when appropriate.

Figure 8-7 illustrates a basic SNMP setup. In this picture, one computer is the Network Management Station (NMS). Its job is to collect and act on information from the various devices being monitored. The latter are grouped on the right side of the figure and include two computers, a router, a network printer, and an environmental monitoring device (these are only a part of the range of devices that support SNMP).

Figure 8-7. SNMP manager and agents
figs/esa3.0807.gif

In the simplest case, the NMS periodically polls the devices it is managing, sending queries for the devices' current status information. The devices respond by transmitting the requested data. In addition, monitored devices can also send traps: unsolicited messages to the NMS, usually generated when the value of some monitored parameter falls out of the acceptable range. For example, an environmental monitoring device may send a trap when the temperature or humidity is too low or too high.

The term manager is used to refer both to the monitoring software running on the NMS as well as the computer (or other device) running the software. Similarly, the term agent refers to the software used by the monitored devices to generate and transmit their status data, but it is also used more loosely to refer to the device being monitored. Clearly, SNMP is a client-server protocol, but its usage of "client" and "server" is reversed from the typical usage: the local manager functions as the client, and the remote agents function as servers. This is similar to the terms' usage in the X Window system: X clients on remote hosts are displayed by the X server on the local host. SNMP messages use TCP and UDP ports 161, and traps use TCP and UDP ports 162. Some vendors use additional ports for traps (e.g., Cisco uses TCP and UDP ports 1993).

For an SNMP manager to communicate with an agent, the manager must be aware of the various data values that the agent keeps track of. The names and contents of these data values are defined in one or moreManagement Information Bases (MIBs). A MIB is just a collection of value/property definitions whose names are arranged into a standard hierarchy (tree structure). A MIB is not a database but rather aschema. A MIB does not hold any data values; it is simply a definition of the data values that are being monitored and that may be queried or modified. These data definitions and naming conventions are used internally by the SNMP agent software, and they are also stored in text files for use by SNMP managers. MIBs may be standard and may be implemented by every agent, or proprietary, describing data values specific to a manufacturer and possibly to a device class.

This will become clearer when we look at an actual data value name. Consider this one:

iso.org.dod.internet.mgmt.mib-2.system.sysLocation = "Dabney Alley 6 Closet"

The name of this data value is the long, italicized string on the left of the equal sign. The various components of the name separated by periods correspond to different levels of the MIB tree (starting with iso at the top). Thus, sysLocation is eight levels deep within the hierarchy. The tree structure is used to group related data values together. For example, the system group defines various data items that relate to the overall system (or device), including its name, physical location (sysLocation), and primary contact person. As this example indicates, not all SNMP data need be dynamic.

Figure 8-8 illustrates the overallSNMP namespace hierarchy. The top levels of the tree exist mainly for historical reasons, and most data resides in the mgmt.mib-2 and private.enterprises subtrees. The former implements what is now the standardMIB, named MIB II (it is an enhancement to the original standard), and it has a large number of items under it. Only two of its direct children are included in the illustration: system, which holds general information about the device, and host, which holds data related to computer systems. Other important children of mib-2 are interfaces (network interfaces); ip, tcp, and udp (protocol-specific data); and snmp (SNMP traffic data). Note that all names within the MIB are case-sensitive. Clearly, not all parts of the hierarchy apply to all devices, and only the relevant portions are implemented by most agents.

Figure 8-8. General SNMP MIB hierarchy
figs/esa3.0808.gif

The highlighted items in the figure are leaf nodes that actually contain data values. Here, we see the system location description, the current number of processes on the system, and the system load average (moving from left to right).

Each of the points with the MIB hierarchy has both a name and a number associated with it. The numbers for each item are also given in the figure. You can refer to a data point by either name or number. For example, iso.org.dod.internet.mgmt.mib-2.system.sysLocation can also be referred to as 1.3.6.1.2.1.1.6. Similarly, the laLoad data item can be specified as iso.org.dod.internet.private.enterprises.ucdavis.laTable.laEntry.laLoad and as 1.3.6.1.4.1.2021.10.1.3. Each of these name types is known generically as an OID (object ID). Usually, only the name of the final node sysLocation or laLoad is needed to refer to a data point, but occasionally the full version of the OID must be specified (as we'll see).

The private.enterprises portion of theMIB tree contains vendor-specific data definitions. Each organization that has applied for one is assigned a unique identifier under this mode; the ones corresponding to the vendors of our operating systems, U.C. Davis, and Cisco are pictured. For a listing of all assigned numbers, see ftp://ftp.isi.edu/in-notes/iana/assignments/enterprise-numbers/. You can request a number for your organization from theInternet Assigned Numbers Authority (IANA) at http://www.iana.org/cgi-bin/enterprise.pl.

The ucdavis subtree is important for Linux and FreeBSD systems, because the open source Net-SNMP package is what is used on these systems. This package was developed by U.C. Davis for a long time (and Carnegie Mellon University before that), and this is the enterprise-specific subtree that applies to open source SNMP agents. This package is available for all the operating systems we are considering.

Another important MIB is the remote monitoringMIB, RMON. This MIB defines a set of generic network statistics. It is designed to allow data collection from a series of autonomous probes positioned around the network which ultimately transmit summary data to a central manager. Probe capabilities are supported by many current routers, switches and other network devices. Placing probes at strategic points throughout a WAN can greatly reduce the network traffic required to monitor the performance across the entire network by limiting the raw data collection to the probes and minimizing communication with a distant NMS by reducing it to summary form.

Access to SNMP data is controlled by passwords called community names (or strings). There are generally separate community names for the agent's read-only and read-write modes, as well as an additional name used with traps. Each SNMP agent knows its name (i.e., password) for each mode and will not answer queries which specify anything else. Community names can be up to 32 characters long and should be chosen using the same security considerations as root passwords. We'll discuss other security implications of community names in a bit later.

Unfortunately, many devices are delivered with SNMP enabled, using the default read-only community string public and sometimes the default read-write community string private. It is imperative that you change these values before the device is placed on the network (or that you disable SNMP for the device). Otherwise, you immediately place the device at risk for easy attack for hijacking and tampering by hackers, and its can vulnerability can put other parts of your network at risk.

The procedure for changing this value varies by device. For hosts, you change it in the configuration file associated with the SNMP agent. For other types of devices, such as routers, consult the documentation provided by the manufacturer.

In contrast to the relative complexity of the data definitions, the set of SNMP operations that monitor and manage devices is quite limited, consisting of get (to request a value from device), set (to specify the value of a modifiable device parameter), and trap (to send a trap message to a specified manager). In addition, there are a few variations on these basic operations, such as get-next, which requests the next data item in the MIB hierarchy. We'll see the operations in action in the next subsection.

8.6.3.2 SNMP implementations

The commercial Unix operating systems we are considering all provide an SNMP agent, implemented as a single daemon or a series of daemons. In addition, the Net-SNMP package provides SNMP functionality for Linux, FreeBSD, and other free operating systems. It can also be used with commercial Unix systems that do not provide SNMP support.

AIX and Net-SNMP also provide some simple utilities for performing client operations. The utilities from the latter may also be built and used for systems providing their own SNMP agent.

Table 8-10 lists the various components of the SNMP packages provided by and available to the various operating systems we are considering.

Table 8-10. SNMP components

Component

Location

Insecure agent running after initial OS install?

AIX

yes

HP-UX

yes

Net-SNMP[24]

no

Solaris

yes

Tru64

yes

Primary agent daemon

AIX

/usr/sbin/snmpd

HP-UX

/usr/sbin/snmpdm

Net-SNMP

/usr/local/sbin/snmpd /usr/sbin/snmpd (SuSE Linux)

Solaris

/usr/lib/snmp/snmpdx

Tru64

/usr/sbin/snmpd

Agent configuration file(s)

AIX

/etc/snmpd.conf

HP-UX

/etc/SnmpAgent.d/snmpd.conf

Net-SNMP

/usr/local/share/snmp/snmpd.conf /usr/share/snmp/snmpd.conf (SuSE Linux)

Solaris

/etc/snmp/conf/snmpdx.* and /etc/snmp/conf/snmpd.conf

Tru64

/etc/snmpd.conf

MIB files

AIX

/etc/mib.defs

HP-UX

/etc/SnmpAgent.d/snmpinfo.dat /opt/OV/snmp_mibs/* (OpenView)

Net-SNMP

/usr/share/snmp/mibs/*

Solaris

/var/snmp/mib/*

Tru64

/usr/examples/esnmp/*

Enterprise number(s)

AIX

2 (ibm), 4 (unix)

HP-UX

11 (hp)

Net-SNMP

2021 (ucdavis)

Linux

Red Hat: 3212; SuSE: 7057

Solaris

42 (sun)

Tru64

36 (dec), 232 (compaq)

Management/monitoring package

AIX

Tivoli

HP-UX

OpenView

Solaris

Solstice Enterprise Manager

Boot script that starts theSNMP agent(s)

AIX

/etc/rc.tcpip

FreeBSD

/etc/rc (add command manually)

HP-UX

/sbin/init.d/Snmp*

Linux

/etc/init.d/snmpd

Solaris

/etc/init.d/init.snmpdx

Tru64

/sbin/init.d/snmpd

Boot script config. file: relevant entries

Usual

none used

HP-UX

/etc/rc.config.d/Snmp*: SNMP_*_START=1

Linux

SuSE 7: /etc/rc.config: START_SNMPD="yes"

[24] Net-SNMP is used on FreeBSD and Linux systems.

We'll consider some of the specifics for the various operating systems a bit later in this section.

8.6.3.3 Net-SNMP client utilities

Unlike most implementations, the Net-SNMP package includes several useful utilities that can be used to query SNMP devices. You can build these tools for most operating systems even when they provide their own SNMP agent, so we'll consider them in some detail in this section. In addition, reading these examples will provide you with a greater understanding of how SNMP works, regardless of the specific implementation.

The first tool we'll consider is snmptranslate, which provides information about the MIB structure and its entities (but does not display any actual data). Table 8-11 lists the most useful snmptranslate commands.

Table 8-11. Useful snmptranslate commands

Purpose

Command

Display MIB subtree

snmptranslate -Tp .oid[25]

Text description for OID

snmptranslate -Td .oid[25]

Show full OID name (mib-2 subtree only)

snmptranslate -IR -On name

Translate OID name to number

snmptranslate -IR name

Translate OID number to name

snmptranslate -On .number[25]

[25] Absolute OIDs (numeric or text) are preceded by a period.

As an example, we'll define an alias (using the C shell) which takes a terminal leaf entry name (in the mib-2 tree) as its argument and then displays the definition for that item, including its full OID string and numeric equivalent. Here is the alias definition:

% alias snmpwhat 'snmptranslate -Td  `snmptranslate -IR -On \!:1`'

The alias uses two snmptranslate commands. The one in back quotes finds the full OID for the specified name (substituted in via !:1). Its output becomes the argument of the second command, which displays the description for this data item.

Here is an example using the alias which shows the description for the sysLocation item we considered earlier:

% snmpwhat sysLocation  .1.3.6.1.2.1.1.6  sysLocation OBJECT-TYPE   -- FROM       SNMPv2-MIB, RFC1213-MIB   -- TEXTUAL CONVENTION DisplayString   SYNTAX        OCTET STRING (0..255)   DISPLAY-HINT  "255a"   MAX-ACCESS    read-write   STATUS        current   DESCRIPTION   "The physical location of this node (e.g.,                  'telephone closet, 3rd floor'). If the location is                  is unknown, the value is the zero-length string."  ::={iso(1) org(3) dod(6) internet(1) mgmt(2) mib-2(1) system(1) 6}

Other forms of the snmptranslate command provide related information.

The snmpget command retrieves data from an SNMP agent. For example, the following command displays the value of sysLocation from the agent on beulah, specifying the community string as somethingsecure:

# snmpget beulah somethingsecure sysLocation.0 system.sysLocation.0 = "Receptionist Printer"

The specified data location is followed by an instance number, which is used to specify the row number within tables of data. For values not in tables scalars it is always 0.

For tabular data, indicated by an entry named somethingTable within the OID, the instance number is the desired table element. For example, this command retrieves the 5-minute load average value, because the 1-, 5-, and 15-minute load averages are stored in the successive rows of the enterprises.ucdavis.laTable (as defined in the MIB):

# snmpget beulah somethingsecure laLoad.2 enterprises.ucdavis.laTable.laEntry.laLoad.2 = 1.22

The snmpwalk command displays the entire subtree underneath a specified node. For example, this command displays all data values under iso.org.dod.internet.mgmt.mib-2.host.hrSystem:

# snmpwalk beulah somethingsecure host.hrSystem  host.hrSystem.hrSystemUptime.0 = Timeticks: (31861126)                                   3 days, 16:30:11.26  host.hrSystem.hrSystemDate.0 = 2002-2-8,11:5:4.0,-5:0  host.hrSystem.hrSystemInitialLoadDevice.0 = 1536  host.hrSystem.hrSystemInitialLoadParameters.0 =    "auto BOOT_IMAGE=linux ro root=2107     BOOT_FILE=/boot/vmlinuz enableapic vga=0x0314."  host.hrSystem.hrSystemNumUsers.0 = Gauge32: 1  host.hrSystem.hrSystemProcesses.0 = Gauge32: 205 host.hrSystem.hrSystemMaxProcesses.0 = 0

The format of each output line is:

OID = [datatype:] value

If you're curious what all these items are, use snmptranslate to get their full descriptions.

Finally, the snmpset command can be used to modify writable data values, as in this command, which changes the device's primary contact (the s parameter indicates a string data type):

# snmpset beulah somethingelse sysContact.0 s "chavez@ahania.com"  system.sysContact.0 = chavez@ahania.com

Other useful data types are i for integer, d for decimal, and a for IP address (see the manual page for the entire list).

8.6.3.3.1 Generating traps

The Net-SNMP package includes the snmptrap command for manually generating traps. Here is an example of its use, which also illustrates the general characteristics of traps:

# snmptrap -v2c dalton anothername '' .1.3.6.1.6.3.1.1.5.3 \     ifIndex i 2 ifAdminStatus i 1 ifOperStatus i 1

The -v2c option indicates that an SNMP version 2c trap is to be sent (technically, version 2 traps are called notifications).The next two arguments are the destination (manager) and community name. The next argument is the device uptime, and it is required for all traps. Here, we specify a null string, which defaults to the current uptime. The final argument in the first line is the trap OID; these OIDs are defined in one of the MIBs used by the device. This one corresponds to the linkDown (as defined in the IF-MIB), defined as a network interface changing state.

The remainder of the arguments (starting with ifIndex) are determined by the specific trap being sent. This one requires the interface number and its administrative and operational statuses, each specified via a keyword-data type-value triple (these particular data types are all integer). In this case, the trap specifies interface 2. A status value of 1 indicates that the interface is up, so this trap is a notification that it has come back online after being down.

Here is the syslog message that might be generated by this trap:

Feb 25 11:44:21 beulah snmptrapd[8675]: beulah.local[192.168.9.8]:  Trap system.sysUpTime.0 = Timeticks:(144235905) 6 days, 06:39:19, .iso.org.dod.internet.snmpV2.snmpModules.snmpMIB.snmpMIBObjects.   snmpTrap.snmpTrapOID.0 = OID: 1.1.5.3,   interfaces.ifTable.ifEntry.ifIndex = 2,   interfaces.ifTable.ifEntry.ifAdminStatus = up(1),   interfaces.ifTable.ifEntry.ifOperStatus = up(1)

SNMP-managed devices generally come with predefined traps that you can sometimes enable/disable during configuration. Some agents are also extensible and allow you to define additional traps.

8.6.3.3.2 AIX and Tru64 clients

AIX also provides an SNMP client utility, snmpinfo . Here is an example of its use:

# snmpinfo -c somethingsecure -h beulah -m get sysLocation.0 system.sysLocation.0 = "Receptionist Printer"

The -c and -h options specify the community name and host for the operation, respectively. The -m option specifies the SNMP operation to be performed here, get and other options are next and set.

Here is the equivalent command as it would be run on a Tru64 system:

# snmp_request beulah somethingsecure get 1.3.6.1.2.1.1.6.0

Yes, it really does require the full OID. The third argument specifies the SNMP operation, and other keywords used there are getnext, getbulk and set.

8.6.3.4 Configuring SNMP agents

In this section, we'll look at the configuration file for each of the operating systems.

8.6.3.4.1 Net-SNMP snmpd daemon (FreeBSD and Linux)

FreeBSD and Linux systems use the Net-SNMP package (http://www.net-snmp.org), also previously known as UCD-SNMP. The package provides both a Unix host agent (the snmpd daemon) and a series of client utilities.

On Linux systems, this daemon is started with the /etc/init.d/snmp boot script and uses the/usr/local/share/snmp/snmpd.conf configuration file by default.[26] On FreeBSD systems, you must add a command like the following to one of the system boot scripts (e.g., /etc/rc):

[26] Be aware that the RPMs provided with recent SuSE operating systems use the /etc/ucdsnmpd.conf configuration file instead, although you can change this by editing the boot script. The canonical configuration file location under SuSE is also different: /usr/share/snmp.

/usr/local/sbin/snmpd -L -A

The options tell the daemon to send log messages to standard output and standard error instead of to a file. You can also specify an alternate configuration file using the -c option.

Here is a sample Net-SNMP snmpd.conf file:

# snmpd.conf  rocommunity    somethingsecure rwcommunity    somethingelse  trapcommunity  anothername  trapsink       dalton.ahania.com trap2sink      dalton.ahania.com syslocation   "Building 2 Main Machine Room"  syscontact    "chavez@ahania.com" # Net-SNMP-specific items: conditions for error flags  #keyw [args] limit(s) load  5.0 6.0 7.0                       1,5,15 load average maximums.  disk  / 3%                              root filesystem below 3% free.  proc  portmap 1 1                       Must be exactly one portmap process running.  proc  cron  1 1                         Require exactly one cron process.  proc  sendmail                          Require at least one sendmail process.

The first three lines of the file specify the community name for accessing the agent in read-only and read-write mode and the name that will be used when it sends traps (which need not be a distinct value as above). The next two lines specify the trap destination for SNMP version 1 and version 2 traps; here it is host dalton.

The next section specifies the values of two MIB II variables, describing the location of the device and its primary contact. They are both located under mib-2.system.

The final section defines some Net-SNMP-specific monitoring items. These items check for a 1-, 5-, or 15-minute load average above 5.0, 6.0, or 7.0 (respectively), whether the free space in the root filesystem has dropped below 3%, and whether the portmap, cron, and sendmail daemons are running. When the corresponding value falls outside of the allowed range, the SNMP daemon sets the corresponding error flag data value under enterprises.ucdavis for the table row corresponding to the specified monitoring item: laTable.laEntry.laErrorFlag, dskTable.dskEntry.dskErrorFlag, and prTable.prEntry.prErrorFlag, respectively. Note that traps are not generated.

NOTE

figs/armadillo_tip.gif

You can also use the command snmpconf -g to configure a snmpd.conf file. Add the -i option if you want the command to automatically install the new file into the proper directory (rather than placing it in the current directory).

8.6.3.4.2 Net-SNMP access control

The community definition entries introduced above also have a more complex form in which they accept additional parameters to specify access control. For example, the following command defines the read-write community as localonly for the 192.168.10.0 subnet:

rwcommunity localonly 192.168.10.0/24

The subnet to which the entry applies is specified by the second parameter.

Similarly, the following command specifies a read-only community name secureread for host callisto and limits access from that host to the mib-2.hosts subtree.

rocommunity secureread callisto .1.3.6.1.2.1.25

The starting point for allowed access is specified as the entry's third parameter.

This syntax is really a compact form of the general Net-SNMP access control directives com2sec, view, group, and access. The first two are the most straightforward:

#com2sec  name      origin           community com2sec   localnet  192.168.10.0/24  somethinggood com2sec   canwrite  192.168.10.22    somethingbetter #view   name    in or out   subtree          [mask] view    mibii   included    .1.3.6.1.2.1    view    sys     included    .1.3.6.1.2.1.1

The com2sec directive defines a named query source-community name pair; this item is known as a security name. In our example, we define the name localnet for queries originating in the 193.0.10 subnet using the community name somethinggood.

The view directive assigns a name to a specific subtree; here we give the mib-2 subtree the label mibii and the name sys to the system subtree. The second parameter indicates whether the specified subtree is included or excluded from the specified view (more than one view directive can be used with the same view name). The optional mask field takes a hexadecimal number, which is interpreted as a mask further limiting access within the given subtree, for example, to specific rows within a table (see the manual page for details).

The group directive associates a security name (from com2sec) with a security model (corresponding to an SNMP version level). For example, the following entries define the group local as the localnet security name with each of the available security models:

#group   grp name  model  sec. name group    local     v1     localnet group    local     v2c    localnet group    local     usm    localnet             usm means version 3. group    admin     v2c    canwrite

The final entry defines the group admin as the canwrite security name with SNMP Version 3.

Finally, the access entry brings all of these items together to define specific access:

#        group                                 read  write  notify #access  name   context  model level   match   view  view   view access   local  ""       any   noauth  exact   mibii none   none access   admin  ""       v2c   noauth  exact   all   sys    all

The first entry allows queries of the mib-2 subtree from the 192.168.10 subnet using the community string somethinggood while rejecting all other operations (access happens via the mibii view). The second entry allows any query and notification from 193.0.10.22 and also allows set operations within the system subtree from this source using SNMP version 2c clients, all using the somethingbetter community name.

See the snmpd.conf manual page for full details on these directives.

8.6.3.4.3 The Net-SNMP trap daemon

The Net-SNMP package also includes the snmptrapd daemon for handling traps that are received. You can start the daemon manually by entering the snmptrapd -s command, which says to send trap messages to the syslog Local0 facility (warning level). If you want it to be started at boot time, you'll need to add this command to the /etc/init.d/snmp boot script.

The daemon can also be configured by the /usr/share/snmp/snmptrapd.conf file. Entries in this file have the following format:

traphandle OID|default program [arguments]

traphandle is a keyword, the second field holds the trap's OID or the keyword default, and the remaining items specify a program to be run when that trap is received, along with any arguments. A variety of data is passed to the program when it is invoked, including the device's hostname and IP address and the trap OID and variables. See the documentation for full details.

Note that snmptrapd is a very simple trap-handler. It is useful if you want to record or handle traps on a system without a manager as well as for experimentation and learning purposes. However, in the long run, you'll want a more sophisticated manager. We'll consider some of these later in this section.

8.6.3.4.4 Configuring SNMP nder HP-UX

HP-UX uses a series of SNMP daemons (subagents), all controlled by the SNMP master agent, snmpdm. The daemons are started by scripts in the /sbin/init.d subdirectory. The SnmpMaster script starts the master agent.

The subagents are:

  • The HP-UX subagent (/usr/sbin/hp_unixagt), started by the SnmpHpunix script.

  • The MIB2 subagent (/usr/sbin/mib2agt), started by the SnmpMib2 script.

  • The trap destination subagent (/usr/sbin/trapdestagt), started by the SnmpTrpDst script.

HP-UX also provides the /usr/lib/snmp/snmpd script for starting all the daemons in a single operation.

The main configuration file is /etc/SnmpAgent.d/snmpd.conf . Here is an example of this file:

get-community-name:  somethingsecure set-community-name:  somethingelse max-trap-dest:       10                     Max. number of trap destinations. trap-dest:           dalton.ahania.com location:           "machine room" contact:            "chavez@ahania.com"

There are also more complex versions of the community name definition entries which allow you to specify access control on a per-host basis, as in this example:

get-community-name: somethingsecure \   IP: 192.168.10.22 192.168.10.222 \   VIEW: mib-2 enterprises -host              Use -name to exclude a subtree. default-mibVIEW: internet                    Default accessible subtree.

The first entry (continued across three lines) allows two hosts from the 192.168.10 subnet to access the mib-2 and enterprises subtrees (except the former's host subtree) in read-only mode, using the somethingsecure community name. The second entry defines the default MIB access; it is applied to queries from hosts for which no specific view has been specified.

HP-UX's SNMP facility is designed to be used as part of its OpenView network management facility, a very elaborate package which allows you to manage many aspects of computers and other network devices from a central control station. In the absence of this package, the SNMP implementation is fairly minimal.

8.6.3.4.5 Configuring SNMP under Solaris

Solaris'SNMP agent is the snmpdx daemon.[27] It controls a number of subagents. The most important of these is mibiisa, which responds to standard SNMP queries within the mib-2 and enterprises.sun subtrees (although MIB II is only partially implemented).

[27] Solaris also supports the Desktop Management Interface (DMI) network management standard, and its daemons can interact with snmpdx on these systems.

The daemons use configuration files in /etc/snmp/conf . The primary settings are contained in snmpd.conf. Here is an example:

# set some system values sysdescr    "old Sparc used as a router" syscontact  "chavez@ahania.com" syslocation "Ricketts basement" # default communities and trap destination read-community   hardtoguess write-community  hardertoguess trap-community   usedintraps trap             dalton.ahania.com           Maximum of 5 destinations. # hosts allowed to query (5/line, max=32) manager localhost dalton.ahania.com hogarth.ahania.com manager blake.ahania.com

Be aware of the difference between the community definition entries in the preceding example and those named system-read|write-community; the latter allow access to the system subtree only.

The snmpdx.acl configuration file may be used to define more complex access control, via entries like these:

acl = {         {            communities = somethinggreat            access = read-write            managers = localhost, dalton.ahania.com         }         {            communities = somethinggood            access = read-only            managers = iago.ahania.com, hamlet.ahania.com, ...         }       }

This access control entry defines the access levels and associated community strings for two lists of hosts: the local system and dalton receive read-write access using the somethinggreat community name, and the second list of hosts receives read-only access using the somethinggood community name.

8.6.3.4.6 The AIX snmpd daemon

AIX's snmpd agent is configured via the /etc/snmpd.conf configuration file. Here is an example:

# what to log and where to log it logging  file=/usr/tmp/snmpd.log  enabled logging  size=0  level=0 # agent information  syscontact   "chavez@ahania.com" syslocation  "Main machine room" #community name    [IP-address   netmask        [access   [view]]] community  something community  differs  127.0.0.1    255.255.255.255 readWrite community  sysonly  127.0.0.1    255.255.255.255 readWrite  1.17.2 community  netset   192.168.10.2 255.255.255.0   readWrite  1.3.6.1 #view   name    [subtree(s)] view    1.17.2   system enterprises  view    1.3.6.1  internet #trap   community  destination  view      mask trap    trapcomm   dalton       1.3.6.1   fe

This file illustrates both general server configuration and access control. The latter is accomplished via the community entries, which not only define a community name, but also optionally limit its use to a host and potentially an access type (read-only or read-write) and a MIB subtree. The latter are defined in view directives. Here we define one view consisting of the system and enterprises subtrees and another consisting of the entire internet subtree. Note that the view names must consist of an OID-like string in dotted notation.

8.6.3.4.7 The Tru64 snmpd daemon

The Tru64 snmpd agent is also configured via the /etc/snmpd.conf configuration file. Here is an example:

sysLocation   "Machine Room" sysContact    "chavez@ahania.com" #community  name       IP-address    access community   something  0.0.0.0       read             Applies to all hosts. community   another    192.168.10.2  write #trap  [version]  community  destination[:port] trap              trapcomm   192.168.10.22 trap    v2c       trap2comm  192.168.10.212

The first section of the file specifies the usual MIB variables describing this agent. The second section defines community names; the arguments specify the name, the host to which it applies (0.0.0.0 means all hosts), and the type of access. The final section defines trap destinations for all traps and for version 2c traps.

8.6.3.5 SNMP and security

As with any network service, SNMP has a variety of associated security concerns and tradeoffs. At the time of this writing (early 2002), a major SNMP vulnerability was uncovered and its existence widely publicized (see http://www.cert.org/advisories/CA-2002-03.html). Interestingly, Net-SNMP was one of the few implementations that did not include the problem, while all of the commercial network management packages were affected.

In truth, prior to Version 3, SNMP is not very secure. Unfortunately, many devices do not yet support this version, which is still in development and is a draft standard, not a final one. One major problem is that community names are sent in the clear over the network. Poor coding practices in SNMP agents also mean that some devices are vulnerable to takeover via buffer overflow attacks, at least until their vendors provide patches. Thus, a decision to use SNMP involves balancing security needs with the functionality and convenience that it provides. Along these lines, I can make the following recommendations:

  • Disable SNMP on devices where you are not using it. Under Linux, remove any links to /etc/init.d/snmp in the rcn.d subdirectories.

  • Choose good community names.

  • Change the default community names before devices are added to the network.

  • Use SNMP Version 3 clients whenever possible to avoid compromising your well-chosen community names.

  • Block external access to the SNMP ports: TCP and UDP ports 161 and 162, as well as any additional vendor-specific ports (e.g., TCP and UDP port 1993 for Cisco). You may also want to do so for some parts of the internal network.

  • Configure agents to reject requests from any but a small list of origins (whenever possible).

  • If you must use SNMP operations across the Internet (e.g., from home), do so via a virtual private network or access the data from a web browser using SSL. Some applications that display SNMP data are discussed in the next section of this chapter.

  • If your internal network is not secure and SNMP Version 3 is not an option, consider adding a separate administrative network for SNMP traffic. However, this is an expensive option, and it does not scale well.

As I've hinted above, SNMP Version 3 goes a long way toward fixing the most egregious SNMP security problems and limitations. In particular, it sends community strings only in cryptographically encoded form. It also provides optional user-based authenticated access control for SNMP operations. All in all, learning about and migrating to SNMP Version 3 is a very good use of your time.

8.6.4 Network Management Packages

Network management tools are designed to monitor resources and other system status metrics on groups of computer systems and other network devices: printers, routers, UPS devices, and so on. In some cases, performance data can be monitored as well. The current data is made available for immediate display, usually via a web interface, and the software updates and refreshes the display frequently.

Some programs are also designed to be proactive and actively look for problems: situations in which a system or service is unusable (basic connectivity tests fail) or the value of some metric has moved outside the acceptable range (e.g., the load average on a computer system rises above some preset level, indicating that CPU resources are becoming scarce). The network monitor will then notify the system administrator about the potential problem, allowing her to intervene before the situation becomes critical. The most sophisticated programs can also begin fixing some problems themselves when they are detected.

Standard Unix operating systems provide very little in the way of status monitoring tools, and those utilities that are included are generally limited to examining the local system and its own network context. For example, you can determine current CPU usage with the uptime command, memory usage with the vmstat command, and various aspects of network connectivity and usage via the ping, traceroute and netstat commands (and their GUI-based equivalents).

In recent years, a variety of more flexible utilities have appeared. These tools allow you to examine basic system status data for group of computers from a single monitoring program on one system. For example, Figure 8-9 illustrates some simple output from the Angel Network Monitor program, written by Marco Paganini (http://www.paganini.net/angel/). The image has been converted to black and white from the full-color original.

Figure 8-9. The Angel Network Monitor
figs/esa3.0809.gif

The display produced by this package consists of a matrix of systems and monitored items, and it provides an easy-to-understand summary display of the current status for each valid combination. Each row of the table corresponds to the specified computer system, and the various columns represent a different network service or other system characteristic that is being monitored. In this case, we are monitoring the status of the FTP facility, the web server service, the system load average, and the electronic mail protocol, although not every item is monitored for every system.

In its color mode, the tools uses green bars to indicate that everything is OK (white in the figure), yellow bars for a warning condition, red bars for a critical condition (gray in the figure), and black bars to indicate that data collection failed (black in the figure). A missing bar means that the data item is not being collected for the system in question.

In this case, system callisto is having problems with its load average (it's probably too high), and its SMTP service (probably not responding). In addition, the load average probe to system bagel failed. Everything else is currently working properly.

The angel command is designed to be run manually. Once it is finished, a file named index.html appears in the package's html subdirectory, containing the display we just examined. The page is updated each time the command is run. If you want continuous updates, you can use the cron facility to run the command periodically. If you want to be able to view the status information from any location, you should create a link to index.html within the web server documents directory.

The Angel Network Monitor is also very easy to configure. It consists of a main Perl script (angel) and several plug-ins, auxiliary scripts that perform the actual data gathering. The facility uses two configuration files, which are stored in the conf subdirectory of the package's top-level directory. I had to modify only one of them, hosts.conf, to start viewing status data.

Here is a sample entry from that file:

#label :plug-in  :args       :column:images #                    host!port    critical!warning    !failure ariadne:Check_tcp:ariadne!ftp:FTP:alertred!alertyellow!alertblack

The (colon-separated) fields hold the label for the entry (which appears in the display), the plug-in to run, its arguments (separated by !'s), the table column header, and the graphics to display when the retrieved value indicates a critical condition, a warning condition, or a plug-in failure. This entry checks the FTP service on ariadne by attempting to connect to its standard port (a numeric port number can also be used) and uses the standard red, yellow, and black bars for the three states (the OK state is always green).

The other provided plug-ins allow you to check whether a device is alive (via ping), the system load average (uptime), and the available disk space (df). It is easy to extend its functionality by writing additional plug-ins and to modify its behavior by editing its main configuration file.

The Angel Network Monitor performs well at the job it was designed for: providing a simple status display for a group of hosts. In doing so, it operates from the point of view of the local system, monitoring those items that can be determined easily by external probes, such as connecting to ports on a remote system or running simple commands via rsh or ssh. While its functionality can be extended, more complex monitoring needs are often better met by a more sophisticated package.

8.6.4.1 Proactive network monitoring

There is no shortage of packages that provide more complex monitoring and event-handling capabilities. While these packages can be very powerful tools for information gathering, their installation and configuration complexity scales at least linearly with their features. There are several commercial programs that provide this functionality, including Computer Associates' Unicenter and Hewlett-Packard's OpenView (see the cover article in the January 2000 issue of Server-Workstation Expert magazine for an excellent overview, available at http://swexpert.com/F/SE.F1.JAN.00.pdf). There are also many free and open source programs and projects, including OpenNMS (http://www.opennms.com), Sean MacGuire's Big Brother (free for non-commercial uses, http://www.bb4.com) and Thomas Aeby's Big Sister (http://bigsister.graeff.com). We'll be looking at the widely-used NetSaint package, written by Ethan Galstad (http://netsaint.org).

8.6.4.1.1 NetSaint

NetSaint is a full-featured network monitoring package which can not only provide information about system/resource status across an entire network but can also be configured to send alerts and perform other actions when problems are detected.

NetSaint's continuing development is taking place under a new name, Nagios, with a new web site (http://www.nagios.com). As of this writing, the new package is still in an alpha version, so we'll discuss NetSaint here. Nagios should be 100% backward compatible with NetSaint as it develops toward Version 1.0.

Installing NetSaint is straightforward. Like most of these packages, it has several prerequisites (including MySQL and the mping command).[28] These are the most important NetSaint components:

[28] Recent SuSE Linux distributions include NetSaint (although it installs the package in nonstandard locations).

  • The netsaint daemon, which continually collects data, updates displays, and generates and handles alerts. The daemon is usually started at boot time via a link to the netsaint script in /etc/init.d.

  • Plug-in programs, which perform the actual device and resource probing.

  • Configuration files, which define devices and services to monitor.

  • CGI programs, which support web access to the displays.

Figure 8-10 displays NetSaint's Tactical Overview display. It provides summary information about the current state of everything being monitored. In this case, we are monitoring 20 hosts, of which 4 currently have problems. We are also monitoring 40 services, 5 of which have reached their critical or warning state. The display shows an abnormally high number of failures to make the discussion more interesting.

Figure 8-10. The NetSaint Network Monitor
figs/esa3.0810.gif

Figure 8-10 also shows the NetSaint menu bar in the window's left frame. The items under Monitoring select various status displays. Figure 8-11 is a composite illustration showing selected items corresponding to the second and third menu choices.

Figure 8-11. NetSaint status summaries
figs/esa3.0811.gif

The two tables at the top of the figure present the overall status figures in tabular form. The items in the middle row of the illustration provide a breakdown of host and service status by computer location (on the left) as well as the details for each device in the Printers host group. In this way, the location of trouble can be determined quickly.

NetSaint provides links within each table to more detailed information. If you click on the "2 WARNING" text in Bldg2's Service Status item, the table at the bottom of the figure is displayed. This table provides details about the two warning-level conditions: the FTP service is not responding as expected to queries, and there are 292 processes running (which is above the warning threshold).

Figure 8-12 illustrates NetSaint's individual host-level reports (which we've reformatted slightly to save space). This report is for a host named leah, a Windows system (if the user-defined icon is to be believed). Earlier, this system was down for over 2 hours. In fact, it has been up only half the time during the periods during which it was monitored.

The Host State Information table displays a variety of specific information about the host's recent monitoring history and its current monitoring configuration. The comment displayed at the bottom of the figure was entered by the system administrator, and it provides a reason for the system's recent outage.

The Host Commands area enables the administrator to change many aspects of this host's monitoring configuration, including enabling/disabling monitoring and/or alert notifications, adding/modifying scheduled downtime for the host (during which monitoring ceases and alerts are not sent), and forcing all defined checks to be run immediately (rather than waiting for their next scheduled instance).

The second menu item allows you to acknowledge any current problem. Acknowledging simply means "I know about the problem, and it is being handled." NetSaint marks the corresponding event as such, and future alerts are suppressed until the item returns to its normal state. This process also allows you to enter a comment explaining the situation, an action that is very helpful when more than one administrator examines the monitoring data.

Table 8-12 lists the locations of the variousNetSaint components.

Figure 8-12. Host-specific information from NetSaint
figs/esa3.0812.gif

Table 8-12. NetSaint components

Item

Standard[29]

SuSE RPM

Daemon

bin/netsaint

/usr/sbin/netsaint

Configuration files

etc

/etc/netsaint

Plug-ins

libexec

/usr/lib/netsaint/plugins

Generated HTML pages

share/images

/usr/share/netsaint/images

Web interface

sbin

/usr/lib/netsaint/cgi

Logs and comments

var/log

/var/log/netsaint

Documentation

none

/usr/share/netsaint/doc

[29] Relative to /usr/local/netsaint.

Configuring Ne tSaint can seem daunting at first, but it is actually relatively straightforward once you understand all of the pieces. It has several configuration files:

netsaint.cfg

Defines directory locations for the package's various components, the user and group context for the netsaint daemon, what items to log, log file rotation settings, various timeouts and other performance-related settings, and additional items related to some of the package's advanced features (e.g., enabling event handling and defining global event handlers).

commands.cfg and hosts.cfg

Define host and service test commands and specify which hosts and services are monitored. These two files hold the same sorts of entries, and they exist as separate files simply for the sake of convenience.

nscgi.cfg

Holds settings related to the NetSaint displays, including paths to web page items and scripts, and per-item icon and sound selections. The file also defines allowed access to NetSaint's data and commands.

resource.cfg

Defines macros that may be used within other settings for clarity and security purposes (e.g., to hide passwords from view).

We will briefly consider entries in the second class of files here. The file holds several different kinds of entries, including the following:

command

Define a monitoring task and its associated command. These entries are also used to define commands used for other purposes such as sending alerts and event handlers.

host

Define a host/device to be monitored.

hostgroup

Create a list of hosts to be grouped together in displays.

service

Define an item on a host/device to be checked periodically.

contact

Specify a list of recipients for alerts.

timeperiod

Assign a name to a specified time period.

Here are some example command definitions:

command[do_ping]=/bin/ping -c 1 $HOSTADDRESS$ command[check_telnet]=/usr/local/netsaint/libexec/check_tcp -H $HOSTADDRESS$ -p 23

The first entry defines a command named do_ping, which runs the ping command to send a single ICMP packet to a host. When this command appears in a service entry, the corresponding host is automatically substituted for the built-in NetSaint macro $HOSTADDRESS$.

The second entry defines the check_telnet command, which runs the plug-in named check_tcp, which attempts to connect to the TCP port specified by -p on the indicated host.

It is also possible to define commands with arguments that are replaced at execution time using macros of the form $ARGn$, as in this example:

command[check_tcp]=/usr/local/netsaint/libexec/check_tcp -H $HOSTADDRESS$ -p $ARG1$

The entry defines the check_tcp command and calls the same plug-in, but it uses the first argument as the desired port number.

Many plug-ins use the -w and -c options to define value ranges that should generate warning- and critical-level alerts, respectively. Somewhat counterintuitively, these options expect the range of acceptable values as their argument. For example, the following entry defines the command snmp_load5 and sets the warning level to values over 150:

command[snmp_load5]=/usr/local/netsaint/libexec/check_snmp     -H $HOSTADDRESS$ -C $ARG1$ -o .1.3.6.1.4.1.2021.10.1.5.2     -w 0:150 -c :300 -l load5            Output is wrapped here.

It calls the check_snmp command provided with the package for the current host, using the first command argument as the SNMP community name, and retrieves the 5-minute load average value (in 3-digit form), labeling the data as "label5." The value will trigger a warning alert if it is over 150; -w 0:150 means that values between 0 and 150 are not in the warning range. It will also trigger a critical alert if it is over 300, i.e., not in the range 0 (optional) to 300. If both are triggered, critical wins.

The following entries illustrate the definitions for hosts:

#host[label]=descr.; IP address;parent;check command host[ishtar]=ishtar;192.168.76.98;taurus;check-printer-alive;10;120;24x7;1;1;1; host[callisto]=callisto;192.168.22.124;;check-host-alive;10;120;24x7;1;1;1;

Let's take these entries apart, field by field (they are separated by semicolons). The first one is the most complicated and has the following syntax: host[name]=description, where label is the label to be used in status displays and description is a (possibly longer) phrase describing the device (we've used the same text for both). The next field holds the device's IP address, which is the item which actually identifies the desired device (the preceding items are just arbitrary labels).

The third field specifies the parent device for the item: a list of one or more labels for intermediate devices located between the current system and this one. For example, to reach ishtar, we must go through the router named taurus, so taurus is specified as its parent. The fourth field specifies the command NetSaint should use to determine whether the host is accessible ("alive"), and the fifth field indicates how many checks must fail before the host is assumed to be down (10 in our example). The parent is optional, and the entry for callisto does not use it.

The remaining fields in the example entries relate to alert notifications. They hold the time interval between alerts when a host remains down, in minutes (here, two hours), the time period during which alerts should be sent, and three flags indicating whether to send notifications when the host recovers after being down, when the host goes down, and when the host is unreachable due to a failure of an intermediate device, respectively (where 0 means no and 1 means yes). The time period is defined elsewhere in the configuration file. This one, named 24x7, is included in the default file and means "all the time." It's a convenient choice when you are getting started using NetSaint. All the flags are set to yes in our examples.

Now that we have both commands and hosts entries, we are ready to define specific items that NetSaint should monitor. These items are known as services. Here are some sample entries:

#service[host]   =label;;  when;;;;   notify;;;;;;           check-command service[callisto]=TELNET;0;24x7;4;5;1;admins;960;24x7;0;0;0;;check_telnet service[callisto]=PROCS;0;24x7;4;5;1;admins;960;24x7;0;0;0;;snmp_nproc!commune!250!400 service[ingres]=HPJD;0;24x7;4;5;1;localhost;960;24x7;0;0;0;;check_hpjd

The most important fields in these entries are the first, third, sixth, and final ones, which hold the following settings:

  • The service definition (field 1), using the syntax service[host-label]=service-label. For example, the first example entry defines a service named TELNET for the host entry named callisto.

  • The name of the time period during which this check should be performed (field 3), again defined in a timeperiod entry.

  • The contact name (field 7): this item holds the name of a contact entry defined elsewhere in the file. The latter entry type is used to specify lists of users to be contacted when alerts are generated.

  • The command to run to perform the check (final field), defined via a command entry elsewhere in the configuration file. Arguments to the command are specified as separate !-separated subfields with the command.

The other fields hold the volatility flag (field 2), maximum number of checks before a service is considered down (4), number of minutes between normal checks and failure rechecks (5 and 6), number of minutes between failure alerts while the service remains down (8), time period during which to send alerts (8), and three alert flags corresponding to service recovery and whether or not to send critical alerts and warning alerts, respectively. The penultimate field holds the command name for the event handler for this service (see below); no event handler is specified in these cases. The default values, used in the examples, are good starting points.

As we saw in Figure 8-11, NetSaint displays can summarize status information for a group of devices. You specify this by defining host group. For example, the following configuration file entry defines the Printers host group (as displayed in the right table in the middle row in the illustration):

hostgroup[Printers]=Printers;localhost;ingres,lomein,turtle,catprt

The syntax is simple:

hostgroup[label]=description;contact-group;list-of-host-names

Keep in mind that the host labels refer to the names of host definitions within the NetSaint configuration file (and not necessarily to literal hostnames). The members of the specified contact group will be notified whenever there is any problem with any device in the list.

In addition to sending alert messages, NetSaint also provides support for event handlers: commands to be performed when a service check fails. In this way, you can begin dealing with a problem before you even know about it. Here are the entries corresponding to a simple event handler:

#event handler for disk full failures command[clean]=/usr/local/netsaint/local/clean $STATETYPE$ service[beulah]=DISK;0;24x7;4;5;1;localhost;960;24x7;0;0;0;clean;check_disk!/!15!5

First, we define a command named clean, which specifies a script to run. Its sole argument is the value of the $STATETYPE$ NetSaint macro, which is set to HARD for critical failures and SOFT for warnings. The clean command is then specified as the event handler for the DISK service on beulah. The script uses the find command to delete junk files within the filesystem and uses the argument value to decide how aggressive to be. In this case, the warning level means that the disk is 85% full and critical alerts correspond to 95% full, values specified via the final two parameters to the service monitoring command named check_disk (defined elsewhere), whose first argument is the filesystem to check.

NetSaint has a few other nice features which we'll consider very briefly. First of all, it can save data between runs (and it does so under the default configuration). You can also specify whether to display the saved status information when the NetSaint page is first opened. The following netsaint.cfg entries control this feature:

retain_state_information=1 retention_update_interval=60 use_retained_program_state=1

You can also save the data produced by the status commands for future use outside of NetSaint, using these main configuration file entries:

process_performance_data=1 service_perfdata_command=process-service-perfdata

The command specified in the second entry must be defined in hosts.cfg or another configuration file. Typically, this command simply writes the command's output to an external file: e.g., echo $OUTPUT$ >> file. The $OUTPUT$ macro expands to the full output returned by the monitoring command. You can also specify a separate processing command for host status monitoring commands. The data in the file can be analyzed, sent to a database (see the next section), or processed in any other way that you like.

So far, we have considered NetSaint in the context of a single monitoring location. In other words, all monitoring commands are issued from a single master system. However, the NetSaint daemon can also be configured to accept data sent from outside sources. It refers to this option as passive mode, which may be enabled via the check_external_commands main configuration file directive.

As we noted earlier, access to NetSaint is defined in the nscgi.cfg configuration file. Here are some example entries from that file:

use_authentication=1 authorized_for_configuration_information=netsaintadmin,root,chavez authorized_for_all_services=netsaintadmin,root,chavez,maresca hostextinfo[bagel]=;redhat.gif;;redhat.gd2;;168,36;,,;

The first entry enables the access control mechanism. The next two entries specify users who are allowed to view NetSaint configuration information and services status information (respectively). Note that all users also must be authenticated to the web server using the Apache htpasswd mechanism.

The final entry specifies extended attributes for the host defined in the entry labeled bagel. The filenames in this example specify images files for the host in status tables (GIF format) and in the status map (GD2 format), and the two numeric values specify the device's location within the status map. NetSaint status maps provide a quick way of accessing information about individual devices. A sample status map is displayed in Figure 8-13. The illustration shows the saintmap utility written by David Kmoch (http://www.netsaint.org/download/), which provides a convenient way of creating status maps. In this case, we have grouped devices by their physical location (although we haven't bothered to label the groups). The lines from taurus to each device in the bottom group illustrate the fact that taurus is the gateway to this location. When used by NetSaint, each icon will have a status indication up or down added to it, enabling an administrator to get an overall view of things right away, even when the network is very large and complex.

Figure 8-13. Using netsaint to create a status map
figs/esa3.0813.gif
8.6.4.2 Identifying trends over time

NetSaint is very good about providing up-to-the-minute status information, but there are also times when it is helpful to compare the current situation to conditions in the past. Accordingly, we now turn to tools that track status and performance data over time, thereby providing the sort of historical usage data that is essential to performance tuning and capacity planning.

8.6.4.2.1 MRTG and RRDtool

One of the best-known packages of this type is the Multi-Router Traffic Grapher (MRTG), written by Tobias Oetiker and Dave Rand. It collects data over time and automatically produces graphs of it over various time periods (see http://www.mrtg.org). As its name suggests, it was first designed to track the ongoing performance of the routers in a network, but it can be used for a wide variety of data (even ranging beyond the computer realm). The general term for this type of data is "time series data," and it consists of any value that can be tracked over time.

More recently, MRTG has been supplanted by Oetiker's newer package, RRDtool (http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/). RRDtool has much more powerful and configurable graphing facilities, although it requires a separate data collection script or package (the web site contains a list of some of the latter).

Both these tools work by storing only the data needed to produce the various graph types. Instead of saving every data point, they store a collection of the most recent ones, as well as summary values collected over various time periods. When new data comes in, it replaces the oldest point in the current collection of raw values, and the relevant summary data values are updated as appropriate. This strategy results in small, fixed-size databases nevertheless offering a wealth of important information.

We'll now look briefly at the RRDtool package and then consider a popular data collection front-end named Cricket. We'll begin by creating a simple database, using theRRDtool command provided by the package:

# rrdtool create ping.rrd \   --step 300 \                           Interval is 5 minutes.   DS:trip:GAUGE:600:U:U \   DS:lost:GAUGE:600:U:U \   RRA:AVERAGE:0.5:1:600 \                600 5-minute averages.   RRA:AVERAGE:0.5:6:700 \                700 30-minute averages.   RRA:AVERAGE:0.5:24:775 \               775 2-hour averages.   RRA:AVERAGE:0.5:288:750 \              750 daily averages.   RRA:MAX:0.5:1:600 \   RRA:MAX:0.5:6:700 \   RRA:MAX:0.5:24:775 \   RRA:MAX:0.5:288:797

This command creates a database named ping.rrd consisting of two fields, trip and lost, defined by the two DS lines (DS for "data set"). They will hold the round-trip travel time for ICMP packets and the percentage of lost packets resulting from running the ping command. Both are of type GAUGE, meaning that the data for these fields should be interpreted as a distinct value. The other data types refer to counters of various sorts, and their values are interpreted as changes with respect to the preceding value; they include COUNTER for monotonically increasing data and DERIVE for data that can vary up or down.

The fourth field in each DS line is the time period between data samples, in seconds (here 10 minutes), and the final two fields hold the valid range of the data. A setting of U stands for unknown, and two U's together have the effect of allowing the data itself to define the valid range (i.e., accept any value).

The remaining lines of the command, labeled RRA, createround-robin archive data within the database. Each RRA applies to every defined DA in the file. The second RRA field indicates the kind of aggregate value to compute; here, we compute averages and maximum. The remaining fields specify the maximum percentage of the required sampled data that can be missing, the number of raw values to combine, and the number of data points of this type to store.

Those final two fields can be confusing at first. Let's consider a simple example: values of 6 and 100 would mean that the average (or other function) of 6 raw values will be computed, and the most recent 100 averages will be saved. If the time period between data points is 300 seconds (the default value and also specified via the --step option), this will be a 30-minute average value (6*5 minutes), and we will have 30-minute averages going back for 50 hours (100*6*5). Note that the aggregate periods do not overlap; the 30-minute values are for the preceding 30 minutes, the 30 minutes before that, and so on. In addition, aggregate definitions always start from the present moment.[30]

[30] In other words, contrary to how MRTG works, they do not begin where the preceding one left off.

Thus, in our example database, we are creating 5-minute (--step 300) averages and maximums, 30-minute values of each type (5*6=30), 2-hour values (5*24=120) and daily values (5*288=1440=24 hours). Eventually, we will have data going back for over 2 years. At any given time, we'll have 50 hours worth of 5-minute averages (600*5 minutes), about 14.5 days of 30-minute averages, about 64.5 days of 2-hour averages, and 750 days of daily averages. We'll also have the maximum value data for each point.

There are many ways to add data to an RRDtool database. Here is a script illustrating one of the simplest, using rrdtool 's update keyword:

#!/bin/csh ping -w 30 -c 10 $1 > /tmp/ping_$1 set trip=`tail -1 /tmp/ping_$1 | awk -F= '{print $2}' | \           awk -F/ '{print $2}'` set lost=`grep transmitted /tmp/ping_$1 | awk -F, '{print $3}' \          | awk -F% '{print $1}'` rm -f /tmp/ping_$1 rrdtool update ping.rrd "N:"$trip":"$lost

We use the ping command to generate the data, then we take apart the output, and finally we use rrdtool update to enter it into our database. The final argument to the command is a colon-separated list of data values, beginning with the time to be associated with the data (N means now) followed by the value for each defined data field, in order. In this case, we use normal Unix commands to obtain the data we need, but we could also have used SNMP as the sources.

Once we've accumulated data for awhile, we can create graphs, again using rrdtool. For example, the following command (taken from a script) creates a simple graph of the data from the previous 24 hours:

rrdtool graph ping.gif \   --title "Packet Trip Times" \   DEF:time=ping.rrd:trip:AVERAGE \   LINE2:time\#0000FF

This command defines a graph of a single value, specified via the DEF (definition) line. The graphed variable is named time, and it comes from the stored averages of the trip field in the ping.rrd database (raw values cannot be graphed). The LINE2 line is what actually graphs its values. This line refers to a 2-point line of the defined variable time, displayed in the color corresponding to the RGB value #0000FF (blue). The backslash before the number sign is required to protect it from the shell; it is not part of the command syntax. The resulting output file, named ping.gif, is displayed in Figure 8-14 (although the blue line appears black in this version).

Figure 8-14. A simple RRDtool graph
figs/esa3.0814.gif

In the graph, time flows forward from left to right, and the current time is at the extreme right (here, about 8:00 P.M.).

You can display more than one value per graph. Consider Figure 8-15, which displays the values of the 5-minute load average (black line) and number of processes (gray line) for a system.

Figure 8-15. Graphing two values
figs/esa3.0815.gif

The upper graph displays the values in their normal ranges. In this case, we cannot see much detail in the load average line because its values are too small with respect to the number of processes. In the bottom graph, we correct this by multiplying the load average by 10 to bring the two data sets within the same general numerical range. Since load averages are a somewhat arbitrary metric anyway, this does not distort the data (because only relative load average values are really meaningful).

Here is the command from the script that created the bottom graph:

rrdtool graph cpu.gif \   --title "CPU Performance" \   DEF:la=cpu.rrd:la5:AVERAGE \   CDEF:xla=la,10,* \   DEF:np=cpu.rrd:nproc:AVERAGE \   LINE2:xla\#0000FF:"la*10" \  'GPRINT:la:AVERAGE:(avg=%.0lf' \  'GPRINT:la:MIN:min=%.0lf' \  'GPRINT:la:MAX:max=%.0lf)' \   LINE2:np\#FF0000:"# procs" \  'GPRINT:np:AVERAGE:(avg=%.0lf' \  'GPRINT:np:MIN:min=%.0lf' \  'GPRINT:np:MAX:max=%.0lf)'

The CDEF (computed definition) command is used to create a new graph variable based on an expression. In this case, we define the variable xla by multiplying the la variable by 10. The expression is specified in Reverse Polish Notation (RPN; see the RRDtool documentation if this is unfamiliar). Both variables are graphed by LINE2 subcommands, and these examples use the optional third field to set a label for the line. In addition, the parenthesized summary data for each variable shown at the bottom of the graph is created via the GPRINT subcommands (enclosed in quotation marks to protect special characters from the shell).

As a final graph example, consider Figure 8-16. In this graph, we again display data from ping.rrd. The average round-trip time is again a blue line, but this time the background is shaded to indicate whether the packet loss was significant: green means normal (little or no packet loss), and yellow and red indicate a busy and overloaded network, respectively. Note that the illustration in Figure 8-16 colors the three bands white, light gray, and dark gray, and the blue graph line is black.

This technique was inspired by an example graph created by BrandonGant (see gallery/brandon_01.html under the main RRDtool page), although his implementation is undoubtedly more sophisticated.

Figure 8-16. Shading a graph based on data values
figs/esa3.0816.gif

Here is the command section that created the shaded bands:

DEF:stat=ping.rrd:lost:AVERAGE \ CDEF:band0=stat,0,GE,m,13,LT,+,2,EQ,INF,0,IF \ CDEF:band1=stat,13,GE,m,27,LT,+,2,EQ,INF,0,IF \ CDEF:band2=stat,27,GE,m,1000,LT,+,2,EQ,INF,0,IF \ AREA:band0\#00FF00:"normal" \ ...

We define the variable stat as the lost field from ping.rrd. Next, we create three more variables, named band0, band1, and band2, via a complex conditional expression that sets the variable's value to infinity (INF) if it is true and otherwise. For example, the first RPN expression is equivalent to 0 <= stat < 13. As defined above, the AREA subcommand generates a green area plot labeled "normal," which in this case consists of a series of vertical green lines and white spaces (since the variable is 0 or infinite). There are two additional AREA lines for the other two bands in the full command. Since each value of stat is placed into one of the three bands, the entire graph background is filled in.

Creating graphs like these can be tedious, but fortunately, there is a utility named RRGrapher which automates the process. This CGI script, written by Dave Plonka (http://net.doit.wisc.edu/~plonka/RRGrapher/), is illustrated in Figure 8-17.

Figure 8-17. The RRGrapher utility
figs/esa3.0817.gif

You can use this tool to create graphs that draw data from multiple RRD databases. In this example, we are plotting values from two databases over a specified time period. The latter is one of RRGrapher's most convenient features, since rrdtool requires times to be expressed in standard Unix format (seconds since 1/1/1970) but you can enter them here in a readable format.

8.6.4.2.2 Using Cricket to feed RRDtool

To use RRDtool to gather and present data from more than a few sources, you will need some sort of front-end package to automate the process. The Cricket package is an excellent choice for this purpose. It was written by Jeff Allen (http://www.afn.org/~jam/software/cricket/). Cricket is written in Perl, and it requires a very large number of modules to function (plan on several visits to CPAN), so installing it may take a bit of time. Once it is up and running, these are its most important components:

  • The cricket-config subdirectory tree, containing specifications for each device to be monitored (see below).

  • The collector script, run periodically from cron (usually every five minutes).

  • The grapher.cgi script, used to display Cricket graphs within a web browser.

The cricket-config directory tree contains the configuration files that tell the collector script what data to get from which devices. It holds a hierarchical set of configuration files. Default values set at each level continue to apply to lower levels unless they are explicitly overridden. Once the initial setup is completed, adding additional devices is very simple.

The first-level subdirectories within this tree refer to broad classes of devices: routers, switches, and so on. We will be examining the device class hosts. It is not part of the default tree installed with the package, but is available at http://www.certaintysolutions.com/tech-advice/cricket-contrib/ (it was created by James Moore). We use this one because it is relatively simple and refers to metrics we have already examined in other contexts.

Within the hosts subdirectory ofcricket-config there is a file named Defaults, which supplies default values for entries within this subtree. Here are some lines from that file, which we've annotated with comment lines:

# cricket-config/hosts/Defaults # device specification target --default--    snmp-host = %server% # define symbolic names for some SNMP OIDs OID   ucd_load1min      1.3.6.1.4.1.2021.10.1.3.1 OID   ucd_load5min      1.3.6.1.4.1.2021.10.1.3.2 OID   ucd_load15min     1.3.6.1.4.1.2021.10.1.3.3 # define specific data values to be collected (RRD data sources) datasource ucd_load1min    ds-source = snmp://%snmp%/ucd_load1min datasource ucd_load5min    ds-source = snmp://%snmp%/ucd_load5min datasource ucd_load15min    ds-source = snmp://%snmp%/ucd_load15min # define a data source group named ucd_System targetType   ucd_System    ds   =   "ucd_cpuUser, ucd_cpuSystem, ucd_cpuIdle,               ucd_memrealAvail, ucd_memswapAvail,               ucd_memtotalAvail, ucd_load1min, ucd_load5min,               ucd_load15min" # define 3 subgroups of ucd_System for graphing purposes    view   =    "cpu: ucd_cpuUser ucd_cpuSystem ucd_cpuIdle,                  Memory: ucd_memrealAvail ucd_memswapAvail                  ucd_memtotalAvail, Load: ucd_load1min ucd_                 load5min ucd_load15min" # define graphs to be generated graph   ucd_load5min    legend   =   "5 Min Load Av"    si-units=   false graph   ucd_memrealAvail    legend   =   "Used RAM"    scale   =   1024,*    bytes   =   true    units   =   "Bytes"

These entries are all quite intuitive. We can see the underlying RRD database structure used for this data, but using Cricket means that we don't have to worry about it. The entries following the data source definitions relate to the Cricket reporting structure (as we'll see).

Specific hosts to be monitored are generally defined in files named Targets. Each host has a subdirectory under hosts in which such a file lives. Here are some excerpts from the file for host callisto:

# cricket-config/hosts/callisto/Targets Target --default--    server         =   callisto    snmp-community =   somethingsecure # Specify data source groups to collect target ucd_sys    target-type  =   ucd_System    short-desc   =   "CPU, Memory, and Load" target boot    target-type  =   ucd_Storage    inst         =   1    short-desc   =   "Bytes used on /boot"    max-size     =   19487    storage      =   boot

This file instructs Cricket to collect values for all of the items defined in the ucd_System and ucd_Storage groups. Each target will appear as an option within the Cricket web interface for this host.

Figure 8-18 illustrates someCricket output. The upper-left window lists the first-level menu; each of its items corresponds to a top-level subdirectory under cricket-config. The lower-right graph shows the page corresponding the ucd_sys target for host callisto. It begins with a summary of the current data and then displays one or more graphs showing the data over time (you can select which ones appear via the links in the right-hand cell in the Summary table).

Figure 8-18. Cricket status and history reports
figs/esa3.0818.gif

In this case, we have chosen the weekly graph. It shows clearly that callisto generally used very little of its CPU resources in the past seven days, but there was an exceptional period on the previous Sunday (although even then the load average was never very high). Graphs like these can be very helpful in determining what the normal range of behavior is for the various devices for which you are responsible. When you understand the normal status and variation, you are in a much better position to recognize and understand the significance of anomalies that do turn up.

As we've seen, network monitoring software can be a powerful tool for keeping track of system status, both at the current moment and over the long haul. However, don't underestimate the time it will take to implement a monitoring strategy for a real-world environment. As with most things, careful planning can minimize the amount of time that this will require, but putting a monitoring strategy in place is always a big job. You need to consider not only the installation and configuration issues but also the performance impact on your network and the security ramifications of the daemons and protocols you are enabling. While this can be a daunting task and cannot be rushed, in the end it is worth the effort.



Essential System Administration
Essential System Administration, Third Edition
ISBN: 0596003439
EAN: 2147483647
Year: 2002
Pages: 162

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net