System-Related Questions | Performance and Fault Management: A Practical Guide to Effectively Managing Cisco Network Devices (Cisco Press Core Series)

The questions in this section relate to system characteristics of Cisco devices, such as the following:

CPU
memory
buffers
environmental

How Do I Collect CPU Utilization?

CPU utilization is a good indicator for determining the health of a network device. The following sections detail how to collect and interpret CPU-related information.

Collecting CPU Utilization on IOS Devices

Critical router functions, such as routing protocol processing and process packet switching, are handled in memory and share the CPU. Thus, if the CPU utilization is very high, it is possible that a routing update cannot be handled or a process-switched packet is dropped.

For routers, collect avgBusy5 from the OLD-CISCO-CPU MIB.

The avgBusy5 variable reflects an exponentially decaying five-minute moving average. By using the avgBusy5, as opposed to busyPer, which reflects CPU utilization in the last 5 seconds, you avoid catching CPU spikes that can inaccurately reflect the longer-term impact of traffic on the CPU.

TIP

Keep in mind that the polling of CPU utilization variables (and any other SNMP variables) affects the actual CPU utilization. Some customers have reported seeing utilization of 99 percent returned when continuously polling the variable at 1 second intervals. While polling so frequently is overkill, take into consideration the impact to CPU when determining how frequently you want to poll the variable.

If you discover a high CPU utilization, you may want to determine which process or processes are causing it. Output from the IOS show proc cpu command (see Example 19-7) will help you identify the offending processes.

Example 19-7 A partial show proc cpu listing.

 CPU utilization for five seconds: 42%/16%; one minute: 47%; five minutes: 50%  PID  Runtime(ms)  Invoked  uSecs    5Sec   1Min   5Min TTY Process    1        2364     23730     99   0.00%  0.00%  0.00%   0 Load Meter    2       64576     41029   1573   0.16%  0.07%  0.06%   0 PPP auth    3      368608      3950  93318   0.00%  0.19%  0.26%   0 Check heaps    4          72       259    277   0.00%  0.00%  0.00%   0 Pool Manager    5           0         2      0   0.00%  0.00%  0.00%   0 Timers    6           0         2      0   0.00%  0.00%  0.00%   0 Serial Background    7          88      3962     22   0.00%  0.00%  0.00%   0 Environmental monitor    8       15676     13224   1185   0.08%  0.02%  0.00%   0 ARP Input    9       16384     49080    333   0.00%  0.01%  0.00%   0 DDR Timers   10           0         1      0   0.00%  0.00%  0.00%   0 SERIAL A'detect   11        5924    191457     30   0.08%  0.02%  0.00%   0 Call Management   12       20192    464930     43   0.00%  0.00%  0.00%   0 Framer background   13    15470284   7925310   1952  20.69% 21.83% 23.88%   0 IP Input   14        3544     13902    254   0.00%  0.00%  0.00%   0 CDP Protocol   15           0         1      0   0.00%  0.00%  0.00%   0 Asy FS Helper   16        2084     35167     59   0.00%  0.00%  0.00%   0 PERUSER aux   17       17904     46273    386   0.08%  0.06%  0.04%   0 PPP IP Add Route   18          28       200    140   0.00%  0.00%  0.00%   0 MOP Protocols   19           0         1      0   0.00%  0.00%  0.00%   0 X.25 Encaps Manage   20       37164    183448    202   0.16%  0.17%  0.17%   0 IP Background   68     1626176   3415015    476   2.29%  2.48%  2.23%   0 CCP manager

The output in Example 19-7 shows that the busiest process is the IP input process, which is responsible for process-switching IP traffic.

Collecting CPU Utilization on Catalyst 4000, 5000, and 6000 Series Switches

For switches, at the time of this writing there is no CPU utilization variable to watch with SNMP. There are alternative methods to collecting CPU utilization from switches, but first we will discuss the importance of CPU utilization for switches.

Unlike routers, for switches, CPU utilization is not as vital an indicator of device health. Switches make packet-forwarding decisions in ASICs, and thus do not need processor time. In fact, we've seen faulty software crash on a switch, yet the switch continues to pass packets.

The switch's processor handles some vital functions such as the processing of bridge BPDUs for spanning tree calculation, but problems tend to manifest themselves in different areas first. Switch CPU utilization issues are discussed in more detail in "Catalyst Switch Processors" in Chapter 11, "Monitoring Network Systems Processes and Resources."

To collect CPU utilization from a switch, you can use the command line show pro cpu (older versions of Catalyst IOS software use the less reliable ps c command). The output contains individual process utilization as well as overall process utilization.

NOTE

Please note that the output of the ps c command is not always accurate due to the priorities of the switch. The results of the command should be used only as a rough approximation of actual process utilization.

Because the switching of traffic is done without involving the switch's CPU, individual key interface utilization and backplane utilization tend to be better measures of device capacity.

For more information on measuring switch and router characteristics, please see "Performance Data for Switch Processors" in Chapter 11.

Why Is IP SNMP Causing High CPU Utilization?

If an IOS device has high CPU utilization caused by the IP SNMP process, there are several possible underlying causes:

You are polling the device too much. A common culprit is a network management autodiscovery.
You have the process priority for the IP SNMP process set too high. Set the process priority to low with the following command: snmp-server priority low. With the exception of a few 10.x versions of IOS, the priority should be set to low by default.

For more information, please see "Controlling SNMP Access Using Views and Access Lists" in Chapter 18.

How Do I Collect Free and Largest Block of Contiguous Memory?

Memory leaks and abnormal network events are the main reason for monitoring memory consumption and fragmentation. A memory leak occurs when a process requests memory blocks and does not release the block when it is finished with it. Eventually, the process will gobble up all of the available memory. This is considered a bug, and it will eventually cause a router to crash.

Collecting Memory Statistics from IOS Devices

Not having enough memory prohibits the router from, among other things, creating more buffers. The lack of memory can also affect the router's capability to grow data structures such as a routing table.

Monitoring free memory and the largest free block of memory on IOS devices can be good indicators of router health. The variables to watch are ciscoMemoryPoolFree and ciscoMemoryPoolLargestFree from CISCO-MEMORY-POOL-MIB.

Collecting Memory Statistics from Catalyst IOS Switches

With switches, the show mbuf command provides output such as that shown in Example 19-8.

Example 19-8 Abridged output from switch show mbuf command.

 Largest block available : 3825456 Total Memory available  : 3827728 Total Memory used       : 405840 Total Malloc count      : 99395

For more information, please see "Using the show mbuf Command" in Chapter 11.

How Do I Collect Catalyst Switch Backplane Utilization?

For traditional Cisco switches that have a single backplane such as the Catalyst 5000 series, sysTraffic from the CISCO-STACK-MIB MIB provides the system backplane utilization. The sysTraffic measurement equates roughly to the meter of the same name on the supervisor card.

For switches that contain multiple backplanes, such as the Catalyst 5500, use the sysTrafficMeterTable from the CISCO-STACK-MIB.

For more information on measuring switch backplane utilization, see "Performance Data for Switch Processors" in Chapter 11.

How Can I Measure Router or Switch Health?

As with any computer, measuring the resources that make a router and switch pass packets allows you to gauge the relative health of the device. Like a workstation or mainframe, routers and switches contain one or more CPUs, RAM, storage, and network interfaces. If a resource becomes busy or faulty, it affects the overall operation of the device.

System resources reflect the capability of the device to pass packets. Ultimately, this is why you should be concerned with device health: anything that prevents the router from operating at peak capacity will affect its capability to pass packets. Monitoring device health allows you to detect events and conditions that may affect the processing of traffic.

Five essential areas can be used to gauge the healthiness of a router. Each of these is covered in detail in Chapter 10, "Managing Hardware and Environmental Characteristics," and elsewhere in this chapter:

Device availability
CPU utilization
Ratio of buffer hits to misses
Largest block of contiguous memory and free memory
Ratio of process-switched traffic to other switching paths

For switches, there are fewer overall health indicators:

Device availability
Backplane utilization
Trunk and server port utilization and error rates

Aside from the preceding, switch health should be determined on an interface by interface basis.

Please see Chapters 4, 10 and 11 for more details on router and switch health.

How Do I Measure the Ratio of Buffer Hits to Misses?

When a packet is received by a router and must be process-switched, it is temporarily stored in system buffers.

Buffer misses and failures are good indications of the following:

Abnormal network events
Heavy amounts of broadcast traffic
Lack of available and/or contiguous memory
High amounts of process switched traffic
A software bug

The variables in OLD-CISCO-MEMORY-MIB provide most of the desired information to monitor the ratio of buffer hits to misses. Buffer failures are not available via SNMP, but the hit-to-miss ratio should be an adequate indicator of memory resource problems and should allow you to avoid having to process a show command.

Several factors need to be considered when collecting the hit-to-miss ratio. The first is the fact that a high ratio may not indicate a fault. For example, a buffer pool that is seldom used and has few buffers can reflect a high ratio despite the number of hits and misses being extremely low.

The second factor to consider is that you don't want to measure buffers that are bigger than the largest MTU on a router. For instance, if a router has only ethernet and fast ethernet interfaces, the largest MTU will be 1500 bytes; thus, you only need collect information from the small, middle, and big buffers.

For more information, please refer to "MIB Variables for Buffer Utilization on Routers" in Chapter 11.

How Do I Measure the Ratio of Process-switched Packets to Total Packets?

A packet can be forwarded through a router on different switching paths. The slowest, most processor-intensive path is called process switching. Although process switching is not inherently bad, it adds considerable overhead compared to the other switching paths. For example, each time a packet needs to be process switched, it generates a CPU interrupt and must be temporarily stored in a shared system buffer. Thus, it is desirable to have as much traffic as possible switched through paths other than process switching in order to maximize the packet throughput of the router.

Measuring the ratio of process-switched traffic to other switched traffic can be useful when determining network resource utilization. Depending on the type of router, a high amount of process-switched traffic can indicate many different things, including the following:

Non-standard traffic patterns
High rate of broadcasts
Routing protocol problems
Misconfiguration
A software bug

High amounts of process-switched traffic can substantially limit the number of packets a router can forward.

For more information, please see "Correlating High CPU Values" in Chapter 11.

How Can I Avoid a Device Appearing to Be Down if a Single Managed Interface Goes Down?

When managing a network device, you must refer to the device by a resolvable name (such as router.your-company.com) or an IP address. If the interface associated with the managed IP address goes down, the entire device will appear to be down because the IP address is unreachable, despite the fact that the rest of the router's interfaces are forwarding packets.

To work around this problem, you should implement a loopback interface on the router. A loopback interface is a Cisco proprietary mechanism that creates a virtual interface that can be reached via any physical interface. Thus, if there are multiple paths to a device from a management workstation, if a path goes down, the device IP address for the loopback address is still reachable. In essence, the loopback interface only goes down if the device itself dies.

Keep in mind that the router must advertise the loopback address so that the rest of the network knows how to reach the device. Also, be sure to have the DNS name of the device resolve to the loopback IP address.

For more information, please see "Setting Up a Loopback Interface" in Chapter 18.

How Can I Track When a Power Supply Dies or a Redundant Supply Changes State?

When devices contain redundant power supplies, you can choose to have the device generate traps when a power supply dies or changes state. Traps exist on both routers and switches that notify of a state change or failure, and point to further information as to the nature of the change.

For routers, watch for the ciscoEnvMonRedundantSupplyNotification trap from the CISCO-ENVMON-MIB document. The variables ciscoEnvMonSupplyStatusDescr and ciscoEnvMonSupplyState provide details on the nature of the change. You must configure the snmp-server enable traps envmon command in order to enable the traps.

For switches, watch for the SNMP trap chassisAlarmOn. The variables chassisTempAlarm, chassisMinorAlarm, and chassisMajorAlarm are included with the trap and are necessary for determining the specific chassis alarm in progress.

Several sections in Chapter 10 contain more details: See "MIB Variables for Switch Failure," "Error/Fault Data for Router Environmental Characteristics," and "Error/Fault Data for Switch Environmental Characteristics."

How Do I Track When a Cisco Device Reloads and Determine the Reload Reason?

Sometimes, devices reload. They can crash, be turned off, have a power cable knocked loose, or be issued a reload command. When monitoring for device availability, you should be concerned about two aspects of reloads:

When did the reload occur?
Why did the reload occur?

There are two methods to determine whether a reload has occurred. The first involves having your network management trap receiver watch for traps, indicating that a device has rebooted.

The second involves polling the sysUpTime value from MIB II (RFC 1213). This value contains the number of seconds since the device became active. Because the number constantly rises, a decrease between polling cycles indicates that either the device rebooted or the counter rolled over back to zero.

Once you determine that a reload has occurred, you can check whyReload from the OLD-CISCO-SYSTEM-MIB to determine the reload reason. This variable contains the same text as can be seen from show version.

For more information, refer to "MIB Variables for Router Failure" and "CLI Commands for Router Failure" in Chapter 10.