Performance Data Reporting

A natural next step of collecting performance data is generating reports. Reports are useful only if properly correlated data is presented in a useful manner to its intended audience. For example, upper management tends to want reports that report the network's performance in simple measurable terms, such as a network availability summary report. As a network engineer, you need both a single at-a-glance indication of the network's health and detailed reports of relatively raw data. You need an alarm mechanism that tells you when something is wrong and where the problem occurred. And you need to be able to drill down and obtain more detail to isolate the problem.

For example, the network engineer comes to work in the morning and looks at the at-a-glance health rating for the network. The at-a-glance reports provide quick information on any fault or failing events in the network. Additionally, she runs reports whenever she needs to understand how traffic flows change over time. Finally, she runs reports during and after faults in order to understand any anomalies that may have occurred before or during a fault. Fortunately, the different reports can be built from the same set of collected data. Each of the report types varies, based on the intended audience and use. When implementing performance management, it is important to determine the types of reports needed for your operation. Don't produce reports that no one uses. The following are examples of performance reporting:

Network and device health
Fault
Capacity planning

Network and Device Health Reporting

Network and device health reports provide "dashboard" or at-a-glance reporting about the relative health of the network and its components. Health reporting summarizes the relative healthy states of devices and the network for example, device availability and interface utilization. This type of reporting is extremely valuable for help desk, operations, engineering, and management.

Implementing health reporting provides the quickest benefit when you initially implement performance monitoring and reporting tools. The collected data can be processed, analyzed, and displayed quickly. Thus, when a user calls the help desk, the help desk technician can bring up the present state of the user's network through health reporting, rather than having to Telnet directly to devices or ask the network guru what the status is.

Third-party health reporting systems are reaching a very useful maturity. Traditional reporting for distributed networks tended to look at a device level, whereas newer commercial systems provide reporting at a more comprehensive network level that encompasses the member devices. For examples of the type of health reporting tools, please see Chapter 9. Health reports become even more valuable when they incorporate knowledge of devices and device types. This may involve using proprietary MIBs from devices in order to report on specific health factors. The following sections describe several specific types of health reporting.

Device-Level Health Reporting

The simplest form of health reporting involves reporting device-level performance information. The information is reported by using graphs or graphical elements to indicate each collected variable for each device. Graphic elements include gauges, speedometers, level meters, and on/off switches.

The information is reported in real-time that permits an engineer to monitor the information while troubleshooting, as opposed to sifting through archived data. The disadvantage of this type of reporting is that it doesn't scale past a single device very well. However, this type of reporting is useful for troubleshooting.

Figure 4-1 shows an example of a real-time health report using speedometers to indicate CPU and interface utilization. Elements such as speedometers and gauges are effective in quickly communicating the state of the variable.

Figure 4-1. Example of Simple Real-Time Health Reporting Using Speedometer Graphic Elements

graphics/04fig01.gif

Tabular Health Reporting

A more sophisticated method for health reporting involves the use of tables to indicate the best and worst elements for a particular collected variable. The advantage of this type of report is that you can present the best and worst for a given element across the whole network. Being able to look at the macro view through these reports helps a network engineer or operations person quickly understand where trouble spots exist or are about to occur.

Reporting is usually done in a top 10 and bottom 10 type of format. Top 10 generally reflects the highest values across devices, whereas the bottom 10 reflects the lowest values. Bottom 10 reports can be just as useful as top 10 reports. For example, looking at the least-utilized WAN links in a network may identify redundant links that are not in use, even if they should be.

Table 4-1 demonstrates a literal "top 10" list in which the top 10 routers exhibiting the highest CPU utilization are listed, starting with the worst of the bunch.

Table 4-1. Example of a Top 10 Health Report for Router CPU Utilization
Rank	Device Name	CPU %
1.	Comfy-home1	98
2.	Outback-r4	97
3.	Eng-rr-3	85
4.	Eng-rr-5	84
5.	Eng-rr-1	67
6.	172.26.11.13	66
7.	Fw-1	66
8.	Old-router	63
9.	bb-rr-2	62
10.	Du-mr	57

A variation on this type of report is listing only the devices that exhibit a certain CPU threshold or greater. This is called an exception report. For instance, if the threshold of concern for router CPU utilization is 60 percent and none of the routers have utilization greater than 50 percent, the report will simply list that there are no entries.

Correlated Predictive Reports

The final example of health reporting is called correlated predictive reports. Predictive reports apply rules and knowledge of specific device characteristics against collected data in order to attempt to predict trends or potential faults that may result. Predictive reports typically look at collected data, gather more if needed, and make some sort of conclusion based on sets of rules.

For instance, suppose that a router's CPU utilization is extremely high (98 percent). Rather than simply reporting high CPU, a predictive analysis may dig deeper by obtaining a show proc cpu command. Example 4-1 shows sample output from this command.

Example 4-1 Partial output from router show proc cpu indicating high CPU utilization with CDP as the cause.

 CPU utilization for five seconds: 98%/99%; one minute: 99%; five minutes: 99%  PID Runtime(ms) Invoked uSecs  5Sec  1Min  5Min TTY Process    1        628   490033     1  0.00% 0.00% 0.00%  0 Load Meter    2       4008     2692  1488  0.00% 0.00% 0.00%  0 PPP auth    3   1800432  289902  6210  0.00%  0.05%  0.05%  0 Check heaps    4         0       1     0  0.00%  0.00%  0.00%  0 Pool Manager    5         0       2     0  0.00%  0.00%  0.00%  0 Timers    6      3876   44111    87  0.00%  0.00%  0.00%  0 ARP Input    7      7304  273357    26  0.00%  0.00%  0.00%  0 DDR Timers    8         0       1     0  0.00%  0.00%  0.00%  0 Entity MIB API    9         0       2     0  0.00%  0.00%  0.00%  0 Serial Backgroun   10         0       1     0  0.00%  0.00%  0.00%  0 SERIAL A'detect   11      2240 1281395     1  0.00%  0.00%  0.00%  0 LED Timers   12         0       3     0  0.00%  0.00%  0.00%  0 CSM timer proces   13       324    1568   206  0.00%  0.00%  0.00%  0 POTS   14     40124 2447944247863 99.00% 99.00% 99.00%  0 CDP Protocol   15    785576  772809  1016  0.16%  0.02%  0.09%  0 IP Input

From the output in Example 4-1, you can see that something other than normal behavior is causing the high CPU: CDP. You can assume that something is wrong with this router: All of the other processes are running extremely low.

CDP should not be running so high.
After running a show interface, packets are being dropped seemingly because of CDP dominating the processor.

So, rather than simply reporting high CPU, the tools could instead report anomalous behavior with high CDP Protocol utilization as the root cause.

Additionally, if this is a known condition, the predictive report could recommend a suggested resolution. By first providing a tactical solution and then searching for a long-term remedy, the software can help quickly solve the problem and suggest ways to permanently avoid it. In this example, you can assume that CDP should never run this way, so predictive software might suggest doing a bug search on Cisco's bug navigator Web site (www.cisco.com).

Predictive analysis for network devices is still in the early stages of maturity as vendors continue to learn more about available devices and devise rule sets around their behavior.

Fault Reporting

A nice side benefit of collecting performance-related data is that the data can be used for multiple purposes. Not only can you use the data for reporting, but you can also set thresholds on the collected data. The thresholds can be used to generate alarms to an event-handling network management station. Thresholds and fault management are discussed in Chapter 5, "Configuring Events," and Chapter 6, "Event and Fault Management."

Faults can be reported in several ways. For catastrophic or severe events such as a device disappearing or an interface becoming non-operational, you must have real-time status reports. These reports may be in the form of topological maps that indicate device and interface status by color, such as turning a port red when it dies. Or they may be tabular reports that list faults in order of severity and duration.

Fault reports are useful for help desks. A person taking trouble calls can refer to the report when determining whether a user's problem may be related to a network outage.

Most, if not all fault-reporting software will do its job out of the box, but require customization for a smooth fit into your organization. Every network needs to be managed differently, so you must fine-tune and customize the reports. The risk in not customizing these types of systems is rendering the tool useless as people learn not to trust the severity or messages associated with events.

Depending on your budget, you can purchase different levels of sophistication with these types of tools. However, any tool you purchase will require some level of customization in order to reflect your organization's politics and policies.

Capacity Reporting

When drafting a budget or considering network redesigns, you need solid, real-life data in order to understand the current network's capacity. Armed with good data, you can identify congestion and trouble areas in the network that need to be relieved, either through additional equipment or network reconfiguration. You can also identify devices that are being underutilized or have available chassis slots.

The data collected can be reported to reveal available capacity or lack of capacity in the network. The result is accurate and real-life data that helps you understand and plan for network capacity.