What Is Performance Management? | Performance and Fault Management: A Practical Guide to Effectively Managing Cisco Network Devices (Cisco Press Core Series)

In terms of networking, performance management is the configuration and measurement of network traffic for the purpose of providing a consistent and predictable level of service. Performance management involves monitoring the network activity and adjusting network design and/or configuration in order to improve its performance and traffic handling. Performance measurement can help identify the following:

Normal baseline network performance (for comparing to perceived "bad" network behavior)
Current or potential utilization problems
Slow response time
Application, server, and network availability
Optimum data transfer times

Performance measurement can be broken into two categories: performance monitoring and performance reporting. Performance monitoring is the collection of performance-related data from network devices. Performance reporting is the presentation of the collected data. With performance reporting, the collected data can be used to analyze faults, growth, and capacity of the network.

The following sections discuss performance data collection and reporting in more detail.

Performance Data Collection

Performance data collection is the process of collecting performance-related management data from network devices and storing them in a database or data file. The following issues should be considered when implementing performance data collection:

Polling versus event-based collection
Data aggregation and reduction
Differences when measuring QoS networks

Polling Versus Event-based Collection

There are two methods for collecting performance data: active polling and event reporting. Each has its own strengths and weaknesses.

The first method, active polling, involves a management station actively obtaining specific management data from network devices. A single management station may do the collection or multiple distributed data collectors may be used. If data collectors are used, the data is temporarily stored and then forwarded onto a central data repository.

The collected data is then stored in a database and used later for reporting. The advantage of active polling is that as long as a managed device is accessible, data is collected and stored in regular intervals. Disadvantages include the following:

Collecting large amounts of data from devices can impact the performance of the managed device, the data collector, or the network.
Keeping large amounts of data over time can fill up hard disks and cause reports to take a long time to generate.

TIP

Care must be taken when an SNMP counter wraps back to zero or a device reboots this causes the value of a counter to start over at zero. Both of these events are natural occurrences and your data collection application should be able to handle the events.

In most cases, it is best to throw out the data from a poll cycle in which the collected value is less than the previously collected value. Having a lesser value should never happen with a counter type of SNMP variable because, by definition, a counter always increases.

The second method is event reporting, or polling by exception. With this model, active polling does not occur; instead, the managed device or agent generates a trap or event that is then received and logged by the manager. Because this method requires that events be generated based only on thresholds, the manager can assume that the lack of an event indicates that the particular item being measured is performing within acceptable ranges. The advantage of event reporting is that traffic is generated only if an exception occurs or is corrected. The impact to the network and network management station is potentially lessened.

The disadvantages to event reporting are as follows:

It is more difficult to determine the duration and frequency of an event.
Traps are unreliable (although the use of informs is not see Chapter 8, "Understanding Network Management Protocols," to compare traps and informs). Losing an event trap means you may not be able to determine the start or stop of an event.

Active versus passive polling becomes a trade-off between active polling actually affecting the performance numbers you are collecting and events potentially getting lost in the network.

Generally, you should employ a combination of polling and event-based reporting. Have the devices monitor themselves and notify the manager of critical items such as interface utilization exceeding 90 percent. Regular polling can be reduced when you also conduct exception polling because you no longer need to poll for the events at the same frequency. The polled traffic becomes less necessary for real-time fault identification and more useful for long-term analysis of the network's performance.

Data Aggregation and Reduction

Collecting performance-related data from large distributed networks presents challenges for storing the collected data. Consider collecting 50 variables from 500 devices every 15 minutes. Additionally, suppose that each of the objects collected is, on average, 5 bytes in size. When storing the data, add an additional 40 100 bytes for timestamps, devices, and interface instances for each collected object. For this example, assume an average total size of 50 bytes per object.

After one hour, you will have stored 5 MB of data. After one day, the number would grow to 120 MB of data. After one year, you will have stored slightly over 40 GB of data. The data can become overwhelming to store and manipulate.

There are methods for efficiently storing and reducing the collected data. Specifically, statistical methods and aggregating data as it becomes old can reduce the amount of data stored. Aggregation provides a useful method for reducing stored data over time. However, you lose detail as you aggregate. Also, your reporting applications must take into consideration the aggregated data when reporting. Some commercial applications perform aggregation automatically and are able to report seamlessly across reduced data. Chapter 7, "Understanding and Using Basic Network Statistics," provides more detail on data reduction and aggregation techniques. Chapter 9, "Selecting the Tools," provides more detail on selecting tools for this purpose.

Differences When Measuring QoS Networks

Part of performance management is the actual configuration of queuing mechanisms and traffic prioritization that enforce different levels of service. After you have designed and implemented Quality of Service (QoS) mechanisms in your network, the natural next step is to see whether traffic is moving as you expect it to: Is it being marked, shaped, or queued appropriately? Unfortunately, this can be difficult to measure and verify.

Some show commands on routers, for instance, allow you to view how different traffic flows are handled within a particular router. With show commands, however, there is no easy method for verifying that all of the routers in a path are successfully and correctly handling your desired QoS.

In response to increased customer use of QoS mechanisms, vendors are providing tools that allow you to model, configure, measure, and report on network-wide QoS. For the case of measuring and reporting, these tools tend to involve placing devices on either end of a path that can emulate the stream of traffic generated by a particular application. Thus, the two devices can conduct their appropriate operations and measure how reliably the network reacts to the conversation.

As QoS deployment continues to grow, the configuration and reporting tools will mature as well. Please refer to Chapter 2, "Policy-Based Network Management," for more details on how policy-based networking will help, and see Chapter 9 for details of the types of tools that are available.