Understanding Data Sources | Performance and Fault Management: A Practical Guide to Effectively Managing Cisco Network Devices (Cisco Press Core Series)

This section breaks out each network management task you'll want to implement and discusses the knowledge you'll want to have in your knowledge base to support these tasks. The order in which these tasks are discussed is in general the optimal order for performing them, but you may need to modify the sequence for your network.

Network Inventory

The first task you need to do to manage your network is to perform a network audit. Out of this audit, you will obtain the information required to create an inventory of your network. Performing this audit is covered in Chapter 1, "Conducting a Network Audit." The information collected should include the following:

Device name
Device IP address(es)
Device location
Contact information

If the device is connected to a switch, you may want to include information about this connection, as follows:

Switch
Switch port

You may want to include the following information about these devices in your knowledge base. This information may be useful in automating your performance and fault management functions:

Device function
Logical access methods such as passwords or community strings
Physical access methods such as keys or badges

To keep track of your network's layer 2 connectivity, you'll need to add the following information to your knowledge base:

The device's MAC address(es)
The name of one device's port that connects to the name of its neighbor's port (for example, switch 1 port 2/1 connects to switch 2 port 1/1) this is a more detailed and bi-directional form of the switch and switch port information listed previously
Information on layer 2 redundant paths (for example, redundant bridged paths, or port channels)

You may also want to track your layer 3 connectivity. You'll want to add the following information to your knowledge base:

Layer 3 connectivity information such as default routes and routing tables
VLAN information
Information on layer 3 redundant paths (for example, HSRP or VRRP router peers)

Policy-based Network Management

The knowledge required to support policy-based network management consists of the following:

Service level agreements
Policies
Rules that are required to implement these agreements and policies

Policy-based network management is covered in Chapter 2.

You may want to implement a two-tier structure that allows you to describe the rules in a device-independent way and a set of commands that implement a particular type of rule on the type of devices in your network. For example, you may have a rule that states:

For all switched user ports, set ToS equal to zero

You would then add to your knowledge base the commands required to change the ToS on each device that this rule might be applied to.

A key part of policy-based network management is verifying that the policies are being met. The information collected during the baselining of your network will provide the base values to compare to the current state of your network and allow you to determine your network's rate of compliance with those policies. This information will be stored in the performance monitoring section of your knowledge base, which is covered next.

Performance Measurement and Reporting

Performance measurement and reporting entails several types of tasks and, therefore, several types of information:

Availability
Response time
Accuracy
Utilization
Reporting

Performance measurement and reporting is covered in detail in Chapter 4, "Performance Measurement and Reporting."

We suggest that you implement availability monitoring first. Your knowledge base should contain information to support this, and if kept flexible, it will support additional performance measurements as well. Keep this in mind when designing your knowledge base.

You'll want to add which devices are considered "interesting" or important to your knowledge base to support performance measurement and reporting. Note that most network managers select to ignore most end user devices such as PCs. Monitoring the availability of hubs, switches, routers, servers, and possibly key users will be important. The easiest way of supporting this in your knowledge base may be by supporting groups or classes of devices that you want to monitor at the same service level. Examples of classes of devices are WAN devices, backbone network devices, and server farms.

Let's look at what knowledge is required in your knowledge base to support each component of performance measurement and reporting.

Availability Monitoring

The information your knowledge base will require to support availability monitoring includes the following:

The address to use for availability monitoring
The protocol or protocols to use, such as PING or SNMP
The frequency to monitor, such as every five minutes
The number of attempts to reach the device to make
The length of time to wait for a response

Portions of your network may be "interesting" only during certain times of the day. For example, an office environment that is staffed from 8 a.m. to 5 p.m. may be interesting during these hours and not monitored other hours of the day. Note that you may need to modify these hours to account for things such as network-based backups or automated report generation.

Response Time Monitoring

To implement response time measurements, you need knowledge of your network at the applications level. You need to know where the users of the applications you want to instrument are located in your network and where the servers are located for those same applications.

Just as you did for availability monitoring, you need to define what your expectations are for response time. You should have information about the response time performance of your network from the baseline. This information should be in your knowledge base and can be used to determine significant variances from this baseline. If you have SLAs or policies, these can be translated into rules that detail the times and expectations for responses expected in your network.

To support response time monitoring, your knowledge base will need to include the following information:

Key source and destination points for measuring response time
Response time expectations for each source/destination pair

You may also want to include the following:

Protocol to use to measure response time

Accuracy Monitoring

Evaluating the accuracy with which data is transmitted in your network requires much the same information you've accumulated for availability monitoring. For a thorough discussion of accuracy, please see the "Accuracy" section in Chapter 4.

You probably want to monitor the same interfaces for accuracy that you monitor for availability. Only a little more information will be required to support accuracy monitoring, including the following:

The objects to monitor for accuracy for each interface type
The level of accuracy you expect for each object and interface type

Utilization Monitoring

For utilization, you may be able to use the same list of interfaces you use for availability and accuracy monitoring because most network managers don't monitor utilization on links going to individual PCs.

The information you'll need to add to your knowledge base for utilization monitoring includes the following:

The utilization limit, or rising threshold, for each interface type
If your utilization-monitoring software supports hysteresis (see the "Hysteresis" section in Chapter 5, "Configuring Events"), the falling threshold for each interface type

Performance Reporting

Many network managers and network management stations start out with hard-coding the available reports. However, if you design your reporting tool to be flexible, and take definitions of the reports and the groups of devices and interfaces you want to report on from your knowledge base, you will have much more flexibility in producing the desired reports.

You'll want to add the following to you knowledge base:

Report definitions
Groups of devices and interfaces to use for each report type

Configuring Events

Your next step in implementing network management is configuring events (this is covered in detail in Chapter 5). You should have collected the data you need to add during baselining. This information will build and expand upon the data you put in your knowledge base to support performance measurements.

The specific information required include what objects are interesting to monitor for specific interfaces or groups of interfaces and what thresholds seem appropriate for these interfaces. Note that you will want to include both rising and falling thresholds for each object to allow you to implement hysteresis to reduce the volume of events received. You may consider implementing thresholds based on network technology. For example, acceptable error thresholds for LANs may be significantly lower than for WAN links.

The information to add to your knowledge base include the following:

Objects to configure events against
Devices and, if applicable, interfaces to configure events on
Values for each object and device or interface type

Prioritizing Faults

The information needed to process events and faults is an extension of the information you have already collected, as outlined thus far. The difference is the focus. Although you need to know what ports are interesting in order to start availability management, you need to supplement this information with prioritization information. That is, for each port that you might receive an event about, you need some information about whether the port is interesting (already in the knowledge base) and what priority of fault an outage on this port generates.

There are several ways to determine the priority of a fault. You could keep it simple and assign a priority to each port or group, or you could try to derive this information from your knowledge of the network. A good example of where prioritizing faults gets ugly is to look at redundant links. If you lose one link, there is no outage on your network, but fixing the problem is still important because your network now lacks redundancy. You may decide to wait until a scheduled outage period to fix the problem if fixing it requires you to bring down production services. However, if the only other link goes down while the first link is down, the priority of this outage becomes very high. Your fault management system could determine priority by information on each port or from a more generalized view of the topology of the network.

Another use for topology information is event correlation. One issue with availability monitoring is that it is usually done from one place in the network. This can give a radically skewed view of the priority of an outage if, for example, the port that the monitoring system is connected to goes down. The network-management station would perceive this as the whole network being down.

If topology information about the network is available, the fault management system can use event correlation techniques to determine that the directly connected link is down and, therefore, is the place to start determining the actual source of the fault. Placing a high priority on a fault of this type is probably desirable because there may be other faults that you cannot detect during the period in which your network management application cannot contact the network.

The priority of a fault can vary, depending on influences outside of the network itself. These outside influences include the following:

SLAs and associated policies
Users impacted by the outage (for example, the CEO, your boss, etc.)
The time of day the outage occurs (3 p.m. may have higher weight than 3 a.m.)
The type of traffic flow (for example, ERP applications)
The criticality of the segment(s) involved (for example, Sales receipt)

If you defined an SLA, you may have specified criteria that you need to take into consideration, such as an agreement that a certain category of fault will be fixed within a certain period of time. As that time approaches, the priority of the fault may need to be increased to ensure that proper attention is given to it, considering your SLA.

Other considerations that may influence the priority of a given fault include who specifically is affected by the fault. For example, if the CEO of the company is affected by a fault the priority should probably be increased. This is the mahogany row effect.

Another consideration that could modify the priority of a fault is the time of day it occurs. The previous example covered in the availability monitoring section, in which an office network is very important during business hours but is much less important during other hours, is one example of time modifying the priority of the fault. Another example is a trading floor, in which it is desirable to have the network up at all times, but it is critical to have the network up during trading hours.

You could also determine the priority of a fault by periodically examining the traffic statistics on the link affected by a fault, determining how much traffic and what type of traffic will be affected, and storing this information in your knowledge base. Obviously, this requires knowledge about traffic flow on your network.

Another way of evaluating fault priority is by looking at the financial consequences of a fault. In this case, you need information about what the financial contribution portions of your network make to your company's bottom line.

We recommend that you start out with a simple priority scheme and enhance it where and when required. If you start with a simple yet flexible scheme, you'll be able to add capabilities as you expand your network management.

So, the items you'll want to add to your knowledge base to support prioritizing faults will include the following:

Priority of interfaces and ports
How time of day affects priorities