This section breaks out each network management task you'll want to implement and discusses the knowledge you'll want to have in your knowledge base to support these tasks. The order in which these tasks are discussed is in general the optimal order for performing them, but you may need to modify the sequence for your network. Network InventoryThe first task you need to do to manage your network is to perform a network audit. Out of this audit, you will obtain the information required to create an inventory of your network. Performing this audit is covered in Chapter 1, "Conducting a Network Audit." The information collected should include the following:
If the device is connected to a switch, you may want to include information about this connection, as follows:
You may want to include the following information about these devices in your knowledge base. This information may be useful in automating your performance and fault management functions:
To keep track of your network's layer 2 connectivity, you'll need to add the following information to your knowledge base:
You may also want to track your layer 3 connectivity. You'll want to add the following information to your knowledge base:
Policy-based Network ManagementThe knowledge required to support policy-based network management consists of the following:
Policy-based network management is covered in Chapter 2. You may want to implement a two-tier structure that allows you to describe the rules in a device-independent way and a set of commands that implement a particular type of rule on the type of devices in your network. For example, you may have a rule that states:
You would then add to your knowledge base the commands required to change the ToS on each device that this rule might be applied to. A key part of policy-based network management is verifying that the policies are being met. The information collected during the baselining of your network will provide the base values to compare to the current state of your network and allow you to determine your network's rate of compliance with those policies. This information will be stored in the performance monitoring section of your knowledge base, which is covered next. Performance Measurement and ReportingPerformance measurement and reporting entails several types of tasks and, therefore, several types of information:
Performance measurement and reporting is covered in detail in Chapter 4, "Performance Measurement and Reporting." We suggest that you implement availability monitoring first. Your knowledge base should contain information to support this, and if kept flexible, it will support additional performance measurements as well. Keep this in mind when designing your knowledge base. You'll want to add which devices are considered "interesting" or important to your knowledge base to support performance measurement and reporting. Note that most network managers select to ignore most end user devices such as PCs. Monitoring the availability of hubs, switches, routers, servers, and possibly key users will be important. The easiest way of supporting this in your knowledge base may be by supporting groups or classes of devices that you want to monitor at the same service level. Examples of classes of devices are WAN devices, backbone network devices, and server farms. Let's look at what knowledge is required in your knowledge base to support each component of performance measurement and reporting. Availability MonitoringThe information your knowledge base will require to support availability monitoring includes the following:
Portions of your network may be "interesting" only during certain times of the day. For example, an office environment that is staffed from 8 a.m. to 5 p.m. may be interesting during these hours and not monitored other hours of the day. Note that you may need to modify these hours to account for things such as network-based backups or automated report generation. Response Time MonitoringTo implement response time measurements, you need knowledge of your network at the applications level. You need to know where the users of the applications you want to instrument are located in your network and where the servers are located for those same applications. Just as you did for availability monitoring, you need to define what your expectations are for response time. You should have information about the response time performance of your network from the baseline. This information should be in your knowledge base and can be used to determine significant variances from this baseline. If you have SLAs or policies, these can be translated into rules that detail the times and expectations for responses expected in your network. To support response time monitoring, your knowledge base will need to include the following information:
You may also want to include the following:
Accuracy MonitoringEvaluating the accuracy with which data is transmitted in your network requires much the same information you've accumulated for availability monitoring. For a thorough discussion of accuracy, please see the "Accuracy" section in Chapter 4. You probably want to monitor the same interfaces for accuracy that you monitor for availability. Only a little more information will be required to support accuracy monitoring, including the following:
Utilization MonitoringFor utilization, you may be able to use the same list of interfaces you use for availability and accuracy monitoring because most network managers don't monitor utilization on links going to individual PCs. The information you'll need to add to your knowledge base for utilization monitoring includes the following:
Performance ReportingMany network managers and network management stations start out with hard-coding the available reports. However, if you design your reporting tool to be flexible, and take definitions of the reports and the groups of devices and interfaces you want to report on from your knowledge base, you will have much more flexibility in producing the desired reports. You'll want to add the following to you knowledge base:
Configuring EventsYour next step in implementing network management is configuring events (this is covered in detail in Chapter 5). You should have collected the data you need to add during baselining. This information will build and expand upon the data you put in your knowledge base to support performance measurements. The specific information required include what objects are interesting to monitor for specific interfaces or groups of interfaces and what thresholds seem appropriate for these interfaces. Note that you will want to include both rising and falling thresholds for each object to allow you to implement hysteresis to reduce the volume of events received. You may consider implementing thresholds based on network technology. For example, acceptable error thresholds for LANs may be significantly lower than for WAN links. The information to add to your knowledge base include the following:
Prioritizing FaultsThe information needed to process events and faults is an extension of the information you have already collected, as outlined thus far. The difference is the focus. Although you need to know what ports are interesting in order to start availability management, you need to supplement this information with prioritization information. That is, for each port that you might receive an event about, you need some information about whether the port is interesting (already in the knowledge base) and what priority of fault an outage on this port generates. There are several ways to determine the priority of a fault. You could keep it simple and assign a priority to each port or group, or you could try to derive this information from your knowledge of the network. A good example of where prioritizing faults gets ugly is to look at redundant links. If you lose one link, there is no outage on your network, but fixing the problem is still important because your network now lacks redundancy. You may decide to wait until a scheduled outage period to fix the problem if fixing it requires you to bring down production services. However, if the only other link goes down while the first link is down, the priority of this outage becomes very high. Your fault management system could determine priority by information on each port or from a more generalized view of the topology of the network. Another use for topology information is event correlation. One issue with availability monitoring is that it is usually done from one place in the network. This can give a radically skewed view of the priority of an outage if, for example, the port that the monitoring system is connected to goes down. The network-management station would perceive this as the whole network being down. If topology information about the network is available, the fault management system can use event correlation techniques to determine that the directly connected link is down and, therefore, is the place to start determining the actual source of the fault. Placing a high priority on a fault of this type is probably desirable because there may be other faults that you cannot detect during the period in which your network management application cannot contact the network. The priority of a fault can vary, depending on influences outside of the network itself. These outside influences include the following:
If you defined an SLA, you may have specified criteria that you need to take into consideration, such as an agreement that a certain category of fault will be fixed within a certain period of time. As that time approaches, the priority of the fault may need to be increased to ensure that proper attention is given to it, considering your SLA. Other considerations that may influence the priority of a given fault include who specifically is affected by the fault. For example, if the CEO of the company is affected by a fault the priority should probably be increased. This is the mahogany row effect. Another consideration that could modify the priority of a fault is the time of day it occurs. The previous example covered in the availability monitoring section, in which an office network is very important during business hours but is much less important during other hours, is one example of time modifying the priority of the fault. Another example is a trading floor, in which it is desirable to have the network up at all times, but it is critical to have the network up during trading hours. You could also determine the priority of a fault by periodically examining the traffic statistics on the link affected by a fault, determining how much traffic and what type of traffic will be affected, and storing this information in your knowledge base. Obviously, this requires knowledge about traffic flow on your network. Another way of evaluating fault priority is by looking at the financial consequences of a fault. In this case, you need information about what the financial contribution portions of your network make to your company's bottom line. We recommend that you start out with a simple priority scheme and enhance it where and when required. If you start with a simple yet flexible scheme, you'll be able to add capabilities as you expand your network management. So, the items you'll want to add to your knowledge base to support prioritizing faults will include the following:
|