Section 1.3. Applying the Concepts of Network Management

1.3. Applying the Concepts of Network Management

Being able to apply the concepts of network management is as important as learning how to use SNMP. This section of the chapter provides insights into some of the issues surrounding network management.

1.3.1. Business Case Requirements

The endeavor of network management involves solving a business problem through an implementation of some sort. A business case is developed to understand the impact of implementing some sort of task or function. It looks at how, for example, network administrators do their day-to-day jobs. The basic idea is to reduce costs and increase effectiveness. If the implementation doesn't save a company any money while providing more effective services, there is almost no need to implement a given solution.

1.3.2. Levels of Activity

Before applying management to a specific service or device, you must understand the four possible levels of activity and decide what is appropriate for that service or device:

Inactive: No monitoring is being done, and, if you did receive an alarm in this area, you would ignore it.
Reactive: No monitoring is being done; you react to a problem if it occurs.
Interactive: You monitor components but must interactively troubleshoot them to eliminate side-effect alarms and isolate a root cause.
Proactive: You monitor components, and the system provides a root-cause alarm for the problem at hand and initiates predefined automatic restoral processes where possible to minimize downtime.

1.3.3. Reporting of Trend Analysis

The ability to monitor a service or system proactively begins with trend analysis and reporting . Chapters 12 and 13 describe two tools that are capable of aiding in trend reporting. In general, the goal of trend analysis is to identify when systems, services, or networks are beginning to reach their maximum capacity, with enough lead time to do something about it before it becomes a real problem for end users. For example, you may discover a need to add more memory to your database server or upgrade to a newer version of some application server software that adds a performance boost. Doing so before it becomes a real problem can help your users avoid frustration and possibly keep you employed.

1.3.4. Response Time Reporting

If you are responsible for managing any sort of server (HTTP, SMTP, etc.), you know how frustrating it can be when users come knocking on your door to say that the web server is slow or that surfing the Internet is slow. Response time reporting measures how various aspects of your network (including systems) are performing with respect to responsiveness. Chapter 11 shows how to monitor services with SNMP.

1.3.5. Alarm Correlation

Alarm correlation deals with narrowing down many alerts and events into a single alert or several events that depict the real problem. Another name for this is root-cause analysis. The idea is simple, but it tends to be difficult in practice. For example, when a web server on your network goes down, and you are managing all devices between you and the server (including the switch the server is on and the router), you may get any number of alerts including ones for the server being down, the switch being down, or the router being down, depending on where the real failure is.

Let's say the router is the real issue (for example, an interface card died). You really only need to know that the router is down. Network management systems can often detect when some device or network is unreachable due to varying reasons. The key in this situation is to correlate the server, switch, and router down events into a single high-level event detailing that the router is down. This high-level event can be made up of all the entities and their alarms that are affected by the router being down, but you want to shield an operator from all of these until he is interested in looking at them. The real problem that needs to be addressed is the router's failure. Keeping this storm of alerts and alarms away from the operator helps with overall efficiency and improves the trouble resolution capabilities of the staff.

Clearing alarms is also important. For example, once the router is back up and running, presumably it's going to send an SNMP message that it has come back to life, or maybe a network management system will discover that it's back up and create an alarm to this effect. This notion of state transition, from bad to good, is common. It helps operators know that something is indeed up and operational. It also helps with trending. If you see that a certain device is constantly unreliable, you may want to investigate why.

1.3.6. Trouble Resolution

The key to trouble resolution is knowing that what you are looking at is valuable and can help you resolve the problem. As such, alarms and alerts should aid an operator in resolving the problem. For example, when your router goes down, a cryptic message like "router down" is not helpful. If possible, alerts and alarms should provide the operator with enough detail so that she can effectively troubleshoot and resolve the problem.