Chapter 8. Polling and Thresholds | Essential SNMP, Second Edition

SNMP gives you the ability to poll your devices regularly, collecting their management information. Furthermore, you can tell the NMS that there are certain thresholds that, if crossed, require some sort of action. For example, you might want to be notified if the traffic at an interface jumps to an extremely high (or low) value; that event might signal a problem with the interface, or insufficient capacity, or even a hostile attack on your network. When such a condition occurs, the NMS can forward an alarm to an event-correlation engine or have an icon on an OpenView map flash. To make this more concrete, let's say that the NMS is polling the status of an interface on a router. If the interface goes down, the NMS reports what has happened so that the problem can be resolved quickly.

SNMP can perform either internal or external polling. Internal polling is typically used in conjunction with an application that runs as a daemon or a facility such as cron that periodically runs a local application. External polling is done by the NMS. The OpenView NMS provides a great implementation of external polling; it can graph and save your data for later retrieval or notify you if it looks like something has gone wrong. Many software packages make good NMSs, and if you're clever about scripting, you can throw together an NMS that's fine-tuned to your needs. In this chapter, we will look at a few of the available packages.

Polling is like checking the oil in a car; this analogy may help you to think about appropriate polling strategies. Three distinct items concern us when checking the oil: the physical process (opening the hood, pulling out the dipstick, and putting it back in); the preset gauge that tells us if we have a problem (is the level too high, too low, or just right?); and the frequency with which we check it (once an hour, week, month, or year).

Let's assume that you ask your mechanic to go to the car and check the oil level. This is like an NMS sending a packet to a router to perform an snmpget on some piece of information. When the mechanic is finished, you pay him $30 and go on your way. Because a low oil level may result in real engine damage, you want to check the oil regularly. So, how long should you wait until you send the mechanic out to the car again? Checking the oil has a cost: in this scenario, you paid $30. In networks, you pay with bandwidth. Like money, you have only so much bandwidth, and you can't spend it frivolously. So, the real question is, how long can you wait before checking the oil again without killing your budget?

The answer lies within the car itself. A finely tuned race car needs to have its fluids at perfect levels. A VW Beetle,^[*] unlike a race car, can have plus or minus a quart at any time without seriously hindering its performance. You're probably not driving a Beetle, but you're probably not driving a race car either. So, you decide that you can check the oil level about every three weeks. But how will you know what is low, high, or just right?

^[*] The old ones from the 1960s, not the fancy modern ones.

The car's dipstick tells you. Your mechanic doesn't need to know the car model, engine type, or even the amount of oil in the car; he only needs to know what value he gets when he reads the dipstick. On a network, a device's dipstick is called an agent, and the dipstick reading is the SNMP response packet. All SNMP-compatible devices contain standardized agents (dipsticks) that can be read by any mechanic (NMS). It is important to keep in mind that the data gathered is only as good as the agent, or mechanic, that generated it.

In both cases, some predefined threshold determines the appropriate action. In the oil example, the threshold is "low oil," which triggers an automatic response: add oil. (Crossing the "high oil" threshold might trigger a different kind of response.) If we're talking about a router interface, the possible values we might receive are "up" and "down." Imagine that your company's gateway to the Internet, a port on a router, must stay up 24 hours a day, 7 days a week. If that port goes down, you could lose $10,000 for each second it stays down. Would you check that port often? Most organizations won't pay someone to check router interfaces every hour, let alone every second. Even if you had the time, that wouldn't be fun, right? This is where SNMP polling comes in. It allows network managers to guarantee that mission-critical devices are up and functioning properly, without having to pay someone to constantly monitor routers, servers, etc.

Once you determine your monitoring needs, you can specify at what interval you would like to poll a device or set of devices. This is typically referred to as the poll interval and can be as granular as you like (e.g., every second, every hour, etc.). The threshold value at which you take action doesn't need to be binary: you might decide that something's obviously wrong if the number of packets leaving your Internet connection falls below a certain level.

Whenever you are figuring out how often to poll a device, remember to keep three things in mind: the device's agent/CPU, bandwidth consumption, and the types of values you are requesting. Some values you receive may be 10-minute averages. If this is the case, it is a waste to poll every few seconds. Review the MIBs surrounding the data for which you are polling. Our preference is to start polling fairly often. Once we see the trends and peak values, we back off. This can add congestion to the network but ensures that we don't miss any important information.

Whatever the frequency at which you poll, keep in mind other things that may be happening on the network. Be sure to stagger your polling times to avoid other events if possible. Keep in mind backups, data loads, routing updates, and other events that can cause stress on your networks or CPUs.