Determining What to Monitor and Alert Upon | Microsoft Windows Server 2003 Insider Solutions

Monitoring and alerting go hand in hand. The real value in system monitoring is being able to alert an administrator if something goes wrong. As such, it is important to determine what parameters should result in an alert being generated. As a general rule any outage in a nonredundant system should be alerted immediately. Any outage in a redundant system that would result in a single system remaining should be immediately alerted. Any security- related events should generate an alert.

Most monitoring and alerting systems today not only track the failures of systems, subsystems, and services but they can also detect these items coming back online. It is always worthwhile to trigger an event when a system comes back into an "up" state. This can mean the difference between coming in to the office at 4 a.m. and going back to sleep.

Mail System Alerts

Alerts regarding a mail system should not be sent via e-mail only. If this is the only choice, try to have more than one mail server available to relay the message. Similarly, alerts regarding an Internet connection should be sent via a pager that it not dependent on the Internet connection.

Hardware Alerting

Although the knowledge that a computer is responding to a ping is of only limited value, the knowledge that a computer has stopped responding to ping is quite useful. Most network hardware places a high priority on ICMP traffic. It is unusual for a router or switch to fail to respond to ping due to excessive load. This means that hardware monitoring rarely results in false positives. As such, failures determined by hardware monitoring should always be alerted. These alerts should use event correlation so that the failure of a router generates an alert but the apparent failures of the upstream routers do not create additional alerts.

E-Mail Is an Excellent Vehicle for Sending Alerts

Always be aware if an alert method is dependent on the device that is being monitored . E-mail is an excellent vehicle for sending alerts but it becomes much less useful when it's e-mailing a pager to tell it that the Internet connection is down.

Port-Level Alerting

It is not unusual for a port-level monitoring event to timeout before the system has properly responded. As such it is often necessary to set thresholds that report failures. Rather than having a port-level failure immediately trigger an alert it should be set to require multiple failures on consecutive cycles to generate an alert.

Service-Level Alerting

Service-level monitoring looks to the operating system to determine if service is running. As such, service-level checks rarely produce false positives. Service failures reported via the operating system should immediately generate alerts. Services returning to a "running" state should also generate an alert. These types of "up and down" alerts are often used to determine system uptime.

Application-Level Alerting

Application-level alerts are almost always generated by specific counters meeting specific thresholds. Because application parameters can spike under burst conditions it is necessary to set thresholds for how long a specific parameter must remain at a specific level before triggering an alert. By doing so, false positives can be greatly reduced.

As with application-level monitoring, you should work closely with the application owner to determine thresholds for alerting. The application owner will have a much better understanding of the application and will know how to spot abnormal behavior.

Performance Alerting

Performance counters tend to fluctuate greatly during the operation of a system. As a result, point-in-time monitoring can very easily result in false positives. By setting thresholds for not only the value of a performance counter but also for how long the counter must remain above a threshold, an administrator can greatly reduce the number of false positives generated by the system.

Be Aware of Any Service-Level Agreements

Be aware of any service-level agreements when determining the thresholds for triggering a performance-based alert. If a system has an SLA requiring it to be back in service within one hour , using a threshold of more than 10 minutes would be ill-advised.

Administrators must depend on their familiarity with servers to determine the thresholds for alerting. Although a system might run fine at 75% utilization, if the system normally spikes to no more than 10% utilization, a sustained load of 20% might be enough to cause the administrator some concern. Avoid falling into the trap of only generating alerts if something is "pegged" at max utilization for an extended period of time. Any sustained and drastic change in system behavior is most likely a sign that something is wrong and the appropriate resources should be notified.

Alerting Pitfalls

There are many ways to get an alert to the appropriate resource. It is to your advantage to use more than one method for each alert. The two most commonly used alerts are e-mail and pager. E-mail can be a very effective method but it is susceptible to failures in the e-mail system and the Internet connection. There is nothing more annoying than receiving two messages back to back stating "The Internet router is down!" and "The Internet router is back up!" Mail server failures can elicit the same response. Whenever possible, have the alert sent via a media that isn't a single point of failure. E-mail along with a pager that is dialed via a phone line is an excellent way to ensure that critical alerts reach their intended target.

If alerts are going to an onsite 24/7 resource, make sure that the staff is responding correctly to alerts by performing regular tests. Monitoring a "fake" server that can be used to trigger alerts is a good way to keep the monitoring staff on their toes. An onsite monitoring staff that ignore alerts or doesn't know how to react aren't benefiting anyone .

Don't just take the default values offered by the monitoring package. Some servers just don't behave in a normal fashion. Generate alerts on values that are outside the server's normal behavior. Don't always assume that a resource will hit 100% if there is a problem. Similarly, don't focus only on high utilization. If a server has been between 30% “40% utilization for the past year, a sustained 5% should be just as alarming as a sustained 75%. Both situations suggest that something bad has happened .