Deciding What to Monitor | Microsoft Windows Server 2003 Insider Solutions

There is a plethora of parameters that you can monitor. Some are very useful; some are not. By limiting the monitored items to only those in which you are interested, there is less chance of missing important information due to the sheer volume of incoming data. Because different types of monitoring have access to different types of data, the following sections each end with the recommendations of specific items to monitor with that type of monitoring.

Monitoring Hardware

Simple hardware monitoring such as pinging a device to see if it will respond is one way to determine if a machine is up and running. This essentially tests layers 1, 2, and 3 of the OSI model. The problem with this type of monitoring is it only tells you if the box itself is physically responding to ping. It does not tell you whether the machine has a particular service running or if that service is in fact running properly. Hardware monitoring is a good basis for other types of monitoring because it enables you to do things such as event correlation. If a number of machines stop responding to a particular query and a router is not responding to ping the software can safely assume that the reason the machines are unreachable is because the router itself is down. For this level of monitoring, basic hardware monitoring is fairly effective.

Ensure that All Interfaces Are Being Monitored

When monitoring a network with redundant connections it is critical to ensure that all interfaces are being monitored. If only the "far side" IP addresses are being monitored, a packet could be taking an alternative route to get there and a failed local interface could be missed.

Recommended monitoring points include the following:

Local routers
Remote routers
ISP
Switches
VPN devices

Port-Level Monitoring

Going beyond simple ping tests is an effective way to get more information about a system's health. Well-known services such as SMTP or POP3 can be monitored easily by querying the server on ports 25 or 110, respectively. This goes beyond the simple ping test by ensuring that port is available, which generally means that the application is running. This enables you to make a determination about whether or not a particular service is available. This can be especially effective in finding problems before users report them. One of the primary goals of monitoring is to proactively capture system problems before users discover them. Most monitoring packages on the market support the capability to monitor a system at a port level. By combining hardware-level monitoring with port-level monitoring a clever administrator can reduce the load on the monitoring system. If the link to a remote network is not responding to a physical ping test there is no reason for the software to continue processing port checks on machines on that network. This allows the system to perform event correlation to reduce the number of false positives and that in turn reduces the number of alerts being sent. This will result in less traffic on the network as well as reduce the "boy that cried wolf" effect.

The Netstat Command

An easy way to find the ports to monitor on a system is via the Netstat command. By identifying the ports that multiple users are connecting to, you can identify the ports that are important to the end users. Running Netstat -nao on a Windows Server 2003 system will list source IP addresses, destination addresses and the ports on which they are connecting. Additionally it will list process identifiers associated with each connection. These PIDs can be compared to the Task Manager to see what processes are connecting on which ports.

Some recommended monitoring points are as follows :

Port 25 (SMTP)
Port 110 (POP3)
Port 80 (Web)
Port 443 (Secure Web)
Port 53 (DNS)
Port 1723 (PPTP)
Port 3389 (Remote Desktop Protocol)

Service-Level Monitoring

Going one step further in the area of monitoring is the ability to query a service to see if it is running properly. Rather than simply see that port 25 is responding, a monitoring package can send an SMTP query to see if the server will respond with the correct hello. This enables you to ensure that an e-mail server, for example, is correctly receiving e-mail messages. Similarly, software packages that perform service-level monitoring can query the operating system to see if a service is in the "running" state. Services that have failed or that have been stopped can be identified in this manner.

Identify Dependencies for All Services

When possible, identify dependencies for all services to help reduce redundant alerts. For example, in Exchange, if the System Attendant service is down, there is no point in checking the Information Store service. This would only generate a needless alert.

Because the services running will be very specific to the type of server in question it is recommended that you research the system to determine the specific services needed for the server to properly do its job. For example, on an Exchange server you might monitor the following:

Microsoft Exchange IMAP4
Microsoft Exchange Information Store
Microsoft Exchange Management
Microsoft Exchange MTA Stacks
Microsoft Exchange POP3
Microsoft Exchange Routing Engine
Microsoft Exchange Site Replication Service
Microsoft Exchange System Attendant
Simple Mail Transfer Protocol
World Wide Web Publishing Service
Antivirus
Antispam

Application-Level Monitoring

Monitoring systems at the application level enables you to pull useful performance information from the system. Not only can you determine whether or not a service is running, but you can also determine how well it is running. Specific performance metrics such as SMTP queue sizes or mailbox sizes can be monitored to determine the health of the system. Databases can be monitored for critical things like available file locks, replication status, or even current user load. This type of monitoring allows thresholds to be used to determine when reactive measure should be taken to address a system problem. By layering several types of application-level monitoring, complex tests can be performed on the system. Rather than simply pinging a Web server to make sure it's running or querying it on port 80 or even checking to see if the World Wide Web publishing service is running, an application-level monitoring system can send a specific query to the Web server and determine whether the correct response was received. This level of monitoring gives you an impressive level of insights into the workings of the network.

This type of monitoring can be exceptionally useful in the area of capacity planning. By monitoring and logging application-level performance counters, you can use long- term system-usage tracking to determine when a resource will become insufficient.

Not unlike service-level monitoring, the key monitoring points of application-level monitoring will vary by application based on the role of the server. An Exchange server, for example, might be monitored for the following types of items:

SMTP Queue growth
MTA Queue growth
MAPI transaction time, average
Mailbox sizes
NDR count
Information Store size
User load
Concurrent connections
Traffic on connectors to foreign mail systems

However, a SQL server might be more concerned with the following:

Transaction response time
Number of long running transactions
Error log tracking
Process blocks
Page-level locks
Table-level locks
Exclusive locks
Shared locks
Log space
Database space
Cache hit rate

Application-Level Monitoring Solutions

Most application-level monitoring solutions require the installation of a monitoring agent on the target system. This allows the monitored system to have greater knowledge of its applications but could potentially increase the CPU load on the monitored system.

In any case, it is critical for the administrator who is implementing the monitoring solution to work very closely with the application owners to ensure that the important monitoring points are being captured, both from an alerting standpoint as well as from a capacity monitoring and planning standpoint.

Performance Monitoring

Although a monitoring system is quite useful for spotting problems and outages it can also be used to measure and track the performance of the system. By identifying key performance metrics such as memory usage or database transaction times you can not only be aware of outages but also see changes in system performance that would affect the end-user experience. By logging these performance metrics you also have the capability to see a long-term view of the performance of the system. Trends in system usage and trends in resource usage become extremely valuable when tracked over extended periods of time. This information can be used to predict when upgrades will be needed.

Performance Monitoring's Real Value

Although performance monitoring is useful for identifying problems with a system, its real value comes in long-term trend identification.

Some recommended monitoring points are as follows:

CPU usage
Available memory
Available disk space
Transaction rate
Network utilization
Disk I/O

Monitoring Pitfalls

There are a lot of different types of monitoring packages on the market and there are pros and cons to each type. There are a few things you should be aware of when picking a monitoring package.

Resist the Temptation ...

Resist the temptation to turn on monitoring for each and every subsystem available. Limit the scope of the monitoring to data points that will actually be used either for long term performance trending or for failure notification. Enabling too many monitoring points only serves to cloud the valid data and discourage you from addressing all of the data. Monitoring too many data points also imposes an unnecessarily harsh load on the system and reduces the scalability of the monitoring system.

A lot of monitoring packages use agents . This means that some piece of code needs to be installed on each machine that will be monitored. This introduces an unknown to the server. You should always baseline the performance of a server before adding a monitoring agent. In this way, any negative impact to the server's performance can be accurately measured. Also, be aware of packages that utilize protocols that are built into the operating system. Almost every monitoring package on the market supports Simple Network Management Protocol. SNMP is built into most Windows operating systems. Unfortunately, the version built into older versions of Windows isn't terribly secure. It sends its traps in clear text. Although updates to SNMP are available and have been included in service packs , most administrators don't know to reinstall their latest service pack after loading SNMP on the system. Also, an administrator should never leave the default community strings! This is a huge security risk.

Other monitoring packages gather information about a Windows system through the use of NetBIOS calls. Although at first this seems like a good idea, keep in mind that if a legitimate monitoring system can gather vital information about a server via NetBIOS requests , so can any other system. Never enable NetBIOS for monitoring purposes on a system that is reachable from outside your network. This goes for DMZs and wireless networks alike.