Before you create a custom monitor, you will have to decide whether the monitor is going to be specific to a single member or common to all cluster members. If the latter case is true, you have to create the monitor in the Synchronized Monitors data group; if the former case is true, you can create the monitor in any other data group, such as the Non-Synchronized Monitors group. After making this decision and deciding what it is that you want to monitor, creating a monitor involves three basic steps: creating a data group, which is optional, creating the actual data collector for the monitor, and setting a threshold for the data collector.
In the following example, we're going to create a monitor that checks to see if the latest version of a specific DLL file has been installed on a new member. This is useful in cases where hot fixes have been provided and you'd like to verify that they've all been installed before putting your new member into production.
CAUTION
Every monitor has a performance cost. It's important to create monitors that are essential to supporting your operations. You need to consider:
- Scoping data collection to ensure that data is collected efficiently and quickly.
- Scheduling the monitor so that it runs only when needed, and collects data only often enough to achieve its purpose.
For this example, we'll assume that one of the hot fixes we have to track is the Replication Service, and that one of the updates was to the Replication Service Library DLL, called Replib.dll. We'll create a data collector that will test to see if an instance of the newest version of this DLL exists. This data collector is associated with two actions if its threshold is crossed—which is to say, an instance of the correct file version isn't found.
We'll start by creating a data group named Version Checker in the Synchronized Monitors group.
Create the Data Group
Because we want to test for the presence of the latest hot fix on servers as they're added to the cluster, we'll create the new monitor in the Synchronized Monitors group. As soon as the member joins the cluster, our new monitor will be synchronized to the new member. (We're making the assumption that a new member will be synchronized to the controller, but will not automatically be brought online for load balancing.)
Use the following steps and settings to duplicate our process for creating a new data group.
Now we'll create the data collector.
Create the Data Collector
You can also obtain this information by using the Browse button, but this is very time consuming because it enumerates all of the files on the server. It's quicker to provide a complete path and file name.
NOTE
Make sure that you use two backslashes (\\) in the path statement.
The properties list determines what is shown in the statistics details pane. You need to select at least one property.
Figure 9.17The Details tab for the Replication Service Library
We've made the assumption that under normal conditions, new members will be brought online only during nonpeak periods, which is the time of day in our hypothetical environment. Set the collection interval at 60 minutes (check once per hour), and set the number of samples at 1 because we don't need any averaging; we only need one instance.
%EmbeddedCollectedInstance.FileName% is not the latest version. %SystemName% will be taken offline.
You can add these or similar insertion strings by clicking the right angle bracket (>) to the right of the message area.
The final task is creating a threshold for the data collector because the only default threshold created is for WMI errors.
Create the Threshold
Follow these steps to create the data collector's threshold.
NOTE
The data collector's Statistics tab (Figure 9.17) shows the value for the LastModified property in WMI data format. To use this date, we right-clicked the last modified item, made a copy, and pasted the line into Notepad. The WMI date, 20000920100455.620875-420, breaks out as follows:
- Year 2000
- Month 09
- Day 20
- Time 10:04:00
- Duration: Any time this occurs
- The following will occur: The status changes toWarning
As soon as this action is completed, the data collector will start collecting data and testing the data against the threshold.
Figure 9.18 shows a typical statistical report that Application Center provides for data collectors that are installed on a cluster member. In this example, information about our new data collector is displayed.
Figure 9.18 The statistics provided for a data collector
For our sample data collector, we used a value for LastModified that was older than the current version of the DLL. Notice in Figure 9.18 that the LastModified date is different, and that a Warning was generated.
Throttling notifications
When creating data collectors, you have to be careful about setting up a situation where a data collector fires too many administrative events—throttling is extremely important in having a reliable monitoring solution. The following scenarios include workarounds that can reduce the number of notifications that are fired.Scenario 1:
An action is mapped to an event-based data collector, and there are large numbers of events coming in.
- The workaround is to set the collection interval and have Health Monitor count the number of events that are received. Specify that the action is triggered only when the number of events exceeds the count threshold you establish for the collection period. For example, 10 failed log-on attempts in 2 minutes.
Scenario 2:
A data collector flips back and forth repeatedly between a good and bad state.
- The first workaround is to require that a bad state stay that way for n collection periods (where n is a configurable value). For nonbinary thresholds, such as processor utilization, you can key off an average over multiple intervals, thus blunting the impact of oscillations right around the threshold value.
- The second workaround is to set up two thresholds on a data collector, a go to good threshold and a go to bad threshold, with an established delta between them. For example, send an alert when disk free space goes below 5 percent, but only reset this data collector when free space returns to above 10 percent. In this example, you would require a manual both reset and establish Critical and go to good thresholds.
- The third workaround is to set up multiple periods or averages. For example, collect the average processor utilization every 30 seconds, but go critical only if this value is over the threshold for 2 minutes (4 collection intervals).