Creating a Custom Monitor | Microsoft Application Center 2000 Resource Kit 2001

Before you create a custom monitor, you will have to decide whether the monitor is going to be specific to a single member or common to all cluster members. If the latter case is true, you have to create the monitor in the Synchronized Monitors data group; if the former case is true, you can create the monitor in any other data group, such as the Non-Synchronized Monitors group. After making this decision and deciding what it is that you want to monitor, creating a monitor involves three basic steps: creating a data group, which is optional, creating the actual data collector for the monitor, and setting a threshold for the data collector.

In the following example, we're going to create a monitor that checks to see if the latest version of a specific DLL file has been installed on a new member. This is useful in cases where hot fixes have been provided and you'd like to verify that they've all been installed before putting your new member into production.

CAUTION
Every monitor has a performance cost. It's important to create monitors that are essential to supporting your operations. You need to consider:

Scoping data collection to ensure that data is collected efficiently and quickly.

Scheduling the monitor so that it runs only when needed, and collects data only often enough to achieve its purpose.

The Software Version Checker

For this example, we'll assume that one of the hot fixes we have to track is the Replication Service, and that one of the updates was to the Replication Service Library DLL, called Replib.dll. We'll create a data collector that will test to see if an instance of the newest version of this DLL exists. This data collector is associated with two actions if its threshold is crossed—which is to say, an instance of the correct file version isn't found.

We'll start by creating a data group named Version Checker in the Synchronized Monitors group.

Create the Data Group

Because we want to test for the presence of the latest hot fix on servers as they're added to the cluster, we'll create the new monitor in the Synchronized Monitors group. As soon as the member joins the cluster, our new monitor will be synchronized to the new member. (We're making the assumption that a new member will be synchronized to the controller, but will not automatically be brought online for load balancing.)

Use the following steps and settings to duplicate our process for creating a new data group.

Expand the Health Monitor node down to the Synchronized Monitors node.
Right-click Synchronized Monitors; on the pop-up menu, point to New; and then click Datagroup.
In the Data Group Properties dialog box, click the General tab, and then enter the following information:
- Name: Version Checker
- Comment: This group contains collectors that check software versions to see that the latest hot fix is applied on a server.
Leave the default settings for actions—we'll let the data collector handle that aspect of the monitor—and click OK to save the new data group.

Now we'll create the data collector.

Create the Data Collector

Right-click the Version Checker data group; on the pop-up menu, point to New; point to Data Collector; and then click WMI Instance.
In the WMI Instance Properties dialog box, click the General tab, and then enter the following information:
- General: Replication Service Library
- Comment: This data collector checks the Replication Service Library dll to see that the newest version is installed on the local system.
Click the Details tab, and then provide the following information (illustrated in Figure 9.17):
- Namespace: root\CIMV2–
  
  Use the default namespace for the system for which you are configuring the collector.
- Class: CIM_DataFile–
  
  Before creating the collector, we determined that this particular class captured the information that we wanted to test for.
- Instance: CIM_DataFile.Name="D:\\Program Files\\Microsoft Application Center\\replib.dll"
  
  You can also obtain this information by using the Browse button, but this is very time consuming because it enumerates all of the files on the server. It's quicker to provide a complete path and file name.
  
  NOTE
  Make sure that you use two backslashes (\\) in the path statement.
- Properties: FileName, FileSize,LastModified,Version–
  The properties list determines what is shown in the statistics details pane. You need to select at least one property.
  
  Figure 9.17The Details tab for the Replication Service Library
Click the Schedule tab, and then change the schedule so that the collector runs from 12:00 midnight through 4:00 A.M.
We've made the assumption that under normal conditions, new members will be brought online only during nonpeak periods, which is the time of day in our hypothetical environment. Set the collection interval at 60 minutes (check once per hour), and set the number of samples at 1 because we don't need any averaging; we only need one instance.
Click the Action tab, and then click New Action Association.
In the Execute Action Properties dialog box, add the Email Administrator action.
Click the Message tab, and then expand the default message by adding the following text:
%EmbeddedCollectedInstance.FileName% is not the latest version. %SystemName% will be taken offline.

You can add these or similar insertion strings by clicking the right angle bracket (>) to the right of the message area.

The final task is creating a threshold for the data collector because the only default threshold created is for WMI errors.

Create the Threshold

Follow these steps to create the data collector's threshold.

Right-click the Replication Service Library node; on the pop-up menu, point to New; and then click Threshold.
In the Threshold Properties dialog box, click the General tab, and then provide the following information:
- Name: Date and Time
- Comment: This threshold uses the WMI date and time stamp for the LastModified property.
We decided to use the WMI date and time stamp for this threshold, but file version would have worked as well.
Click the Expression tab, and then create the following expression by using the lists and boxes:
- If this condition is true: If the current value for LastModified (Date/Time) Is less than 20000920100400.
  NOTE
  The data collector's Statistics tab (Figure 9.17) shows the value for the LastModified property in WMI data format. To use this date, we right-clicked the last modified item, made a copy, and pasted the line into Notepad. The WMI date, 20000920100455.620875-420, breaks out as follows:
  - Year 2000
  - Month 09
  - Day 20
  - Time 10:04:00
  - Duration: Any time this occurs
  - The following will occur: The status changes toWarning
Click OK to save the threshold information.

As soon as this action is completed, the data collector will start collecting data and testing the data against the threshold.

Figure 9.18 shows a typical statistical report that Application Center provides for data collectors that are installed on a cluster member. In this example, information about our new data collector is displayed.

click to view at full size

Figure 9.18 The statistics provided for a data collector

For our sample data collector, we used a value for LastModified that was older than the current version of the DLL. Notice in Figure 9.18 that the LastModified date is different, and that a Warning was generated.

Throttling notifications
When creating data collectors, you have to be careful about setting up a situation where a data collector fires too many administrative events—throttling is extremely important in having a reliable monitoring solution. The following scenarios include workarounds that can reduce the number of notifications that are fired.
Scenario 1:

An action is mapped to an event-based data collector, and there are large numbers of events coming in.

The workaround is to set the collection interval and have Health Monitor count the number of events that are received. Specify that the action is triggered only when the number of events exceeds the count threshold you establish for the collection period. For example, 10 failed log-on attempts in 2 minutes.

Scenario 2:

A data collector flips back and forth repeatedly between a good and bad state.

The first workaround is to require that a bad state stay that way for n collection periods (where n is a configurable value). For nonbinary thresholds, such as processor utilization, you can key off an average over multiple intervals, thus blunting the impact of oscillations right around the threshold value.

The second workaround is to set up two thresholds on a data collector, a go to good threshold and a go to bad threshold, with an established delta between them. For example, send an alert when disk free space goes below 5 percent, but only reset this data collector when free space returns to above 10 percent. In this example, you would require a manual both reset and establish Critical and go to good thresholds.

The third workaround is to set up multiple periods or averages. For example, collect the average processor utilization every 30 seconds, but go critical only if this value is over the threshold for 2 minutes (4 collection intervals).