6.7 Event monitoring

 < Day Day Up > 

6.7 Event monitoring

CSM provides an extensible framework for detecting and responding to conditions arising on the cluster. The principle is that the administrator should not have to proactively monitor the status of a cluster; instead, the administrator should be notified in the event of any significant events.

The Resource Management and Control (RMC) subsystem is the scalable backbone of RSCT that provides a generalized framework for managing and monitoring resources.

RMC is comprehensively documented in IBM Reliable Scalable Cluster Technology for Linux: Guide and Reference, SA22-7892. For further detailed information, consult this book.

6.7.1 RMC components

The following components form the core of RMC:

  • Resources

  • Sensors

  • Conditions

  • Responses


A resource is a collection of attributes that describe or measure a logical or physical entity; all of RMC hinges on resources. Resources are introduced to RMC through resource managers, such as IBM.DMSRM and IBM.HWCTRLRM.

Resources may be queried using the lsrsrc command, as shown in Example 6-30. The CT_MANAGEMENT_SCOPE environment variable affects which resources lsrsrc will display. If it is un-set or set to 0 or 1, lsrsrc will display resources local to the current node. If lsrsrc is run on the management node with CT_MANAGEMENT_SCOPE set to 3, resources for all managed nodes will be displayed; this is the most common usage.

Example 6-30: Displaying node resources using lsrsrc

start example
 [root@master /]# export CT_MANAGEMENT_SCOPE=3 [root@master /]# lsrsrc -Ab IBM.Host Name KernelVersion Resource Persistent Attributes for: IBM.Host resource 1:         Name          = "node2.cluster.com"         KernelVersion = "2.4.18-3smp" resource 2:         Name          = "node1.cluster.com"         KernelVersion = "2.4.18-3smp" resource 3:         Name          = "node3.cluster.com"         KernelVersion = "2.4.18-3smp" resource 4:         Name          = "node4.cluster.com"         KernelVersion = "2.4.18-3smp" resource 5:         Name          = "storage1.cluster.com"         KernelVersion = "2.4.18-3smp" [root@master /]# unset CT_MANAGEMENT_SCOPE [root@master /]# 
end example

lsrsrc is extremely powerful. Consult the RMC documentation for more information on available resources.


CT_MANAGEMENT_SCOPE should not be set when running the remainder of the commands in this section; they will return no information. The following command will unset this variable:



Sensors are a way of providing resources to RMC without writing a full-blown resource manager. An individual sensor measures a specific value, for example, the last time root logged into a system. The lssensor command can be used to display information about currently defined sensors, as in Example 6-31.

Example 6-31: Displaying sensor information with lssensor

start example
 [root@master /]# lssensor CFMRootModTime [root@master /]# lssensor CFMRootModTime  Name = CFMRootModTime  Command = /opt/csm/csmbin/mtime /cfmroot  ConfigChanged = 0  Description =  ErrorExitValue = 1  ExitValue = 0  Float32 = 0  Float64 = 0  Int32 = 0  Int64 = 0  NodeNameList = {master.cluster.com}  RefreshInterval = 60  String = 1034278600  Uint32 = 0  Uint64 = 0  UserName = root [root@master /]# 
end example

CFM uses the CFMRootModTime sensor to determine when the last file was modified in /cfmroot.


A condition determines, based on resources, if an event has occurred. For example, a file system condition may trigger when a file system reaches 95% full. A condition may also contain a re-arm condition, which must become true before any more events will be generated. The file system monitor could re-arm when the file system is less than 70% full.

The lscondition command will display currently defined conditions. Example 6-32 shows the pre-defined conditions and detail for the CFMRootModTimeChanged condition.

Example 6-32: Displaying condition information using lscondition

start example
 [root@master /]# lscondition Displaying condition information: Name                            Node                     MonitorStatus "NodePowerStatus"               "master.cluster.com" "Not monitored" "NodeChanged"                   "master.cluster.com" "Monitored" "NodeGroupMembershipChanged"    "master.cluster.com" "Not monitored" "AnyNodeTmpSpaceUsed"           "master.cluster.com" "Not monitored" "UpdatenodeFailedStatusChange"  "master.cluster.com" "Monitored" "AnyNodeFileSystemSpaceUsed"    "master.cluster.com" "Not monitored" "AnyNodeProcessorsIdleTime"     "master.cluster.com" "Not monitored" "AnyNodeVarSpaceUsed"           "master.cluster.com" "Not monitored" "AnyNodeFileSystemInodesUsed"   "master.cluster.com" "Not monitored" "CFMRootModTimeChanged"         "master.cluster.com" "Monitored" "NodeReachability"              "master.cluster.com" "Not monitored" "AnyNodePagingPercentSpaceFree" "master.cluster.com" "Not monitored" [root@master /]# lscondition CFMRootModTimeChanged Displaying condition information: condition 1:         Name             = "CFMRootModTimeChanged"         Node             = "master.cluster.com"         MonitorStatus    = "Monitored"         ResourceClass    = "IBM.Sensor"         EventExpression  = "String!=String@P"         EventDescription = "An event will be generated whenever a file under /cfmroot is added or modified."         RearmExpression  = ""         RearmDescription = ""         SelectionString  = "Name=\"CFMRootModTime\""         Severity         = "i"         NodeNames        = {}         MgtScope         = "l" [root@master /]# 
end example

Note that the condition has English descriptions associated with it as a reminder for the function. A CFMRootModTimeChanged condition occurs whenever the CFMRootModTime sensor changes value.


The action to perform when a condition event occurs. A response links to one or more scripts that will perform a certain action. It may be a very general alert or perform a more specific task. Responses may be associated with more than one condition.

Example 6-33 shows the predefined responses, which include sending e-mail to root, sending walls, and so on.

Example 6-33: Displaying condition responses using lsresponse

start example
 [root@master /]# lsresponse Displaying response information: ResponseName                     Node "MsgEventsToRootAnyTime"         "master.cluster.com" "LogOnlyToAuditLogAnyTime"       "master.cluster.com" "BroadcastEventsAnyTime"         "master.cluster.com" "rconsoleUpdateResponse"         "master.cluster.com" "DisplayEventsAnyTime"           "master.cluster.com" "CFMNodeGroupResp"               "master.cluster.com" "CFMModResp"                     "master.cluster.com" "LogCSMEventsAnyTime"            "master.cluster.com" "UpdatenodeFailedStatusResponse" "master.cluster.com" [root@master /]# lsresponse CFMModResp Displaying response information:         ResponseName    = "CFMModResp"         Node            = "master.cluster.com"         Action          = "CFMModResp"         DaysOfWeek      = 1-7         TimeOfDay       = 0000-2400         ActionScript    = "/opt/csm/csmbin/CFMmodresp"         ReturnCode      = 0         CheckReturnCode = "n"         EventType       = "b"         StandardOut     = "n"         EnvironmentVars = ""         UndefRes        = "n" [root@master /]# 
end example

As you can see, mostly the response is merely a notification. If possible, a response that actually solves the problem is usually more desirable; if /tmp fills up, you could clean old files from it. CFM uses CFMModResp to trigger the cfmupdatenode command.

6.7.2 Activating condition responses

Just defining a condition and response will not have any effect; the response must be associated with the condition for actual monitoring to occur.

To start monitoring a condition, use startcondresp. In Example 6-34, we set up a configuration so that root will get a message on the terminal when /var starts to fill up.

Example 6-34: Starting event monitoring with startcondresp

start example
 [root@master /]# startcondresp AnyNodeVarSpaceUsed MsgEventsToRootAnyTime [root@master /]# lscondresp AnyNodeVarSpaceUsed Displaying condition with response information: condition-response link 1:         Condition = "AnyNodeVarSpaceUsed"         Response  = "MsgEventsToRootAnyTime"         Node      = "master.cluster.com"         State     = "Active" [root@master /]# 
end example

You may associate more than one response with a condition by specifying more than one response on the startcondresp command line or by running it multiple times. This can be useful if you want to both send an alert and perform an action in response to a condition.

6.7.3 Deactivating condition responses

Condition response associations may be temporarily de-activated using stopcondresp or permanently removed using rmcondresp. If multiple response associations to a condition are defined, they will all be stopped or removed unless an individual response is specified.

Example 6-35 shows the use of stopcondresp and rmcondresp.

Example 6-35: Pausing and removing condition responses

start example
 [root@master /]# stopcondresp AnyNodeVarSpaceUsed MsgEventsToRootAnyTime [root@master /]# lscondresp Displaying condition with response information: Condition                   Response                     Node State "UpdatenodeFailedStatusChange" "UpdatenodeFailedStatusResponse" "master.cluster.com" "Active" "NodeChanged"                  "rconsoleUpdateResponse" "master.cluster.com" "Active" "ViRunning"                    "BroadcastEventsAnyTime" "master.cluster.com" "Active" "TmpNeedsCleaning"             "BroadcastEventsAnyTime" "master.cluster.com" "Active" "TmpNeedsCleaning"             "CleanTmp" "master.cluster.com" "Active" "AnyNodeVarSpaceUsed"          "MsgEventsToRootAnyTime" "master.cluster.com" "Not active" [root@master /]# rmcondresp AnyNodeVarSpaceUsed [root@master /]# 
end example

6.7.4 Creating your own conditions and responses

Here we will provide a very brief overview of how to create your own conditions and responses. There is much more information on this subject in the IBM Reliable Scalable Cluster Technology for Linux: Guide and Reference, SA22-7892.

All resources provided by a resource manager feature static and dynamic attributes. Static attributes are constant for the particular instance of a resource, whereas dynamic attributes vary and may be monitored.

CSM includes a resource manager (IBM.FSRM) that provides a file system resource, IBM.FileSystem. In this example, we will utilize IBM.FileSystem to create a condition and response that monitor /tmp. If it starts to fill, we will automatically clean it out.

The static attributes of a FileSystem resource are those that do not change throughout the time the file system is mounted; for example, Dev represents the mounted device and MountDir represents where it is mounted. The dynamic attributes are those that can change; PercentTotUsed represents how full the file system is as a percentage. If the file system is unmounted and re-mounted elsewhere, the existing FileSystem resource will be destroyed and a new instance created.

To list all the static attributes available on IBM.FileSystem resources, with examples for usage, run:

 # lsrsrcdef -e IBM.FileSystem 

List all the dynamic attributes, with examples, using:

 # lsrsrcdef -e -A d IBM.FileSystem 

You list all attributes of the currently available IBM.FileSystem resources with:

 # lsrsrc -A b IBM.FileSystem 

The CT_MANAGEMENT_SCOPE environment variable will affect which resources are listed. Unset, 0 or 1 will cause only resources on the local node to be listed. When CT_MANAGEMENT_SCOPE is set to 3, resources from the managed nodes will be listed. Remember to unset it before proceeding with the rest of this section.

In order to watch /tmp, we will be monitoring the PercentTotUsed attribute. Here we create a monitor called TmpNeedsCleaning that will monitor the IBM.Filesystem resource. It will look for an instance where the MountDir is /tmp and the PercentTotUsed (percentage of space used) is over 90. The condition will re-arm when the file system drops to 85% or less used. The command used is as follows:

 # mkcondition -r IBM.FileSystem -s 'MountDir="/tmp"' -e 'PercentTotUsed>90' -E 'PercentTotUsed<=85' -m m TmpNeedsCleaning 

Note the -m m switch; this specifies the scope of the monitor. Scope m refers to the nodes being managed and is the same as a CT_MANAGEMENT_SCOPE of 3.

Now we need to create a response that will clean /tmp on the correct node. Let us assume that all the nodes have the fictitious script /usr/local/sbin/tmpscrubber installed that will clean all the old files from /tmp. In Example 6-36, we demonstrate a very basic response script. Note that the script will run on the management node but must clean /tmp on the correct node.

Example 6-36: Creating an event response script

start example
 [root@master /]# cat >> /usr/local/sbin/cleannodetmp #!/bin/sh [ -z "$ERRM_NODE_NAME" ] && exit 1 # Use -n switch to rsh/ssh to prevent stdin being read dsh -o '-n' -n "$ERRM_NODE_NAME" /usr/local/sbin/tmpscrubber ^D [root@master /]# chmod 755 /usr/local/sbin/cleannodetmp [root@master /]# 
end example

We can now use the cleannodetmp command as the basis of a response. We create the response using mkresponse:

 # mkresponse -n cleannodetmp -s /usr/local/sbin/cleannodetmp CleanTmp 

The -n switch refers to the name of this particular action and is free-form; there can be multiple actions associated with a single response.

6.7.5 RMC audit log

Unless the response is some kind of alert, it is not always obvious whether events and responses are being triggered. For this reason, RMC has an audit log where all activity is recorded.

Example 6-37 shows a section of the audit log on our cluster using the lsaudrec command.

Example 6-37: Displaying the audit log with lsaudrec

start example
 [root@master /]# lsaudrec Time                   Subsystem Category Description 07/14/2003 11:05:43 AM      ERRM Info     Monitoring of condition CFMRootModTimeChanged is started successfully. 07/14/2003 11:05:43 AM      ERRM Info     Event : CFMRootModTimeChanged occurred at 07/14/2003 11:05:43 AM 599548 on CFMRootModTime on master.cluster.com. 07/14/2003 11:05:43 AM      ERRM Info     Event from CFMRootModTimeChanged that occurred at 07/14/2003 11:05:43 AM 599548 will cause /opt/csm/csmbin/CFMmodresp from CFMModResp to be executed. 07/14/2003 11:07:44 AM      ERRM Info     Event from CFMRootModTimeChanged that occurred at 07/14/2003 11:05:43 AM 599548 caused /opt/csm/csmbin/CFMmodresp from CFMModResp to complete with a return code of 0. ... [root@master /]# 
end example

 < Day Day Up > 

Linux Clustering with CSM and GPFS
Linux Clustering With Csm and Gpfs
ISBN: 073849870X
EAN: 2147483647
Year: 2003
Pages: 123
Authors: IBM Redbooks

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net