| < Day Day Up > |
|
CSM provides an extensible framework for detecting and responding to conditions arising on the cluster. The principle is that the administrator should not have to proactively monitor the status of a cluster; instead, the administrator should be notified in the event of any significant events.
The Resource Management and Control (RMC) subsystem is the scalable backbone of RSCT that provides a generalized framework for managing and monitoring resources.
RMC is comprehensively documented in IBM Reliable Scalable Cluster Technology for Linux: Guide and Reference, SA22-7892. For further detailed information, consult this book.
The following components form the core of RMC:
Resources
Sensors
Conditions
Responses
A resource is a collection of attributes that describe or measure a logical or physical entity; all of RMC hinges on resources. Resources are introduced to RMC through resource managers, such as IBM.DMSRM and IBM.HWCTRLRM.
Resources may be queried using the lsrsrc command, as shown in Example 6-30. The CT_MANAGEMENT_SCOPE environment variable affects which resources lsrsrc will display. If it is un-set or set to 0 or 1, lsrsrc will display resources local to the current node. If lsrsrc is run on the management node with CT_MANAGEMENT_SCOPE set to 3, resources for all managed nodes will be displayed; this is the most common usage.
Example 6-30: Displaying node resources using lsrsrc
[root@master /]# export CT_MANAGEMENT_SCOPE=3 [root@master /]# lsrsrc -Ab IBM.Host Name KernelVersion Resource Persistent Attributes for: IBM.Host resource 1: Name = "node2.cluster.com" KernelVersion = "2.4.18-3smp" resource 2: Name = "node1.cluster.com" KernelVersion = "2.4.18-3smp" resource 3: Name = "node3.cluster.com" KernelVersion = "2.4.18-3smp" resource 4: Name = "node4.cluster.com" KernelVersion = "2.4.18-3smp" resource 5: Name = "storage1.cluster.com" KernelVersion = "2.4.18-3smp" [root@master /]# unset CT_MANAGEMENT_SCOPE [root@master /]#
lsrsrc is extremely powerful. Consult the RMC documentation for more information on available resources.
Important: | CT_MANAGEMENT_SCOPE should not be set when running the remainder of the commands in this section; they will return no information. The following command will unset this variable:
# unset CT_MANAGEMENT_SCOPE |
Sensors are a way of providing resources to RMC without writing a full-blown resource manager. An individual sensor measures a specific value, for example, the last time root logged into a system. The lssensor command can be used to display information about currently defined sensors, as in Example 6-31.
Example 6-31: Displaying sensor information with lssensor
[root@master /]# lssensor CFMRootModTime [root@master /]# lssensor CFMRootModTime Name = CFMRootModTime Command = /opt/csm/csmbin/mtime /cfmroot ConfigChanged = 0 Description = ErrorExitValue = 1 ExitValue = 0 Float32 = 0 Float64 = 0 Int32 = 0 Int64 = 0 NodeNameList = {master.cluster.com} RefreshInterval = 60 String = 1034278600 Uint32 = 0 Uint64 = 0 UserName = root [root@master /]#
CFM uses the CFMRootModTime sensor to determine when the last file was modified in /cfmroot.
A condition determines, based on resources, if an event has occurred. For example, a file system condition may trigger when a file system reaches 95% full. A condition may also contain a re-arm condition, which must become true before any more events will be generated. The file system monitor could re-arm when the file system is less than 70% full.
The lscondition command will display currently defined conditions. Example 6-32 shows the pre-defined conditions and detail for the CFMRootModTimeChanged condition.
Example 6-32: Displaying condition information using lscondition
[root@master /]# lscondition Displaying condition information: Name Node MonitorStatus "NodePowerStatus" "master.cluster.com" "Not monitored" "NodeChanged" "master.cluster.com" "Monitored" "NodeGroupMembershipChanged" "master.cluster.com" "Not monitored" "AnyNodeTmpSpaceUsed" "master.cluster.com" "Not monitored" "UpdatenodeFailedStatusChange" "master.cluster.com" "Monitored" "AnyNodeFileSystemSpaceUsed" "master.cluster.com" "Not monitored" "AnyNodeProcessorsIdleTime" "master.cluster.com" "Not monitored" "AnyNodeVarSpaceUsed" "master.cluster.com" "Not monitored" "AnyNodeFileSystemInodesUsed" "master.cluster.com" "Not monitored" "CFMRootModTimeChanged" "master.cluster.com" "Monitored" "NodeReachability" "master.cluster.com" "Not monitored" "AnyNodePagingPercentSpaceFree" "master.cluster.com" "Not monitored" [root@master /]# lscondition CFMRootModTimeChanged Displaying condition information: condition 1: Name = "CFMRootModTimeChanged" Node = "master.cluster.com" MonitorStatus = "Monitored" ResourceClass = "IBM.Sensor" EventExpression = "String!=String@P" EventDescription = "An event will be generated whenever a file under /cfmroot is added or modified." RearmExpression = "" RearmDescription = "" SelectionString = "Name=\"CFMRootModTime\"" Severity = "i" NodeNames = {} MgtScope = "l" [root@master /]#
Note that the condition has English descriptions associated with it as a reminder for the function. A CFMRootModTimeChanged condition occurs whenever the CFMRootModTime sensor changes value.
The action to perform when a condition event occurs. A response links to one or more scripts that will perform a certain action. It may be a very general alert or perform a more specific task. Responses may be associated with more than one condition.
Example 6-33 shows the predefined responses, which include sending e-mail to root, sending walls, and so on.
Example 6-33: Displaying condition responses using lsresponse
[root@master /]# lsresponse Displaying response information: ResponseName Node "MsgEventsToRootAnyTime" "master.cluster.com" "LogOnlyToAuditLogAnyTime" "master.cluster.com" "BroadcastEventsAnyTime" "master.cluster.com" "rconsoleUpdateResponse" "master.cluster.com" "DisplayEventsAnyTime" "master.cluster.com" "CFMNodeGroupResp" "master.cluster.com" "CFMModResp" "master.cluster.com" "LogCSMEventsAnyTime" "master.cluster.com" "UpdatenodeFailedStatusResponse" "master.cluster.com" [root@master /]# lsresponse CFMModResp Displaying response information: ResponseName = "CFMModResp" Node = "master.cluster.com" Action = "CFMModResp" DaysOfWeek = 1-7 TimeOfDay = 0000-2400 ActionScript = "/opt/csm/csmbin/CFMmodresp" ReturnCode = 0 CheckReturnCode = "n" EventType = "b" StandardOut = "n" EnvironmentVars = "" UndefRes = "n" [root@master /]#
As you can see, mostly the response is merely a notification. If possible, a response that actually solves the problem is usually more desirable; if /tmp fills up, you could clean old files from it. CFM uses CFMModResp to trigger the cfmupdatenode command.
Just defining a condition and response will not have any effect; the response must be associated with the condition for actual monitoring to occur.
To start monitoring a condition, use startcondresp. In Example 6-34, we set up a configuration so that root will get a message on the terminal when /var starts to fill up.
Example 6-34: Starting event monitoring with startcondresp
[root@master /]# startcondresp AnyNodeVarSpaceUsed MsgEventsToRootAnyTime [root@master /]# lscondresp AnyNodeVarSpaceUsed Displaying condition with response information: condition-response link 1: Condition = "AnyNodeVarSpaceUsed" Response = "MsgEventsToRootAnyTime" Node = "master.cluster.com" State = "Active" [root@master /]#
You may associate more than one response with a condition by specifying more than one response on the startcondresp command line or by running it multiple times. This can be useful if you want to both send an alert and perform an action in response to a condition.
Condition response associations may be temporarily de-activated using stopcondresp or permanently removed using rmcondresp. If multiple response associations to a condition are defined, they will all be stopped or removed unless an individual response is specified.
Example 6-35 shows the use of stopcondresp and rmcondresp.
Example 6-35: Pausing and removing condition responses
[root@master /]# stopcondresp AnyNodeVarSpaceUsed MsgEventsToRootAnyTime [root@master /]# lscondresp Displaying condition with response information: Condition Response Node State "UpdatenodeFailedStatusChange" "UpdatenodeFailedStatusResponse" "master.cluster.com" "Active" "NodeChanged" "rconsoleUpdateResponse" "master.cluster.com" "Active" "ViRunning" "BroadcastEventsAnyTime" "master.cluster.com" "Active" "TmpNeedsCleaning" "BroadcastEventsAnyTime" "master.cluster.com" "Active" "TmpNeedsCleaning" "CleanTmp" "master.cluster.com" "Active" "AnyNodeVarSpaceUsed" "MsgEventsToRootAnyTime" "master.cluster.com" "Not active" [root@master /]# rmcondresp AnyNodeVarSpaceUsed [root@master /]#
Here we will provide a very brief overview of how to create your own conditions and responses. There is much more information on this subject in the IBM Reliable Scalable Cluster Technology for Linux: Guide and Reference, SA22-7892.
All resources provided by a resource manager feature static and dynamic attributes. Static attributes are constant for the particular instance of a resource, whereas dynamic attributes vary and may be monitored.
CSM includes a resource manager (IBM.FSRM) that provides a file system resource, IBM.FileSystem. In this example, we will utilize IBM.FileSystem to create a condition and response that monitor /tmp. If it starts to fill, we will automatically clean it out.
The static attributes of a FileSystem resource are those that do not change throughout the time the file system is mounted; for example, Dev represents the mounted device and MountDir represents where it is mounted. The dynamic attributes are those that can change; PercentTotUsed represents how full the file system is as a percentage. If the file system is unmounted and re-mounted elsewhere, the existing FileSystem resource will be destroyed and a new instance created.
To list all the static attributes available on IBM.FileSystem resources, with examples for usage, run:
# lsrsrcdef -e IBM.FileSystem
List all the dynamic attributes, with examples, using:
# lsrsrcdef -e -A d IBM.FileSystem
You list all attributes of the currently available IBM.FileSystem resources with:
# lsrsrc -A b IBM.FileSystem
The CT_MANAGEMENT_SCOPE environment variable will affect which resources are listed. Unset, 0 or 1 will cause only resources on the local node to be listed. When CT_MANAGEMENT_SCOPE is set to 3, resources from the managed nodes will be listed. Remember to unset it before proceeding with the rest of this section.
In order to watch /tmp, we will be monitoring the PercentTotUsed attribute. Here we create a monitor called TmpNeedsCleaning that will monitor the IBM.Filesystem resource. It will look for an instance where the MountDir is /tmp and the PercentTotUsed (percentage of space used) is over 90. The condition will re-arm when the file system drops to 85% or less used. The command used is as follows:
# mkcondition -r IBM.FileSystem -s 'MountDir="/tmp"' -e 'PercentTotUsed>90' -E 'PercentTotUsed<=85' -m m TmpNeedsCleaning
Note the -m m switch; this specifies the scope of the monitor. Scope m refers to the nodes being managed and is the same as a CT_MANAGEMENT_SCOPE of 3.
Now we need to create a response that will clean /tmp on the correct node. Let us assume that all the nodes have the fictitious script /usr/local/sbin/tmpscrubber installed that will clean all the old files from /tmp. In Example 6-36, we demonstrate a very basic response script. Note that the script will run on the management node but must clean /tmp on the correct node.
Example 6-36: Creating an event response script
[root@master /]# cat >> /usr/local/sbin/cleannodetmp #!/bin/sh [ -z "$ERRM_NODE_NAME" ] && exit 1 # Use -n switch to rsh/ssh to prevent stdin being read dsh -o '-n' -n "$ERRM_NODE_NAME" /usr/local/sbin/tmpscrubber ^D [root@master /]# chmod 755 /usr/local/sbin/cleannodetmp [root@master /]#
We can now use the cleannodetmp command as the basis of a response. We create the response using mkresponse:
# mkresponse -n cleannodetmp -s /usr/local/sbin/cleannodetmp CleanTmp
The -n switch refers to the name of this particular action and is free-form; there can be multiple actions associated with a single response.
Unless the response is some kind of alert, it is not always obvious whether events and responses are being triggered. For this reason, RMC has an audit log where all activity is recorded.
Example 6-37 shows a section of the audit log on our cluster using the lsaudrec command.
Example 6-37: Displaying the audit log with lsaudrec
[root@master /]# lsaudrec Time Subsystem Category Description 07/14/2003 11:05:43 AM ERRM Info Monitoring of condition CFMRootModTimeChanged is started successfully. 07/14/2003 11:05:43 AM ERRM Info Event : CFMRootModTimeChanged occurred at 07/14/2003 11:05:43 AM 599548 on CFMRootModTime on master.cluster.com. 07/14/2003 11:05:43 AM ERRM Info Event from CFMRootModTimeChanged that occurred at 07/14/2003 11:05:43 AM 599548 will cause /opt/csm/csmbin/CFMmodresp from CFMModResp to be executed. 07/14/2003 11:07:44 AM ERRM Info Event from CFMRootModTimeChanged that occurred at 07/14/2003 11:05:43 AM 599548 caused /opt/csm/csmbin/CFMmodresp from CFMModResp to complete with a return code of 0. ... [root@master /]#
| < Day Day Up > |
|