| < Day Day Up > |
|
One of the key capabilities of CSM is to monitor the cluster resources and react to events. Therefore, the monitoring component in CSM is very important to understand.
This section does not presume to replace the Monitoring HowTo documentation, but rather is intended to provide the administrator with a basic understanding of the CSM monitoring architecture and components. Chapter 6, "Cluster management with CSM" on page 149 provides a great deal of information on how to manage a Linux cluster using CSM.
The following are definitions of the terms condition and response in the context of CSM.
Condition | A measurement or status that a subsystem must reach to generate an alert, for example, more than 90 percent of CPU workload. |
Response | An action that is executed in response to a condition that has been encountered. |
CSM uses Resource Managers to monitor a cluster. These Resource Managers use resource classes to update the attributes of the object representing the resources they are monitoring. A CSM daemon then monitors these attributes to identify a condition and to initiate a response.
A Resource Manager is a daemon that runs on the management server and on the managed nodes.
A resource class is a group of functions and values used by the Resource Manager.
The list of resources being monitored by the Resource Manager can be obtain by using the lssrc command, as in Example 3-2.
Example 3-2: lssrc -a command result
[root@masternode root]# lssrc -a Subsystem Group PID Status ctrmc rsct 776 active IBM.ERRM rsct_rm 798 active IBM.DMSRM rsct_rm 814 active IBM.AuditRM rsct_rm 850 active ctcas rsct 865 active IBM.HostRM rsct_rm 970 active IBM.SensorRM rsct_rm 1003 active IBM.FSRM rsct_rm 10972 active IBM.HWCTRLRM rsct_rm 14755 active IBM.ConfigRM rsct_rm 20323 active IBM.CSMAgentRM rsct_rm inoperative [root@masternode root]#
The Resource Manager works with resources classes. These classes contain all the variables and functions that the Resources Manager can use or update.
To get the list of all the available resources classes, use the lsrsrc command, as in Example 3-3.
Example 3-3: lsrsrc command result
[root@masternode root]# lsrsrc class_name "IBM.Association" "IBM.AuditLog" "IBM.AuditLogTemplate" "IBM.Condition" "IBM.EthernetDevice" "IBM.EventResponse" "IBM.Host" "IBM.FileSystem" "IBM.Program" "IBM.TokenRingDevice" "IBM.Sensor" "IBM.ManagedNode" "IBM.NodeGroup" "IBM.ManagementServer" "IBM.NodeAuthenticate" "IBM.NetworkInterface" "IBM.DmsCtrl" "IBM.NodeHwCtrl" "IBM.HwCtrlPoint" "IBM.HostPublic" [root@masternode root]#
Resource classes contain a great deal of information. Some of this information is static (or persistent), like node hostname, and some is dynamic. Both static and dynamic values can be used to define conditions, but usually the dynamic values are used to set conditions. A static value in theory would not change, so a condition based on a static value is usually less interesting.
The static values of a class can be found with the lsrsrc -a p <class_name> command, where <class_name> is the name of the class, for example "IBM.Host", and p specifies persistent values. Example 3-4 shows the static values from the IBM.Host class.
Example 3-4: lsrsrc -a p "IBM.Host" result
[root@masternode root]# lsrsrc -a p "IBM.Host" Resource Persistent Attributes for: IBM.Host resource 1: Name = "masternode.cluster.com" NodeNameList = {"masternode.cluster.com"} NumProcessors = 2 RealMemSize = 1055698944 OSName = "Linux" KernelVersion = "2.4.18-10smp" DistributionName = "Red Hat" DistributionVersion = "7.3" Architecture = "i686" [root@masternode root]#
It is also possible to list the dynamic values of a class by typing the lsrsrc -a d <class_name> command, where <class_name> is the name of the class, for example "IBM.Host", and d specifies dynamic values. Example 3-5 shows the dynamic values from the IBM.Host class.
Example 3-5: lsrsrc -a d "IBM.Host" output
[root@masternode root]# lsrsrc -a d "IBM.Host" Resource Dynamic Attributes for: IBM.Host resource 1: ActiveMgtScopes = 5 UpTime = 1635130 NumUsers = 9 LoadAverage = {0,0,0} VMPgSpOutRate = 0 VMPgSpInRate = 0 VMPgOutRate = 35 VMPgInRate = 0 PctRealMemFree = 1 PctTotalTimeKernel = 0.228220599793863 PctTotalTimeUser = 1.4116708831583 PctTotalTimeWait = 0.000412104445455916 PctTotalTimeIdle = 98.3596964126024 PctTotalPgSpFree = 99.626756733549 PctTotalPgSpUsed = 0.37324326645104 TotalPgSpFree = 522099 TotalPgSpSize = 524055 [root@masternode root]#
It is also possible to get both static and persistent types of values by typing the lsrsrc -a b <class_name> command, where <class_name> is the name of the class and b specifies both persistent and dynamic values.
This section describes the CSM Resource Managers and explains the classes they use to manage the nodes. CSM resource managers are listed in Example 3-2 on page 51. These include the following:
IBM.AuditRM (audit log)
IBM.DMSRM (distributed management server)
IBM.ERRM (event responses)
IBM.FSRM (file system)
IBM.HostRM (host)
IBM.HWCTRLRM (hardware)
IBM.SensorRM (sensor)
IBM.CSMAgentRM (fundamentals)
IBM.ConfigRM (network)
This Resource Manager provides a system-wide facility for recording information about the system's operations, which is particularly useful for tracking subsystems running in the background.
This Resource Manager has two resource classes: the IBM.AuditLog and the IBM.AuditLogTemplate.
These classes allow subsystems to manage logs (add, delete, and count records in log files).
There are no predefined conditions for these classes.
This Resource Manager manages a set of nodes that are part of a system management cluster. This includes monitoring the status of the nodes and adding, removing, and changing cluster nodes' attributes.
This Resource Manager runs only on the management server.
This Resource Manager has four resources classes:
IBM.ManagedNode
IBM.NodeGroup
IBM.NodeAuthenticate
IBM.DmsCtrl
The IBM.ManagedNode class provides information on a node and its status. The IBM.ManagedNode class has four predefined conditions, which are NodeReachability, NodeChanged, UpdatenodeFailedStatusChange, and NodePowerStatus. A list of all predefined conditions, along with their definitions, may be found in 3.3.3, "Predefined conditions" on page 57.
The IBM.NodeGroup class provides information about created groups. This class also contains the NodeGroupMembershipChanged predefined condition.
The IBM.NodeAuthenticate class hold the private and public keys information used to authenticate each CSM transaction between the management node and the managed node of a cluster. There are no predefined conditions for this class.
The IBM.DmsCtrl class provides CSM with information such as whether CSM should attend a management request from a unrecognized node or not, the maximum number of nodes allowed to be managed, the remote shell to be used by CSM applications, the type and model and serial number assigned to the cluster, and whether CSM should attempt to automatically set up remote shell access to the nodes in the cluster.
This Resource Manager provides the ability to take actions in response to conditions occurring on the system. When an event occurs, ERRM runs user-configured or predefined scripts or commands.
The Event Response Manager use three classes:
IBM.Condition
IBM.EventResponse
IBM.Association
The IBM.Condition class contains all of the defined conditions for the cluster.
The second class is the IBM.EventResponse class. This class contain all the responses that can be applied to an event.
The IBM.Association class contains the associations between conditions and responses.
This Resource Manager is used to monitor everything associated with file systems. It includes a list of all file systems, their status, and attributes such as the amount of space or i-nodes used, and so on.
The FSRM uses only one class, the IBM.FileSystem class. This class provides the Resource Manager all of the functions and data it needs to monitor a file system.
The IBM.FileSystem class has four predefined conditions, which are AnyNodeFilesystemInodesUsed, AnyNodeFilesystemSpaceUsed, AnyNodeTmpSpaceUsed, and AnyNodeVarSpaceUsed.
This Resource Manager monitors resources related to an individual machine. The types of values that are provided relate to load (processes, paging space, and memory usage) and status of the operating system. It also monitors program activity from initiation until termination.
The IBM.Host Resource Manager use five classes of resources:
IBM.Host | This class gives the Resource Manager the ability to monitor the paging space and total processor utilization. AnyNodePagingPercentSpaceFree and AnynodeProcessorsIdleTime are the two predefined conditions for this class. |
IBM.Program | This class is used to monitor the set of processes that are running. |
IBM.EthernetDevice | This class is used to monitor Ethernet network interfaces, and provides interface statistics. |
IBM.TokenRingDevice | This class is used to monitor token ring network interfaces, and provides interface statistics. |
IBM.HostPublic | This class contains a public key used for transaction authentication. |
This Resource Manager is used to monitor node hardware. There are two resource classes associated with this resource manager.
The IBM.NodeHwCtrl class provides support for powering a node on and off, resetting a node, querying the power status of a node, resetting a node's service processor, and resetting a node's hardware control point. It provides CSM with node control information, such as the node hardware type, hardware model number, hardware serial number, host name of the network adapter for the console server, console method used to open node console, and MAC address of the network adapter used to perform node installations.
The IBM.HwCtrlPoint provides support for defining a node's hardware control point. It contains information such as the power method used for a particular node, the time interval between power status queries, and the symbolic names of the nodes where the operational interface for hardware control is available.
This Resource Manager provides a means to create a single user-defined attribute to be monitored by the RMC subsystem.
This Resource Manager uses only one resource class: IBM.Sensor. This resource class enables you to create your own monitors. For example, a script can be written to return the number of users logged on to the system, then an ERRM condition and a response can be defined to run an action when the number of users logged on exceeds a certain threshold.
By default, the IBM.Sensor class has one predefined condition: CFMrootModTimeChanged. This condition is used to generated an event each time the /cfmroot directory is changed. Sensors are created using the mksensor command. The mksensor command adds an event sensor command to the Resource Monitoring and Control (RMC) subsystem.
This resource manager holds fundamentals parameters and definitions used by CSM. It contains only one resource class: IBM.ManagementServer. This resource class provides CSM with information about the management node, such as the host name, NodeID, type, all the host name aliases, and so on.
This resource manager provides CSM with networking information. It contains only one resource class: IBM.NetworkInterface. This resource class holds information such as the name of the network interface of the management node, the network device that hosts the network interface, the base IP address, subnet mask, all the other IP addresses that have been assigned to the network interface (as well as the network switch Network ID), and the device-specific switch adapter logical ID.
CSM automatically predefines a number of conditions in all resource classes. Use the lscondition command to view the currently defined conditions. It will provide a list of all the condition names with the monitoring status for each condition, as shown in Example 3-6.
Example 3-6: Predefined conditions
[root@masternode root]# lscondition Displaying condition information: Name Node MonitorStatus "NodePowerStatus" "masternode.cluster.com" "Not monitored" "NodeChanged" "masternode.cluster.com" "Monitored" "NodeGroupMembershipChanged" "masternode.cluster.com" "Not monitored" "AnyNodeTmpSpaceUsed" "masternode.cluster.com" "Not monitored" "UpdatenodeFailedStatusChange" "masternode.cluster.com" "Monitored" "AnyNodeFileSystemSpaceUsed" "masternode.cluster.com" "Not monitored" "AnyNodeProcessorsIdleTime" "masternode.cluster.com" "Not monitored" "AnyNodeVarSpaceUsed" "masternode.cluster.com" "Not monitored" "AnyNodeFileSystemInodesUsed" "masternode.cluster.com" "Not monitored" "CFMRootModTimeChanged" "masternode.cluster.com" "Not monitored" "NodeReachability" "masternode.cluster.com" "Not monitored" "AnyNodePagingPercentSpaceFree" "masternode.cluster.com" "Not monitored" [root@masternode root]#
Here we provide a description of all the conditions that are predefined in all resource classes:
NodePowerStatus
An event will be generated whenever the power status of the node is no longer 1 (1 means power is on). This will typically happen either when the node is powered off or the power status of the node cannot be determined for some reason. A rearm event will be generated when the node is powered up again.
NodeChanged
An event is generated when a node definition in the ManagedNode resource class changes.
NodeGroupMembershipChanged
An event will be generated whenever a node is added to or deleted from a previously existing NodeGroup.
AnyNodeTmpSpaceUsed
An event is generated when more than 90% of the total space in the /tmp directory is in use. The event is rearmed when the percentage of space used in the /tmp directory falls below 75%.
UpdatenodeFailedStatusChange
An event will be generated when a node on which the updatenode command failed now has online status.
AnyNodeFileSystemSpaceUsed
An event is generated when more than 90% of the total space in the file system is in use. The event is rearmed when the percentage of space used in the file system falls below 75%.
AnyNodeProcessorsIdleTime
An event is generated when the average time all processors are idle is at least 70% of the time. The event is rearmed when the idle time decreases below 10%.
AnyNodeVarSpaceUsed
An event is generated when more than 90% of the total space in the /var directory is in use. The event is rearmed when the percentage of space used in the /var directory falls below 75%.
AnyNodeFileSystemInodesUsed
An event is generated when more than 90% of the total i-nodes in the file system are in use. The event is rearmed when the percentage of i-nodes used in the file system falls below 75%.
CFMRootModTimeChanged
An event is generated when a file under /cfmroot is modified, added, or removed.
NodeReachability
An event is generated when a node in the network cannot be reached from the server. The event is rearmed when the node can be reached again.
AnyNodePagingPercentSpaceFree
An event is generated when more than 90% of the total paging space is in use. The event is rearmed when the percentage falls below 85%.
The RMC process keeps all of the class' values up to date, so the defined conditions can check if they have to generate an alert. Then, in response to this alert, an action can be launched.
Once an event has been generated, ERRM can generate a response. The list of actions can be shown with the lsresponse command. This section describes the predefined responses provided by CSM.
Example 3-7 shows the predefined responses.
Example 3-7: List of predefined responses
[root@masternode root]# lsresponse Displaying response information: ResponseName Node "MsgEventsToRootAnyTime" "masternode.cluster.com" "LogOnlyToAuditLogAnyTime" "masternode.cluster.com" "BroadcastEventsAnyTime" "masternode.cluster.com" "rconsoleUpdateResponse" "masternode.cluster.com" "DisplayEventsAnyTime" "masternode.cluster.com" "CFMNodeGroupResp" "masternode.cluster.com" "CFMModResp" "masternode.cluster.com" "LogCSMEventsAnyTime" "masternode.cluster.com" "UpdatenodeFailedStatusResponse" "masternode.cluster.com" [root@masternode root]#
The responses in CSM are scripts or commands. The following list describes what each of the responses does and the associated scripts or commands:
MsgEventsToRootAnyTime
Command: /usr/sbin/rsct/bin/msgevent root
This response send a message to a specified user (in this example, to the user root).
LogOnlyToAuditLogAnyTime
This response simply logs an event in the audit log, but does not take any action.
BroadcastEventsAnyTime
Command: /usr/sbin/rsct/bin/wallevent
This response sends an event or a rearm event to all users who are logged in.
rconsoleUpdateResponse
Command: /opt/csm/csmbin/rconsoleUpdate_response
This response runs an internal command that is used as part of the automatic rconsole configuration file update facility.
DisplayEventsAnyTime
Command: /usr/sbin/rsct/bin/displayevent admindesktop:0
This response notifies a user of an event by displaying it on her X Window console, here the admindesktop:0 console.
CFMNodeGroupResp
Command: /opt/csm/csmbin/CFMnodegroupresp
This response is used internally to determine whether changed files in /cfmroot belong to a particular NodeGroup.
CFMModResp
Command: /opt/csm/csmbin/CFMmodresp
This response is used internally to perform an update to all nodes when the /cfmroot directory changes.
LogCSMEventsAnyTime
Command: /usr/sbin/rsct/bin/logevent /var/log/csm/systemEvents
This response logs events to the /var/log/csm/systemEvents file.
UpdatenodeFailedStatusResponse
Command: /opt/csm/csmbin/updatenodeStatusResponse
Once the condition UpdatenodeFailedStatusChange generates an event when a node that previously did not complete updatenode is back online, the response UpdatenodeFailedStatusResponse re-runs the updatenode command on those particular nodes.
Of course, the above list of responses is limited to the predefined ones. CSM provides some commands to create, modify, or delete responses. The name of the commands are mostly self-explanatory.
The lsresponse lists all the defined responses, chresponse adds or changes actions included in a response, rmresponse deletes the response, and mkresponse creates a new response. Refer to Chapter 6, "Cluster management with CSM" on page 149 for details.
As already discussed, CSM provides predefined conditions and responses to the cluster administrator to help him in his management role, but the conditions and the responses are not linked together by default.
To be effective, conditions and the responses need to be linked together and a monitor started. This section provides an example of how to form this association.
To see what condition/response associations already exist, type the lscondresp command. Example 3-8 shows what the output of the lscondresp command should look like immediately after installing CSM.
Example 3-8: List of predefined associations
[root@masternode root]# lscondresp Displaying condition with response information: Condition Response Node State "UpdatenodeFailedStatusChange" "UpdatenodeFailedStatusResponse" "masternode" "Active" "NodeChanged" "rconsoleUpdateResponse" "masternode" "Active" [root@masternode root]#
Note | The command output shown in has been edited to fit in the allotted space. The Node column originally shows the fully qualified name of your management node, in our case, masternode.cluster.com. |
The startcondresp command is used to create condition/response relationships and begin actively monitoring them.
The lsaudrec command may be used to review ERRM audit logs. Use rmaudrec to remove records from the audit logs.
The stopcondresp and rmcondresp commands are used to temporarily stop monitoring and permanently remove condition/response associations, respectively.
Important: | By default, some conditions and responses are created but are not associated. The administrator would have to link conditions with responses to begin using this facility to manage the cluster. |
For more details on managing condition/response associations, including command syntax and examples, see 6.7.1, "RMC components" on page 174.
The predefined conditions and responses may not always meet the needs of the cluster administrator. Therefore, the administrator can create new conditions and new responses to meet his/her requirements.
The mkcondition and mkresponse commands may be used to implement locally required event tracking. Refer to 6.7.4, "Creating your own conditions and responses" on page 179 for an example of creating and using a custom condition and response.
If the predefined conditions and responses are not a good match for your environment, it is possible to remove or recreate them by using the predefined-condresp command.
| < Day Day Up > |
|