3.3 CSM monitoring

 < Day Day Up > 



3.3 CSM monitoring

One of the key capabilities of CSM is to monitor the cluster resources and react to events. Therefore, the monitoring component in CSM is very important to understand.

This section does not presume to replace the Monitoring HowTo documentation, but rather is intended to provide the administrator with a basic understanding of the CSM monitoring architecture and components. Chapter 6, "Cluster management with CSM" on page 149 provides a great deal of information on how to manage a Linux cluster using CSM.

The following are definitions of the terms condition and response in the context of CSM.

Condition

A measurement or status that a subsystem must reach to generate an alert, for example, more than 90 percent of CPU workload.

Response

An action that is executed in response to a condition that has been encountered.

3.3.1 How CSM monitors a system

CSM uses Resource Managers to monitor a cluster. These Resource Managers use resource classes to update the attributes of the object representing the resources they are monitoring. A CSM daemon then monitors these attributes to identify a condition and to initiate a response.

A Resource Manager is a daemon that runs on the management server and on the managed nodes.

A resource class is a group of functions and values used by the Resource Manager.

The list of resources being monitored by the Resource Manager can be obtain by using the lssrc command, as in Example 3-2.

Example 3-2: lssrc -a command result

start example
 [root@masternode root]# lssrc -a Subsystem         Group             PID     Status  ctrmc            rsct              776     active  IBM.ERRM         rsct_rm           798     active  IBM.DMSRM        rsct_rm           814     active  IBM.AuditRM      rsct_rm           850     active  ctcas            rsct              865     active  IBM.HostRM       rsct_rm           970     active  IBM.SensorRM     rsct_rm           1003    active  IBM.FSRM         rsct_rm           10972   active  IBM.HWCTRLRM     rsct_rm           14755   active  IBM.ConfigRM     rsct_rm           20323   active  IBM.CSMAgentRM   rsct_rm                   inoperative [root@masternode root]# 
end example

The Resource Manager works with resources classes. These classes contain all the variables and functions that the Resources Manager can use or update.

To get the list of all the available resources classes, use the lsrsrc command, as in Example 3-3.

Example 3-3: lsrsrc command result

start example
 [root@masternode root]# lsrsrc class_name "IBM.Association" "IBM.AuditLog" "IBM.AuditLogTemplate" "IBM.Condition" "IBM.EthernetDevice" "IBM.EventResponse" "IBM.Host" "IBM.FileSystem" "IBM.Program" "IBM.TokenRingDevice" "IBM.Sensor" "IBM.ManagedNode" "IBM.NodeGroup" "IBM.ManagementServer" "IBM.NodeAuthenticate" "IBM.NetworkInterface" "IBM.DmsCtrl" "IBM.NodeHwCtrl" "IBM.HwCtrlPoint" "IBM.HostPublic" [root@masternode root]# 
end example

Resource classes contain a great deal of information. Some of this information is static (or persistent), like node hostname, and some is dynamic. Both static and dynamic values can be used to define conditions, but usually the dynamic values are used to set conditions. A static value in theory would not change, so a condition based on a static value is usually less interesting.

The static values of a class can be found with the lsrsrc -a p <class_name> command, where <class_name> is the name of the class, for example "IBM.Host", and p specifies persistent values. Example 3-4 shows the static values from the IBM.Host class.

Example 3-4: lsrsrc -a p "IBM.Host" result

start example
 [root@masternode root]# lsrsrc -a p "IBM.Host" Resource Persistent Attributes for: IBM.Host resource 1:         Name                = "masternode.cluster.com"         NodeNameList        = {"masternode.cluster.com"}         NumProcessors       = 2         RealMemSize         = 1055698944         OSName              = "Linux"         KernelVersion       = "2.4.18-10smp"         DistributionName    = "Red Hat"         DistributionVersion = "7.3"         Architecture        = "i686" [root@masternode root]# 
end example

It is also possible to list the dynamic values of a class by typing the lsrsrc -a d <class_name> command, where <class_name> is the name of the class, for example "IBM.Host", and d specifies dynamic values. Example 3-5 shows the dynamic values from the IBM.Host class.

Example 3-5: lsrsrc -a d "IBM.Host" output

start example
 [root@masternode root]# lsrsrc -a d "IBM.Host" Resource Dynamic Attributes for: IBM.Host resource 1:         ActiveMgtScopes    = 5         UpTime             = 1635130         NumUsers           = 9         LoadAverage        = {0,0,0}         VMPgSpOutRate      = 0         VMPgSpInRate       = 0         VMPgOutRate        = 35         VMPgInRate         = 0         PctRealMemFree     = 1         PctTotalTimeKernel = 0.228220599793863         PctTotalTimeUser   = 1.4116708831583         PctTotalTimeWait   = 0.000412104445455916         PctTotalTimeIdle   = 98.3596964126024         PctTotalPgSpFree   = 99.626756733549         PctTotalPgSpUsed   = 0.37324326645104         TotalPgSpFree      = 522099         TotalPgSpSize      = 524055 [root@masternode root]# 
end example

It is also possible to get both static and persistent types of values by typing the lsrsrc -a b <class_name> command, where <class_name> is the name of the class and b specifies both persistent and dynamic values.

3.3.2 Resource Managers

This section describes the CSM Resource Managers and explains the classes they use to manage the nodes. CSM resource managers are listed in Example 3-2 on page 51. These include the following:

  • IBM.AuditRM (audit log)

  • IBM.DMSRM (distributed management server)

  • IBM.ERRM (event responses)

  • IBM.FSRM (file system)

  • IBM.HostRM (host)

  • IBM.HWCTRLRM (hardware)

  • IBM.SensorRM (sensor)

  • IBM.CSMAgentRM (fundamentals)

  • IBM.ConfigRM (network)

IBM.AuditRM (audit log)

This Resource Manager provides a system-wide facility for recording information about the system's operations, which is particularly useful for tracking subsystems running in the background.

This Resource Manager has two resource classes: the IBM.AuditLog and the IBM.AuditLogTemplate.

These classes allow subsystems to manage logs (add, delete, and count records in log files).

There are no predefined conditions for these classes.

IBM.DMSRM (distributed management server)

This Resource Manager manages a set of nodes that are part of a system management cluster. This includes monitoring the status of the nodes and adding, removing, and changing cluster nodes' attributes.

This Resource Manager runs only on the management server.

This Resource Manager has four resources classes:

  • IBM.ManagedNode

  • IBM.NodeGroup

  • IBM.NodeAuthenticate

  • IBM.DmsCtrl

The IBM.ManagedNode class provides information on a node and its status. The IBM.ManagedNode class has four predefined conditions, which are NodeReachability, NodeChanged, UpdatenodeFailedStatusChange, and NodePowerStatus. A list of all predefined conditions, along with their definitions, may be found in 3.3.3, "Predefined conditions" on page 57.

The IBM.NodeGroup class provides information about created groups. This class also contains the NodeGroupMembershipChanged predefined condition.

The IBM.NodeAuthenticate class hold the private and public keys information used to authenticate each CSM transaction between the management node and the managed node of a cluster. There are no predefined conditions for this class.

The IBM.DmsCtrl class provides CSM with information such as whether CSM should attend a management request from a unrecognized node or not, the maximum number of nodes allowed to be managed, the remote shell to be used by CSM applications, the type and model and serial number assigned to the cluster, and whether CSM should attempt to automatically set up remote shell access to the nodes in the cluster.

IBM.ERRM (event response)

This Resource Manager provides the ability to take actions in response to conditions occurring on the system. When an event occurs, ERRM runs user-configured or predefined scripts or commands.

The Event Response Manager use three classes:

  • IBM.Condition

  • IBM.EventResponse

  • IBM.Association

The IBM.Condition class contains all of the defined conditions for the cluster.

The second class is the IBM.EventResponse class. This class contain all the responses that can be applied to an event.

The IBM.Association class contains the associations between conditions and responses.

IBM.FSRM (file system)

This Resource Manager is used to monitor everything associated with file systems. It includes a list of all file systems, their status, and attributes such as the amount of space or i-nodes used, and so on.

The FSRM uses only one class, the IBM.FileSystem class. This class provides the Resource Manager all of the functions and data it needs to monitor a file system.

The IBM.FileSystem class has four predefined conditions, which are AnyNodeFilesystemInodesUsed, AnyNodeFilesystemSpaceUsed, AnyNodeTmpSpaceUsed, and AnyNodeVarSpaceUsed.

IBM.HostRM (host)

This Resource Manager monitors resources related to an individual machine. The types of values that are provided relate to load (processes, paging space, and memory usage) and status of the operating system. It also monitors program activity from initiation until termination.

The IBM.Host Resource Manager use five classes of resources:

IBM.Host

This class gives the Resource Manager the ability to monitor the paging space and total processor utilization. AnyNodePagingPercentSpaceFree and AnynodeProcessorsIdleTime are the two predefined conditions for this class.

IBM.Program

This class is used to monitor the set of processes that are running.

IBM.EthernetDevice

This class is used to monitor Ethernet network interfaces, and provides interface statistics.

IBM.TokenRingDevice

This class is used to monitor token ring network interfaces, and provides interface statistics.

IBM.HostPublic

This class contains a public key used for transaction authentication.

IBM.HWCTRLRM (hardware)

This Resource Manager is used to monitor node hardware. There are two resource classes associated with this resource manager.

The IBM.NodeHwCtrl class provides support for powering a node on and off, resetting a node, querying the power status of a node, resetting a node's service processor, and resetting a node's hardware control point. It provides CSM with node control information, such as the node hardware type, hardware model number, hardware serial number, host name of the network adapter for the console server, console method used to open node console, and MAC address of the network adapter used to perform node installations.

The IBM.HwCtrlPoint provides support for defining a node's hardware control point. It contains information such as the power method used for a particular node, the time interval between power status queries, and the symbolic names of the nodes where the operational interface for hardware control is available.

IBM.SensorRM (sensor)

This Resource Manager provides a means to create a single user-defined attribute to be monitored by the RMC subsystem.

This Resource Manager uses only one resource class: IBM.Sensor. This resource class enables you to create your own monitors. For example, a script can be written to return the number of users logged on to the system, then an ERRM condition and a response can be defined to run an action when the number of users logged on exceeds a certain threshold.

By default, the IBM.Sensor class has one predefined condition: CFMrootModTimeChanged. This condition is used to generated an event each time the /cfmroot directory is changed. Sensors are created using the mksensor command. The mksensor command adds an event sensor command to the Resource Monitoring and Control (RMC) subsystem.

IBM.CSMAgentRM (fundamentals)

This resource manager holds fundamentals parameters and definitions used by CSM. It contains only one resource class: IBM.ManagementServer. This resource class provides CSM with information about the management node, such as the host name, NodeID, type, all the host name aliases, and so on.

IBM.ConfigRM (network)

This resource manager provides CSM with networking information. It contains only one resource class: IBM.NetworkInterface. This resource class holds information such as the name of the network interface of the management node, the network device that hosts the network interface, the base IP address, subnet mask, all the other IP addresses that have been assigned to the network interface (as well as the network switch Network ID), and the device-specific switch adapter logical ID.

3.3.3 Predefined conditions

CSM automatically predefines a number of conditions in all resource classes. Use the lscondition command to view the currently defined conditions. It will provide a list of all the condition names with the monitoring status for each condition, as shown in Example 3-6.

Example 3-6: Predefined conditions

start example
 [root@masternode root]# lscondition Displaying condition information: Name                            Node                     MonitorStatus "NodePowerStatus"               "masternode.cluster.com" "Not monitored" "NodeChanged"                   "masternode.cluster.com" "Monitored" "NodeGroupMembershipChanged"    "masternode.cluster.com" "Not monitored" "AnyNodeTmpSpaceUsed"           "masternode.cluster.com" "Not monitored" "UpdatenodeFailedStatusChange"  "masternode.cluster.com" "Monitored" "AnyNodeFileSystemSpaceUsed"    "masternode.cluster.com" "Not monitored" "AnyNodeProcessorsIdleTime"     "masternode.cluster.com" "Not monitored" "AnyNodeVarSpaceUsed"           "masternode.cluster.com" "Not monitored" "AnyNodeFileSystemInodesUsed"   "masternode.cluster.com" "Not monitored" "CFMRootModTimeChanged"         "masternode.cluster.com" "Not monitored" "NodeReachability"              "masternode.cluster.com" "Not monitored" "AnyNodePagingPercentSpaceFree" "masternode.cluster.com" "Not monitored" [root@masternode root]# 
end example

Here we provide a description of all the conditions that are predefined in all resource classes:

  • NodePowerStatus

    An event will be generated whenever the power status of the node is no longer 1 (1 means power is on). This will typically happen either when the node is powered off or the power status of the node cannot be determined for some reason. A rearm event will be generated when the node is powered up again.

  • NodeChanged

    An event is generated when a node definition in the ManagedNode resource class changes.

  • NodeGroupMembershipChanged

    An event will be generated whenever a node is added to or deleted from a previously existing NodeGroup.

  • AnyNodeTmpSpaceUsed

    An event is generated when more than 90% of the total space in the /tmp directory is in use. The event is rearmed when the percentage of space used in the /tmp directory falls below 75%.

  • UpdatenodeFailedStatusChange

    An event will be generated when a node on which the updatenode command failed now has online status.

  • AnyNodeFileSystemSpaceUsed

    An event is generated when more than 90% of the total space in the file system is in use. The event is rearmed when the percentage of space used in the file system falls below 75%.

  • AnyNodeProcessorsIdleTime

    An event is generated when the average time all processors are idle is at least 70% of the time. The event is rearmed when the idle time decreases below 10%.

  • AnyNodeVarSpaceUsed

    An event is generated when more than 90% of the total space in the /var directory is in use. The event is rearmed when the percentage of space used in the /var directory falls below 75%.

  • AnyNodeFileSystemInodesUsed

    An event is generated when more than 90% of the total i-nodes in the file system are in use. The event is rearmed when the percentage of i-nodes used in the file system falls below 75%.

  • CFMRootModTimeChanged

    An event is generated when a file under /cfmroot is modified, added, or removed.

  • NodeReachability

    An event is generated when a node in the network cannot be reached from the server. The event is rearmed when the node can be reached again.

  • AnyNodePagingPercentSpaceFree

    An event is generated when more than 90% of the total paging space is in use. The event is rearmed when the percentage falls below 85%.

The RMC process keeps all of the class' values up to date, so the defined conditions can check if they have to generate an alert. Then, in response to this alert, an action can be launched.

3.3.4 Responses

Once an event has been generated, ERRM can generate a response. The list of actions can be shown with the lsresponse command. This section describes the predefined responses provided by CSM.

Example 3-7 shows the predefined responses.

Example 3-7: List of predefined responses

start example
 [root@masternode root]# lsresponse Displaying response information: ResponseName                     Node "MsgEventsToRootAnyTime"         "masternode.cluster.com" "LogOnlyToAuditLogAnyTime"       "masternode.cluster.com" "BroadcastEventsAnyTime"         "masternode.cluster.com" "rconsoleUpdateResponse"         "masternode.cluster.com" "DisplayEventsAnyTime"           "masternode.cluster.com" "CFMNodeGroupResp"               "masternode.cluster.com" "CFMModResp"                     "masternode.cluster.com" "LogCSMEventsAnyTime"            "masternode.cluster.com" "UpdatenodeFailedStatusResponse" "masternode.cluster.com" [root@masternode root]# 
end example

The responses in CSM are scripts or commands. The following list describes what each of the responses does and the associated scripts or commands:

  • MsgEventsToRootAnyTime

    Command: /usr/sbin/rsct/bin/msgevent root

    This response send a message to a specified user (in this example, to the user root).

  • LogOnlyToAuditLogAnyTime

    This response simply logs an event in the audit log, but does not take any action.

  • BroadcastEventsAnyTime

    Command: /usr/sbin/rsct/bin/wallevent

    This response sends an event or a rearm event to all users who are logged in.

  • rconsoleUpdateResponse

    Command: /opt/csm/csmbin/rconsoleUpdate_response

    This response runs an internal command that is used as part of the automatic rconsole configuration file update facility.

  • DisplayEventsAnyTime

    Command: /usr/sbin/rsct/bin/displayevent admindesktop:0

    This response notifies a user of an event by displaying it on her X Window console, here the admindesktop:0 console.

  • CFMNodeGroupResp

    Command: /opt/csm/csmbin/CFMnodegroupresp

    This response is used internally to determine whether changed files in /cfmroot belong to a particular NodeGroup.

  • CFMModResp

    Command: /opt/csm/csmbin/CFMmodresp

    This response is used internally to perform an update to all nodes when the /cfmroot directory changes.

  • LogCSMEventsAnyTime

    Command: /usr/sbin/rsct/bin/logevent /var/log/csm/systemEvents

    This response logs events to the /var/log/csm/systemEvents file.

  • UpdatenodeFailedStatusResponse

    Command: /opt/csm/csmbin/updatenodeStatusResponse

    Once the condition UpdatenodeFailedStatusChange generates an event when a node that previously did not complete updatenode is back online, the response UpdatenodeFailedStatusResponse re-runs the updatenode command on those particular nodes.

Of course, the above list of responses is limited to the predefined ones. CSM provides some commands to create, modify, or delete responses. The name of the commands are mostly self-explanatory.

The lsresponse lists all the defined responses, chresponse adds or changes actions included in a response, rmresponse deletes the response, and mkresponse creates a new response. Refer to Chapter 6, "Cluster management with CSM" on page 149 for details.

3.3.5 Associating conditions and responses

As already discussed, CSM provides predefined conditions and responses to the cluster administrator to help him in his management role, but the conditions and the responses are not linked together by default.

To be effective, conditions and the responses need to be linked together and a monitor started. This section provides an example of how to form this association.

To see what condition/response associations already exist, type the lscondresp command. Example 3-8 shows what the output of the lscondresp command should look like immediately after installing CSM.

Example 3-8: List of predefined associations

start example
 [root@masternode root]# lscondresp Displaying condition with response information: Condition                      Response                         Node            State "UpdatenodeFailedStatusChange" "UpdatenodeFailedStatusResponse" "masternode"    "Active" "NodeChanged"                  "rconsoleUpdateResponse"         "masternode"    "Active" [root@masternode root]# 
end example

Note 

The command output shown in has been edited to fit in the allotted space. The Node column originally shows the fully qualified name of your management node, in our case, masternode.cluster.com.

The startcondresp command is used to create condition/response relationships and begin actively monitoring them.

The lsaudrec command may be used to review ERRM audit logs. Use rmaudrec to remove records from the audit logs.

The stopcondresp and rmcondresp commands are used to temporarily stop monitoring and permanently remove condition/response associations, respectively.

Important: 

By default, some conditions and responses are created but are not associated. The administrator would have to link conditions with responses to begin using this facility to manage the cluster.

For more details on managing condition/response associations, including command syntax and examples, see 6.7.1, "RMC components" on page 174.

3.3.6 Creating new conditions and responses

The predefined conditions and responses may not always meet the needs of the cluster administrator. Therefore, the administrator can create new conditions and new responses to meet his/her requirements.

The mkcondition and mkresponse commands may be used to implement locally required event tracking. Refer to 6.7.4, "Creating your own conditions and responses" on page 179 for an example of creating and using a custom condition and response.

If the predefined conditions and responses are not a good match for your environment, it is possible to remove or recreate them by using the predefined-condresp command.



 < Day Day Up > 



Linux Clustering with CSM and GPFS
Linux Clustering With Csm and Gpfs
ISBN: 073849870X
EAN: 2147483647
Year: 2003
Pages: 123
Authors: IBM Redbooks

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net