Reliable Scalable Cluster Technology (RSCT) | Linux Clustering With Csm and Gpfs

< Day Day Up >

The RSCT component provides a set of services designed to address issues related to the high-availability of a system. It includes two subsystems, as shown in Figure A-1 on page 254:

Topology Services
Group Services

Both of these distributed subsystems operate within a domain. A domain is a set of machines upon which the RSCT components execute and, exclusively of other machines, provides their services.

The CSM product uses only a relatively small subset of the RSCT components. This subset is primarily composed of a set of commands. Therefore, some parts on this chapter relate only to the GPFS product and not to CSM.

Note

Because both the SRC and RSCT components are required (and shipped) by both the CSM and GPFS, it is important to use the same version of SCR and RSCT during the installation. Refer to Chapter 5, "Cluster installation and configuration with CSM" on page 99 and Chapter 7, "GPFS installation and configuration" on page 185 for details.

Topology Services subsystem

Topology Services provides high-availability subsystems with network adapter status, node connectivity information, and a reliable messaging service. The adapter status and node connectivity information is provided to the Group Services subsystem upon request. Group Services makes the information available to its client subsystems. The reliable messaging service, which takes advantage of node connectivity information to reliably deliver a message to a destination node, is available to the other high-availability subsystems.

The adapter status and node connectivity information is discovered by an instance of the subsystem on one node, participating in concert with instances of the subsystem on other nodes, to form a ring of cooperating subsystem instances. This ring is known as a heartbeat ring, because each node sends a heartbeat message to one of its neighbors and expects to receive a heartbeat from its other neighbor.

Actually, each subsystem instance can form multiple rings, one for each network it is monitoring. This system of heartbeat messages enables each member to monitor one of its neighbors and to report to the heartbeat ring leader, called the Group Leader, if it stops responding. The Group Leader, in turn, forms a new heartbeat ring based on such reports and requests for new adapters to join the membership. Every time a new group is formed, it lists which adapters are present and which adapters are absent, making up the adapter status notification that is sent to Group Services.

The Topology Services subsystem consists in the following elements:

Topology Services daemon (hatsd)
Pluggable Network Interface Methods (NIMs)
Port numbers and sockets
Topology Services commands
Files and directories

Topology Services daemon (hatsd)

The daemon is the central component of the Topology Services subsystem and is located in the /usr/sbin/rsct/bin/ directory. This daemon runs on each node in the GPFS cluster.

When each daemon starts, it first reads its configuration from a file set up by the startup command (cthats). This file is called the machines list file, because it has all the nodes listed that are part of the configuration and the IP addresses for each adapter for each of the nodes in that configuration.

From this file, the daemon knows the IP address and node number of all the potential heartbeat ring members.

The Topology Services subsystem directive is to form as large a heartbeat ring as possible. To form this ring, the daemon on one node must alert those on the other nodes of its presence by sending a proclaim message. According to a hierarchy defined by the Topology Services component, daemons can send a proclaim message only to IP addresses that are lower than its own and can accept a proclaim message only from an IP address higher than its own. Also, a daemon only proclaims if it is the leader of a ring.

When a daemon first starts up, it builds a heartbeat ring for every local adapter, containing only that local adapter. This is called a singleton group and this daemon is the Group Leader in each one of these singleton groups.

To manage the changes in these groups, Topology Services defines the following roles for each group:

Group Leader

The daemon on the node with the local adapter that has the highest IP address in the group. The Group Leader proclaims, handles request for joins, handles death notifications, coordinates group membership changes, and sends connectivity information.
Crown Prince

The daemon on the node with the local adapter that has the second highest IP address in the group. This daemon can detect the death of the Group Leader and has the authority to become the Group Leader of the group in that case.
Mayor

A daemon on a node with a local adapter present in this group that has been picked by the Group Leader to broadcast a message to all the adapters in the group. When a daemon receives a message to broadcast, it is a mayor.
Generic

This is the daemon on any node with a local adapter in the heartbeat ring. The role of the Generic daemon is to monitor the heartbeat of the upstream neighbor and inform the Group Leader if the maximum allowed number of heartbeats have been missed.

Each one of these roles are dynamic, which means that every time a new heartbeat ring is formed, the roles of each member are evaluated and assigned.

In summary, Group Leaders send and receive proclaim messages. If the proclaim is from a Group Leader with a higher IP address, then the Group Leader with the lower address replies with a join request. The higher address Group Leader forms a new group with all members from both groups. All members monitor their upstream neighbor for heartbeats. If a sufficient number of heartbeats are missed, a message is sent to the Group Leader and the unresponsive adapter will be dropped from the group. Whenever there is a membership change, Group Services is notified if it asked to be.

The Group Leader also accumulates node connectivity information, constructs a connectivity graph, and routes connections from its node to every other node in the GPFS cluster. The group connectivity information is sent to all nodes so that they can update their graphs and also compute routes from their node to any other node. It is this traversal of the graph on each node that determines which node membership notification is provided to each node. Nodes to which there is no route are considered unreachable and are marked as down. Whenever the graph changes, routes are recalculated, and a list of nodes that have connectivity is generated and made available to Group Services.

When a network adapter fails or has a problem in one node the daemon will, for a short time, attempt to form a singleton group, since the adapter will be unable to communicate with any other adapter in the network. Topology Services will invoke a function that uses self-death logic. This self-death logic will attempt to determine whether the adapter is still working. This invokes network diagnosis to determine if the adapter is able to receive data packets from the network. The daemon will try to have data packets sent to the adapter. If it cannot receive any network traffic, the adapter is considered to be down. Group Services is then notified that all adapters in the group are down.

After an adapter that was down recovers, the daemon will eventually find that the adapter is working again by using a mechanism similar to the self-death logic, and will form a singleton group. This should allow the adapter to form a larger group with the other adapters in the network. An adapter up notification for the local adapter is sent to the Group Services subsystem.

Pluggable NIMs

Topology Services' pluggable NIMs are processes started by the Topology Services daemon to monitor each local adapter. The NIM is responsible for:

Sending messages to a peer daemon upon request from the local daemon.
Receiving messages from a peer daemon and forwarding it to the local daemon.
Periodically sending heartbeat messages to a destination adapter.
Monitoring heartbeats coming from a specified source and notifying the local daemon if any heartbeats are missing.
Informing the local daemon if the local adapter goes up or down.

Port numbers and sockets

The Topology Services subsystem uses several types of communications:

UDP port numbers for inter-cluster communications

For communication between Topology Services daemons within the GPFS cluster, the Topology Services subsystem uses a single UDP port number. This port number is provided by GPFS during GPFS cluster creation. The Topology Services port number is stored in the GPFS cluster data so that, when the Topology Services subsystem is configured on each node, the port number is retrieved from the GPFS cluster data. This ensures that the same port number is used by all Topology Services daemons in the GPFS cluster. This intra-cluster port number is also set in the /etc/services file, using the service name cthats. The /etc/services file is automatically updated on all nodes in the GPFS cluster.
UNIX domain sockets

The UNIX domain sockets used for communication are connection-oriented sockets. For the communication between the Topology Services clients and the local Topology Services daemon, the socket name is server_socket and it is located at the /var/ct/<cluster_name>/soc/cthats directory, where <cluster_name> is the name of the GPFS cluster. For communication between the local Topology Services daemon and the NIMs, the socket name is <NIM_name>.<process_id> and it is located at the /var/ct/<cluster_name>/soc/cthats/ directory, where <cluster_name> is the name of the GPFS cluster, <NIM_name> is the name of the NIM, and <process_id> is the process identifier (PID).

Topology Services commands

The Topology Services subsystems contain several commands, such as:

The cthatsctrl control command, which is used to add, remove, start, and stop the Topology Services subsystem to the operating software configuration of the GPFS cluster, as well as to build the configuration file for the subsystem.
The cthats startup command, which is used to obtain the necessary configuration information from the GPFS cluster data server and prepare the environment for the Topology Services daemon. Under normal operating conditions, the Topology Services startup command runs without user initiation.
The cthatstune tuning command, which is used to change the Topology Services tunable parameters at run time without the need to shut down and restart the GPFS daemon.

For details on the above commands, refer to the IBM General Parallel File System (GPFS) for Linux: RSCT Guide and Reference, SA22-7854.

Configuring and operating Topology Services

The following sections describe how the components of the Topology Services subsystem work together to provide topology services. Included are discussions of Topology Services tasks.

Attention:

Under normal operating conditions, Topology Services is controlled by GPFS. User intervention of Topology Services may cause GPFS to fail. Be very cautious when manually configuring or operating Topology Services.

The Topology Services subsystem is contained in the rsct.basic RPM. After the RPM is installed, you may change the default configuration options using the cthatsctrl command.

Initializing the Topology Services daemon

Normally, the Topology Services daemon is started by GPFS. If necessary, you can start the Topology Services daemon using the cthatsctrl command or the startsrc command directly. The first part of initialization is done by the startup command, cthats. It starts the hatsd daemon, which completes the initialization steps.

During this initialization, the startup command does the following:

Determines the number of the local node.
Obtains the name of the cluster from the GPFS cluster data.
Retrieves the machines.lst file from the GPFS cluster data.
Performs file maintenance in the log directory and current working directory to remove the oldest log and rename any core files that might have been generated.
Starts the Topology Services hatsd daemon.

The daemon then continues the initialization with the following steps:

Reads the current machines list file and initializes internal data structures.
Initializes daemon-to-daemon communication, as well as client communication.
Starts the NIMs.
For each local adapter defined, forms a membership consisting of only the local adapter.

At this stage, the daemon is now in its initialized state and ready to communicate with Topology Services daemons on other nodes. The intent is to expand each singleton membership group formed during initialization to contain as many members as possible. Eventually, as long as all adapters in a particular network can communicate with each other, there will be a single group to which all adapters belong.

Merging all adapters into a single group

Initially the subsystem starts out as N singleton groups, one for each node. Each of those daemons is a Group Leader of those singleton groups and knows which other adapters could join the group by the configuration information. The next step is to begin proclaiming to subordinate nodes.

The proclaim logic tries to find members as efficiently as possible. For the first three proclaim cycles, daemons proclaim to only their own subnet, and if the subnet is broadcast-capable, that message is broadcast. The result of this is that given the previous assumption that all daemons started out as singletons, this would evolve into M groups, where M is the number of subnets that span this heartbeat ring. On the fourth proclaim cycle, those M Group Leaders send proclaims to adapters that are outside of their local subnet.

This causes a merging of groups into larger and larger groups until they have coalesced into a single group.

From the time the groups were formed as singletons until they reach a stabilization point, the groups are considered unstable. The stabilization point is reached when a heartbeat ring has no group changes for the interval of 10 times the heartbeat send interval. Up to that point, the proclaim continues on a four cycle operation, where three cycles only proclaim to the local subnets, and one cycle proclaims to adapters not contained on the local subnet. After the heartbeat ring has reached stability, proclaim messages go out to all adapters not currently in the group regardless of the subnet to which they belong. Adapter groups that are unstable are not used when computing the node connectivity graph.

Topology Services daemon operations

Normal operation of the Topology Services subsystem does not require administrative intervention.

The subsystem is designed to recover from temporary failures, such as node failures or failures of individual Topology Services daemons. Topology Services also provides indications of higher-level system failures.

However, there are some operational characteristics of interest to system administrators, and after adding or removing nodes or adapters you might need to refresh the subsystem. The maximum node number allowed is 2047 and the maximum number of networks it can monitor is 16.

Topology Services is meant to be sensitive to network response and this sensitivity is tunable. However, other conditions can degrade the ability of Topology Services to accurately report on adapter or node membership. One such condition is the failure to schedule the daemon process in a timely manner. This can cause daemons to be late in sending their heartbeats by a significant amount. This can happen because an interrupt rate is too high, the rate of paging activity is too high, or there are other problems. If the daemon is prevented from running for enough time, the node might not be able to send out heartbeat messages and will be considered to be down by other peer daemons. The node down indication, when notified to GPFS, will cause it to perform, in this case, undesirable recovery procedures and take over resources and roles of the node. Whenever these conditions exist, analyze the problem carefully to fully understand it.

Because Topology Services should get processor time-slices in a timely fashion, do not intentionally subvert its use of the CPU, because you can cause false indications.

Attention:

The network options to enable IP source routing are set to their default values for security reasons. Changing them may cause the node to be vulnerable to network attacks. System administrators are advised to use other methods to protect the cluster from such attacks.

Topology Services requires the IP source routing feature to deliver its data packets when the networks are broken into several network partitions. The network options must be set correctly to enable IP source routing. The Topology Services startup command will set the options:

IP forward: Enable

 echo 1 > /proc/sys/net/ipv4/ip_forward

Accept source routing: Enable

 echo 1 > /proc/sys/net/ipv4/conf/all/accept_source_route echo 1 > /proc/sys/net/ipv4/conf/interface/accept_source_route

RP filter: Disable

 echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter echo 0 > /proc/sys/net/ipv4/conf/interface/rp_filter

Tuning the Topology Services subsystem

The default settings for the frequency and sensitivity attributes discussed in "Configuring and operating Topology Services" on page 261 are overly aggressive for GPFS clusters that have more than 128 nodes or heavy load conditions. Using the default settings will result in false failure indications. Decide which settings are suitable for your system by considering the following:

Higher values for the frequency attribute result in lower CPU and network utilization from the Topology Services daemon. Higher values for the product frequency versus sensitivity result in less sensitivity of Topology Services to factors that cause the daemon to be blocked or messages to not reach their destinations. Higher values for the product also result in Topology Services taking longer to detect a failed adapter or node.
If the nodes are used primarily for parallel scientific jobs, we recommend the settings shown in Table A-1.

Table A-1: Parallel scientific environment

Frequency

Sensitivity

Seconds to detect node failure

2

6

24

3

5

30

3

10

60

4

9

72
If the nodes are used in a mixed environment or for database workloads, we recommend the settings shown in Table A-2 on page 264.

Table A-2: Mixed environment or database workload

Frequency

Sensitivity

Seconds to detect node failure

2

6

24

3

5

30

2

10

40
If the nodes tend to operate in a heavy paging or I/O intensive environment, as is often the case when running the GPFS software, we recommend the settings shown in Table A-3.

Table A-3: Heavy paging or I/O environment

Frequency

Sensitivity

Seconds to detect node failure

1

12

24

1

15

30

Table A-1: Parallel scientific environment
Frequency	Sensitivity	Seconds to detect node failure
2	6	24
3	5	30
3	10	60
4	9	72

Table A-2: Mixed environment or database workload
Frequency	Sensitivity	Seconds to detect node failure
2	6	24
3	5	30
2	10	40

Table A-3: Heavy paging or I/O environment
Frequency	Sensitivity	Seconds to detect node failure
1	12	24
1	15	30

By default, Topology Services uses the settings shown in Table A-4.

Table A-4: Topology Services defaults
Frequency	Sensitivity	Seconds to detect node failure
1	4	8

You can adjust the tunable attributes by using the cthatstune command. For example, to change the frequency attribute to the value 2 on network gpfs and then refresh the Topology Services subsystem, use the command:

 cthatstune -f gpfs:2 -r

Refreshing the Topology Services daemon

When your system configuration is changed (such as by adding or removing nodes or adapters), the Topology Services subsystem needs to be refreshed before it can recognize the new configuration.

To refresh the Topology Services subsystem, run either the cthatsctrl command or the cthatstune command, both with the -r option, on any node in the GPFS cluster.

Note that if there are nodes in the GPFS cluster that are unreachable with Topology Services active, they will not be refreshed. Also, if the connectivity problem is resolved such that Topology Services on that node is not restarted, the node refreshes itself to remove the old configuration. Otherwise, it will not acknowledge nodes or adapters that are part of the configuration.

Topology Services procedures

Normally, the Topology Services subsystem runs itself without requiring administrator intervention. On occasion, you might need to check the status of the subsystem. You can display the operational status of the Topology Services daemon by issuing the lssrc command.

Topology Services monitors the Ethernet, the Myrinet switch, and the Token-Ring networks. To see the status of the networks, you need to run the command on a node that is up:

 lssrc -ls cthats

In response, the lssrc command writes the status information to the standard output. The information includes:

The information provided by the lssrc -s cthats command (short form).
Six lines for each network for which this node has an adapter and includes the following information:
- The network name.
- The network index.
- The number of defined members: The number of adapters that the configuration reported existing for this network.
- The number of members: Number of adapters currently in the membership group.
- The state of the membership group, denoted by S (Stable), U (Unstable), or D (Disabled).
- Adapter ID: The address and instance number for the local adapter in this membership group.
- Group ID: The address and instance number of the membership group. The address of the membership group is also the address of the group leader.
- Adapter interface name.
- HB Interval, which corresponds to the Frequency attribute in the GPFS cluster data. This exists on a per network basis and has a default value that could be different.
- HB Sensitivity, which corresponds to the Sensitivity attribute in the GPFS cluster data. This exists on a per network basis and has a default value that could be different.
- Two lines of the network adapter statistics.
- The PID of the NIMs.
The number of clients connected and the client process IDs and command names.
Configuration instance: The instance number of the machines list file.
Whether the daemon is working in a secure environment. If it is, the version number of the key used for mutual authentication is also included.
The segments pinned: NONE, a combination of TEXT, DATA, and STACK, or PROC.
The size of text, static data, and dynamic data segments. Also, the number of outstanding memory allocations without a corresponding free memory operation.
Whether the daemon is processing a refresh request.
Daemon process CPU time, both in user and kernel modes.
The number of page faults and the number of times the process has been swapped out.
The number of nodes that are seen as reachable (up) from the local node and the number of nodes that are seen as not reachable (down).
A list of nodes that are either up or down, whichever list is smaller. The list of nodes that are down includes only the nodes that are configured and have at least one adapter, which Topology Services monitors. Nodes are specified in the list using the format:
```
 N1-N2(I1) N3-N4(I2)... 
```
where N1 is the initial node in a range, N2 is the final node in a range, and I1 is the increment. For example, 5-9(2) specifies nodes 5, 7, and 9. If the increment is 1, then the increment is omitted. If the range has only one node, only the one node number is specified.

Example A-2 shows the output from the lssrc -ls cthats command on a node.

Example A-2: lssrc -ls cthats output

 [root@node001 root]# lssrc -ls cthats Subsystem         Group            PID     Status  cthats           cthats           1632    active Network Name   Indx Defd Mbrs St Adapter ID      Group ID gpfs           [ 0]    5    5  S 10.2.1.1        10.2.1.141 gpfs           [ 0] myri0        0x85b72fa4      0x85b82c56 HB Interval = 1 secs. Sensitivity = 15 missed beats Missed HBs: Total: 0 Current group: 0 Packets sent    : 194904 ICMP 0 Errors: 0 No mbuf: 0 Packets received: 211334 ICMP 0 Dropped: 0 NIM's PID: 1731 gpfs2          [ 1]    5   5  S 10.0.3.1      10.0.3.141 gpfs2          [ 1] eth0        0x85b72fa5    0x85b82c5a HB Interval = 1 secs. Sensitivity = 15 missed beats Missed HBs: Total: 0 Current group: 0 Packets sent    : 194903 ICMP 0 Errors: 0 No mbuf: 0 Packets received: 211337 ICMP 0 Dropped: 0 NIM's PID: 1734   1 locally connected Client with PID: hagsd(1749)   Configuration Instance = 1035415331   Default: HB Interval = 1 secs. Sensitivity = 8 missed beats   Daemon employs no security   Segments pinned: Text Data Stack.   Text segment size: 593 KB. Static data segment size: 628 KB.   Dynamic data segment size: 937. Number of outstanding malloc: 383   User time 7 sec. System time 2 sec.   Number of page faults: 844. Process swapped out 0 times.   Number of nodes up: 5. Number of nodes down: 0. [root@node001 root]#

Group Services (GS) subsystem

The function of the Group Services subsystem is to provide other subsystems with a distributed coordination and synchronization service. Other subsystems that utilize or depend upon Group Services are called client subsystems. Each client subsystem forms one or more groups by having its processes connect to the Group Services subsystem and use the various Group Services interfaces. A process of a client subsystem is called a GS client.

A group consists of two pieces of information:

The list of processes that have joined the group, called the group membership list
A client-specified group state value

Group Services guarantees that all processes that are members of a group see the same values for the group information, and that they see all changes to the group information in the correct order. In addition, the processes may initiate changes to the group information via certain protocols that are controlled by Group Services.

A GS client that has joined a group is called a provider. A GS client that wants only to monitor a group without being able to initiate changes in the group is called a subscriber.

Once a GS client has initialized its connection to Group Services, it can join a group and become a provider. All other GS clients that have already joined the group (those that have already become providers) are notified as part of the join protocol about the new providers that want to join. The existing providers can either accept new joiners unconditionally (by establishing a one-phase join protocol) or vote on the protocol (by establishing an n-phase protocol). During the vote, they can choose to approve the request and accept the new provider into the group, or reject the request and refuse to allow the new provider to join.

Group Services monitors the status of all the processes that are members of a group. If either the processes or the node on which a process is executing fails, Group Services initiates a failure notification that informs the remaining providers in the group that one or more providers have been lost.

Join and failure protocols are used to modify the membership list of the group. Any provider in the group may also propose protocols to modify the state value of the group. All protocols are either unconditional (one-phase) protocols, which are automatically approved, or conditional (n-phase) (sometimes called votes on) protocols, which are voted on by the providers.

During each phase of an n-phase protocol, each provider can take application-specific action and must vote to approve, reject, or continue the protocol. The protocol completes when it is either approved (the proposed changes become established in the group), or rejected (the proposed changes are dropped).

Group Services components

The Group Services subsystem consists of the following components:

Group Services daemon

This daemon (hagsd) runs on each GPFS cluster node and is the central component of the Group Services subsystem.
Group Services Application Program Interface (GSAPI)

This is the application programming interface that GS clients use to obtain the services of the Group Services subsystem. These API calls are used by both the CSM and GPFS products.
Ports

The Group Services subsystem uses several types of communication:
- User Datagram Protocol (UDP) port numbers for intra-domain communications. That is, communications between Group Services daemons within an operational domain that is defined within the cluster.
  
  The port used to communicate between nodes can be found in the /etc/services file (see the cthags tag).
- UNIX domain sockets for communications between GS clients and the local Group Services daemon (via the GSAPI).
Control command

The control command is used to add the Group Services subsystem to the cluster. It can also be used to remove, start, or stop the subsystem from the cluster.
Files and directories

The Group Services subsystem uses the /var/ct directory to store files.

Group Services dependencies

The Group Services subsystem depends on the following resources:

System Resource Controller

A subsystem that can be used to define and control other subsystems. For more information, see "System Resource Controller (SRC)" on page 254.
Cluster data

The cluster data contains the system configuration information, such as host names and MAC addresses. These data are used by GPFS or CSM commands to be able to contact cluster nodes. A GPFS cluster has a primary server, which is in charge of maintaining this data and may have a secondary server as a backup. These nodes are designated during the GPFS cluster creation. CSM stores the cluster data on the management server.
Topology Services

A subsystem that is used to determine which nodes in a system are running at any given time. Group Services requests information such as adapter status and node connectivity from the Topology Services subsystem in order to make these information available to its clients. Details on how the Topology Services subsystem provides such information have been provided in "Topology Services subsystem" on page 257.

Configuring and operating Group Services

The following sections describe the various aspects of configuring and operating Group Services.

Group Services subsystem configuration

Group Services subsystem configuration is performed by the cthagscrtl command, which is included in the RSCT package.

The cthagsctrl command provides a number of functions for controlling the operation of the Group Services subsystem. These are:

Add (configure) the Group Services subsystem

When the cthagsctrl command is used to add the Group Services subsystem, it first fetches the port number from the GPFS cluster data. Then it adds the Group Services daemon to the SRC using the mkssys command.
Start and Stop the Group Services subsystem

The start and stop functions of the cthagsctrl command run the startsrc and stoprsc commands, respectively.
Delete (unconfigure) the Group Services subsystem

The delete function of the cthagsctrl command removes the subsystem from the SRC and removes the Group Services daemon communications port number from the /etc/services file. It does not remove anything from the GPFS cluster data because the Group Services subsystem may still be configured on other nodes in the operational domain.
Turn tracing of the Group Services daemon on or off

The tracing function of the cthagsctrl command is provided to supply additional problem determination information when it is required by the IBM Support Center. Normally, tracing should not be turned on, because it might slightly degrade Group Services subsystem performance and can consume large amounts of disk space in the /var file system.

Initializing the Group Services daemon

In a normal condition, the Group Service daemon is started by GPFS. If necessary, the Group Services daemon can be started using the cthagsctrl command or the startsrc command directly.

During initialization, the Group Services daemon performs the following steps:

It retrieves the number of the node on which it is running.
It tries to connect to the Topology Services subsystem. If the connection cannot be established because the Topology Services subsystem is not running, it is scheduled to be retried periodically. This continues until the connection to Topology Services is established. Until the connection, the Group Services daemon writes an error log entry periodically and no clients may connect to the Group Services subsystem.
It performs actions that are necessary to become a daemon. This includes establishing communications with the SRC subsystem so that it can return status in response to SRC commands.
It establishes the Group Services domain, which is the set of nodes in the GPFS cluster. At this point, one of the GPFS daemons establishes itself as the GS name server. Until the domain is established, no GS client requests to join or subscribe to groups are processed. The GS name server is responsible for keeping track of all client groups that are created in the domain. To ensure that only one node becomes a GS name server, Group Services uses the following protocol:
1. When the daemon is connected to the Topology Services subsystem, it waits for Topology Services to tell it which nodes are currently running.
2. Based on the input from Topology Services, each daemon finds the lowest numbered running node in the domain. The daemon compares its own node number to the lowest numbered node and performs one of the following:
  - If the node the daemon is on is the lowest numbered node, the daemon waits for all other running nodes to nominate it as the GS name server.
  - If the node the daemon is on is not the lowest numbered node, it sends a nomination message to the lowest numbered node periodically, initially every five seconds.
3. Once all running nodes have nominated the GS name server and a coronation timer (about 20 seconds) has expired, the nominee sends an insert message to the nodes. All nodes must acknowledge this message. When they do, the nominee becomes the established GS name server and it sends a commit message to all of the nodes.
4. At this point, the Group Services domain is established and requests by clients to join or subscribe to groups are processed
It enters the main control loop. In this loop, the Group Services daemon waits for requests from GS clients, messages from other Group Services daemons, messages from the Topology Services subsystem, and requests from the SRC for status.

Group Services daemon operation

Normal operation of the Group Services subsystem requires no administrative intervention. The subsystem normally automatically recovers from temporary failures, such as node failures or failures of Group Services daemons. However, there are some operational characteristics that might be of interest to administrators:

The maximum number of groups to which a GS client can subscribe or that a GS client can join is equivalent to the largest value containable in a signed integer variable.
The maximum number of groups allowed within a domain is 65535.
These limits are the theoretical maximum limits. In practice, the amount of memory available to the Group Services daemon and its clients will reduce the limits to smaller values.

On occasion, you may need to check the status of the subsystem. You can display the operational status of the Group Services daemon by issuing the lssrc command, as shown in Example A-3.

Example A-3: Verify that Group Services is running

 [root@storage001 root]# lssrc -ls cthags Subsystem         Group            PID      Status  cthags           cthags           2772     active 1 locally-connected clients.  Their PIDs: 3223(mmfsd) HA Group Services domain information: Domain established by node 1 Number of groups known locally: 3                    Number of   Number of local Group name         providers   providers/subscribers GpfsRec.1                5           1           0 Gpfs.1                   5           1           0 NsdGpfs.1                5           1           0 [root@storage001 root]#

< Day Day Up >