8.5 Troubleshooting: Some possible GPFS problems

< Day Day Up >

Troubleshooting of a GPFS file system can be complex due its distributed nature. In this section, we describe the most common problems you may find when running GPFS and possible solutions. For further information on trouble shooting, refer to IBM General Parallel File System for Linux: Problem Determination Guide, GA22-7842.

8.5.1 Authorization problems

ssh and scp (or rsh and rcp) are used by GPFS administration commands to perform operations on other nodes. In order for these commands to be run, the sshd daemon must be running and configured to accept the connections from the other root users on the other nodes.

The first thing to check is the connection authorization from one node to other nodes and for extraneous messages in the command output. You can find information on OpenSSH customization in Appendix B, "Common facilities" on page 275. Check that all nodes can connect to all others without any password prompt.

You can also check if your GPFS cluster has been configured correctly to use the specified remote shell and remote copy commands by issuing the mmlscluster command, as in Example 8-17. Verify the contents of the remote shell command and remote file copy command fields.

Example 8-17: mmlscluster command

 [root@storage001 root]# mmlscluster GPFS cluster information ========================   Cluster id:  gpfs1035415317   Remote shell command:      /usr/bin/ssh   Remote file copy command:  /usr/bin/scp   Primary network:           myrinet   Secondary network:         ether GPFS cluster data repository servers: -------------------------------------   Primary server:    storage001-myri0.cluster.com   Secondary server:  (none) Nodes in nodeset 1: -------------------    1  storage001-myri0 10.2.1.141     storage001-myri0.cluster.com  10.0.3.141    2  node001-myri0  10.2.1.1         node001-myri0.cluster.com    10.0.3.1    3  node002-myri0  10.2.1.2         node002-myri0.cluster.com    10.0.3.2    4  node003-myri0  10.2.1.3         node003-myri0.cluster.com    10.0.3.3    5  node004-myri0  10.2.1.4         node004-myri0.cluster.com    10.0.3.4 [root@storage001 root]#

8.5.2 Connectivity problems

Another reason why SSH may fail is that connectivity to a node has been lost. Error messages from mmdsh may indicate such a condition. For example:

 mmdsh: node001 rsh process had return code 1.

There are many things that could cause this problem: cable failures, network cardproblems, switch failures, and so on. You can start by checking if the affected node is powered on. If the node is up, check the node connectivity and verify the sshd daemon is running on the remote node. If not, restart the daemon by issuing:

 # service sshd start

Sometimes you may see a mmdsh error message due to the lack of an mmfsd process on some of the nodes, as in Example 8-18. Make sure the mmfsd is running on all nodes, using lssrc -a, as in Example 8-19.

Example 8-18: mmcrfs command

 [root@storage001 root]# mmcrfs /gpfs gpfs0 -F DescFile -v yes -r 1 -R 2 GPFS: 6027-624 No disks GPFS: 6027-441 Unable to open disk 'gpfs2nsd'. No such device GPFS: 6027-538 Error accessing disks. mmdsh: node001 rsh process had return code 19. mmcommon: Unexpected error from runRemoteCommand_Cluster: mmdsh. Return code: 1 mmcrfs: tscrfs failed. Cannot create gpfs0 [root@storage001 root]#

Example 8-19: Verifying mmfsd is running

 # lssrc -a Subsystem        Group        PID     Status  cthats          cthats       843     active  cthags          cthags       943     active  ctrmc           rsct         1011    active  ctcas           rsct         1018    active  IBM.HostRM      rsct_rm      1069    active  IBM.FSRM        rsct_rm      1077    active  IBM.CSMAgentRM  rsct_rm      1109    active  IBM.ERRM        rsct_rm      1110    active  IBM.AuditRM     rsct_rm      1148    active  mmfs            aixmm        1452    active  IBM.SensorRM    rsct_rm              inoperative  IBM.ConfigRM    rsct_rm              inoperative

8.5.3 NSD disk problems

In this section, we describe the two most common problems related to NSD and disks. These are not the only problems you might face, but they are the most common.

The disk has disappeared from the system

Sometimes you may face a disk failure and the disk appears to have disappeared from the system. This can happen if somebody simply removes an in-use hot-swap disk from the server or in the case of a particularly nasty disk failure.

In this situation, GPFS loses connectivity to the disk and, depending on how the file system was created, you may or may not lose access to the file system.

You can verify whether the disk is reachable by the operating system using mmlsnsd -m, as shown in Example 8-20. In this situation, the GPFS disk gpfs1nsd is unreachable. This could mean that the disk has been turned off, has been removed from its bay, or has failed for some other reason.

Example 8-20: mmlsnsd command

 [root@storage001 root]# mmlsnsd -m  NSD name     PVID               Device       Node name    Remarks -----------------------------------------------------------------------  gpfs1nsd     0A0000013BF15AFD   -            node-a       (error) primary node  gpfs2nsd     0A0000023BF15B0A   /dev/sdb1    node-b       primary node  gpfs3nsd     0A0000033BF15B26   /dev/sdb1    node-c       primary node  gpfs4nsd     0A0000013BF2F4EA   /dev/sda9    node-a       primary node  gpfs5nsd     0A0000023BF2F4FF   /dev/sda3    node-b       primary node  gpfs6nsd     0A0000033BF2F6E1   /dev/sda6    node-c       primary node [root@storage001 root]#

To correct this problem, you must first verify whether the disk is correctly attached and that it is not dead. After that, you can verify whether the driver for the disk is operational, and reload the driver using the rmmod and insmod commands. If the disk had only been removed from its bay or turned off, reloading the driver will activate the disks again, and then you can enable them again following the steps in "The disk is down and will not come up" on page 241. If the disk had any kind of hardware problem that will require replacing the disk, refer to 8.1.3, "Replacing a failing disk in an existing GPFS file system" on page 230.

The disk is down and will not come up

Occasionally, disk problems will occur on a node and, even after the node has been rebooted, the disk connected to it does not come up again. In this situation, you will have to manually set the disk up again and then run some recovery commands in order to restore access to your file system.

For our example, we see that the gpfs0 file system has lost two of its three disks: gpfs1nsd and gpfs3nsd. In this situation, we have to recover the two disks, run a file system check, and then re-stripe the file system.

Because the file system check and re-stripe require access to the file system, which is down, you must first re-activate the disks. Once the file system is up again, recovery may be undertaken. In Example 8-21, we verify which disks are down using the mmlsdisk command, re-activate the disks by using the mmchdisk command, and then verify the disks again with mmlsdisk.

Example 8-21: Reactivating disks

 [root@storage001 root]# mmlsdisk gpfs0 disk         driver   sector failure holds    holds name         type       size   group metadata data  status        availability ------------ -------- ------ ------- -------- ----- ------------- ------------ gpfs1nsd     nsd         512       1 yes      yes   ready         down gpfs2nsd     nsd         512       2 yes      yes   ready         up gpfs3nsd     nsd         512       3 yes      yes   ready         down [root@storage001 root]# mmchdisk gpfs0 start -d "gpfs1nsd;gpfs3nsd" Scanning file system metadata, phase 1 ... Scan completed successfully. Scanning file system metadata, phase 2 ... Scan completed successfully. Scanning file system metadata, phase 3 ... Scan completed successfully. Scanning user file metadata ...   77 % complete on Tue Nov 27 00:13:38 2001  100 % complete on Tue Nov 27 00:13:39 2001 Scan completed successfully. [root@storage001 root]# mmlsdisk gpfs0 disk         driver   sector failure holds    holds name         type       size   group metadata data  status        availability ------------ -------- ------ ------- -------- ----- ------------- ------------ gpfs1nsd     nsd         512       1 yes      yes   ready         up gpfs2nsd     nsd         512       2 yes      yes   ready         up gpfs3nsd     nsd         512       3 yes      yes   ready         up [root@storage001 root]#

Now that we have the three disks up, it is time to verify the file system consistency. Additionally, because some operations could have occurred on the file system when only one of the disks was down, we must re-balance it. We show the output of the mmfsck and mmrestripefs commands in Example 8-22. The mmfsck command has some important options you may need to use, like -r, for read-only access, and -y, to automatically correct problems found in the file system.

Example 8-22: mmfsck and mmrestripefs commands

[root@storage001 root]# mmfsck gpfs0 Checking "gpfs0" Checking inodes Checking inode map file Checking directories and files Checking log files Checking extended attributes file Checking file reference counts Checking file system replication status 33792 inodes 14 allocated 0 repairable 0 repaired 0 damaged 0 deallocated 0 orphaned 0 attached 384036 subblocks 4045 allocated 0 unreferenced 0 deletable 0 deallocated 231 addresses 0 suspended File system is clean. # mmrestripefs gpfs0 -r Scanning file system metadata, phase 1 ... Scan completed successfully. Scanning file system metadata, phase 2 ... Scan completed successfully. Scanning file system metadata, phase 3 ... Scan completed successfully. Scanning user file metadata ... 72 % complete on Tue Nov 27 00:19:24 2001 100 % complete on Tue Nov 27 00:19:25 2001 Scan completed successfully. [root@storage001 root]# mmlsdisk gpfs0 -e All disks up and ready [root@storage001 root]# mmlsdisk gpfs0 disk driver sector failure holds holds name type size group metadata data status availability ------------ -------- ------ ------- -------- ----- ------------- ------------ gpfs1nsd nsd 512 1 yes yes ready up gpfs2nsd nsd 512 2 yes yes ready up gpfs3nsd nsd 512 3 yes yes ready up [root@storage001 root]#

< Day Day Up >