| < Day Day Up > |
|
Troubleshooting of a GPFS file system can be complex due its distributed nature. In this section, we describe the most common problems you may find when running GPFS and possible solutions. For further information on trouble shooting, refer to IBM General Parallel File System for Linux: Problem Determination Guide, GA22-7842.
ssh and scp (or rsh and rcp) are used by GPFS administration commands to perform operations on other nodes. In order for these commands to be run, the sshd daemon must be running and configured to accept the connections from the other root users on the other nodes.
The first thing to check is the connection authorization from one node to other nodes and for extraneous messages in the command output. You can find information on OpenSSH customization in Appendix B, "Common facilities" on page 275. Check that all nodes can connect to all others without any password prompt.
You can also check if your GPFS cluster has been configured correctly to use the specified remote shell and remote copy commands by issuing the mmlscluster command, as in Example 8-17. Verify the contents of the remote shell command and remote file copy command fields.
Example 8-17: mmlscluster command
[root@storage001 root]# mmlscluster GPFS cluster information ======================== Cluster id: gpfs1035415317 Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Primary network: myrinet Secondary network: ether GPFS cluster data repository servers: ------------------------------------- Primary server: storage001-myri0.cluster.com Secondary server: (none) Nodes in nodeset 1: ------------------- 1 storage001-myri0 10.2.1.141 storage001-myri0.cluster.com 10.0.3.141 2 node001-myri0 10.2.1.1 node001-myri0.cluster.com 10.0.3.1 3 node002-myri0 10.2.1.2 node002-myri0.cluster.com 10.0.3.2 4 node003-myri0 10.2.1.3 node003-myri0.cluster.com 10.0.3.3 5 node004-myri0 10.2.1.4 node004-myri0.cluster.com 10.0.3.4 [root@storage001 root]#
Another reason why SSH may fail is that connectivity to a node has been lost. Error messages from mmdsh may indicate such a condition. For example:
mmdsh: node001 rsh process had return code 1.
There are many things that could cause this problem: cable failures, network cardproblems, switch failures, and so on. You can start by checking if the affected node is powered on. If the node is up, check the node connectivity and verify the sshd daemon is running on the remote node. If not, restart the daemon by issuing:
# service sshd start
Sometimes you may see a mmdsh error message due to the lack of an mmfsd process on some of the nodes, as in Example 8-18. Make sure the mmfsd is running on all nodes, using lssrc -a, as in Example 8-19.
Example 8-18: mmcrfs command
[root@storage001 root]# mmcrfs /gpfs gpfs0 -F DescFile -v yes -r 1 -R 2 GPFS: 6027-624 No disks GPFS: 6027-441 Unable to open disk 'gpfs2nsd'. No such device GPFS: 6027-538 Error accessing disks. mmdsh: node001 rsh process had return code 19. mmcommon: Unexpected error from runRemoteCommand_Cluster: mmdsh. Return code: 1 mmcrfs: tscrfs failed. Cannot create gpfs0 [root@storage001 root]#
Example 8-19: Verifying mmfsd is running
# lssrc -a Subsystem Group PID Status cthats cthats 843 active cthags cthags 943 active ctrmc rsct 1011 active ctcas rsct 1018 active IBM.HostRM rsct_rm 1069 active IBM.FSRM rsct_rm 1077 active IBM.CSMAgentRM rsct_rm 1109 active IBM.ERRM rsct_rm 1110 active IBM.AuditRM rsct_rm 1148 active mmfs aixmm 1452 active IBM.SensorRM rsct_rm inoperative IBM.ConfigRM rsct_rm inoperative
In this section, we describe the two most common problems related to NSD and disks. These are not the only problems you might face, but they are the most common.
Sometimes you may face a disk failure and the disk appears to have disappeared from the system. This can happen if somebody simply removes an in-use hot-swap disk from the server or in the case of a particularly nasty disk failure.
In this situation, GPFS loses connectivity to the disk and, depending on how the file system was created, you may or may not lose access to the file system.
You can verify whether the disk is reachable by the operating system using mmlsnsd -m, as shown in Example 8-20. In this situation, the GPFS disk gpfs1nsd is unreachable. This could mean that the disk has been turned off, has been removed from its bay, or has failed for some other reason.
Example 8-20: mmlsnsd command
[root@storage001 root]# mmlsnsd -m NSD name PVID Device Node name Remarks ----------------------------------------------------------------------- gpfs1nsd 0A0000013BF15AFD - node-a (error) primary node gpfs2nsd 0A0000023BF15B0A /dev/sdb1 node-b primary node gpfs3nsd 0A0000033BF15B26 /dev/sdb1 node-c primary node gpfs4nsd 0A0000013BF2F4EA /dev/sda9 node-a primary node gpfs5nsd 0A0000023BF2F4FF /dev/sda3 node-b primary node gpfs6nsd 0A0000033BF2F6E1 /dev/sda6 node-c primary node [root@storage001 root]#
To correct this problem, you must first verify whether the disk is correctly attached and that it is not dead. After that, you can verify whether the driver for the disk is operational, and reload the driver using the rmmod and insmod commands. If the disk had only been removed from its bay or turned off, reloading the driver will activate the disks again, and then you can enable them again following the steps in "The disk is down and will not come up" on page 241. If the disk had any kind of hardware problem that will require replacing the disk, refer to 8.1.3, "Replacing a failing disk in an existing GPFS file system" on page 230.
Occasionally, disk problems will occur on a node and, even after the node has been rebooted, the disk connected to it does not come up again. In this situation, you will have to manually set the disk up again and then run some recovery commands in order to restore access to your file system.
For our example, we see that the gpfs0 file system has lost two of its three disks: gpfs1nsd and gpfs3nsd. In this situation, we have to recover the two disks, run a file system check, and then re-stripe the file system.
Because the file system check and re-stripe require access to the file system, which is down, you must first re-activate the disks. Once the file system is up again, recovery may be undertaken. In Example 8-21, we verify which disks are down using the mmlsdisk command, re-activate the disks by using the mmchdisk command, and then verify the disks again with mmlsdisk.
Example 8-21: Reactivating disks
[root@storage001 root]# mmlsdisk gpfs0 disk driver sector failure holds holds name type size group metadata data status availability ------------ -------- ------ ------- -------- ----- ------------- ------------ gpfs1nsd nsd 512 1 yes yes ready down gpfs2nsd nsd 512 2 yes yes ready up gpfs3nsd nsd 512 3 yes yes ready down [root@storage001 root]# mmchdisk gpfs0 start -d "gpfs1nsd;gpfs3nsd" Scanning file system metadata, phase 1 ... Scan completed successfully. Scanning file system metadata, phase 2 ... Scan completed successfully. Scanning file system metadata, phase 3 ... Scan completed successfully. Scanning user file metadata ... 77 % complete on Tue Nov 27 00:13:38 2001 100 % complete on Tue Nov 27 00:13:39 2001 Scan completed successfully. [root@storage001 root]# mmlsdisk gpfs0 disk driver sector failure holds holds name type size group metadata data status availability ------------ -------- ------ ------- -------- ----- ------------- ------------ gpfs1nsd nsd 512 1 yes yes ready up gpfs2nsd nsd 512 2 yes yes ready up gpfs3nsd nsd 512 3 yes yes ready up [root@storage001 root]#
Now that we have the three disks up, it is time to verify the file system consistency. Additionally, because some operations could have occurred on the file system when only one of the disks was down, we must re-balance it. We show the output of the mmfsck and mmrestripefs commands in Example 8-22. The mmfsck command has some important options you may need to use, like -r, for read-only access, and -y, to automatically correct problems found in the file system.
Example 8-22: mmfsck and mmrestripefs commands
[root@storage001 root]# mmfsck gpfs0 Checking "gpfs0" Checking inodes Checking inode map file Checking directories and files Checking log files Checking extended attributes file Checking file reference counts Checking file system replication status 33792 inodes 14 allocated 0 repairable 0 repaired 0 damaged 0 deallocated 0 orphaned 0 attached 384036 subblocks 4045 allocated 0 unreferenced 0 deletable 0 deallocated 231 addresses 0 suspended File system is clean. # mmrestripefs gpfs0 -r Scanning file system metadata, phase 1 ... Scan completed successfully. Scanning file system metadata, phase 2 ... Scan completed successfully. Scanning file system metadata, phase 3 ... Scan completed successfully. Scanning user file metadata ... 72 % complete on Tue Nov 27 00:19:24 2001 100 % complete on Tue Nov 27 00:19:25 2001 Scan completed successfully. [root@storage001 root]# mmlsdisk gpfs0 -e All disks up and ready [root@storage001 root]# mmlsdisk gpfs0 disk driver sector failure holds holds name type size group metadata data status availability ------------ -------- ------ ------- -------- ----- ------------- ------------ gpfs1nsd nsd 512 1 yes yes ready up gpfs2nsd nsd 512 2 yes yes ready up gpfs3nsd nsd 512 3 yes yes ready up [root@storage001 root]#
| < Day Day Up > |
|