< Day Day Up > |
Here we describe the steps we took to install GPFS on our cluster. We used the NSD model, using one internal disk per server node. Both nodes act as client and server. Although our cluster was not a realistic one, bringing up a real production GPFS system would follow the same path . 6.3.1 PlanningGPFS is becoming more and more dynamic in the way it handles its components . You can add or remove disks and nodes and change file system settings without stopping GPFS. However, careful planning is still a very important step. You need to consider the following areas:
6.3.2 Software installationThe following RPMs must be installed:
The kernel source, development tools, and cross-compilers need to be installed in order to be able to compile the portability layer. The imake command, which is part of the xdevel package, is also needed. If all your nodes are identical, you may install these development tools on one node only and then copy the binaries to the other nodes; this technique is described in 6.3.3, "Compiling the portability layer" on page 291. 6.3.3 Compiling the portability layerAn explanation of how to compile the portability layer is detailed in a README file located under /usr/lpp/mmfs/src. You can build as a regular user (which requires this user to have write permission in the /usr/lpp/mmfs/src directory and read permission for all files under the /usr/src/linux/ directory). All the action takes place in the /usr/src/mmfs/src/config directory. The first step in compiling the portability layer is to set the environment, as shown in Example 6-11. Example 6-11. Compilation of the portability layerr01n33:/usr/lpp/mmfs/src/config # export SHARKCLONEROOT=/usr/lpp/mmfs/src r01n33:/usr/lpp/mmfs/src/config # cp site.mcr.proto site.mcr Use the site.mcr.proto file as a template. The real file is site.mcr, and it needs to be edited to suit your needs; this is well documented inside the file. Example 6-12 on page 292 shows the differences between the template file and a file that worked for us. Example 6-12. diff site.mcr site.mcr.proto r01n33:/usr/lpp/mmfs/src/config # diff site.mcr.proto site.mcr 13c13 < #define GPFS_ARCH_I386 --- > /* #define GPFS_ARCH_I386 */ 15c15 < /* #define GPFS_ARCH_PPC64 */ --- > #define GPFS_ARCH_PPC64 34c34 < LINUX_DISTRIBUTION = REDHAT_LINUX --- > /* LINUX_DISTRIBUTION = REDHAT_LINUX */ 36c36 < /* LINUX_DISTRIBUTION = SUSE_LINUX */ --- > LINUX_DISTRIBUTION = SUSE_LINUX 55c55 < #define LINUX_KERNEL_VERSION 2041900 --- > #define LINUX_KERNEL_VERSION 2042183 The differences are quite easy to see. The architecture needs to be changed to PPC64, and the name of the distribution and the Linux kernel version need to be changed, too. To determine which kernel version you are running, use the command shown in Example 6-13; note that the version number is highlighted. Example 6-13. Get to know your kernel versionr01n33:/usr/lpp/mmfs/src/config # cat /proc/version Linux version 2.4.21-83 -pseries64 (root@PowerPC64-pSeries.suse.de) (gcc version 3.2.2) #1 SMP Tue Sep 30 11:30:48 UTC 2003 Next, you need to configure the installed kernel source tree. The source tree is not configured properly when the kernel-source RPM is installedthis is true for both SLES8 and RHAS 3 distributions. Several commands are needed to rectify the situation; these commands are shown in Example 6-14 on page 293 for SLES8 (RHAS 3 is not supported at the time of writing). Adjust the names of the files in the /boot directory to your environment, if needed. Root authority is required. Example 6-14. Configure the kernel source tree under SLES8r01n33:/usr/lpp/mmfs/src/config # cd /usr/src/linux-2.4.21-83 r01n33:/usr/src/linux-2.4.21-83 # sh make_ppc64.sh distclean r01n33:/usr/src/linux-2.4.21-83 # cp /boot/vmlinuz-2.4.21.config .config r01n33:/usr/src/linux-2.4.21-83 # sh make_ppc64.sh oldconfig $(/bin/pwd)/include/linux/version.h update-modverfile Once this is done, move back to the /usr/lpp/mmfs/src directory to build and install the portability layer, as shown in Example 6-15. Example 6-15. Build and install the portability layerr01n33:/usr/lpp/mmfs/src # make World ... Checking Destination Directories.... \c \c \c touch install.he \c \c \c touch install.ti make[1]: Leaving directory `/usr/lpp/mmfs/src/misc' r01n33:/usr/lpp/mmfs/src # make InstallImages cd gpl-linux; /usr/bin/make InstallImages; make[1]: Entering directory `/usr/lpp/mmfs/src/gpl-linux' mmfslinux mmfs25 lxtrace dumpconv tracedev make[1]: Leaving directory `/usr/lpp/mmfs/src/gpl-linux' The five files highlighted in Example 6-15 are copied into the /usr/lpp/mmfs/bin directory. If all the nodes in your cluster are identical and run the same Linux kernel, you can simply copy over these files to the /usr/lpp/mmfs/bin directory on all nodes. 6.3.4 Configuring and bringing up GPFSThis proceeds in seven steps, for NSD-based configurations:
Create the GPFS clusterWe defined the GPFS cluster on top of the RSCT peer domain created during 6.2, "RSCT peer domain setup" on page 285. First, you need to create the file that describes the nodes that will be part of the GPFS cluster. Its contents are shown in Example 6-16. The syntax is one node per line. Example 6-16. Example of a GPFS nodes file r01n33:~ # cat /tmp/gpfsnodes gr01n33 gr01n34 To create the GPFS cluster, use the mmcrcluster command as shown in Example 6-17 on page 294. You do not need to specify the RSCT peer domain; the GPFS cluster will be defined on the current peer domain, which must be online when you issue the command. Note that, in the example, we specified the use of ssh and scp . In order for the command to succeed, root must be able to execute ssh from all nodes, to all nodes, using all interfaces, without being prompted for a password. Refer to 3.12, "ssh" on page 144 for a discussion on how to set up ssh . You have to designate a primary and a backup server. What is meant here by "primary" and "secondary" servers are nodes that hold the cluster data , not the subsequent file data that will indeed be spread over all the nodes in the nodesets that will be defined later. Note also the use of the lc type for the cluster. lc (loose clusters) is the only type supported for Linux on pSeries. In Example 6-17, we also show the use of the mmlscluster command to list the properties of the newly created cluster. mmlscluster is correct in reporting that the two nodes do not belong to any nodeset. Example 6-17. mmcrcluster commandr01n33:~ # mmcrcluster -t lc -p gr01n33 -s gr01n34 -r /usr/bin/ssh -R /usr/bin/scp -n /tmp/gpfsnodes 2>&1 Thu Oct 30 15:12:35 PST 2003: mmcrcluster: Processing node gr01n33 Thu Oct 30 15:12:36 PST 2003: mmcrcluster: Processing node gr01n34 mmcrcluster: Command successfully completed mmcrcluster: Propagating the changes to all affected nodes. This is an asynchronous process. r01n33:~ # mmlscluster GPFS cluster information ======================== GPFS cluster type: lc GPFS cluster id: gpfs1067555555 RSCT peer domain name: itso Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp GPFS cluster data repository servers: ------------------------------------- Primary server: gr01n33 Secondary server: gr01n34 Cluster nodes that are not assigned to a nodeset: ------------------------------------------------- 1 (1) gr01n33 10.10.10.33 gr01n33 2 (2) gr01n34 10.10.10.34 gr01n34 Create a GPFS nodesetTo create a GPFS nodeset, you need a simple file that lists, one line per node, the name of the nodes to include in the nodeset. This has to be a subset of the nodes composing the cluster. To create the nodeset, use the mmconfig command as shown in Example 6-18 on page 295. Use mmlsconfig to display the properties of the nodeset. Example 6-18. Creating the GPFS nodesetr01n33:~ # cat /tmp/gpfsnodeset gr01n33 gr01n34 r01n33:~ # mmconfig -n /tmp/gpfsnodeset mmconfig: Command successfully completed mmconfig: Propagating the changes to all affected nodes. This is an asynchronous process. r01n33:~ # mmlsconfig Configuration data for nodeset 1 : --------------------------------- clusterType lc comm_protocol TCP multinode yes autoload no useSingleNodeQuorum no useDiskLease yes group Gpfs.1 recgroup GpfsRec.1 maxFeatureLevelAllowed 700 File systems in nodeset 1: -------------------------- (none) mmlsconfig shows the GPFS nodeset id that was assigned (in our case, it was 1). You can tailor this at mmconfig time, but the important thing is to remember the id, as it will be used to start up GPFS on the nodeset. Note Be aware that the nodeset id is only significant if you want to have multiple nodesets (which is optional, and not common for lc clusters). If there is only one nodeset in the cluster, there is no reason to know the nodeset, as you never need to supply it (GPFS, by default, picks the id of the nodeset on which the command is being executed). Start up the GPFS nodesetWe are now ready to start up GPFS. This is done with the mmstartup command, to which we must give the GPFS nodeset id as returned by mmlsconfig. This is described in Example 6-19. It is useful to check the status of Topology Services and Group Services after GPFS has been started. To do so, use the lssrc command with the -ls flag on each of the two subsystems. Example 6-19. Starting up the GPFS nodesetr01n33:~ # mmstartup -C 1 Thu Oct 30 15:14:31 PST 2003: mmstartup: Starting GPFS ... gr01n33: 0513-059 The mmfs Subsystem has been started. Subsystem PID is 18224. gr01n34: 0513-059 The mmfs Subsystem has been started. Subsystem PID is 8145. r01n33:~ # lssrc -ls cthats Subsystem Group PID Status cthats cthats 17262 active Network Name Indx Defd Mbrs St Adapter ID Group ID CG1 [ 0] 2 2 S 10.10.10.33 10.10.10.34 CG1 [ 0] eth1 0`47a19a84 0`47a19a7e HB Interval = 1 secs. Sensitivity = 4 missed beats Missed HBs: Total: 0 Current group: 0 Packets sent : 261 ICMP 0 Errors: 0 No mbuf: 0 Packets received: 317 ICMP 0 Dropped: 0 NIM's PID: 17321 CG2 [ 1] 2 2 S 129.40.34.33 129.40.34.34 CG2 [ 1] eth0 0`47a19a82 0`47a19a7d HB Interval = 1 secs. Sensitivity = 4 missed beats Missed HBs: Total: 0 Current group: 0 Packets sent : 262 ICMP 0 Errors: 0 No mbuf: 0 Packets received: 317 ICMP 0 Dropped: 0 NIM's PID: 17324 2 locally connected Clients with PIDs: rmcd( 17378) hagsd( 17266) Configuration Instance = 1067555418 Default: HB Interval = 1 secs. Sensitivity = 8 missed beats Daemon employs no security Segments pinned: Text Data Stack. Text segment size: 131611 KB. Static data segment size: 595 KB. Dynamic data segment size: 939. Number of outstanding malloc: 130 User time 0 sec. System time 0 sec. Number of page faults: 1100. Process swapped out 0 times. Number of nodes up: 2. Number of nodes down: 0. r01n33:~ # lssrc -ls cthags Subsystem Group PID Status cthags cthags 17266 active 3 locally-connected clients. Their PIDs: 17200(IBM.ConfigRMd) 17378(rmcd) 18344(mmfsd) HA Group Services domain information: Domain established by node 1 Number of groups known locally: 5 Number of Number of local Group name providers providers/subscribers GpfsRec.1 2 1 0 Gpfs.1 2 1 0 rmc_peers 2 1 0 NsdGpfs.1 2 1 0 IBM.ConfigRM 2 1 0 If the portability layer was not properly compiled and installed on all the nodes, mmstartup will fail. You may also see, in the GPFS log file (/var/mmfs/gen/mmfslog), entries such as those shown in Example 6-20. Example 6-20. mmfsd will not start without the portability layer # cat /var/mmfs/gen/mmfslog Wed Oct 29 08:07:10 PST 2003 runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: /bin/mv: cannot stat `/var/adm/ras/mmfs.log.previous': No such file or directory Unloading modules from /usr/lpp/mmfs/bin Error: /usr/lpp/mmfs/bin/mmfslinux kernel extension does not exist. Please compile a custom mmfslinux module for your kernel. See /usr/lpp/mmfs/src/README for directions. Error: unable to verify kernel/module configuration Loading modules from /usr/lpp/mmfs/bin Error: /usr/lpp/mmfs/bin/mmfslinux kernel extension does not exist. Please compile a custom mmfslinux module for your kernel. See /usr/lpp/mmfs/src/README for directions. Error: unable to verify kernel/module configuration Wed Oct 29 08:07:10 PST 2003 runmmfs: error in loading or unloading the mmfs kernel extension Wed Oct 29 08:07:10 PST 2003 runmmfs: stopping SRC ... 0513-044 The mmfs Subsystem was requested to stop. Wed Oct 29 08:07:10 PST 2003 runmmfs: received an SRC stop request; exiting In our case, it was compiled and installed properly, and the GPFS kernel modules show up as illustrated in Example 6-21. Example 6-21. GPFS kernel modules loaded r01n33:~ # lsmod Module Size Used by Tainted: PF mmfs 1106392 1 mmfslinux 219800 1 [mmfs] tracedev 14880 1 [mmfs mmfslinux] ipv6 481800 -1 (autoclean) key 102936 0 (autoclean) [ipv6] e1000 152368 1 e100 106696 1 lvm-mod 110912 0 (autoclean) Create the local disk partitions (optional)GPFS will accept anything that looks like a block device, whether it is a partition or an entire disk. In our example, we set aside a partition on each node, as listed in Example 6-22. On the first node, we used /dev/sda4, and on the second node, we used /dev/sdb4. Example 6-22. Local disk partitionsr01n33:~ # fdisk /dev/sda The number of cylinders for this disk is set to 34715. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): p Disk /dev/sda: 64 heads, 32 sectors, 34715 cylinders Units = cylinders of 2048 * 512 bytes Device Boot Start End Blocks Id System /dev/sda1 * 1 4 4080 41 PPC PReP Boot /dev/sda2 5 1029 1049600 82 Linux swap /dev/sda3 1030 10246 9438208 83 Linux /dev/sda4 10247 26631 16778240 83 Linux r01n34:~ # fdisk /dev/sdb The number of cylinders for this disk is set to 34715. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): p Disk /dev/sdb: 64 heads, 32 sectors, 34715 cylinders Units = cylinders of 2048 * 512 bytes Device Boot Start End Blocks Id System /dev/sdb1 * 1 4 4080 41 PPC PReP Boot /dev/sdb2 5 1029 1049600 82 Linux swap /dev/sdb3 1030 9222 8389632 83 Linux /dev/sdb4 9223 25607 16778240 83 Linux Create the Network Shared Disks (NSDs)It is now time to create the Network Shared Disks (NSDs) that we will use to store our file data (this step could have been done after cluster creation). We used the mmcrnsd command to perform this operation. We gave it a description file, listed in Example 6-23. The mmlsnsd is used to display the NSDs just created. The purpose of this step is to prepare all the NSDs, and to assign, to each NSD, a unique name across the cluster; a PVID is stored in the NSD itself. We need a unified naming scheme, because each node names its local partitions irrespective of the other nodes. As Example 6-23 on page 300 shows, we have two NSDs (one per node) capable of storing file system data and file system meta data, each one served by a single server (we have no twin-tailed disks here). Each disk belongs to its own failure group, as there is no common point of failure for the two disks, which are on separate machines. Refer to GPFS documentation for a full description of the syntax. Example 6-23. Create the NSDs and check themr01n33:~ # cat /tmp/gpfsnsd /dev/sda4:gr01n33::dataAndMetadata:1 /dev/sdb4:gr01n34::dataAndMetadata:2 r01n33:~ # mmcrnsd -F /tmp/gpfsnsd mmcrnsd: Propagating the changes to all affected nodes. This is an asynchronous process. r01n33:~ # mmlsnsd File system Disk name Primary node Backup node --------------------------------------------------------------------------- (free disk) gpfs1nsd gr01n33 (free disk) gpfs2nsd gr01n34 Create a file systemGPFS is in operation, and we have NSDs available for receiving data. Now we can create a file system. We do this using the mmcrfs command, as shown in Example 6-24 on page 300. mmcrnsd is clever enough to modify the disk description file, which we gave it in "Create the Network Shared Disks (NSDs)" on page 299, and convert it to a format suitable to the mmcrfs command. The local partition names are replaced by the global, cluster-wide NSD names. Refer to the mmfscrfs man page for a detailed explanation of its syntax. In our example, we use no data replication and we chose to name the file system /dev/gpfs0 and to mount it automatically (-A 1) under /bigfs. The mmlsdisk command lists the current usage of the NSD disks by GPFS. df -h shows that the file system is indeed mounted and the dd command demonstrates that we can write into the GPFS file system. The file system is mounted on all the nodes in the nodeset. Example 6-24. GPFS file system creationr01n33: # cat /tmp/gpfsnsd # /dev/sda4:gr01n33::dataAndMetadata:1 gpfs1nsd:::dataAndMetadata:1 # /dev/sdb4:gr01n34::dataAndMetadata:2 gpfs2nsd:::dataAndMetadata:2 r01n33: # mmcrfs /bigfs gpfs0 -F /tmp/gpfsnsd -C 1 The following disks of gpfs0 will be formatted on node r01n33: gpfs1nsd: size 16778240 KB gpfs2nsd: size 16778240 KB Formatting file system ... Creating Inode File Creating Allocation Maps Clearing Inode Allocation Map Clearing Block Allocation Map Flushing Allocation Maps Completed creation of file system /dev/gpfs0. mmcrfs: Propagating the changes to all affected nodes. This is an asynchronous process. r01n33:~ # mmlsdisk gpfs0 disk driver sector failure holds holds name type size group metadata data status availability ------------ -------- ------ ------- -------- ----- ------------- ------------ gpfs1nsd nsd 512 1 yes yes ready up gpfs2nsd nsd 512 2 yes yes ready up r01n33:~ # df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 9.1G 4.5G 4.6G 50% / shmfs 7.3G 0 7.3G 0% /dev/shm /dev/gpfs0 33G 43M 32G 1% /bigfs r01n33:~ # dd if=/dev/zero of=/bigfs/junk bs=1024k count=1024 1024+0 records in 1024+0 records out 6.3.5 Shutting down and restarting GPFSTo shut down GPFS, the first step is to unmount the GPFS file systems on all the nodes where they are mounted. Then the mmshutdown command is issued on one of the nodes as shown in Example 6-25 on page 301. Example 6-25. Shutting down GPFS r01n33:~ # mmshutdown -a Thu Oct 30 14:57:19 PST 2003: mmshutdown: Starting force unmount of GPFS file systems Thu Oct 30 14:57:24 PST 2003: mmshutdown: Shutting down GPFS daemons r01n33: 0513-044 The mmfs Subsystem was requested to stop. r01n33: Shutting down! r01n33: Unloading modules from /usr/lpp/mmfs/bin r01n33: Unloading module mmfs r01n33: Unloading module mmfslinux r01n33: Unloading module tracedev r01n34: 0513-044 The mmfs Subsystem was requested to stop. r01n34: Shutting down! r01n34: Unloading modules from /usr/lpp/mmfs/bin r01n34: Unloading module mmfs r01n34: Unloading module mmfslinux r01n34: Unloading module tracedev Thu Oct 30 14:57:33 PST 2003: mmshutdown: Finished To restart GPFS, simply issue the mmstartup -C nodesetID command as shown in Example 6-19 on page 296. If you created your GPFS file systems with the automatic mount on start up option, they should get mounted now. |
< Day Day Up > |