6.5 Software RAID ImplementationsMany people snub their noses at software RAID implementations, citing concerns about performance and reliability. In general, I have found these considerations to be somewhat overstated. While software RAID solutions are often slightly idiosyncratic in their implementation and require some care in deployment, they are generally not as slow as their reputation indicates, particularly for non-parity-based disk arrays. Their chief shortcoming is probably their inability to provide caching at the device level. We'll briefly discuss the two most popular software array products, Solstice DiskSuite and the Linux md package. 6.5.1 Solaris: Solstice DiskSuiteSolstice DiskSuite revolves around the idea of metadevices . A metadevice is a group of physical slices that are addressed by the system as a single, logical device. These metadevices are treated just like physical devices. The metadevices begin with the letter d . The block metadevices reside in /dev/md/dsk (e.g., /dev/md/dsk/d0 ) and the raw metadevices reside in /dev/md/rdsk (e.g., /dev/md/rdsk/d0 ). By default, the system allows 128 metadevices, but the limit of the software is 1024. [5] One thing I have found to be helpful is to use the metadevice name to encode the physical device location. For example, under such a scheme, the disk with ID 4 on controller 1 would be metadevice d14 . The component I/O requests to and from physical devices are handled by the metadisk driver . Because these metadevices appear as physical devices, you will still need to use newfs to create a filesystem after setting them up.
SDS has two different ways of accomplishing the same tasks . There is a graphical interface, invoked by metatool , and there are command-line utilities, which I focus on here. However, SDS is an extremely complex product: what we discuss here is only the essential heart of SDS configuration. Sun publishes two excellent technical references on this product, the Solstice DiskSuite 4.2 Reference Guide and the Solstice DiskSuite 4.2 User's Guide . 6.5.1.1 State databasesSDS stores its configuration information in a state database , also called a metainformation database . Because this information is integral to the proper functioning of the product, multiple copies ( replicas ) are stored on disk; when the data is written, each copy is updated in sequence. At least three, and at most fifty replicas must exist. By default, these replicas are 517 KB each. In order to ensure data integrity, SDS implements a majority consensus algorithm for the replicas, in which a majority (half, rounded down, plus one) of the replicas must be available for the system to function. If this criteria is not met, the database is considered stale , and functionality fails:
As a result, the replicas should be spread across as many disks and controllers as possible, in order to limit the chances that a single component will fail and cause the entire database to become stale. Replicas typically reside on either dedicated partitions or on slices that are parts of metadevices. [6] I have always put them on slice 3 or slice 7, but there is no technical reason for this. As a general rule, the total number of replicas should be determined by Table 6-2.
Table 6-2. Choosing how many replicas to configure
The state database replicas are managed through the metadb command. This command is rather complex, and its behavior is summarized in Table 6-3. Table 6-3. Summary of metadb usage
In addition, you may specify -c number-of-copies with both of the commands to add more than one replica to the slice. If the default size for each replica is too small, you may specify the -l size option to metadb to specify a larger size. Using metadb without arguments reports the condition of all replicas: # metadb flags first blk block count a m pc luo 16 1034 /dev/dsk/c0t0d0s7 a pc luo 1050 1034 /dev/dsk/c0t0d0s7 a pc luo 16 1034 /dev/dsk/c0t8d0s7 a pc luo 1050 1034 /dev/dsk/c0t8d0s7 Flags that are in lowercase letters imply an "Okay" status; flags that are in uppercase letters are indicative of a problem. The possible flags are summed up in Table 6-4. Table 6-4. metadb status flags
You can easily find inactive database replicas by using metadb -i . If the state database is stale (usually by means of disk failure), you must repair the problem. You can accomplish this by using metadb -i to report the inactive replicas, noting how many copies are on each slice, and then by using metadb -d -f replica-slice to remove the inactive replicas. Now, reboot the system and replace the failed disk; when the system comes back up, read the replicas with metadb -a -c number-of-copies replica-slice . In order to initially create the state database, run metadb with the -a and -f switches. If you have less than five disks, you will want to create more than one copy per slice by specifying -c number-of-copies . This is done to maintain quorum in the event of a disk failure. You can't create a state database on a partition that has an existing filesystem. Here's an example of initially creating state database replicas on a system with three disks, and two copies per slice: # metadb -a -f -c 2 c0t0d0s7 c0t1d0s7 c0t2d0s7 # metadb flags first blk block count a m pc luo 16 1034 /dev/dsk/c0t0d0s7 a pc luo 1050 1034 /dev/dsk/c0t0d0s7 a pc luo 16 1034 /dev/dsk/c0t1d0s7 a pc luo 1050 1034 /dev/dsk/c0t1d0s7 a pc luo 16 1034 /dev/dsk/c0t2d0s7 a pc luo 1050 1034 /dev/dsk/c0t2d0s7 You may want to add additional state database replicas after the initial creation, by leaving out the -f switch. The general format of the command is metadb -a -c number-of-copies replica-slice . Once you have the state database set up, you can start working with metadevices. 6.5.1.2 RAID 0: stripesYou can only use a stripe to hold a filesystem that is not used at either boot time or during an upgrade/install process, e.g., / , swap, /usr , /var , or /opt . A prerequisite to creating a stripe is the existence of at least three stable state database replicas. Creating a stripe involves using the metainit command: # metainit d0 1 2 c0t1d0s0 c0t2d0s0 -i 64k d0: Concat/Stripe is setup This creates a metadevice d0 , which is a single (1) stripe of two (2) slices, namely c0t1d0s0 and c0t2d0s0 . The optional -i flag sets the interlace size to 64 KB; by default, the interlace is set to 16 KB. Here's an example of creating a four-device stripe: # metainit d1 1 4 c0t1t0s0 c0t2d0s0 c0t3d0s0 c0t4d0s0 d1: Concat/Stripe is setup Because there was no specified interlace value, this array uses a 16 KB interlace. One interesting use of a stripe is the creation of metadevices that consist of only one physical device. This is called encapsulation, and is accomplished by the following: # metainit d2 1 1 c1t0d0s0 d1: Concat/Stripe is setup Encapsulation is useful for mirroring, as we'll discuss later. 6.5.1.3 RAID 1: mirrorsA mirror consists of between one and three submirrors . A submirror consists of one or more striped metadevices -- thus the utility of encapsulations , since we can create a single-device stripe as a submirror. You can use a mirror (in DiskSuite lingo, a metamirror ) for any filesystem, and, like a stripe, require at least three state database replicas to exist prior to creation. Unlike a stripe, you can use a mirror for any filesystem. There are four cases in which you might want to create a mirror:
Each involves a slightly different procedure. If you are creating a mirror from scratch, you can follow this procedure, which has three high-level steps. First, create the two encapsulations with metainit ; second, use metainit to configure the metamirror; third, use metattach to add the second submirror to the metamirror: # metainit d10 1 1 c1t0d0s0 d10: Concat/Stripe is setup # metainit d11 1 1 c1t1d0s0 d11: Concat/Stripe is setup # metainit d50 -m d10 d50: Mirror is setup # metattach d50 d11 d50: Submirror d11 is attached When the metattach command is run, it attaches the d11 metadevice to the d50 metamirror, which creates a two-way mirror and causes a mirror resynchronization. Any data on the newly attached submirror is overwritten by the master submirror during this resynchronization. You can watch the progress of the resynchronization via metastat . If you are creating a mirror for an existing filesystem (say, /data on c2t0d0s0 ) and do not wish to back up and restore the filesystem, you can perform this operation with a much reduced delay. The basic idea is to create two encapsulations, create the mirror, unmount the filesystem, cause the system to refer to the mirror device rather than the physical device, mount the filesystem, and attach the second submirror: # metainit -f d20 1 1 c2t0d0s0 d20: Concat/Stripe is setup # metainit d21 1 1 c2t1d0s0 d21: Concat/Stripe is setup # metainit d51 -m d20 d51: Mirror is setup # umount /data (Edit /etc/vfstab to reflect /data being mounted as /dev/md/dsk/md51 rather than /dev/dsk/c2t0d0s0) # mount /data # metattach d51 d21 d51: Submirror d21 is attached The -f switch to metainit causes the metadevice to be created nondestructively over the existing mounted filesystem. When the metattach command is executed, the contents of the d21 metadevice are destroyed in the resulting resynchronization operation. Sometimes a filesystem cannot be unmounted during regular system operation (for example, /usr on c3t0d0s0 ). The procedure for mirroring such a filesystem is essentially the same as mirroring an unmountable filesystem, except that we will reboot rather than unmount and remount the filesystem: # metainit -f d30 1 1 c3t0d0s0 d30: Concat/Stripe is setup # metainit d31 1 1 c3t1d0s0 d31: Concat/Stripe is setup # metainit d52 -m d30 d52: Mirror is setup (Edit /etc/vfstab to reflect /data being mounted as /dev/md/dsk/md51 rather than /dev/dsk/c2t0d0s0) # reboot ... # metattach d52 d31 d52: Submirror d31 is attached Mirroring the root filesystem (on c4t0d0s0 in this example), incurs some particular complications. Specifically, we need to use the specialized metaroot command, as well as flushing all pending, logged transactions against all filesystems prior to reboot: [7]
# metainit -f d40 1 1 c4t0d0s0 d40: Concat/Stripe is setup # metainit d41 1 1 c4t1d0s0 d41: Concat/Stripe is setup # metainit d53 -m d40 d53: Mirror is setup # metaroot d53 # lockfs -fa # reboot ... # metattach d53 d41 d53: Submirror d41 is attached 6.5.1.4 RAID 5 arraysThere are a few things to keep in mind when creating RAID 5 metadevices:
The good news is that it is very simple to create RAID 5 metadevices: # metainit d60 -r c5t0d0s0 c5t1d0s0 c5t2d0s0 c5t3d0s0 c5t4d0s0 d60: RAID is setup Because no interlace size was specified with the -i switch, this array has an interlace size of 16 KB. The metainit command then starts an initialization process, which can be monitored via metastat . You must wait for this to finish before using the new metadevice. 6.5.1.5 Hot spare poolsA hot spare pool is a collection of slices that are reserved for use as automatic substitutes in the event of a submirror or RAID 5 component failure. A hot spare device must consist of slices, not metadevices. Although a hot spare slice can be associated with one or more hot spare pools, and a hot spare pool can be shared between multiple submirrors and RAID 5 arrays, each submirror or RAID 5 device can only be associated with one hot spare pool. In addition, a hot spare slice cannot contain a state database replica. Configuring a hot spare pool for use involves two steps: creating the hot spare pool and associating metadevices with the hot spare pool: # metainit hsp001 c6t0d0s0 c6t1d0s0 hsp001: Hotspare pool is set up Once the hot spare pool is created, you can associate a metadevice with it: # metaparam -h hsp001 d10 # metaparam -h hsp001 d11 You may wish to add additional slices to a given hot spare pool after the intial configuration; this is accomplished by using the metahs command: # metahs -a hsp001 c6t2d0s0 If you wish to add a slice to every hot spare pool in the system, specify -all instead of a hot spare pool name. 6.5.2 Linux: mdLinux supports software disk arrays by means of the multiple device ( md ) driver package. We are discussing the newer version of the RAID software, which became integrated into the 2.4 production kernels. If you are interested in using this RAID variant with 2.0 or 2.2 kernels, you should fetch a patch for your kernel, as those kernels do not have direct support for the RAID software I describe here. The Linux md software supports linear mode (or strict concatenation of disks), as well as RAID 0, 1, 5, and 0+1. Before we start our discussion of the RAID levels supported and how to implement them, let's briefly talk about two underlying principles: persistent superblock and chunk size. 6.5.2.1 Persistent superblocksIn older versions of the Linux md support, the software RAID toolkit would read your /etc/raidtab configuration file and initialize the array. However, this requires that the filesystem with /etc/raidtab be mounted, which is not good if you intended to boot from disk. The solution to this problem is the persistent superblock. When an array is initialized with the persistent-superblock option in the /etc/raidtab file, a special superblock is written to the beginning of all the disks participating in the array. This lets the kernel read the RAID device configuration directly from the disks involved, instead of reading from a configuration file that might not be available. 6.5.2.2 Chunk sizeThe notion of a chunk size is analogous to the DiskSuite concept of the interlace size; it is the smallest "atomic" mass that can be written to a device. For example, a RAID 0 device with two disks and a chunk size of 4 KB that incurs a 16 KB data write will have the first and third 4 KB chunks written to the first disk in the array, and the second and fourth 4 KB chunks written to the second disk. The chunk size must be specified for all RAID levels, including linear mode, and is specified in kilobytes. Table 6-5 is a precise description of what the chunk size specifies for each type of disk array. Table 6-5. Descriptions of chunk size effect on disk arrays
Good values for the chunk size vary from 32 to 128 KB, depending on the application. All disk arrays are configured by means of entries in /etc/raidtab . Let's walk through some examples of how to write the configuration file entry for each RAID type, then discuss how to create the device itself. 6.5.2.3 Linear modeA linear mode device is simply a concatenation of existing devices. It doesn't support any of the advanced RAID features we've discussed, but can be useful if your application calls for it. An example /etc/raidtab entry for a linear array is shown in Example 6-1. Example 6-1. /etc/raidtab entry for a linear arrayraiddev /dev/md0 raid-level linear nr-raid-disks 2 chunk-size 32 persistent-superblock 1 device /dev/sdb1 raid-disk 0 device /dev/sdc1 raid-disk 1 Since this is the first time we've seen a sample configuration, let's go through it step-by-step:
6.5.2.4 RAID 0: stripesStripes are configured essentially identically to linear-mode arrays: raiddev /dev/md0 raid-level 0 nr-raid-disks 2 chunk-size 32 persistent-superblock 1 device /dev/sdb1 raid-disk 0 device /dev/sdc1 raid-disk 1 6.5.2.5 RAID 1: mirrorsThe configuration file entry for a mirrored array is quite similar to that for a stripe or a linear mode array, but with a slight twist: raiddev /dev/md0 raid-level 1 nr-raid-disks 2 nr-spare-disks 1 chunk-size 32 persistent-superblock 1 device /dev/sdb1 raid-disk 0 device /dev/sdc1 raid-disk 1 device /dev/sdd1 spare-disk 0 There are two additional, closely related parameters here (I've highlighted the lines they're on):
6.5.2.6 RAID 5 arraysYou must have at least three disks to build a RAID 5 array. The /etc/raidtab entry looks like this: raiddev /dev/md0 raid-level 5 nr-raid-disks 7 nr-spare-disks 0 chunk-size 128 persistent-superblock 1 parity-algorithm left-symmetric device /dev/sdb1 raid-disk 0 device /dev/sdc1 raid-disk 1 device /dev/sdd1 raid-disk 2 device /dev/sde1 raid-disk 3 device /dev/sdf1 raid-disk 4 device /dev/sdg1 raid-disk 5 device /dev/sdh1 raid-disk 6 There is one new specification for a RAID 5 array, the parity-algorithm line. Leave it set to left-symmetric . You can add spare disks if you wish. 6.5.2.7 Creating the arrayOnce you've written the /etc/raidtab file, you can create the array by running mkraid device on the array device. For example, to create any of the arrays I have described, you would use mkraid /dev/md0 . Disk arrays that require a complex intialization process (RAID 1 and 5) will start to resynchronize; this has no impact on the system's functionality whatsoever (you can mke2fs the new filesystem, work with it, etc.).You can get information on the state of any RAID device by means of the /proc/mdstat file. Once you have your array device working properly, you use the raidstop and raidstart commands to control it. There are two important parameters to pass to mke2fs when you construct your filesystem. The first is the block size in bytes, specified by -b . This value should be at least 4,096 (4 KB). The second, which is only available for RAID 5 devices, is -R stride=N . This allows mke2fs to better place data structures on the device. For example, if the chunk size is 64 KB, it means that 64 KB of contiguous data will reside on one disk. If we want to build a filesystem with a 4 KB block size, we realize that there will be 16 filesystem blocks in 1 array chunk. We can then pass this information onto the mke2fs utility: # mke2fs -b 4096 -R stride=16 /dev/md0 RAID 5 performance is heavily influenced by this option. 6.5.2.8 AutodetectionAutodetection allows the kernel to automatically detect and start up RAID devices on system boot. In order for autodetection to work, you must meet three criteria:
If all of these criteria are met, you can simply reboot the system -- autodetection should work normally ( cat /proc/mdstat to be sure). Autostarted devices are automatically stopped at shutdown. You can now use /dev/md devices just as you would any other devices. 6.5.2.9 Booting from an array deviceIf you are interested in booting from an array device, there are a few things that you have to take care of. The first is that the kernel used to boot must have the RAID kernel module available to it. There are two ways that you can do this: either compile a new kernel that has the RAID modules linked in directly (not built as a loadable module), or instruct LILO to use a RAM disk that contains all the kernel modules necessary to mount the root partition. This is done by using the mkinitrd -- with= module ramdisk -name kernel command. For example, if your root partition resides on a RAID 5 device, you would be able to accomplish this by running mkinitrd -- with=raid5 raid5-ramdisk 2.4.5 . There is another problem; namely, that LILO 0.21 (the current distribution) does not understand how to deal with a RAID device. As a result, the kernel cannot be read from a RAID device, and your /boot filesystem will have to reside on a nonRAID device. A patch to LILO exists, included with Redhat 6.1, that can handle booting from a kernel located on a RAID 1 array only . This patch can be obtained from dist/redhat-6.1/SRPMS/SRPMS/lilo-0.21-10.src.rpm on any RedHat mirror. The patched version of LILO will accept boot=/dev/md0 in lilo.conf and will make each disk in the mirror bootable. You must now set up a root filesystem on a RAID device. This method assumes that you will have a spare disk that you can install the system on, which isn't part of the RAID you will configure. Once you have the system installed, RAID support working, and the RAID device created via the normal procedures we've discussed and persisting properly through reboot, you can run mke2fs to create a filesystem on the new array device, and mount it (say to /newroot ). You now have to copy the contents of your current root filesystem (on the spare disk) to the new root filesystem (the array). One good way to do this is via find and cpio : # cd / # find . -xdev cpio -pm /newroot Now edit the /newroot/etc/fstab file to supply the correct boot device. The last step is to update LILO. Unmount the current /boot filesystem, and mount the boot device on /newroot/boot . Modify /newroot/etc/lilo.conf to point to the correct devices. Note that the boot device must still be a regular (nonRAID) device, but the root device should point to your new RAID. When you're finished, run lilo -r /newroot . You can then reboot your system, which will boot from the new RAID device. |