6.5 Software RAID Implementations

Many people snub their noses at software RAID implementations, citing concerns about performance and reliability. In general, I have found these considerations to be somewhat overstated. While software RAID solutions are often slightly idiosyncratic in their implementation and require some care in deployment, they are generally not as slow as their reputation indicates, particularly for non-parity-based disk arrays. Their chief shortcoming is probably their inability to provide caching at the device level. We'll briefly discuss the two most popular software array products, Solstice DiskSuite and the Linux md package.

6.5.1 Solaris: Solstice DiskSuite

Solstice DiskSuite revolves around the idea of metadevices . A metadevice is a group of physical slices that are addressed by the system as a single, logical device. These metadevices are treated just like physical devices. The metadevices begin with the letter d . The block metadevices reside in /dev/md/dsk (e.g., /dev/md/dsk/d0 ) and the raw metadevices reside in /dev/md/rdsk (e.g., /dev/md/rdsk/d0 ). By default, the system allows 128 metadevices, but the limit of the software is 1024. ^[5] One thing I have found to be helpful is to use the metadevice name to encode the physical device location. For example, under such a scheme, the disk with ID 4 on controller 1 would be metadevice d14 . The component I/O requests to and from physical devices are handled by the metadisk driver . Because these metadevices appear as physical devices, you will still need to use newfs to create a filesystem after setting them up.

^[5] Increasing the number of metadevices involves changing the nmd parameter in /kernel/drv/md.conf .

SDS has two different ways of accomplishing the same tasks . There is a graphical interface, invoked by metatool , and there are command-line utilities, which I focus on here. However, SDS is an extremely complex product: what we discuss here is only the essential heart of SDS configuration. Sun publishes two excellent technical references on this product, the Solstice DiskSuite 4.2 Reference Guide and the Solstice DiskSuite 4.2 User's Guide .

6.5.1.1 State databases

SDS stores its configuration information in a state database , also called a metainformation database . Because this information is integral to the proper functioning of the product, multiple copies ( replicas ) are stored on disk; when the data is written, each copy is updated in sequence. At least three, and at most fifty replicas must exist. By default, these replicas are 517 KB each.

In order to ensure data integrity, SDS implements a majority consensus algorithm for the replicas, in which a majority (half, rounded down, plus one) of the replicas must be available for the system to function. If this criteria is not met, the database is considered stale , and functionality fails:

The system will function normally if at least half of the replicas are active.
The system will panic if less than half of the replicas are available.
The system will not reboot without a majority of the replicas available.

As a result, the replicas should be spread across as many disks and controllers as possible, in order to limit the chances that a single component will fail and cause the entire database to become stale.

Replicas typically reside on either dedicated partitions or on slices that are parts of metadevices. ^[6] I have always put them on slice 3 or slice 7, but there is no technical reason for this. As a general rule, the total number of replicas should be determined by Table 6-2.

^[6] If you place a replica on a slice that is being used for a metadevice, you must add the replica to the slice before you add the slice to the metadevice.

Table 6-2. Choosing how many replicas to configure

Number of disks	Number of replicas	Placement
1	3	All on a single slice
2-4	4-8	2 per disk
More than 5	5+	1 per disk

The state database replicas are managed through the metadb command. This command is rather complex, and its behavior is summarized in Table 6-3.

Table 6-3. Summary of metadb usage

Command	Description
`metadb`	Lists all configured replicas
`metadb -i`	Lists only inactive replicas
`metadb -a -f` `slice`	Creates new replicas
`metadb -a` `slice`	Creates additional replicas
`metadb -d -f` `slice`	Removes a failed replica

In addition, you may specify -c number-of-copies with both of the commands to add more than one replica to the slice. If the default size for each replica is too small, you may specify the -l size option to metadb to specify a larger size.

Using metadb without arguments reports the condition of all replicas:

 #  metadb  flags           first blk       block count      a m  pc luo        16              1034            /dev/dsk/c0t0d0s7      a    pc luo        1050            1034            /dev/dsk/c0t0d0s7      a    pc luo        16              1034            /dev/dsk/c0t8d0s7      a    pc luo        1050            1034            /dev/dsk/c0t8d0s7

Flags that are in lowercase letters imply an "Okay" status; flags that are in uppercase letters are indicative of a problem. The possible flags are summed up in Table 6-4.

Table 6-4. metadb status flags

Flag	Description
o	Replica active prior to the last configuration change
u	Replica is up to date
l	Replica locator was read successfully
c	Replica's location was in /etc/opt/SUNWmd/mddb.cf
p	Replica's location was in /etc/system
a	Replica is active
m	Replica is the master replica
W	Replica has write errors
R	Replica has read errors
M	Replica has problems with master blocks
D	Replica has problems with data blocks
F	Replica has format problems
S	Replica is too small to hold the current state database

You can easily find inactive database replicas by using metadb -i . If the state database is stale (usually by means of disk failure), you must repair the problem. You can accomplish this by using metadb -i to report the inactive replicas, noting how many copies are on each slice, and then by using metadb -d -f replica-slice to remove the inactive replicas. Now, reboot the system and replace the failed disk; when the system comes back up, read the replicas with metadb -a -c number-of-copies replica-slice .

In order to initially create the state database, run metadb with the -a and -f switches. If you have less than five disks, you will want to create more than one copy per slice by specifying -c number-of-copies . This is done to maintain quorum in the event of a disk failure. You can't create a state database on a partition that has an existing filesystem.

Here's an example of initially creating state database replicas on a system with three disks, and two copies per slice:

 #  metadb -a -f -c 2 c0t0d0s7 c0t1d0s7 c0t2d0s7  #  metadb  flags           first blk       block count      a m  pc luo        16              1034            /dev/dsk/c0t0d0s7      a    pc luo        1050            1034            /dev/dsk/c0t0d0s7      a    pc luo        16              1034            /dev/dsk/c0t1d0s7      a    pc luo        1050            1034            /dev/dsk/c0t1d0s7      a    pc luo        16              1034            /dev/dsk/c0t2d0s7      a    pc luo        1050            1034            /dev/dsk/c0t2d0s7

You may want to add additional state database replicas after the initial creation, by leaving out the -f switch. The general format of the command is metadb -a -c number-of-copies replica-slice .

Once you have the state database set up, you can start working with metadevices.

6.5.1.2 RAID 0: stripes

You can only use a stripe to hold a filesystem that is not used at either boot time or during an upgrade/install process, e.g., / , swap, /usr , /var , or /opt . A prerequisite to creating a stripe is the existence of at least three stable state database replicas. Creating a stripe involves using the metainit command:

 #  metainit d0 1 2 c0t1d0s0 c0t2d0s0 -i 64k  d0: Concat/Stripe is setup

This creates a metadevice d0 , which is a single (1) stripe of two (2) slices, namely c0t1d0s0 and c0t2d0s0 . The optional -i flag sets the interlace size to 64 KB; by default, the interlace is set to 16 KB. Here's an example of creating a four-device stripe:

 #  metainit d1 1 4 c0t1t0s0 c0t2d0s0 c0t3d0s0 c0t4d0s0  d1: Concat/Stripe is setup

Because there was no specified interlace value, this array uses a 16 KB interlace.

One interesting use of a stripe is the creation of metadevices that consist of only one physical device. This is called encapsulation, and is accomplished by the following:

 #  metainit d2 1 1 c1t0d0s0  d1: Concat/Stripe is setup

Encapsulation is useful for mirroring, as we'll discuss later.

6.5.1.3 RAID 1: mirrors

A mirror consists of between one and three submirrors . A submirror consists of one or more striped metadevices -- thus the utility of encapsulations , since we can create a single-device stripe as a submirror. You can use a mirror (in DiskSuite lingo, a metamirror ) for any filesystem, and, like a stripe, require at least three state database replicas to exist prior to creation. Unlike a stripe, you can use a mirror for any filesystem.

There are four cases in which you might want to create a mirror:

From scratch (unused slices)
From a filesystem that can be unmounted
From a filesystem that cannot be unmounted
From the root filesystem

Each involves a slightly different procedure.

If you are creating a mirror from scratch, you can follow this procedure, which has three high-level steps. First, create the two encapsulations with metainit ; second, use metainit to configure the metamirror; third, use metattach to add the second submirror to the metamirror:

 #  metainit d10 1 1 c1t0d0s0  d10: Concat/Stripe is setup #  metainit d11 1 1 c1t1d0s0  d11: Concat/Stripe is setup #  metainit d50 -m d10  d50: Mirror is setup #  metattach d50 d11  d50: Submirror d11 is attached

When the metattach command is run, it attaches the d11 metadevice to the d50 metamirror, which creates a two-way mirror and causes a mirror resynchronization. Any data on the newly attached submirror is overwritten by the master submirror during this resynchronization. You can watch the progress of the resynchronization via metastat .

If you are creating a mirror for an existing filesystem (say, /data on c2t0d0s0 ) and do not wish to back up and restore the filesystem, you can perform this operation with a much reduced delay. The basic idea is to create two encapsulations, create the mirror, unmount the filesystem, cause the system to refer to the mirror device rather than the physical device, mount the filesystem, and attach the second submirror:

 #  metainit -f d20 1 1 c2t0d0s0  d20: Concat/Stripe is setup #  metainit d21 1 1 c2t1d0s0  d21: Concat/Stripe is setup #  metainit d51 -m d20  d51: Mirror is setup #  umount /data   (Edit /etc/vfstab to reflect /data being mounted as  /dev/md/dsk/md51 rather than /dev/dsk/c2t0d0s0)  #  mount /data  #  metattach d51 d21  d51: Submirror d21 is attached

The -f switch to metainit causes the metadevice to be created nondestructively over the existing mounted filesystem. When the metattach command is executed, the contents of the d21 metadevice are destroyed in the resulting resynchronization operation.

Sometimes a filesystem cannot be unmounted during regular system operation (for example, /usr on c3t0d0s0 ). The procedure for mirroring such a filesystem is essentially the same as mirroring an unmountable filesystem, except that we will reboot rather than unmount and remount the filesystem:

 #  metainit -f d30 1 1 c3t0d0s0  d30: Concat/Stripe is setup #  metainit d31 1 1 c3t1d0s0  d31: Concat/Stripe is setup #  metainit d52 -m d30  d52: Mirror is setup  (Edit /etc/vfstab to reflect /data being mounted as /dev/md/dsk/md51 rather than /dev/dsk/c2t0d0s0)  #  reboot  ... #  metattach d52 d31  d52: Submirror d31 is attached

Mirroring the root filesystem (on c4t0d0s0 in this example), incurs some particular complications. Specifically, we need to use the specialized metaroot command, as well as flushing all pending, logged transactions against all filesystems prior to reboot: ^[7]

^[7] This procedure will only work on Solaris/SPARC systems. If you are running Solaris/i86pc, please consult the Sun documentation for a similar procedure.

 #  metainit -f d40 1 1 c4t0d0s0  d40: Concat/Stripe is setup #  metainit d41 1 1 c4t1d0s0  d41: Concat/Stripe is setup #  metainit d53 -m d40  d53: Mirror is setup #  metaroot d53  #  lockfs -fa  #  reboot  ... #  metattach d53 d41  d53: Submirror d41 is attached

6.5.1.4 RAID 5 arrays

There are a few things to keep in mind when creating RAID 5 metadevices:

The metadevice must consist of at least three slices.
The interlace value is critical to good RAID 5 performance; see Section 6.2.6 earlier in this chapter for details.
The more slices a RAID 5 metadevice contains, the longer operations will take when a slice has failed.
RAID 5 metadevices cannot be striped or mirrored within SDS.

The good news is that it is very simple to create RAID 5 metadevices:

 #  metainit d60 -r c5t0d0s0 c5t1d0s0 c5t2d0s0 c5t3d0s0 c5t4d0s0  d60: RAID is setup

Because no interlace size was specified with the -i switch, this array has an interlace size of 16 KB. The metainit command then starts an initialization process, which can be monitored via metastat . You must wait for this to finish before using the new metadevice.

6.5.1.5 Hot spare pools

A hot spare pool is a collection of slices that are reserved for use as automatic substitutes in the event of a submirror or RAID 5 component failure. A hot spare device must consist of slices, not metadevices. Although a hot spare slice can be associated with one or more hot spare pools, and a hot spare pool can be shared between multiple submirrors and RAID 5 arrays, each submirror or RAID 5 device can only be associated with one hot spare pool. In addition, a hot spare slice cannot contain a state database replica.

Configuring a hot spare pool for use involves two steps: creating the hot spare pool and associating metadevices with the hot spare pool:

 #  metainit hsp001 c6t0d0s0 c6t1d0s0  hsp001: Hotspare pool is set up

Once the hot spare pool is created, you can associate a metadevice with it:

 #  metaparam -h hsp001 d10  #  metaparam -h hsp001 d11

You may wish to add additional slices to a given hot spare pool after the intial configuration; this is accomplished by using the metahs command:

 #  metahs -a hsp001 c6t2d0s0

If you wish to add a slice to every hot spare pool in the system, specify -all instead of a hot spare pool name.

6.5.2 Linux: md

Linux supports software disk arrays by means of the multiple device ( md ) driver package. We are discussing the newer version of the RAID software, which became integrated into the 2.4 production kernels. If you are interested in using this RAID variant with 2.0 or 2.2 kernels, you should fetch a patch for your kernel, as those kernels do not have direct support for the RAID software I describe here. The Linux md software supports linear mode (or strict concatenation of disks), as well as RAID 0, 1, 5, and 0+1.

Before we start our discussion of the RAID levels supported and how to implement them, let's briefly talk about two underlying principles: persistent superblock and chunk size.

6.5.2.1 Persistent superblocks

In older versions of the Linux md support, the software RAID toolkit would read your /etc/raidtab configuration file and initialize the array. However, this requires that the filesystem with /etc/raidtab be mounted, which is not good if you intended to boot from disk. The solution to this problem is the persistent superblock. When an array is initialized with the persistent-superblock option in the /etc/raidtab file, a special superblock is written to the beginning of all the disks participating in the array. This lets the kernel read the RAID device configuration directly from the disks involved, instead of reading from a configuration file that might not be available.

6.5.2.2 Chunk size

The notion of a chunk size is analogous to the DiskSuite concept of the interlace size; it is the smallest "atomic" mass that can be written to a device. For example, a RAID 0 device with two disks and a chunk size of 4 KB that incurs a 16 KB data write will have the first and third 4 KB chunks written to the first disk in the array, and the second and fourth 4 KB chunks written to the second disk. The chunk size must be specified for all RAID levels, including linear mode, and is specified in kilobytes. Table 6-5 is a precise description of what the chunk size specifies for each type of disk array.

Table 6-5. Descriptions of chunk size effect on disk arrays

RAID level	Effect
Linear	None whatsoever.
RAID 0	Chunk- sized blocks are written to each disk, serially . For example, a write of 24 KB to an array of two disks with a chunk size of 8 KB will mean that 8 KB is written to both disks, in parallel, and then the remaining 8 KB is written to the first disk.
RAID 1	For writes , the chunk size is not important, since all data must be written to all disks. In this case, the chunk size specifies how much data will be read from the participating disks; since all active disks contain the same data, this lets us parallelize reads in a manner akin to a striped array.
RAID 5	RAID 5 writes must also update parity information. The chunk size is the size of the parity blocks: if one byte is written from a RAID 4 array, then chunk-size blocks will be read from all of the data disks, the parity information calculated, and a chunk-sized block of parity data written to the parity disk.

Good values for the chunk size vary from 32 to 128 KB, depending on the application.

All disk arrays are configured by means of entries in /etc/raidtab . Let's walk through some examples of how to write the configuration file entry for each RAID type, then discuss how to create the device itself.

6.5.2.3 Linear mode

A linear mode device is simply a concatenation of existing devices. It doesn't support any of the advanced RAID features we've discussed, but can be useful if your application calls for it. An example /etc/raidtab entry for a linear array is shown in Example 6-1.

Example 6-1. /etc/raidtab entry for a linear array

 raiddev /dev/md0    raid-level               linear    nr-raid-disks            2    chunk-size               32    persistent-superblock    1    device                   /dev/sdb1    raid-disk                0    device                   /dev/sdc1    raid-disk                1

Since this is the first time we've seen a sample configuration, let's go through it step-by-step:

The first line defines the device we're creating, in this case /dev/md0 .
The raid-level line describes the RAID level of this particular array device.
The nr-raid-disks line defines how many disks participate in the array.
The chunk-size parameter sets the chunk size, in KB.
The persistent-superblock line acts as a Boolean toggle for whether a persistent superblock should be created. It should always be set to 1.
The next four lines are device/RAID disk number pairs. They must always exist in pairs. The device line is a path to a disk partition, whereas raid-disk is the number of that disk in the array as a whole.

6.5.2.4 RAID 0: stripes

Stripes are configured essentially identically to linear-mode arrays:

 raiddev /dev/md0    raid-level               0    nr-raid-disks            2    chunk-size               32    persistent-superblock    1    device                   /dev/sdb1    raid-disk                0    device                   /dev/sdc1    raid-disk                1

6.5.2.5 RAID 1: mirrors

The configuration file entry for a mirrored array is quite similar to that for a stripe or a linear mode array, but with a slight twist:

 raiddev /dev/md0    raid-level               1    nr-raid-disks            2  nr-spare-disks           1  chunk-size               32    persistent-superblock    1    device                   /dev/sdb1    raid-disk                0    device                   /dev/sdc1    raid-disk                1  device                   /dev/sdd1    spare-disk               0

There are two additional, closely related parameters here (I've highlighted the lines they're on):

The spare-disk line follows a device line, and specifies that the named disk is to be used as a hot spare. If one disk in the mirror fails, data will be reconstructed onto the spare disk.
The nr-spare-disks line specifies how many spare disks are attached to this array.

6.5.2.6 RAID 5 arrays

You must have at least three disks to build a RAID 5 array. The /etc/raidtab entry looks like this:

 raiddev /dev/md0    raid-level               5    nr-raid-disks            7    nr-spare-disks           0    chunk-size               128    persistent-superblock    1  parity-algorithm         left-symmetric  device                   /dev/sdb1    raid-disk                0    device                   /dev/sdc1    raid-disk                1    device                   /dev/sdd1    raid-disk                2    device                   /dev/sde1    raid-disk                3    device                   /dev/sdf1    raid-disk                4    device                   /dev/sdg1    raid-disk                5    device                   /dev/sdh1    raid-disk                6

There is one new specification for a RAID 5 array, the parity-algorithm line. Leave it set to left-symmetric . You can add spare disks if you wish.

6.5.2.7 Creating the array

Once you've written the /etc/raidtab file, you can create the array by running mkraid device on the array device. For example, to create any of the arrays I have described, you would use mkraid /dev/md0 . Disk arrays that require a complex intialization process (RAID 1 and 5) will start to resynchronize; this has no impact on the system's functionality whatsoever (you can mke2fs the new filesystem, work with it, etc.).You can get information on the state of any RAID device by means of the /proc/mdstat file. Once you have your array device working properly, you use the raidstop and raidstart commands to control it.

There are two important parameters to pass to mke2fs when you construct your filesystem. The first is the block size in bytes, specified by -b . This value should be at least 4,096 (4 KB). The second, which is only available for RAID 5 devices, is -R stride=N . This allows mke2fs to better place data structures on the device. For example, if the chunk size is 64 KB, it means that 64 KB of contiguous data will reside on one disk. If we want to build a filesystem with a 4 KB block size, we realize that there will be 16 filesystem blocks in 1 array chunk. We can then pass this information onto the mke2fs utility:

 #  mke2fs -b 4096 -R stride=16 /dev/md0

RAID 5 performance is heavily influenced by this option.

6.5.2.8 Autodetection

Autodetection allows the kernel to automatically detect and start up RAID devices on system boot. In order for autodetection to work, you must meet three criteria:

Autodetection support must be compiled into the kernel.
The RAID devices must have the persistent superblock enabled.
The partition types of the RAID partitions must be set to 0xfd .

If all of these criteria are met, you can simply reboot the system -- autodetection should work normally ( cat /proc/mdstat to be sure). Autostarted devices are automatically stopped at shutdown. You can now use /dev/md devices just as you would any other devices.

6.5.2.9 Booting from an array device

If you are interested in booting from an array device, there are a few things that you have to take care of. The first is that the kernel used to boot must have the RAID kernel module available to it. There are two ways that you can do this: either compile a new kernel that has the RAID modules linked in directly (not built as a loadable module), or instruct LILO to use a RAM disk that contains all the kernel modules necessary to mount the root partition. This is done by using the mkinitrd -- with= module ramdisk -name kernel command. For example, if your root partition resides on a RAID 5 device, you would be able to accomplish this by running mkinitrd -- with=raid5 raid5-ramdisk 2.4.5 .

There is another problem; namely, that LILO 0.21 (the current distribution) does not understand how to deal with a RAID device. As a result, the kernel cannot be read from a RAID device, and your /boot filesystem will have to reside on a nonRAID device. A patch to LILO exists, included with Redhat 6.1, that can handle booting from a kernel located on a RAID 1 array only . This patch can be obtained from dist/redhat-6.1/SRPMS/SRPMS/lilo-0.21-10.src.rpm on any RedHat mirror. The patched version of LILO will accept boot=/dev/md0 in lilo.conf and will make each disk in the mirror bootable.

You must now set up a root filesystem on a RAID device. This method assumes that you will have a spare disk that you can install the system on, which isn't part of the RAID you will configure. Once you have the system installed, RAID support working, and the RAID device created via the normal procedures we've discussed and persisting properly through reboot, you can run mke2fs to create a filesystem on the new array device, and mount it (say to /newroot ). You now have to copy the contents of your current root filesystem (on the spare disk) to the new root filesystem (the array). One good way to do this is via find and cpio :

 #  cd /  #  find . -xdev  cpio -pm /newroot

Now edit the /newroot/etc/fstab file to supply the correct boot device.

The last step is to update LILO. Unmount the current /boot filesystem, and mount the boot device on /newroot/boot . Modify /newroot/etc/lilo.conf to point to the correct devices. Note that the boot device must still be a regular (nonRAID) device, but the root device should point to your new RAID. When you're finished, run lilo -r /newroot . You can then reboot your system, which will boot from the new RAID device.