Hack47.Combine LVM and Software RAID | BSD Sockets Programming from a Multi-Language Perspective (Programming Series)

Hack 47. Combine LVM and Software RAID

Combining the flexibility of LVM with the redundancy of RAID is the right thing for critical file servers.

RAID (Redundant Array of Inexpensive Disks or Redundant Array of Independent Disks, depending on who you ask) is a hardware and/or software mechanism used to improve the performance and maintainability of large amounts of disk storage through some extremely clever mechanisms. As the name suggests, RAID makes a large number of smaller disks (referred to as a RAID array) appear to be one or more large disks as far as the operating system is concerned. RAID was also designed to provide both performance and protection against the failure of any single disk in your system, which it does by providing its own internal volume management interface.

RAID is provided by specialized disk controller hardware, by system-level software, or by some combination of both. The support for software RAID under Linux is known as the multiple device (md) interface. Hardware RAID has performance advantages over software RAID, but it can be a problem in enterprise environments because hardware RAID implementations are almost always specific to the hardware controller you are using. While most newer hardware RAID controllers from a given manufacturer are compatible with their previous offerings, there's never any real guarantee of this, and product lines do occasionally change. I prefer to use the software RAID support provided by Linux, for a number of reasons:

It's completely independent of the disk controllers you're using.
It provides the same interface and customization mechanisms across all Linux distributions.
Performance is actually quite good.
It can be combined with Linux Logical Volume Management (LVM) to provide a powerful, flexible mechanism for storage expansion and management.

Hardware RAID arrays usually enable you to remove and replace failed drives without shutting down your system. This is known as hot swapping, because you can swap out drives while the system is running (i.e., "hot"). Hot swapping is supported by software RAID, but whether or not it's possible depends on the drive hardware you're using. If you're using removable or external FireWire, SCSI, or USB drives with software RAID (though most USB drives are too slow for this purpose), you can remove and replace failed drives on these interfaces without shutting down your system.

5.3.1. Mirroring and Redundancy

To support the removal and replacement of drives without anyone but you noticing, RAID provides services such as mirroring, which is the ability to support multiple volumes that are exact, real-time copies of each other. If a mirrored drive (or a drive that is part of a mirrored volume) fails or is taken offline for any other reason, the RAID system automatically begins using the failed drive's mirror, and no one notices its absence (except for the sysadmins who have to scurry for a replacement).

As protection against single-device failures, most RAID levels support the use of spare disks in addition to mirroring. Mirroring protects you when a single device in a RAID array fails, but at this point, you are immediately vulnerable to the failure of any other device that holds data for which no mirror is currently available. RAID's use of spare disks is designed to immediately reduce this vulnerability. In the event of a device failure, the RAID subsystem immediately allocates one of the spare disks and begins creating a new mirror there for you. When using spare disks in conjunction with mirroring, you really only have a non-mirrored disk array for the amount of time it takes to clone the mirror to the spare disk. However, as explained in the next section, the automatic use of spare disks is supported only for specific RAID levels.

RAID is not a replacement for doing backups. RAID ensures that your systems can continue functioning and that users and applications can have uninterrupted access to mirrored data in the event of device failure. However, the simultaneous failure of multiple devices in a RAID array can still take your system down and make the data that was stored on those devices unavailable. If your primary storage fails, only systems of which you have done backups (from which data can therefore be restored onto your new disks) can be guaranteed to come back up.

5.3.2. Overview of RAID Levels

The different capabilities provided by hardware and software RAID are grouped into what are known as different RAID levels. The following list describes the most common of these (for information about other RAID levels or more detailed information about the ones listed here, grab a book on RAID and some stimulants to keep you awake):

RAID-0: Often called stripe mode, volumes are created in parallel across all of the devices that are part of the RAID array, allocating storage from each in order to provide as many opportunities for parallel reads and writes as possible. This RAID level is strictly for performance and does not provide any redundancy in the event of a hardware failure.
RAID-1: Usually known as mirroring, volumes are created on single devices and exact copies (mirrors) of those volumes are maintained in order to provide protection from the failure of a single disk through redundancy. For this reason, you cannot create a RAID-1 volume that is larger than the smallest device that makes up a part of the RAID array. However, as explained in this hack, you can combine Linux LVM with RAID-1 to overcome this limitation.
RAID-4: RAID-4 is a fairly uncommon RAID level that requires three or more devices in the RAID array. One of the drives is used to store parity information that can be used to reconstruct the data on a failed drive in the array. Unfortunately, storing this parity information on a single drive exposes this drive as a potential single point of failure.
RAID-5: One of the most popular RAID levels, RAID-5 requires three or more devices in the RAID array and enables you to support mirroring through parity information without restricting the parity information to a single device. Parity information is distributed across all of the devices in the RAID array, removing the bottleneck and potential single point of failure in RAID-4.
RAID-10: A high-performance modern RAID option, RAID-10 provides mirrored stripes, which essentially gives you a RAID-1 array composed of two RAID-0 arrays. The use of striping offsets the potential performance degradation of mirroring and doesn't require calculating or maintaining parity information anywhere.

In addition to these RAID levels, Linux software RAID also supports linear mode, which is the ability to concatenate two devices and treat them as a single large device. This is rarely used any more because it provides no redundancy and is functionally identical to the capabilities provided by LVM.

5.3.3. Combining Software RAID and LVM

Now we come to the conceptual meat of this hack. Native RAID devices cannot be partitioned. Therefore, unless you go to a hardware RAID solution, the software RAID modes that enable you to concatenate drives and create large volumes don't provide the redundancy that RAID is intended to provide. Many of the hardware RAID solutions available on motherboards export RAID devices only as single volumes, due to the absence of onboard volume management software. RAID array vendors get around this by selling RAID arrays that have built-in software (which is often Linux-based) that supports partitioning using an internal LVM package. However, you can do this yourself by layering Linux LVM over the RAID disks in your systemsin other words, by using software RAID drives as physical volumes that you then allocate and export to your system as logical volumes. Voilà! Combining RAID and LVM gives you flexible volume management with the warm fuzzy feeling of redundancy provided by RAID levels such as 1, 5, and 10. It just doesn't get much better than that.

5.3.4. Creating RAID Devices

RAID devices are created by first defining them in the file /etc/raidtab and then using the mkraid command to actually create the RAID devices specified in the configuration file.

For example, the following /etc/raidtab file defines a linear RAID array composed of the physical devices /dev/hda6 and /dev/hdb5:

 raiddev /dev/md0             raid-level              linear nr-raid-disks           2 chunk-size              32 persistent-superblock   1 device                  /dev/hda6 raid-disk 0 device                  /dev/hdb5 raid-disk               1

Executing the mkraid command to create the device /dev/md0 would produce output like the following:

 # mkraid /dev/md0 handling MD device /dev/md0 analyzing super-block disk 0: /dev/hda6, 10241406kB, raid superblock at 10241280kB disk 1: /dev/hdb5, 12056751kB, raid superblock at 12056640kB

If you are recycling drives that you have previously used for some other purpose on your system, the mkraid command may complain about finding existing filesystems on the disks that you are allocating to your new RAID device. Double-check that you have specified the right disks in your /etc/raidtab file, and then use the mkraid command's f option to force it to use the drives, regardless.

At this point, you can create your favorite type of filesystem on the device /dev/md0 by using the mkfs command and specifying the type of filesystem by using the appropriate -t type option. After creating your filesystem, you can then update the /etc/fstab file to mount the new volume wherever you want, and you're in business.

A linear RAID array is RAID at its most primitive, and isn't really useful now that Linux provides mature logical volume support. The /etc/raidtab configuration file for a RAID-1 (mirroring) RAID array that mirrors the single-partition disk /dev/hdb1 using the single partition /dev/hde1 would look something like the following:

 raiddev /dev/md0     raid-level     1 nr-raid-disks  2 nr-spare-disks 0 chunk-size    4 persistent-superblock 1 device         /dev/hdb1 raid-disk      0 device         /dev/hde1 raid-disk      1

Other RAID levels are created by using the same configuration file but specifying other mandatory parameters, such as a third disk for RAID levels 4 and 5, and so on. See the references at the end of this hack for pointers to more detailed information about creating and using devices at other RAID levels.

An important thing to consider when creating mirrored RAID devices is the amount of load they will put on your system's device controllers. When creating mirrored RAID devices, you should always try to put the drive and its mirror on separate controllers so that no single drive controller is overwhelmed by disk update commands.

5.3.5. Combining RAID and LVM

As mentioned earlier, RAID devices can't be partitioned. This generally means that you have to use RAID devices in their entirety, as a single filesystem, or that you have to use many small disks and create a RAID configuration file that is Machiavellian in its complexity. A better alternative to both of these (and the point of this hack) is that you can combine the strengths of Linux software RAID and Linux LVM to get the best of both worlds: the safety and redundancy of RAID with the flexibility of LVM. It's important to create logical volumes on top of RAID storage and not the reverse, thougheven software RAID is best targeted directly at the underlying hardware, and trying to (for example) mirror logical devices would stress your system and slow performance as both the RAID and LVM levels competed to try to figure out what should be mirrored where.

Combining RAID and LVM is quite straightforward. Instead of creating a filesystem directly on top of /dev/md0, you define /dev/md0 as a physical volume that can be associated with a volume group [Hack #46]. You then create whatever logical volumes you need within that volume group, format them as described earlier in this hack, and mount and use them however you like on your system.

If you decide to use Linux software RAID and LVM and support for these is not compiled into your kernel, you must remember to update any initial RAM disks that you use to include the RAID and LVM kernel modules. I generally use a standard ext2/ext3 partition for /boot on my systems, which is where the kernel and initial RAM disks live. This avoids boot-strapping problems, such as when the system needs information from a logical volume or RAID device but has not yet loaded the kernel modules necessary to get that information.

To expand your storage after creating this sort of setup, you physically add additional new devices to your system, define the new RAID device in /etc/raidtab (as /dev/md1, etc.), and run the mkraid command followed by the name of the new device to have your system create and recognize it as a RAID volume. You then create a new physical volume on the resulting device, add that to your existing volume group, and then either create new logical volumes in that volume group or use the lvextend command to increase the size of your existing volumes. Here's a sample sequence of commands to do all of this (using the mirrored /etc/raidtab from the previous section):

 # mkraid /dev/md0 # pvcreate /dev/md0 # vgcreate data /dev/md0 # vgdisplay data | grep "Total PE"       Total PE 59618 # lvcreate -n music -l 59618 data       Logical volume "music" created # mkfs -t xfs /dev/data/music meta-data=/dev/mapper/data-music isize=256   agcount=16, agsize=3815552 blks              =                       sectsz=512 data     =                       bsize=4096  blocks=61048832, imaxpct=25              =                       sunit=0     swidth=0 blks, unwritten=1 naming   =version 2              bsize=4096 log      =internal log           bsize=4096  blocks=29809, version=1              =                       sectsz=512  sunit=0 blks realtime =none                   extsz=65536 blocks=0, rtextents=0 # mount /dev/mapper/data-music /mnt/music

These commands create a mirrored RAID volume called /dev/md0 using the storage on /dev/hdb1 and /dev/hde1 (which live on different controllers), allocate the space on /dev/md0 as a physical volume, create a volume group called data using this physical volume, and then create a logical volume called music that uses all of the storage available in this volume group. The last two commands then create an XFS filesystem on the logical volume and mount that filesystem on /mnt/music so that it's available for use. To make sure that your new logical volume is automatically mounted each time you boot your system, you'd then add the following entry to your /etc/fstab file:

 /dev/data/music  /mnt/music      xfs    defaults,noatime    0 0

Specifying the noatime option in the /etc./fstab mount options for my logical volume tells the filesystem not to update inodes each time the files or directories associated with them are accessed.

Until the Linux LVM system supports mirroring, combining software RAID and LVM will give you the reliability and redundancy of RAID with the flexibility and power of LVM. Combining software RAID and LVM on Linux is conceptually elegant and can help you create a more robust, flexible, and reliable system environment. Though RAID levels that support mirroring require multiple disks and thus "waste" some potential disk storage by devoting it to mirroring rather than actual, live storage, you'll be glad that you used them if any of your disks ever fail.

5.3.6. See Also

Linux software RAID HOWTO: http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO.html
"Create Flexible Storage with LVM" [Hack #46]