Certification Objective 13.01Understanding RAID and SVM Concepts | Sun Certified System Administrator for Solaris 10 Study Guide Exams 310-XXX & 310-XXX

Certification Objective 13.01—Understanding RAID and SVM Concepts

Exam Objective 3.1: Analyze and explain RAID (0,1,5) and SVM concepts (logical volumes, soft partitions, state databases, hot spares, and hot spare pools).

The Redundant Array of Inexpensive Disks (RAID), also called Redundant Array of Independent Disks, takes the management of disk space from a single disk to multiple disks called a disk set. On a disk set, the data lives on a volume that can span across multiple disks. Solaris offers a tool called Solaris Volume Manager (SVM) to manage these volumes. In this section we explore the RAID and SVM concepts.

Implementing Disk System Fault Tolerance

Although you have a backup and restoration system as a protection against disasters such as a hard disk failure, it does not solve all the disk problems related to potential data loss. There are two cases that the protection provided by backup does not cover:

The data stored between the last backup and the crash will be lost forever.
There are situations where you want zero time between data loss and data recovery—that is, the end user should have no awareness that the disk has crashed. For example, think of a disk supporting a 24 × 7 web site.

These two cases are covered by implementing fault tolerance, which is the property of a system that enables the system to keep functioning even when a part of the system fails. Fault tolerance in a disk set is implemented through techniques called mirroring, duplexing, and parity.

Disk Mirroring

Disk mirroring is a process of designating a disk drive, say disk B, as a duplicate (mirror) of another disk drive, say disk B. As shown in Figure 13-1, in disk mirroring both disks are attached to the same disk controller. When the OS writes data to a disk drive, the same data also gets written to the mirror disk. If the first disk drive fails, the data will be served from the mirror disk and the user will not see the failure.

image from book
Figure 13-1: Comparison of mirroring and duplexing

Note that the two disks don't have to be identical, but if you don't want to waste disk space, they should be identical. For example, consider two disks of size 5GB and 7GB. Using these two disks, you can make a mirrored system of only 5GB (each disk); hence, 2GB is wasted.

Note that mirroring has a single point of failure—that is, if the disk controller fails, both disks will become inaccessible. The solution to this problem is disk duplexing.

Disk Duplexing

You have seen that the fault tolerance in mirroring is vulnerable to the failure of disk controller. Disk duplexing solves this problem by providing each disk drive its own disk controller, as shown in Figure 13-1.

As shown in Table 13-1, duplexing improves fault tolerance over mirroring. This is because in addition to disk redundancy there is also a controller redundancy. Therefore, duplexing is more fault tolerant than mirroring, and mirroring is more fault tolerant than a single disk.

Table 13-1: Comparison of mirrored and duplexed disk systems
One Disk	Both Disks	One Controller	Is Mirrored System Fault Tolerant?	Is Duplexed System Fault Tolerant?
Fails	—	—	Yes	Yes
—	—	Fails	No	Yes
Fails	—	Fails	No	Yes if the failed controller belongs to the failed disk: otherwise, no.
—	Fail	—	No	No

There is another disk read/write technique that is related to disk volume, called disk striping.

Improving Performance through Disk Striping

When a disk volume spans across multiple disks, you fill up one disk before you start writing onto the next disk. Now that we have more than one disk in a volume, if we could write to all the disks in the volume simultaneously, we could improve performance. This is exactly what disk striping is all about. Disk striping breaks the data into small pieces called stripes, and those stripes are written to multiple disks simultaneously. Because the read/write heads are working simultaneously, striping improves read/write performance. This is depicted in Figure 13-2.

image from book
Figure 13-2: Disk striping writes stripes of data on multiple disks simultaneously

There is another technique to provide data redundancy called parity.

Data Redundancy by Parity

Parity is another technique that is used to implement fault tolerance. If you lose a part of data, you can reconstruct it from the parity information of this data. When a chunk (say a block) of data is stored on a disk, its parity information is calculated and stored on another disk. If this part of the data is lost, it can be reconstructed from the parity information. Note that parity provides fault tolerance at the cost of performance and disk space, because parity calculations take CPU time, and parity information takes disk space.

In this section, we have discussed concepts that can be used to implement disk fault tolerance and improve the read/write performance of a disk. In the next section, we explore a specific technology that implements these concepts.

Understanding RAID

Redundant Array of Inexpensive Disks (RAID), also known as Redundant Array of Independent Disks (RAID), refers to a technology that uses a set of redundant disks to provide fault tolerance. The disks usually reside together in a cabinet and are therefore referred to as an array.

Exam Watch

We have presented here a comprehensive coverage of RAID levels. However, note that Solaris supports only RAID 0, RAID I, RAID 0 + 1, RAID 1 + 0, and RAID 5. So, for the exam, you need to remember only these RAID levels.

There are several methods that use an array of disks, and they are referred to as RAID levels, as described here:

RAID 0. This level of RAID uses striping on multiple disks. This means that a volume contains stripes on multiple disks and the data can be written simultaneously to these stripes. In other words, the data is striped across multiple disks. Because you can write to various stripes simultaneously, RAID 0 improves performance, but it does not implement fault tolerance even though it uses multiple disks. Imagine a file written on various stripes residing on multiple disks. Now if one disk fails that contains a stripe of the file, you lose the file.
RAID 0 + 1. This is the same as RAID 0, but the stripes are mirrored to provide fault tolerance in addition to performance improvement by striping. However, mirroring can slow down performance.
RAID 1. This RAID level uses mirroring (or duplexing) on two disks to provide a very basic level of disk fault tolerance. Both disks contain the same data—that is, one disk is a mirror image of the other. If one disk fails, the other disk takes over automatically. This RAID level also provides performance improvement in data reads, because if one disk is busy, the data can be read from the other disk.
RAID 1+0. This is the same as RAID 1, but the mirroring is striped to provide performance improvement in addition to fault tolerance.
RAID 2. This RAID level implements striping with parity and therefore needs at least three disks, because parity information is written on a disk other than the data disks. The striping is done at bit level—that is, bits are striped across multiple disks. If a data disk fails and the parity disk does not, the data on the failed disk can be reconstructed from the parity disk.
RAID 3. This is similar to RAID 2 (striping with parity). The main difference is that the striping is done at byte level as opposed to bit level—that is, bytes are striped across multiple disks. This offers improved performance over RAID 2, because more data can be read or written in a single read/write operation.
RAID 4. This is similar to RAID 2 and RAID 3 (striping with parity). The main difference is that the striping is done at block level as opposed to bit level or byte level—that is, blocks are striped across multiple disks. This offers improved performance over RAID 2 and RAID 3 because more data can be read or written in a single read/write operation.
RAID 5. This RAID level implements striping with distributed parity—that is, parity is also striped across multiple disks. The parity function does not have a dedicated disk—that is, the data and parity can be interleaved on all disks. However, it is clear that the parity of a given piece of data on one disk must not be written to the same disk, which would defeat the whole purpose of parity. If a disk fails, the parity information about the data on the failed disk must exist on other disks so that the lost data can be reconstructed. If more than one disk fails, it's possible that you will lose parity information along with the data. If you use only two disks, you cannot distribute the parity information, because data and parity must be on separate disks. Therefore, RAID 5 requires three disks at minimum.

RAID 0, RAID 1, and RAID 5 are the most commonly used RAID levels in practice. However, in addition to these RAID levels, Solaris supports variations of RAID 0 and RAID 1, called RAID 0 + 1 and RAID 1 + 0, respectively.

The different RAID levels are distinguished from each other mainly by how much performance improvement and fault tolerance they provide and the way they provide it. This is summarized in Table 13-2.

Table 13-2: Performance improvement and fault tolerance provided by various RAID levels
RAID Level of the Volume	Main Characteristics	Fault Tolerance?	Performance Improvement?
RAID 0	Striping, but no parity.	No	Yes
RAID 0 + 1	Striping with mirroring.	Yes	Yes
RAID 1	Mirroring or duplexing, but no parity.	Yes	Yes in reading
RAID1 +	Mirroring with striping.	Yes	Yes
RAID 2	Bit-level striping with parity.	Yes	Parity slows down performance, whereas striping improves it.
RAID 3	Byte-level striping with parity.	Yes	Parity slows down performance, whereas striping improves it.
RAID 4	Block-level striping with parity.	Yes	Parity slows down performance, whereas striping improves it.
RAID 5	Both data and parity are striped.	Yes	Parity slows down performance, whereas striping improves it.

Exam Watch

Remember that RAID levels are differentiated from each other according to the degree of read/write performance improvement and disk fault tolerance implemented by them. In choosing one RAID level against another for a given situation, always ask What is required here: read/write performance, disk fault tolerance, or both?

RAID technology takes disk space management from a single disk to a disk set. Instead of thinking of disk space in terms of a disk slice that resides on only one disk, now you can think of disk space in terms of a logical volume that can span across multiple disks. These volumes with RAID features are called RAID volumes. Solaris offers a tool called Solaris Volume Manager (SVM) to manage these volumes.

Understanding SVM Concepts

A volume is a named chunk of disk space that can occupy a part of a disk or the whole disk, or it can span across multiple disks. In other words, a volume is a set of physical slices that appears to the system as a single, logical device, also called a virtual (or pseudo) device or a metadevice in standard UNIX terminology. A volume is also called logical volume. In this book we use the terms volume and logical volume interchangeably.

Solaris offers Solaris Volume Manager (SVM) to manage disk volumes that may reside on multiple disks. In other words, SVM is a software product that helps you manage huge amounts of data spread over a large number of disks. Because SVM deals with a large number of disks, you can use it for the following tasks in addition to read/write performance improvement and disk fault tolerance:

Increasing storage capacity
Increasing data availability
Easing administration of large storage devices

In the following, we describe important concepts related to SVM:

Soft partition. With the increase in disk storage capacity and the invention, of disk arrays, a user may need to divide disks (or logical volumes) into more than eight partitions. SVM offers this capability by letting you divide a disk slice or a logical volume into as many divisions as needed; these divisions are called soft partitions. Each soft partition must have a name. A named soft partition can be directly accessed by an application if it's not included in a volume, because in this case it appears to a file system (and the application) as a single contiguous logical volume. Once you include a soft partition in a volume, it cannot be directly accessed. The maximum size of a soft partition is limited to the size of the slice or the logical volume of which it is a part.
Hot spare. A hot spare is a slice that is reserved for automatic substitution in case the corresponding data slice fails. Accordingly, a hot spare must stand ready for an immediate substitution when needed. The hot spares live on a disk separate from the data disk that's being used. By using SVM, you can dynamically add, delete, replace, and enable hot spares within the hot spare pool. An individual hot spare can be included in one or more hot spare pools.
Hot spare pool. A hot spare pool is a collection (an ordered list) of hot spares that SVM uses to provide increased data availability and fault tolerance for a RAID 1 (mirrored) volume and a RAID 5 (striped data with striped parity) volume. You can assign a hot spare pool to one or more RAID 1 or RAID 5 volumes. If a slice failure occurs, SVM automatically substitutes a hot spare for the failed slice.
Disk set. From the perspective of SVM, data lives in logical volumes. Where do the logical volumes (and hot spare pools that support improved data availability) live? They live on a set of disks called a disk set. In other words, a disk set is a set of disk drives that contain logical volumes and hot spare pools.
State database. A state database contains the configuration and status information about the disk sets, and about volumes and hot spares in a disk set managed by SVM. SVM also maintains replicas (multiple copies) of a state database to provide fault tolerance. For example, if a replica gets corrupted during a system crash, other replicas would still be available. That means each replica should live on a separate disk. When a state database is updated, the update flows to all the replicas. If your system loses the state database, how does it determine which of the replicas contain valid data? This is determined by running the so-called majority consensus algorithm. According to this algorithm, a minimum majority (half + 1) replicas must be available and must be in agreement before any of them could be declared valid. This imposes the requirement that you must create at least three replicas of the state database. This way if one replica fails, we still have two of them to reach consensus. This means that the system cannot reboot into multiuser mode unless a minimum majority (half + 1) of the total number of state database replicas are available. However, the system continues to run if at least half of the state database replicas are available; otherwise the system panics.

Do not confuse redundant data (mirrored disk in RAID 1 or duplicate data in RAID 5) with hot spares. Both have their own roles to play. For example, consider a RAID 1 system that has two disks: Disk A and Disk B. Disk A is serving the data and Disk B is the mirrored disk—that is, a duplicate of Disk A; therefore your system is fault tolerant. Suppose Disk A fails; the system will automatically switch to Disk B for the data, and as a result the user does not see the failure. But here is the bad news: your system is not fault tolerant any more. If Disk B fails, there is no disk to take over. This is where a hot spare comes to the rescue; it keeps the system fault tolerant while the failed slice is being repaired or replaced. When a slice from the disk being used fails, the following happens:

The mirrored disk takes over, but the system is no longer fault tolerant.
The failed slice is automatically replaced with a hot spare, and the hot spare is synchronized with the data from the mirrored slice that is currently in use. Once again, the system is fault tolerant.
You can take your time to repair (or replace) the failed slice and then free up the hot spare.

Ultimately, hot spares provide extra protection to the data redundancy that is already available without them.

Exam Watch

Because a hot spare must be synchronized with the current data, you cannot use hot spare technology in systems in which redundant data is not available, such as RAID 0. Furthermore, although you can assign a hot spare pool to multiple submirror (RAID 1) volumes or multiple RAID 5 volumes, a given volume can be associated with only one hot spare pool.

Earlier in this chapter, we discussed RAID levels in general. Next, we discuss how (and which of) these RAID levels are implemented in logical volumes supported by SVM.

Logical Volumes Supported by SVM

A logical volume is a group of physical disk slices that appears to be a single logical device. This means they are transparent to the file systems, applications, and end users—hence the name logical volume. Logical volumes are used to increase storage capacity, data availability (and hence fault tolerance), and performance (possibly including I/O performance).

In a previous section we provided a generic description of RAID technology, including RAID 0, RAID 1, and RAID 5 levels; in this section we discuss RAID 0, RAID 1, and RAID 5 volumes offered by Solaris and managed by SVM.

On the Job

SVM has the capability of supporting a maximum of 8192 logical volumes per disk set. However, its default configuration is for 128 logical volumes per disk set.

RAID 0 Volumes

You can compose a RAID 0 volume of either slices or soft partitions. These volumes enable you to dynamically expand disk storage capacity. There are three kinds of RAID 0 volumes, which are discussed here:

Stripe volume. This is a volume that spreads data across two or more components (slices or soft partitions). Equally sized segments of data (called interlaces) are interleaved alternately (in a round-robin fashion) across two or more components. This enables multiple controllers to read and write data in parallel, thereby improving performance. The size of the data segment is called the interlace size. The default interlace size is 16KB, but you can set the size value when you create the volume. After the volume has been created, you are not allowed to change the interlace value. The total capacity of a stripe volume is equal to the number of components multiplied by the size of the smallest component, because the stripe volume spreads data equally across its components.
Concatenation volumes. Unlike the stripe volume, a concatenation volume writes data on multiple components sequentially. That is, it writes the data to the first available component until it is full; then it moves to write to the next component. There is no parallel access. The advantage of this sequential approach is that you can expand the volume dynamically by adding new components as long as the file system is active. Furthermore, no disk space is wasted, because the total capacity of the volume is the sum of the sizes of all the components, even if the components are not of equal size.
Concatenated stripe volume. Recall that you cannot change the size of the data segments on the components in a stripe volume after the volume has been created. So how can you extend the capacity of a stripe volume? Well, by adding another component (stripe) and promoting the stripe volume to a concatenated stripe volume. In other words, a concatenated stripe volume is a stripe volume that has been expanded by adding components.

On the Job

If you want to stripe an existing file system, back up the file system, create the stripe volume, and restore the file system to the stripe volume. You would need to follow the same procedure if you want to change the interlace value after creating the stripe volume.

Because there is no data redundancy, a RAID 0 volume does not directly provide fault tolerance. However, RAID 0 volumes can be used as building blocks for RAID 1 volumes, which do provide fault tolerance.

RAID I Volumes

A RAID 1 volume, also called a mirror, is a volume that offers data redundancy (and hence fault tolerance) by maintaining copies of RAID 0 volumes. Each copy of a RAID 0 volume in a RAID 1 volume (mirror) is called a submirror. Obviously, mirroring takes more disk space (at least twice as much as the amount of data that needs to be mirrored), and more time to write. You can mirror the existing file systems.

For a volume to be a mirror volume it must contain at least two submirrors, but SVM supports up to four submirrors in a RAID 1 volume. You can attach or detach a submirror to a mirror any time without interrupting service. In other words, you can create your RAID 1 volume with just one submirror, and you can subsequently attach more submirrors to it. To improve fault tolerance and performance, choose the slices for different submirrors from different disks and controllers. For example, consider a RAID 1 volume with two submirrors. If a disk contains slices belonging to both submirrors, we will lose both submirrors when the disk fails, hence there is no fault tolerance.

In a RAID 1 volume, the same data will be written to more than one submirror, and the same data can be read from any of the submirrors to which it was written. This gives rise to multiple read/write options for a RAID 1 volume. To optimize performance, you can configure read and write policies when you create a mirror, and you can reconfigure these policies later when the system is in use. The default write policy is the parallel write, meaning that the data is written to all the submirrors simultaneously. Obviously this policy improves write performance. The alternative write policy is the serial write—that is, the write to one submirror must be completed before starting a write to the second submirror. This policy is designed to handle certain situations—for example, if a submirror should become inaccessible as a result of a power failure. The read policies are described in Table 13-3.

Table 13-3: Read policies for RAID I volumes
Read Policy	Description
First	All reads are performed from the first submirror.
Geometric	Reads are divided among submirrors based on the logical disk block addresses.
Round robin (default)	Reads are spread across all submirrors in a round-robin order to balance the read load.

Now that you know about the various read policies available in SVM, see the next page for some practical scenarios and their solutions related to configuring the read/write policies for a RAID 1 volume.

So, mirroring provides fault tolerance through data redundancy. There is another way to provide fault tolerance, and that is by calculating and saving the parity information about data. This technique is used in RAID 5 volumes.

SCENRIO & SOLUTION
While configuring a RAID 1 volume, you want to minimize the seek time for the reads. Which read policy would you choose?	Geometric
While creating a RAID 1 volume, you know that the disk drive supporting the first submirror is substantially faster than the disk drives supporting other submirrors. Which read policy will you choose if read performance is important?	First
You initially configured your RAID 1 volume for parallel write. One of the submirrors of this volume has become inaccessible as a result of a power failure. Which write policy should you switch to?	Serial

RAID 5 Volumes

A RAID 5 volume is a striped volume that also provides fault tolerance by using parity information distributed across all components (disks or logical volumes). In case of a component failure, the lost data is reconstructed from its parity information and data available on other components. The calculation of parity slows down system performance as compared with the striped volume. A striped volume offers better performance than a RAID 5 volume, but it does not provide data redundancy and hence is not fault tolerant. Note the following about RAID 5 volumes:

Because parity is distributed across all components, and when we lose data we must not lose its parity information (data and its parity must not reside on the same component), a RAID 5 volume requires at least three components.
A component that already contains a file system (that you don't want to lose) must not be included in the creation of a RAID 5 volume, because doing so will erase the data during initialization.
If you don't want to waste disk space, use components of equal size.
You can set the interlace value (size of data segments); otherwise, it will be set to a default of 16KB.

Because RAID 5 uses parity, which slows down performance, it may not be a good solution for write intensive applications. SVM also supports variations of RAID 0 and RAID 1 volumes.

RAID 0 + 1 and RAID 1 + 0 Volumes

You have seen that a stripe volume is beneficial for performance because it enables multiple controllers to perform read/write simultaneously, but it does not support data fault tolerance because there is no data redundancy. You can add data redundancy to the benefits of a stripe volume by choosing RAID 5. However, RAID 5 calculates the parity and impedes performance. There is a simpler solution—mirror your stripe. That is exactly what a RAID 0 + 1 volume is: a stripe volume that has been mirrored. The reverse is also supported by SVM—that is, a mirror volume that is striped. It is known as RAID 1 + 0.

Exercise 13-1: Accessing the SVM GUI

The SVM GUI is part of the Solaris SMC (Solaris Management Console). To access the GUI, perform the following steps:

Start the SMC by issuing the following command:
```
    %/usr/sbin/smc 
```
Double-click This Computer in the Navigation pane.
Double-click Storage in the Navigation pane.
Load the SVM tools by double-clicking Enhanced Storage in the Navigation pane. Log in as superuser (root or another account with equivalent access) if the login prompt appears.
Double-click the appropriate icon to manage disk sets, state database replicas, hot spare pools, or volumes.

The configuration and status information about volumes is contained in a database called a state database. In the following section, we describe how you can create mirrors and state databases.