2.7 Disk configuration | Oracle Real Application Clusters

< Day Day Up >

We looked at the benefits of clustered systems and clustered interconnects in the previous sections. What good are those without the storage systems? Under the hardware configuration, we have seen through evolution of the various technologies like SMP, MPP, etc.; similarly, under the storage tier also we have some details and choices of configurations that should be discussed, like the SAN, NAS, or the traditional direct access architectures. Prior to discussing these configurations, let us look at the basic disk configurations. A disk by itself, like any other equipment, is prone to failures and can have performance issues, especially when many users are accessing one disk. There could be a large amount of I/O contention/ backlogging of user requests, which in turn would create a replicated effect on performance of the system. So the option is to combine multiple disks so that the files could be spread across the disks.

Having multiple disks would not directly solve the issues with I/O contention. It does improve I/O compared to a single disk configuration; however, the contention could still exist and performance issues on the system with a large number of users would remain significantly high.

The next alternative would be to combine these disks and stripe them across all the disks, creating volumes and files that could reside in each volume or stripe. What this means is that multiple users accessing these files could potentially be accessing the files at different locations, thus accessing different disks and reducing performance bottlenecks. In general, striping offers a performance gain by spreading out the I/Os across many channels and devices. This reduces contention and enables throughput beyond the capacity of a single channel or device. The performance gain depends on the database layout, workload, and access patterns. Performance gains should be compared to the costs associated with setting up, managing, and maintaining (extending and reducing) striped volumes.

While striping helps provide distribution of I/O, spreading it across multiple spindles, how about availability? When one disk fails, then data for that stripe probably will not be available. As we discussed in Section 2.4.3 (High Availability), redundancy should be provided at every tier, which includes the storage subsystem. The technology that provides redundancy at the disk level is called mirroring, where every time information is recorded onto one disk it is immediately copied over to another disk. Mirroring functionality could be done either via a software process or through hardware configuration.

An ideal solution would be a combination of both, i.e., a disk configuration that has both options of striping and mirroring because proper storage configuration is vital to the performance of the data tier. The industry has various configuration options available today, and each configuration is used for a specific type of implementation. While most of the scenarios work properly, some of them are better for specific implementation categories over the others. Let us look at some details in this area of disk configuration and elaborate on some of the options available in this tier.

2.7.1 RAID

RAID is the technology for expanding the capacity of the I/O system and providing the capability for data redundancy. RAID stands for redundant array of inexpensive disks. It is the use of two or more physical disks to create one logical disk, where the physical disks operate in tandem to provide greater size and more bandwidth. RAID provides scalability and high availability in the context of I/O and system performance.

RAID implementations can be made at the hardware level or at the software level. Software-based RAID implementations have considerable overhead and consume resources, including CPU and memory. Hardware RAID implementations are more popular than software RAID implemen tations because they have less resource overhead.

RAID 0

RAID 0 provides striping, where a single data partition is physically spread across all the disks in the stripe bank, effectively giving that partition the aggregate performance of all the component disks combined. The unit of granularity for spreading the data across the drives is called the stripe size or chunk size. Typical settings for the stripe size are 32 KB, 64 KB, and 128 KB.

The following are benefits that striping provides:

Load balancing: In a database environment some files are always busier than other files; therefore, I/O load on the system needs to be balanced. In a non-striped environment this causes a hot spot on the disk and results in poor response times. The exposure to hot spots is significantly reduced using a striped disk array, because each file is very thinly spread across all the disks in the stripe bank. The result of this spreading of files is random load balancing. This is very effective in a high-volume, high-transactional, very large online transaction processing system (OLTP).
Concurrency: In a random access environment the overall concurrency of each file is increased. For example, if a file exists on a single disk, only one read or write can be executed against that file at any one time. The reason for this is physical, because the disk drive has only one set of read/write heads. By striping (RAID 0 configuration), many reads and writes can be active at any one time, up to the number of disks in the stripe set.

The effects of striping in both sequential and random access environments make it very attractive from a performance standpoint. However, the downside of this is that RAID 0 configuration has no built in redundancy and is highly exposed to failure. If a single disk in a stripe set fails, the whole stripe set is effectively disabled.

In the scenario shown in Figure 2.17 where there are eight disks all striped across in d different stripes or partitions, the MTBF for each disk is now divided by eight, and the amount of data that is not available is eight times that of a single disk.

click to expand
Figure 2.17: RAID 0.

RAID 1

RAID 1 is known as mirroring and is where all the writes issued to a given disk are duplicated to another disk. This provides for disk availability; if there is a failure of the first disk, the second disk or mirror can take over without any data loss.

The write operation to both disks happens in parallel. When the write is ready to be issued, it is set up and sent to DISK A. Without waiting for that write process to complete (as in a sequential write), the write to DISK B is initiated. However, when the host wants to read from the disk (in a mirrored configuration) it takes advantage of the two disks. Depending on the implementation, the host will elect to round-robin (100% gain in read capacity) all read requests to alternate disks in the mirrored set or to send the request to the drive that has its head closest to the required track.

The biggest advantage in mirrored configuration (RAID 1) is the benefit when one disk or one side of the mirror fails, causing the system to start using exclusively the other side of the mirror for all read/write requests. As a result of this failure the read capability is reduced by 50%. Once replacing or repairing the failed component restores the failed side of the mirror, bringing a disk back online will involve physically copying the surviving mirrored disk onto the replaced peer. This process is called resilvering. These mirrored disks can be placed independent of each other or can be placed in a disk array.

In an Oracle environment, RAID 1 is good for both online and archived redo logs. Writing to these files is performed in a sequential fashion. The write head of the disk is located near the last write operation giving performance enhancements. However, more than one RAID 1 volume is required to allow continuous read and write operations to the redo logs and to allow archive logs to read the previous redo logs. This allows for optimal performance, and it is also required to eliminate contention between LGWR and ARCH background processes.

RAID 0 + 1

RAID 0 + 1 (or RAID 01) is a combination of levels 0 and 1. RAID 0 + 1 does exactly what its name implies, that is, stripes and mirrors disks; for example, it stripes first, then mirrors what was just striped. RAID 0 + 1 provides good write and read performance and redundancy without the overhead of parity calculations. Parity is a term for error checking. Parity algorithms contain error correction code (ECC) capabilities, which calculate parity for a given stripe or chunk of data within a RAID volume.

The advantages of both RAID level 0 and RAID level 1 apply in this situation and are illustrated in Figure 2.18, which shows a four-way striped mirrored volume with eight disks (A–H). A given stripe of data (Data 01) in a file is split/striped across disks A–D with the stripe first and then mirrored across disks E–H. The drawback in this RAID configuration/scenario is that when one of the pieces, for example, Data 01 on Disk A, becomes unavailable due to a disk failure on Disk A, the entire mirror member becomes unavailable. The entire mirror member is lost, which reduces the I/O capacity during reads on the volume by 50%.

click to expand
Figure 2.18: RAID 01.

RAID 1 + 0

RAID 1 + 0 (or RAID 10) is also a combination of the RAID levels 0 and 1 discussed in the previous sections. This combination is also true for RAID 0 + 1. However, in RAID 10 the disks are mirrored and then striped; for example, mirror first, then stripe what was mirrored. All the advantages that apply to the previous RAID configuration apply to this RAID configuration. However, the organization of mirrored sets is different from the previous configuration.

In Figure 2.19, Data 01 is mirrored on the adjoining disks (A and B) and Data 02 is mirrored on the subsequent two disks (C and D), etc. This configuration also contains eight mirrored and striped disks.

click to expand
Figure 2.19: RAID 10.

The advantage in the RAID 10 configuration over the RAID 01 configuration is, if there is a loss of one disk in a mirrored member, the entire member of the mirrored volume does not become unavailable. This configuration is better suited for high availability. The I/O capacity that results from the failure of a mirror member does not reduce its I/O capacity by 50%. RAID 10 configurations must be the preferred implementation choice for an OLTP implementation.

RAID 5

Under RAID 5, parity calculations provide data redundancy, and the parity is stored with the data. This means that the parity is distributed across the number of drives configured in the volume.

Figure 2.20 illustrates the physical placement of stripes (Data 01 through Data 04) with their corresponding parities distributed across the five disks in the volume. This is a four-way, striped RAID 5 volume where data and parity are distributed.

click to expand
Figure 2.20: RAID 5.

RAID 5 is unsuitable for OLTP because of extremely poor performance of small writes at high concurrency levels. This is because the continuous processes of reading a stripe, calculating the new parity, and writing the stripe back to the disk (with new parity) will make writing significantly slower.

In an Oracle environment, rollback segments and redo logs are accessed sequentially (usually for writes) and are not suitable candidates for being placed on a RAID 5 device. Also, data files belonging to temporary tablespaces are not suitable for placement on a RAID 5 device. Another reason the redo logs should not be placed on RAID 5 devices is related to the type of caching (if any) being done by the RAID system. Given the critical nature of the contents of the redo logs, catastrophic loss of data could ensue if the contents of the cache are not written to the disk, for example, because of a power failure, when Oracle was notified that they had been written. This is particularly true of write-back caching, where the write is assumed to have been written to the disk when it has only been written to the cache. Write-through caching, where the write is assumed to have completed when it has reached the disk, is much safer but still not recommended for redo logs for the reason mentioned earlier.

RAID 5 configurations should be preferred where the read patterns are random and are not very bulky in nature. This is because the spindles in a RAID 5 volume work in an independent fashion. For example, all the disks in a given volume can potentially service multiple I/O requests from different disk locations. RAID 5 should be considered for data warehouse (DW), data mining (DM), and operational data store (ODS) applications where data refreshes occur in the off-hours.

2.7.2 Stripe and mirror everything

The ''stripe and mirror everything'' concept for disk configuration is more commonly known as SAME. This model is based on two key proposals, (1) stripe all files across all disks using a 1 megabyte stripe width, and (2) mirror data for high availability.

2.8 Storage system architectures

Striping all files across all disks ensures that full bandwidth of all the disks is available for any operation. This equalizes load across disk drives and eliminates hot spots. The recommendation of using a stripe size of 1 megabyte is based on transfer rates and throughputs of modern disks. If the stripe size is very small, more time is spent positioning the disk head on the data, than in the actual transfer of data. Based on internal studies, we have determined that a size of 1 megabyte achieves reasonably good throughput, while anything smaller does not provide adequate throughput.
Mirroring data at the storage subsystem level is the best way to avoid data loss. The only way to lose data in a mirrored environment is to lose multiple disks simultaneously. Given that current disk drives are highly reliable, simultaneous multiple disk failure is a very low probability event.

The primary benefit of the SAME model is that it makes storage configuration very easy and suitable for all workloads. It works equally well for OLTP, data warehouse, and batch workloads. It eliminates I/O hot spots and maximizes bandwidth. Finally, the SAME model is not tied to any storage management solution in the market today and can be implemented with the technology available today.

< Day Day Up >