6.6 RAID Recipes | System Performance Tuning2002

I will provide four "disk array recipes" here, targeted toward the most common deployments. They are not meant to be silver bullets, but rather are intended to provide a starting point for your own efforts. We will then go over a case study to illustrate the principles we've discussed.

6.6.1 Attribute- Intensive Home Directories

Most general purpose home directory activity fits into this category.

This array pattern exhibits almost entirely random activity. When files are small, access to data is mostly determined by retrieval of directory entires, inode data, and the like; they require a significant amount of seeking. Generally , activity is almost entirely reads; writes are relatively rare.

Because of the random workload, bus utilization is probably not going to be a significant issue. Your goal is to provide as many disk drives as possible: the strategy should tend towards large numbers of small disks on a moderate number of controllers. Try for about 10 fully active disks per 40 MB/second SCSI chain.

If your application mix avoids high throughput situations (writes in particular), then RAID 5 is an excellent choice. If you require faster writes and can absorb the extra cost, RAID 0+1 is preferable.

Some filesystem tuning is going to be required. In general, adjust the free space to 1%, and configure the filesystem to use 16 KB per inode. Tuning the cluster size will also be necessary (see Section 5.4.2.2).

If the home directories are to be shared via NFS, write performance becomes more of an issue, and NVRAM or other nonvolatile storage is an extremely good idea. A ballpark figure is to configure three drives per concurrently active NFS client. Logging is also an excellent idea; if possible, use Veritas or Solstice DiskSuite's metatrans logging feature over the integrated filesystem logging mechanisms.

6.6.2 Data-Intensive Home Directories

Data-intensive home directories present a different problem; they are more likely to have a sequential workload that's not so heavily biased toward read operations.

In this case, the approach I've found to be fruitful is to stripe as many of the fastest drives (i.e., "highest internal transfer rate") as possible across a fairly large number of controllers. Try not to configure more than four disks per controller.

One significant problem is that multiple clients accessing a single disk can create a significant bottleneck. For this reason, it is probably a good idea to create multiple small disk arrays (say, one per two populated disk controllers -- use RAID 0+1 if possible, but RAID-5 is also acceptable if you have significant amounts of NVRAM for write caching) and concatenate these devices. Try to configure one RAID device for every three to five active clients.

Logging presents a more significant tradeoff here. If you do use log disks, try creating a small RAID 0+1 volume, and using that as the log device. If the workspaces are to be accessed via NFS, recall that each client connected via Fast Ethernet can demand up to 10 MB/second of activity.

6.6.3 High Performance Computing

The high performance computing environment is generally characterized by a roughly even mix of reads and writes, ^[8] sequential access patterns, and a desire for performance above all else -- the data can be easily regenerated.

^[8] It's application-dependent, but most scientific applications I've worked with either are almost entirely read-based, or about even between writes and reads.

The core idea is to transfer I/O loads that require high performance from home directories to a separate "workspace" area. In general, this disk subsystem should be configured with a great number of fast disk controllers (40 MB/second), each with two or three of the fastest disks you can find. The key thing is to not overpopulate controllers with disks: the sequential workload can easily overload the bus.

Filesystem throttling due to the UFS write throttles (see Section 5.1.4.1) is something to watch out for, as is throttling due to the virtual memory subsystem. Upgrade to Solaris 8 if possible, and turn on priority paging if not. Much of this is not tunable in Linux, but try to keep the kernel revision current, in order to take advantage of ongoing virtual memory subsystem advances.

Logging may well present a significant performance problem. Since this data is likely to be regenerated quickly, and a reboot will necessitate restarting the computational job, the typical approach is to create as many RAID 0 volumes as possible, aiming for about six disks per array, and concatenate them.

6.6.4 Databases

Configuring disk arrays for databases opens up a whole new realm of specifications. Read operations may be done in small, random blocks, such as index lookups, or they may be done in larger, sequential chunks , such as when an entire table is scanned. Writes are normally small and synchronous. Oracle, for example, defaults to using a 2 KB block size. This keeps the disk service time low for small, random operations, but many blocks may be requested simultaneously for large table reads.

Configuring a fairly large amount of nonvolatile memory in the RAID controller is an excellent idea for two reasons; first, the writes are generally synchronous and in the "critical path " for user responsiveness; and second, large writes often occur as a long stream of small writes, which are amenable to coalescing .

Virtual memory contention can be a serious problem. Using the directio option for UFS filesystems (see Section 5.4.2.6) is a good idea, since most database software caches heavily internally. As usual, keeping up to date with Solaris and Linux core software releases is an excellent idea. Enabling priority paging or migrating to Solaris 8 will be very useful.

While data integrity is paramount, some parts of the database can be regenerated on the fly. For this reason, it may make sense to split the database into two parts :

Temporary tablespaces and indexes can reside on large, unsafe, fast stretches of disk, implemented as concatenated RAID 0 arrays. Aim for six to eight fast disks per 40 MB/second controller.
Areas where reliability is paramount are usually read-mostly. This can be safely done via RAID 5 volumes, or by RAID 0+1 if cost is a secondary concern to performance.

6.6.5 Case Study: Applications Doing Large I/O

Let's consider a transaction processing scenario. This application workload consists of a few processes taking in information over the network, performing some in-memory work on them, and then pushing out 16 KB blocks of data to disk. The users are complaining of poor throughput from the disk subsystem.

In this case, we would expect to see iostat -xtc 1 indicating sustained disk activity, and we know that the applications are doing large disk operations (in this case, 16 KB). In order to maximize disk throughput, we want to send as big a chunk of data as possible to each disk. Since a disk can typically sustain about 150 physical I/O operations per second and about 15 MB per second, as confirmed by some empirical observations, it follows that we should try to send each disk about 100 KB per operation. We can accomplish this by doing three things:

When we are designing our volume, we want to specify the interlace size (recall that this is the contiguous chunk given to one disk before switching to the next one). We will pick something in the range of 64 KB-128 KB.
When we build the filesystem, we need to place large chunks of the file on contiguous block numbers; this behavior is controlled by the maxcontig parameter (settable via newfs 's -C switch when building a new filesystem, or tunefs -a on an existing filesystem). A good value for maxcontig here is 128 (set in 8 KB blocks), which will tell the filesystem to lay out files in blocks of 1 MB of contiguous data.
Eventually, when we are using the filesystem, the kernel will need to flush data to disk. It will do this by looking for modified pages that are contiguously allocated in the filesystem, and push them out in one large, clustered I/O: the largest I/O it can issue is specified by maxphys , so we'll set that to 1,048,576.

We should probably also keep an eye on the UFS write throttle , in case that turns out to be a limiting factor in performance.