A majority of the time, performance issues are related to I/O. However, assuming that a given performance problem is I/O-based is grossly oversimplifying the problem. With any filesystem I/O, there are middle-layer tasks that require resources which may be the source of an I/O contention, such as the volume manager, the volume manager's striping, the filesystem, a multipath I/O driver, or something similar. When troubleshooting a performance problem, always try to simplify the problem by removing as many middle layers as possible. For example, if a particular filesystem is slow, focus your attention first on the disk block or character device performance before considering the volume manager and filesystem performance.
Dissecting a volume with respect to physical device (aka LUN) or lvol into its simplest form is absolutely required when preparing to run any performance test or find a performance concern. In this section, we test the raw speed of a storage device by bypassing the filesystem and volume management layers. We bypass as many layers as possible by using a raw device, better known as a character device. A character device must be bound to a block device through the raw command. To describe "raw" with more detail would include the physical access to a block device bypassing the kernel's block buffer cache. Our first test performs a simple sequential read of a Logical Unit Number (LUN), which resides on a set of spindles, through a single path after we bind the block device to the character. We create a (LUN) character device because we want to test the speed of the disk, not the buffer cache.
Today's large arrays define a data storage device in many ways. However, the best description is Logical Device (LDEV). When an LDEV is presented to a host, the device changes names and is referred to as a Logical Unit Number (LUN).
The components used throughout this chapter for examples and scenarios include:
The tools for examining the hardware layout and adding and removing LUNs are discussed in Chapter 5, "Adding New Storage via SAN with Reference to PCMCIA and USB." Performance tools were fully discussed in Chapter 3, "Performance Tools," and are used in examples but not explained in detail in this chapter. As stated previously, this chapter's focus is strictly on performance through a system's I/O SCSI bus connected to SAN. Let's look at how to find and bind a block device to a character device using the raw command.
Binding a Raw Device to a Block Device Using the raw Command
The LUN, hereafter called disk, used throughout this example is /dev/sdj, also referred to as /dev/scsi/sdh6-0c0i0l2. Determine the capacity of the disk through the fdisk command:
atlorca2:~ # fdisk -l Disk /dev/sdj: 250.2 GB, 250219069440 bytes 255 heads, 63 sectors/track, 30420 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdj1 1 30421 244354559+ ee EFI GPT
Use lshw (an open source tool explained in more detail in Chapter 5) to show the device detail:
atlorca2:~ # lshw ~~Focus only on single disk test run~~~~ *-disk:2 description: SCSI Disk product: OPEN-V*4 vendor: HP physical id: 0.0.2 bus info: email@example.com:0.2 logical name: /dev/sdj version: 2111 size: 233GB capacity: 233GB capabilities: 5400rpm ### <- just the drivers attempt to guess the speed via the standard scsi lun interface... Take this at face value. configuration: ansiversion=2
Linux does not allow raw access to a storage device by default. To remedy this problem, bind the device's block device file to a /dev/raw/rawX character device file to enable I/O to bypass the host buffer cache and achieve a true measurement of device speed through the host's PCI bus (or other bus architecture). This binding can be done by using the raw command. Look at the block device for /dev/sdj:
atlorca2:~ # ls -al /dev/sdj brw-rw---- 1 root disk 8, 144 Jun 30 2004 /dev/sdj
Take note of the permissions set on the device file brw-rw----. The b means block device, which has a major number 8, which refers to the particular driver in control. The minor number 144 represents the device's location in the scan plus the partition number. Refer to man pages on sd for more info. Continuing with our example, the next step is to bind the /dev/sdj to a raw character device, as depicted in the following:
atlorca2:~ # raw /dev/raw/raw8 /dev/sdj /dev/raw/raw8: bound to major 8, minor 144
Now, issue one of the following commands to view the binding parameters:
atlorca2:~ # raw -qa /dev/raw/raw8: bound to major 8, minor 144
atlorca2:~ # raw -q /dev/raw/raw8 /dev/raw/raw8: bound to major 8, minor 144
Raw Device Performance
Now that we have bound a block device to a character device, we can measure a read by bypassing the block device, which in turn bypasses the host buffer cache. Recall that our primary objective is to measure performance from the storage device, not from our host cache.
Our next step requires that we measure a sequential read and calculate time required for the predetermined data allotment. Throughout this chapter, our goal is to determine what factors dictate proper performance of a given device. We focus on average service time, reads per second, writes per second, read sectors per second, average request size, average queue size, and average wait time to evaluate performance. In addition, we discuss the I/Os per second with regard to payload "block" size. For now, we start with a simple sequential read to get our baseline.
Though a filesystem may reside on the device in question, as shown previously in the fdisk-l output, the filesystem cannot be mounted for the test run. If mounted, raw access is denied. Proceed with the following action as illustrated in the next section.
Using the dd Command to Determine Sequential I/O Speed
The dd command provides a simple way to measure sequential I/O performance. The following shows a sequential read of 1GB (1024MB). There are 1024 1MB (1024KB) reads:
atlorca2:~ # time -p dd if=/dev/raw/raw8 of=/dev/null bs=1024k count=1024 1024+0 records in 1024+0 records out real 6.77 user 0.00 sys 0.04
The megabytes per second can be calculated as follows:
1GB/6.77 sec = 151.25MBps
For those who are unfamiliar with high-speed enterprise servers and disk storage arrays, 151MBps may seem extremely fast. However, higher speeds can be achieved with proper striping and tuning of the filesystem across multiple spindles. Though we discuss some of those tuning options later, we first need to reduce the previous test to its simplest form. Let us begin with calculating MBps, proceeding with the blocking factors on each I/O frame and discussing service time for each round trip for a given I/O.
In the previous example, we saw 1024 I/Os, where each I/O is defined to have a boundary set to a block size of 1MB, thanks to the bs option on the dd command. Calculating MBps simply takes an arithmetic quotient of 1024MB/6.77 seconds, providing a speedy 151MB/sec. In our testing, cache on the array is clear, providing a nice 151MBps, which is not bad for a single LUN/LDEV on a single path. However, determining whether the bus was saturated and whether the service time for each I/O was within specifications are valid concerns. Each question requires more scrutiny.
Different arrays require special tools to confirm that cache within the array is flushed so that a true spindle read is measured. For example, HP's largest storage arrays can have well over 100GB of cache on the controller in which a read/write may be responding, thereby appearing to provide higher average reads/writes than the spindle can truly provide. Minimum cache space should be configured when running performance measurements with respect to design layout.
Using sar and iostat to Measure Disk Performance
Continuing with the dd command, we repeat the test but focus only on the data yielded by the sar command to depict service time and other traits, as per the following.
atlorca2:~ # sar -d 1 100 Linux 2.6.5-7.97-default (atlorca2) 05/09/05 14:19:23 DEV tps rd_sec/s wr_sec/s 14:19:48 dev8-144 0.00 0.00 0.00 14:19:49 dev8-144 0.00 0.00 0.00 14:19:50 dev8-144 178.00 182272.00 0.00 14:19:51 dev8-144 303.00 311296.00 0.00 14:19:52 dev8-144 300.00 307200.00 0.00 14:19:53 dev8-144 303.00 309248.00 0.00 14:19:54 dev8-144 301.00 309248.00 0.00 14:19:55 dev8-144 303.00 311296.00 0.00 14:19:56 dev8-144 302.00 309248.00 0.00
This sar output shows that the total number of transfers per second (TPS) holds around 300. rd_sec/s measures the number of read sectors per second, and each sector is 512 bytes. Divide the rd_sec/s by the tps, and you have the number of sectors in each transfer. In this case, the average is 1024 sectors at 512 bytes each. This puts the average SCSI block size at 512KB. This is a very important discovery; because the dd command requests a block size of 1MB, the SCSI driver blocks the request into 512 byte blocks, so two physical I/Os complete for every logical I/O requested. Different operating systems have this value hard coded at the SCSI driver at different block sizes, so be aware of this issue when troubleshooting.
As always, more than one way exists to capture I/O stats. In this case, iostat may suit your needs. This example uses iostat rather than sar to evaluate the dd run.
atlorca2:~ # iostat Linux 2.6.5-7.97-default (atlorca2) 05/09/05 avg-cpu: %user %nice %sys %iowait %idle 0.00 0.01 0.02 0.08 99.89 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdj 0.06 57.76 0.00 15222123 72 sdj 0.00 0.00 0.00 0 0 sdj 0.00 0.00 0.00 0 0 sdj 98.00 102400.00 0.00 102400 0 sdj 298.00 305152.00 0.00 305152 0 sdj 303.00 309248.00 0.00 309248 0 sdj 303.00 311296.00 0.00 311296 0 sdj 301.00 307200.00 0.00 307200 0 sdj 302.00 309248.00 0.00 309248 0 sdj 302.00 309248.00 0.00 309248 0 sdj 141.00 143360.00 0.00 143360 0
Calculating MBps from iostat can be achieved by calculating KB from blocks read per second (Blk_read/s) and multiplying them by the transactions per second (TPS). In the previous example, 311296 Blk_read/s / (303 tps) = 1027.3 blocks x 512 bytes/block = 526018 bytes / 1024 bytes/KB = 513KB avg.
Before we explain the importance of the blocking size based on a given driver, let us demonstrate the same test results with a different block size. Again, we move 1GB of data through a raw character device using a much smaller block size. It is very important to understand that the exact same 1GB of data is being read by dd and written to /dev/null.
Understanding the Importance of I/O Block Size When Testing Performance
The I/O block size can impact performance. By reducing the dd read block size from 1024k to 2k, the FCP payload of 2k and the SCSI disk (sd) driver can deliver about 1/16 of the performance. Additionally, the I/O rate increases dramatically as the block size of each request drops to that of the FCP limit. In the first example, the sd driver was blocking on 512k, which put the I/O rate around 300 per second. In the world of speed, 300 I/O per second is rather dismal; however, we must keep that number in perspective because we were moving large data files at over 100 MBps. Though the I/O rate was low, the MBps was enormous.
Most applications use an 8K block size. In the following demonstration, we use a 2K block size to illustrate the impact of I/O payload (I/O size).
atlorca2:~ # time -p dd if=/dev/raw/raw8 of=/dev/null bs=2k \ count=524288 524288+0 records in 524288+0 records out real 95.98 user 0.29 sys 8.78
You can easily see that by simply changing the block size of a data stream from 1024k to 2k, the time it takes to move large amounts of data changes drastically. The time to transfer 1GB of data has increased 13 times from less than 7 seconds to almost 96 seconds, which should highlight the importance of block size to any bean counter.
We can use sar to determine the average I/O size (payload).
atlorca2:~ # sar -d 1 100| grep dev8-144 14:46:50 dev8-144 5458.00 21832.00 0.00 14:46:51 dev8-144 5478.00 21912.00 0.00 14:46:52 dev8-144 5446.00 21784.00 0.00 14:46:53 dev8-144 5445.00 21780.00 0.00 14:46:54 dev8-144 5464.00 21856.00 0.00 14:46:55 dev8-144 5475.00 21900.00 0.00 14:46:56 dev8-144 5481.00 21924.00 0.00 14:46:57 dev8-144 5467.00 21868.00 0.00
From the sar output, we can determine that 21868 rd_sec/s transpires, while we incur a tps of 5467. The quotient of 21868/5467 provides four sectors in a transaction, which equates to 2048 bytes, or 2K. This calculation shows that we are moving much smaller chunks of data but at an extremely high I/O rate of 5500 I/O per second. Circumstances do exist where I/O rates are the sole concern, as with the access rates of a company's Web site. However, changing perspective from something as simple as a Web transaction to backing up the entire corporate database puts sharp focus on the fact that block size matters. Remember, backup utilities use large block I/O, usually 64k.
With the understanding that small block I/O impedes large data movements, note that filesystem fragmentation and sparse file fragmentation can cause an application's request to be broken into very small I/O. In other words, even though a dd if=/file system/file_name of=/tmp/out_file bs=128k is requesting a read with 128k block I/O, sparse file or filesystem fragmentation can force the read to be broken into much smaller block sizes. So, as we continue to dive into performance troubleshooting throughout this chapter, always stay focused on the type of measurement needed: I/O, payload, or block size. In addition to considering I/O, payload, and block size, time is an important factor.
Importance of Time
Continuing with our example, we must focus on I/O round-trip time and bus saturation using the same performance test as earlier. In the next few examples, we use iostat to illustrate average wait time, service time, and percent of utilization of our test device.
The following iostat display is from the previous sequential read test but with block size set to 4096k, or 4MBs, illustrating time usage and device saturation.
atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=4096k & atlorca2:~ # iostat -t -d -x 1 100 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s \ avgrq-sz avgqu-sz await svctm %util sdj 0.00 0.00 308.00 0.00 319488.00 0.00 159744.00 \ 0.00 1037.30 4.48 14.49 3.24 99.80 sdj 0.00 0.00 311.00 0.00 319488.00 0.00 159744.00 \ 0.00 1027.29 4.53 14.66 3.21 99.70
In this iostat output, we see that the device utilization is pegged at 100%. When device utilization reaches 100%, device saturation has been achieved. This value indicates not only saturation but also the percentage of CPU time for which an I/O request was issued. In addition to eating up CPU cycles with pending I/O waits, notice that the round-trip time (service time) required for each I/O request increased.
Service time, the time required for a request to be completed on any given device, holds around 3.2ms. Before we go into detail about all the items that must be completed within that 3.2ms, which are discussed later in this chapter, we need to recap the initial test parameters. Recall that the previous iostat data was collected while using the dd command with a block size of 4096k. Running the same test with block size set to 1024k yields identical block counts in iostat, as you can see in this example:
atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=1024k & atlorca2:~ # iostat -t -d -x 1 100 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s \ avgrq-sz avgqu-sz await svctm %util sdj 0.00 0.00 303.00 0.00 309248.00 0.00 154624.00 \ 0.00 1020.62 1.51 4.96 3.29 99.80 sdj 0.00 0.00 303.00 0.00 311296.00 0.00 155648.00 \ 0.00 1027.38 1.51 4.98 3.29 99.70 sdj 0.00 0.00 304.00 0.00 311296.00 0.00 155648.00 \ 0.00 1024.00 1.50 4.95 3.26 99.00 sdj 0.00 0.00 303.00 0.00 309248.00 0.00 154624.00 \ 0.00 1020.62 1.50 4.93 3.28 99.40 sdj 0.00 0.00 304.00 0.00 311296.00 0.00 155648.00 \ 0.00 1024.00 1.50 4.93 3.28 99.60
Determining Block Size
As we have illustrated earlier in this chapter, block size greatly impacts an application's overall performance. However, there are limits that must be understood concerning who has control over the I/O boundary. Every application has the capability to set its own I/O block request size, but the key is to understand the limits and locations. In Linux, excluding applications and filesystems, the sd driver blocks all I/O on the largest block depending on medium (such as SCSI LVD or FCP). An I/O operation on the SCSI bus with any typical SCSI RAID controller (not passing any other port drivers, such as Qlogic, or Emulex FCP HBA) holds around 128KB. However, in our case, through FCP, the largest block size is set to 512KB, as shown in the previous example when doing a raw sequential read access through dd. However, it goes without saying that other factors have influence, as shown later in this chapter when additional middle layer drivers are installed for I/O manipulation.
To determine the maximum blocking factor, or max I/O size, of a request at the SD/FCP layer through a raw sequential read access, we must focus on the following items captured by the previous dd request in iostat examples.
The following example explains how to calculate block size.
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdj 303.00 311296.00 0.00 311296 0
As the output shows, the number of blocks read per second is 311296.00, and the number of transactions per second is 303.00.
=~ means approximation.
Recall that a sector has 512 bytes.
Now convert the value to KB.
Another way to calculate the block size of an I/O request is to simply look at the avgrq-sz data from iostat. This field depicts the average number of sectors requested in a given I/O request, which in turn only needs to be multiplied by 512 bytes to yield the block I/O request size in bytes.
Now that we have demonstrated how to calculate the in-route block size on any given I/O request, we need to return to our previous discussion about round-trip time and follow up with queue length.
Importance of a Queue
Service time only includes the amount of time required for a device to complete the request given to it. It is important to keep an eye on svctm so that any latency with respect to the end device can be noted quickly and separated from the average wait time. The average wait time (await) is not only the amount of time required to service the I/O at the device but also the amount of wait time spent in the dispatch queue and the roundtrip time. It is important to keep track of both times because the difference between the two can help identify problems with the local host.
To wrap things up with I/O time and queues, we need to touch on queue length.. If you are familiar with C programming, you may find it useful to look at how these values are calculated. The following depicts the calculation for average queue length and wait time found in iostat source code.
nr_ios = sdev.rd_ios + sdev.wr_ios; tput = ((double) nr_ios) * HZ / itv; util = ((double) sdev.tot_ticks) / itv * HZ; svctm = tput ? util / tput : 0.0; /* * kernel gives ticks already in milliseconds for all platforms * => no need for further scaling. */ await = nr_ios ? (sdev.rd_ticks + sdev.wr_ticks) / nr_ios : 0.0; arqsz = nr_ios ? (sdev.rd_sectors + sdev.wr_sectors) / nr_ios : 0.0; printf("%-10s", st_hdr_iodev_i->name); if (strlen(st_hdr_iodev_i->name) > 10) printf("\n "); /* rrq/s wrq/s r/s w/s rsec wsec rkB wkB \ rqsz qusz await svctm %util */ printf(" %6.2f %6.2f %5.2f %5.2f %7.2f %7.2f %8.2f %8.2f \ %8.2f %8.2f %7.2f %6.2f %6.2f\n", ((double) sdev.rd_merges) / itv * HZ, ((double) sdev.wr_merges) / itv * HZ, ((double) sdev.rd_ios) / itv * HZ, ((double) sdev.wr_ios) / itv * HZ, ((double) sdev.rd_sectors) / itv * HZ, ((double) sdev.wr_sectors) / itv * HZ, ((double) sdev.rd_sectors) / itv * HZ / 2, ((double) sdev.wr_sectors) / itv * HZ / 2, arqsz, ((double) sdev.rq_ticks) / itv * HZ / 1000.0, await, /* The ticks output is biased to output 1000 ticks per second */ svctm, /* Again: ticks in milliseconds */ util / 10.0);
Though it is nice to understand the calculations behind every value provided in performance tools, the most important thing to recall is that a large number of outstanding I/O requests on any given bus is not desirable when faced with performance concerns.
In the following iostat example, we use an I/O request size of 2K, which results in low service time and queue length but high disk utilization.
atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k & atlorca2:~ # iostat -t -d -x 1 100 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s \ wkB/s avgrq-sz avgqu-sz await svctm %util sdj 0.00 0.00 5492.00 0.00 21968.00 0.00 10984.00 \ 0.00 4.00 0.97 0.18 0.18 96.70 sdj 0.00 0.00 5467.00 0.00 21868.00 0.00 10934.00 \ 0.00 4.00 0.95 0.17 0.17 94.80 sdj 0.00 0.00 5413.00 0.00 21652.00 0.00 10826.00 \ 0.00 4.00 0.96 0.18 0.18 96.40 sdj 0.00 0.00 5453.00 0.00 21812.00 0.00 10906.00 \ 0.00 4.00 0.98 0.18 0.18 97.80 sdj 0.00 0.00 5440.00 0.00 21760.00 0.00 10880.00 \ 0.00 4.00 0.97 0.18 0.18 96.60
Notice how the %util remains high, while the request size falls to 4 sectors/(I/O), which equals our 2048-byte block size. In addition, the average queue size remains small, and wait time is negligible along with service time. Recall that wait time includes roundtrip time, as discussed previously. Now that we have low values for avgrq-sz, avgqu-sz, await, and svctm, we must decide whether we have a performance problem. In this example, the answer is both yes and no. Yes, the device is at its peak performance for a single thread data query, and no, the results for the fields typically focused on to find performance concerns are not high.
Multiple Threads (Processes) of I/O to a Disk
Now that we have covered the basics, let us address a multiple read request to a device.
In the following example, we proceed with the same block size, 2K, as discussed previously; however, we spawn a total of six read threads to the given device to illustrate how service time, queue length, and wait time differ. Let's run six dd commands at the same time.
atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k & atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k & atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k & atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k & atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k & atlorca2:~ # dd if=/dev/raw/raw8 of=/dev/null bs=2k &
Note that the previous code can be performed in a simple for loop:
for I in 1 2 3 4 5 6 do dd if=/dev/raw/raw8 of=/dev/null bs=2k & done
Let's use iostat again to look at the dd performance.
atlorca2:~ # iostat -t -d -x 1 100 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s \ wkB/s avgrq-sz avgqu-sz await svctm %util sdj 0.00 0.00 5070.00 0.00 20280.00 0.00 10140.00 \ 0.00 4.00 4.96 0.98 0.20 100.00 sdj 0.00 0.00 5097.00 0.00 20388.00 0.00 10194.00 \ 0.00 4.00 4.97 0.98 0.20 100.00 sdj 0.00 0.00 5103.00 0.00 20412.00 0.00 10206.00 \ 0.00 4.00 4.97 0.97 0.20 100.00
The queue length (avgqu-sz) is 4.97, while the max block request size holds constant. The service time for the device to act on the request remains at 0.20ms. Furthermore, the average wait time has increased to 0.98ms due to the device's response to multiple simultaneous I/O requests requiring a longer round-trip time. It is useful to keep the following example handy when working with a large multithreaded performance problem because the device may be strained, and striping at a volume manager level across multiple devices would help relieve this type of strain.
Using a Striped lvol to Reduce Disk I/O Strain
To illustrate the reduction of strain, let us create a VG and 4000MB lvol striped across two disks with a 16k stripe size.
atlorca2:/home/greg/sysstat-5.0.6 # pvcreate /dev/sdi No physical volume label read from /dev/sdi Physical volume "/dev/sdi" successfully created atlorca2:/home/greg/sysstat-5.0.6 # pvcreate /dev/sdj No physical volume label read from /dev/sdj Physical volume "/dev/sdj" successfully created atlorca2:/home/greg/sysstat-5.0.6 # vgcreate vg00 /dev/sdi /dev/sdj Volume group "vg00" successfully created atlorca2:/home/greg/sysstat-5.0.6 # lvcreate -L 4000m -i 2 -I 16 -n \ lvol1 vg00 Logical volume "lvol1" created atlorca2:/home/greg/sysstat-5.0.6 # lvdisplay -v /dev/vg00/lvol1 Using logical volume(s) on command line ------ Logical volume ------ LV Name /dev/vg00/lvol1 VG Name vg00 LV UUID UQB5AO-dp8Z-N0ce-Dbd9-9ZEs-ccB5-zG7fsF LV Write Access read/write LV Status available # open 0 LV Size 3.91 GB Current LE 1000 Segments 1 Allocation next free (default) Read ahead sectors 0 Block device 253:0
We again use sequential 2k reads with dd to measure the performance of the disks.
atlorca2:/home/greg/sysstat-5.0.6 # raw /dev/raw/raw9 /dev/vg00/lvol1 \ /dev/raw/raw9: bound to major 253, minor 0 atlorca2:/home/greg/sysstat-5.0.6 # dd if=/dev/raw/raw9 of=/dev/null \ bs=2k & atlorca2:/home/greg/sysstat-5.0.6 # dd if=/dev/raw/raw9 of=/dev/null \ bs=2k & atlorca2:/home/greg/sysstat-5.0.6 # dd if=/dev/raw/raw9 of=/dev/null \ bs=2k & atlorca2:/home/greg/sysstat-5.0.6 # dd if=/dev/raw/raw9 of=/dev/null \ bs=2k & atlorca2:/home/greg/sysstat-5.0.6 # dd if=/dev/raw/raw9 of=/dev/null \ bs=2k & atlorca2:/home/greg/sysstat-5.0.6 # dd if=/dev/raw/raw9 of=/dev/null \ bs=2k &
Note that the previous command can be performed in a simple for loop, as previously illustrated. Again we use iostat to measure disk throughput.
atlorca2:/home/greg # iostat -x 1 1000 avg-cpu: %user %nice %sys %iowait %idle 0.01 0.01 0.03 0.11 99.84 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s \ wkB/s avgrq-sz avgqu-sz await svctm %util sdi 0.00 0.00 2387.00 0.00 9532.00 0.00 4766.00 \ 0.00 3.99 3.04 1.28 0.42 100.00 sdj 0.00 0.00 2380.00 0.00 9536.00 0.00 4768.00 \ 0.00 4.01 2.92 1.22 0.42 100.00 sdi 0.00 0.00 2318.00 0.00 9288.00 0.00 4644.00 \ 0.00 4.01 3.14 1.35 0.43 99.70 sdj 0.00 0.00 2330.00 0.00 9304.00 0.00 4652.00 \ 0.00 3.99 2.82 1.21 0.43 99.50
Notice that the average wait time per I/O and the service time have increased slightly in this example. However, the average queue has been cut almost in half, as well as the physical I/O demand on the device sdj. The result is similar to a seesaw effect: As one attribute drops, another rises. In the previous scenario, the LUN (sdj) is physically composed of multiple physical mechanisms in the array called (array group), which remains a hidden attribute to the OS. By using the LVM strategy, we reduce some of the contingency for one LUN or array group to handle the entire load needed by the device (lvol). With the previous demonstration, you can see the advantages of striping, as well as its weaknesses. It seems true here that, for every action, there is an equal and opposite reaction.
Striped lvol Versus Single Disk Performance
In the following example, we compare a striped raw lvol to a raw single disk. Our objective is to watch the wait time remain almost constant, while the queue size is cut almost in half when using a lvol stripe instead of a single disk.
First let's look at performance using the lvol. In this example, we start six dd commands that perform sequential reads with block size set to 512k. The dd commands run at the same time and read a raw device bound to the lvol (as illustrated previously) with a 16k stripe size. Remember, iostat shows two disk devices for our lvol test because the lvol is striped across two disks. At 512KB, iostat yields values as follows:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s \ wkB/s avgrq-sz avgqu-sz await svctm %util sdi 0.00 0.00 152.00 0.00 156672.00 0.00 78336.00 \ 0.00 1030.74 3.00 19.60 6.58 100.00 sdj 0.00 0.00 153.00 0.00 155648.00 0.00 77824.00 \ 0.00 1017.31 2.99 19.69 6.54 100.00 sdi 0.00 0.00 154.00 0.00 157696.00 0.00 78848.00 \ 0.00 1024.00 2.98 19.43 6.49 100.00 sdj 0.00 0.00 154.00 0.00 157696.00 0.00 78848.00 \ 0.00 1024.00 3.01 19.42 6.49 100.00
Notice that the I/O queue length when reading lvol1 is much shorter than the following identical dd sequential read test on a raw disk sdj as shown next. Though the identical blocking size of 512k is used, the service time decreases. Here are the results of the test with a raw disk.
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s \ wkB/s avgrq-sz avgqu-sz await svctm %util sdj 0.00 0.00 311.00 0.00 318464.00 0.00 159232.00 \ 0.00 1024.00 5.99 19.30 3.22 100.00 sdj 0.00 0.00 310.00 0.00 317440.00 0.00 158720.00 \ 0.00 1024.00 5.99 19.31 3.23 100.00 sdj 0.00 0.00 311.00 0.00 318464.00 0.00 159232.00 \ 0.00 1024.00 5.99 19.26 3.22 100.00
The raw device, sdj in the test using lvol1, reflects that the read requests per second (r/s) remain constant (152 on disk device sdi and 153 on disk device sdj), yielding a net result of 305 read requests per second. The lvol test also shows an improvement in the average wait time; however, we hurt the service time. The service time for the lvol test is about 6.5ms, whereas it is 3.2ms for the raw disk. Upon closer inspection, we notice that the service time is higher due to the I/O issued to the device. In the lvol example, we have in fact submitted 512KB every other time (because we are striping and blocking our I/O both on 512k) so that each total I/O submitted to the device is in a smaller queue, thereby reducing the wait time to be serviced. However, in the single device example, the queue wait time is high because we are waiting on the device to finish on the given I/O request, so with no overhead, the return is faster for the service. This example illustrates the seesaw effect discussed previously, in which a device (single LUN or lvol) is slammed, in which case the end user would need to address the application's need to perform such heavy I/O with a single device. In the previous example, tweaking the device or lvol buys no performance gain; it just moves the time wait status to another field.
With a wider stripe, some performance would be gained in the previous sequential I/O example, but it is unrealistic in the real world. In addition to adding more disks for a wider stripe, you could add more paths to the storage for multipath I/O. However, multipath I/O comes with its own list of constraints.
Many administrators have heard about load balance drivers, which allow disk access through multiple paths. However, very few multipath I/O drivers provide load balance behavior to I/Os across multiple HBA paths as found in enterprise UNIX environments. For example, device drivers such as MD, Autopath, Secure Path (spmgr), and Qlogic's secure path are dedicated primarily to providing an alternate path for a given disk. Though HP's Secpath does offer a true load balance policy for EVA HSG storage on Linux, all the other drivers mentioned only offer failover at this time.
The one true load balancing driver for Linux (HP's Secure Path) provides a round robin (RR) load balance scheduling policy for storage devices on EVA and HSG arrays. Unfortunately, just because a driver that provides load balancing, such as the HP Secure Path driver, exists does not mean support is available for your system. Support for array types is limited. Review your vendor's storage requirements and device driver's hardware support list before making any decisions about which driver to purchase. Keeping in mind that restrictions always exist, let's review a typical RR policy and its advantages and disadvantages.
Though we want to discuss load balancing, the vast majority of Linux enterprise environments today use static (also known as "manual") load balancing or preferred path. With this in mind, we keep the discussion of RR to a minimum.
In the next example, we proceed with a new host and new array that will allow the RR scheduling policy.
The following example illustrates RR through Secure Path on Linux connected through Qlogic HBAs to an EVA storage array. Due to configuration layout, we use a different host for this example.
[root@linny5 swsp]# uname -a Linux linny5.cxo.hp.com 2.4.21-27.ELsmp #1 SMP Wed Dec 1 21:59:02 EST \ 2004 i686 i686 i386 GNU/Linux
Our host has two HBAs, /proc/scsi/qla2300/0 and /proc/scsi/qla2300/1, with Secure Path version 3.0cFullUpdate-4.0.SP, shown next.
[root@linny5 /]# cat /proc/scsi/qla2300/0 QLogic PCI to Fibre Channel Host Adapter for QLA2340: Firmware version: 3.03.01, Driver version 7.01.01 Entry address = f88dc060 HBA: QLA2312 , Serial# G8762 [root@linny5 swsp]# cat /etc/redhat-release Red Hat Enterprise Linux AS release 3 (Taroon Update 3)
Continuing with our raw device testing, we must bind the block device to the character device.
[root@linny5 swsp]# raw /dev/raw/raw8 /dev/spdev/spd
Use the Secure Path command spmgr to display the product's configuration.
[root@linny5 swsp]# spmgr display Server: linny5.cxo.hp.com Report Created: Tue, May 10 19:10:04 2005 Command: spmgr display = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Storage: 5000-1FE1-5003-1280 Load Balance: Off Auto-restore: Off Path Verify: On Verify Interval: 30 HBAs: 2300-0 2300-1 Controller: P66C5E1AAQ20AL, Operational P66C5E1AAQ20AD, Operational Devices: spa spb spc spd
To reduce space needed for this example, a large part of the spmgr display has been truncated, and we focus only on device spd, as per the following:
TGT/LUN Device WWLUN_ID #_Paths 0/ 3 spd 6005-08B4-0010-056A-0000-9000-0025-0000 4 Controller Path_Instance HBA Preferred? Path_Status P66C5E1AAQ20AL YES hsx_mod-0-0-0-4 2300-0 no Available hsx_mod-1-0-2-4 2300-1 no Active Controller Path_Instance HBA Preferred? Path_Status P66C5E1AAQ20AD no hsx_mod-0-0-1-4 2300-0 no Standby hsx_mod-1-0-1-4 2300-1 no Standby
Notice that two HBAs and two controllers are displayed. In this case, the EVA storage controller P66C5E1AAQ20AL has been set to preferred active on this particular LUN, in which both of the fabric N_ports enable connection to the fabric. In this configuration, each N_Port connects to different fabrics, A and B, which are seen by Qlogic 0 and 1. In addition, each HBA also sees the alternate controller in case a failure occurs on the selected preferred controller.
We should also to mention that not all arrays are Active/Active on all paths for any LUN at any given time. In this case, the EVA storage array is an Active/Active array because both N_Ports on any given controller have the capability to service an I/O. However, any one LUN can only access a single N_Port at any moment, while another LUN can access the alternate port or alternate controller. Now that we have a background in EVA storage, we need to discuss how the worldwide name (WWN) of a given target device can be found. In the following illustration, we simply read the content of the device instance for the filter driver swsp.
[root@linny5 swsp]# cat /proc/scsi/swsp/2 swsp LUN information: Array WWID: 50001FE150031280
Next, we initiate load balancing and start our raw device test, which is identical to the test performed earlier in this chapter.
[root@linny5 swsp]# spmgr set -b on 50001FE150031280 [root@linny5 swsp]# dd if=/dev/raw/raw8 of=/dev/null bs=512k
While this simple test runs, we collect iostat-x-d1 100 measurements, and we collect a few time captures.
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s \ wkB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 5008.00 0.00 319488.00 0.00 159744.00 \ 0.00 63.80 9.81 1.96 0.19 93.00 sdd 0.00 0.00 4992.00 0.00 320512.00 0.00 160256.00 \ 0.00 64.21 9.96 1.96 0.18 91.00 sdd 0.00 0.00 4992.00 0.00 318464.00 0.00 159232.00 \ 0.00 63.79 9.80 2.00 0.18 92.00 sdd 0.00 0.00 4992.00 0.00 319488.00 0.00 159744.00 \ 0.00 64.00 9.89 1.98 0.19 95.00
Notice that the blocking factor for a given I/O has changed to 64 sectors per I/O, which equals 32k block size from the swsp module. To get a good comparison between a sequential read test with RR enabled and one with RR disabled, we must disable load balance and rerun the same test. We disable load balancing in the following example.
[root@linny5 swsp]# spmgr set -b off 50001FE150031280 [root@linny5 swsp]# dd if=/dev/raw/raw8 of=/dev/null bs=512k [root@linny5 swsp]# iostat -x 1 100 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s \ wkB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 4718.00 0.00 302080.00 0.00 151040.00 \ 0.00 64.03 9.48 2.01 0.21 98.00 sdd 0.00 0.00 4710.00 0.00 302080.00 0.00 151040.00 \ 0.00 64.14 9.61 2.02 0.20 94.00 sdd 0.00 0.00 4716.00 0.00 302080.00 0.00 151040.00 \ 0.00 64.05 9.03 1.91 0.20 95.00 sdd 0.00 0.00 4710.00 0.00 301056.00 0.00 150528.00 \ 0.00 63.92 8.23 1.76 0.20 96.00
Iostat reports that the block size remains constant and that the average wait time for a given I/O round trip is slightly higher. This makes sense now that all I/O is on a single path. Because more I/O is loaded on a single path, average wait time increases, as do service times. Now that we have drawn a quick comparison between spmgr being enabled and disabled on a sequential read, we need to recap the advantages seen thus far.
In the previous example, no performance gain was seen by enabling load balancing with regard to spmgr. As we can see, no obvious performance increase was seen when RR was enabled through the host measurements. However, though the host's overall benefit from enabling load balancing was insignificant, the SAN load was cut in half. Keep in mind that a simple modification can impact the entire environment, even outside the host. Something as minor as having a static load balance with a volume manager strip across multiple paths or having a filter driver automate the loading of paths can have a large impact on the overall scheme.
Finally, with respect to load balance drivers, it is important to watch for the max block size for any given transfer. As seen in the previous iostat examples, the Secure Path product reduces the block size to 32k, and if LVM were to be added on top of that, the block would go to 16K. This small block transfer is great for running small, block-heavy I/O traffic, but for large data pulls, it can become a bottleneck.