8.7 Tuning a VxFS Filesystem | HP-UX CSE(c) Official Study Guide and Desk Reference

With JFS 3.3 (layout version 4), we get a new command: vxtunefs . In order for the vxtunefs command to have a significant impact on performance, it should be used in concert with all the other configuration settings we have established up until now: settings for volumes , filesystems, and even extent attributes for individual files. The result, we hope, will be significantly improved performance of the slowest device in our system, the humble disk drive.

With vxtunefs , we can tune the underlying IO characteristics of the filesystem to work in harmony with the other configuration settings we have so far established. Let's continue the discussion from where we left it with setting extent attributes for individual files . There, we set the extent size to match our stripe size , which in turn matches the IO size of our user application. It makes sense (to me) to try to match the IO size of the filesystem to match these parameters, i.e., get the filesystem to perform IO in chunks of 16KB. The other question to ask is how many IOs should the filesystem perform at any one time? This leads to a question regarding our underlying disk technology. If our stripe set is configured on a JBOD connected via multiple interfaces, it would make sense (to me) to send an IO request that was a multiple of the number of disks in the stripe set. In our example in Figure 8-5, this would equate to three IOs. The IO subsystem is quite happy with this, because the disks are connected via multiple interfaces and hence IOs will be spread across all interfaces. The disks are happy because individually they are performing one IO each. The filesystem is happy because it is performing multiple IOs and hence reading/writing bigger chunks of data to/from the disks (the filesystem may also be performing read-aheads, which can even further improve the quantity of data being shipped each interval).

When we are talking about a RAID 5 layout, the only other consideration is when we perform a write. In those instances, we will want to issue a single write that constitutes the size of one entire stripe of data. In this way, the RAID5 subsystem receives an entire stripe of data whereby it calculates the parity information and then issues writes to all the disks in the stripe set.

This discussion becomes a little more complex when we talk about disk arrays. Commonly, disk arrays, e.g., HPs XP, VA, and EVA disk arrays, will stripe data over a large number of disks at a hardware level. To the operating system, the disk is one device. The operating system does not perform any software striping and views the disk as contiguous. In this instance, we need to understand how the disk array has striped the data over its physical disks. If we know how many disks are used in a stripe set, we can view the disk arrays in a similar way to a JBOD with software RAID. The similarity ends by virtue of the fact that we very seldom talk directly to a disk drive on a disk array. We normally talk to an area of high-speed cache memory on the array. Data destined for an entire stripe will normally be contained in at least one page of cache memory. Ideally, we will want to perform IO in a multiple of the size of a single page of array cache, i.e., send the array an IO that will fill at least one page of cache memory. This is when we need to know how many disks the array is using in a stripe set, the stripe size, and the cache page size. On some arrays, data is striped over every available disk. In this instance, our calculation is based on the cache page size, the IO capacity of the array interface, and other IO users of the array.

On top of all these considerations is the fact that the IO/array interfaces we are using commonly have a high IO bandwidth and can therefore sustain multiple IOs to multiple filesystem extents simultaneously .

With this information at hand, we can proceed to consider some of the filesystem parameters we can tune with vxtunefs . When run against a disk/volume, vxtunefs will display the current value for these parameters:

 root@hpeos003[]  vxtunefs /dev/vg00/lvol3  Filesystem i/o parameters for / read_pref_io = 65536 read_nstream = 1 read_unit_io = 65536 write_pref_io = 65536 write_nstream = 1 write_unit_io = 65536 pref_strength = 10 buf_breakup_size = 131072 discovered_direct_iosz = 262144 max_direct_iosz = 524288 default_indir_size = 8192 qio_cache_enable = 0 max_diskq = 1048576 initial_extent_size = 8 max_seqio_extent_size = 2048 max_buf_data_size = 8192 root@hpeos003[]

I have chosen to display the parameters for the root filesystem because I know these parameters will assume default values. VxFS can understand disk layout parameters if we use VxVM volumes. When we look at the parameters for the filesystem we have been using recently (the VxVM volume = dbvol ), we will see that VxFS has been suggesting some values for us:

 root@hpeos003[]  vxtunefs /dev/vx/dsk/ora1/dbvol  Filesystem i/o parameters for /db read_pref_io = 16384 read_nstream = 12 read_unit_io = 16384 write_pref_io = 16384 write_nstream = 12 write_unit_io = 16384 pref_strength = 20 buf_breakup_size = 262144 discovered_direct_iosz = 262144 max_direct_iosz = 524288 default_indir_size = 8192 qio_cache_enable = 0 max_diskq = 3145728 initial_extent_size = 4 max_seqio_extent_size = 2048 max_buf_data_size = 8192 root@hpeos003[]

The parameters that I am particularly interested in initially are read/write_pref_io and read/write_nstream .

read_pref_io : The preferred read request size. The filesystem uses this in conjunction with the read_nstream value to determine how much data to read ahead. The default value is 64K.
read_nstream : The number of parallel read requests of size read_pref_io to have outstanding at one time. The filesystem uses the product of read_nstream and read_pref_io to determine its read ahead size. The default value for read_nstream is 1.
write_pref_io : The preferred write request size. The filesystem uses this in conjunction with the write_nstream value to determine how to do flush behind on writes. The default value is 64K.
write_nstream : The number of parallel write requests of size write_pref_io to have outstanding at one time. The filesystem uses the product of write_nstream and write_pref_io to determine when to do flush behind on writes. The default value for write_nstream is 1.

For these parameters to take effect, VxFS must detect access to contiguous filesystem blocks in order to " read ahead " or " write behind ". With " read ahead ", more blocks will be read into the buffer cache that are actually needed by the application. The idea is to perform larger but fewer IOs. With " write behind ", the idea is to reduce the number of writes to disk by holding more blocks in the buffer cache because we are expecting subsequent updates to be to the next blocks in the filesystem. If IO is completely random in nature or multiple processes/threads are randomizing the IO stream, these parameters will have no effect. In such a situation, we will rely on the randomness of IO being distributed, on average over multiple disks in our stripe set. These parameters work only for the " standard " IO routines, i.e., read() and write(). Memory-mapped files don't use read-ahead or write-behind . The key here is knowing the typical workload that your applications put on the IO subsystem.

Thinking about the configuration of the underlying volume, I think we can understand why the preferred IO size has been set to 16KB; the underlying disk (a VXVM volume) has been configured with a 16KB stripe size:

 root@hpeos003[]  vxprint -g ora1 -rtvs dbvol  RV NAME         RLINK_CNT    KSTATE   STATE    PRIMARY  DATAVOLS  SRL RL NAME         RVG          KSTATE   STATE    REM_HOST REM_DG    REM_RLNK V  NAME         RVG          KSTATE   STATE    LENGTH   READPOL   PREFPLEX UTYPE PL NAME         VOLUME       KSTATE   STATE    LENGTH   LAYOUT    NCOL/WID MODE SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE SV NAME         PLEX         VOLNAME  NVOLLAYR LENGTH   [COL/]OFF AM/NM    MODE DC NAME         PARENTVOL    LOGVOL SP NAME         SNAPVOL      DCO dm ora_disk1    c0t4d0       simple   1024     71682048 FAILING dm ora_disk2    c4t8d0       simple   1024     71682048 - dm ora_disk3    c4t12d0      simple   1024     71682048 - v  dbvol        -         ENABLED  ACTIVE   10485760 SELECT    dbvol-01 fsgen pl dbvol-01     dbvol     ENABLED  ACTIVE   10485792 STRIPE   3/16   RW sd ora_disk1-01 dbvol-01  ora_disk1 0       3495264  0/0       c0t4d0   ENA sd ora_disk2-01 dbvol-01  ora_disk2 0       3495264  1/0       c4t8d0   ENA sd ora_disk3-01 dbvol-01  ora_disk3 0       3495264  2/0       c4t12d0  ENA root@hpeos003[]

The reason it has chosen to read and write 12 read and write streams is due to the Veritas software trying to understand the underlying disk technology. We are using Ultra Wide LVD SCSI disks, so Veritas has calculated that we can quite happily perform 12x 16KB = 192KB of IO with no perceptible problems for the IO subsystem. We would need to consider all other volumes configured on these disks as well as other disks on these interfaces before commenting whether or not these figures are appropriate. The best measure of this is to run our applications with a normal workload and measure the IO performance against an earlier benchmark. We can increase or decrease these values without unmounting the filesystem, although if performing performance tests, it would be advisable to remount the filesystem with the new parameter settings before running another benchmark.

Thinking about the discussion we had earlier regarding sending multiple IOs to the filesystem, we can now equate that discussion to these tunable parameters. For a striped volume, we would set read_pref_io and write_pref_io size to be stripe unit size. We would set read_nstream and write_nstream to be a multiple of the number of columns in our stripe set. For a RAID 5 volume, we would set write_pref_io to be the size of an entire data stripe and set write_nstream to equal 1.

For filesystems that are accessed sequentially, e.g., containing redo logs or video capture files, here are some other parameters we may consider:

discovered_direct_iosz : When an application is performing large IOs to data files, e.g., a video image, if the IO request is larger than discovered_direct_iosz , the IO is handled as a discovered direct IO. A discovered direct IO is unbuffered, i.e., it bypasses the buffer cache, in a similar way to synchronous IO. The difference with discovered direct IO is that when the file is extended or blocks allocated, the inode does not need to be synchronously updated as well. This may seem a little crazy because the buffer cache is intended to improve IO performance. The buffer cache is intended to even out the randomness of small and irregular IO and improve IO performance for everyone. In this case, the CPU time involved to set up buffers and to buffer the IO into the buffer cache becomes more expensive than performing the IO itself. In this case, it makes sense for the filesystem to perform an unbuffered IO. The trick is to know what size IO requests your applications are making. If you set this parameter too high, IOs smaller than this will still be buffered and cost expensive buffer- related CPU cycles. The default value is 256KB.
max_seqio_extent_size : Increases or decreases the maximum size of an extent. When the filesystem is following its default allocation policy for sequential writes to a file, it allocates an initial extent that is large enough for the first write to the file. When additional extents are allocated, they are progressively larger (the algorithm tries to double the size of the file with each new extent), so each extent can hold several writes worth of data. This reduces the total number of extents in anticipation of continued sequential writes. When there are no more writes to the file, unused space is freed for other files to use. In general, this allocation stops increasing the size of extents at 2048 filesystem blocks, which prevents one file from holding too much unused space. Remember, in an ideal world, we would establish an extent allocation policy for individual files. This parameter will affect the allocation policy for files without their own predefined ( setext ) allocation policy.
max_direct_iosz : If we look at the value that VxFS has set for our dbvol filesystem, it happens to correspond to the amount of memory in our system: 1GB. If we were to perform large IO requests (invoking direct IO if it was larger than discovered_direct_iosz ) to this filesystem, VxFS would allow us to issue an IO request the same size as memory. The idea is that this parameter will set a maximum size of a direct I/O request issued by the filesystem. If there is a larger I/O request, it is broken up into m ax_direct_iosz chunks. This parameter defines how much memory an IO request can lock at once. In our case, the filesystem would allow us to issue a 1GB direct IO to this filesystem. It is suggested that we not set this parameter beyond 20 percent of memory in order to avoid a single process from swamping memory with large direct IO requests. The parameter is set as a multiple of filesystem blocks. In my case, if I have 1GB of RAM, taking the 20 percent guideline (20% of 1GB = 104857.6, 2KB filesystem blocks), I would set this parameter to be at least 104448.

The parameters buf_breakup_sz , qio_cache_enable , pref_strength , read_unit_io , and write_unit_io are currently not supported on HP-UX.

When we have established which parameters we are interested in, we need to set the parameters to the desired values now and ensure that the parameters retain their values after a reboot. To set the values now, we use the vxtunefs command. If we wanted to set the read_nstream to be 6 instead of 12 because we realized there were significant other IO users on those disks, the command would look something like this:

 root@hpeos003[]  vxtunefs s -o read_nstream=6 /dev/vx/dsk/ora1/dbvol  vxfs vxtunefs: Parameters successfully set for /db root@hpeos003[]  vxtunefs p /dev/vx/dsk/ora1/dbvol  Filesystem i/o parameters for /db read_pref_io = 16384   read_nstream = 6   read_unit_io = 16384 write_pref_io = 16384 write_nstream = 12 write_unit_io = 16384 pref_strength = 20 buf_breakup_size = 262144 discovered_direct_iosz = 262144 max_direct_iosz = 524288 default_indir_size = 8192 qio_cache_enable = 0 max_diskq = 3145728 initial_extent_size = 4 max_seqio_extent_size = 2048 max_buf_data_size = 8192 root@hpeos003[]

We would continue in this vein until all parameters were set to appropriate values. To ensure these values sustain a reboot we need to setup the file /etc/vx/tunefstab . This file does not exist by default. Here is how I setup the file for the parameters I want to set:

 root@hpeos003[]  cat /etc/vx/tunefstab  /dev/vx/dsk/ora1/dbvol  read_nstream=6 /dev/vx/dsk/ora1/dbvol  write_nstream=6 /dev/vx/dsk/ora1/dbvol  max_direct_iosz=104448 root@hpeos003[]

I have only set three parameters for this one filesystem. I would list all parameters for all filesystems in this file. You can specify all options for a single filesystem on a single line. Personally I think it easier to read, manage and understand having individual parameters on a single line. Once we have setup our file we can test it by using the vxtunefs “f command.

 root@hpeos003[]  vxtunefs s -f /etc/vx/tunefstab  vxfs vxtunefs: Parameters successfully set for /db root@hpeos003[]  vxtunefs /dev/vx/dsk/ora1/dbvol  Filesystem i/o parameters for /db read_pref_io = 16384   read_nstream = 6   read_unit_io = 16384 write_pref_io = 16384   write_nstream = 6   write_unit_io = 16384 pref_strength = 20 buf_breakup_size = 262144 discovered_direct_iosz = 262144   max_direct_iosz = 104448   default_indir_size = 8192 qio_cache_enable = 0 max_diskq = 3145728 initial_extent_size = 4 max_seqio_extent_size = 2048 max_buf_data_size = 8192 root@hpeos003[]

The other parameters are explained in the man page for vxtunefs . Some of them may apply in particular situations. As always, you should test the performance of your configuration before, during, and after making any changes. If you are unsure of what to set these parameters to, you should leave them at their default values.

8.7.1 Additional mount options to affect IO performance

Now that we have looked at how we can attempt to improve IO performance through customizing intent logging, using extent attributes, and tuning the filesystem, we will look at additional mount options that can have an effect on performance, for good or bad. We discussed the use of mount options to affect the way the intent log is written to. The intent log is used to monitor and manage structural information (metadata) relating to the filesystem. The mount options we are about to look at affect how the filesystem deals with user data. The normal behavior for applications is to issue read() s and write() s to data files. These IOs are asynchronous IO and are buffered in the buffer cache to improve overall system performance. Applications can use synchronous IO to ensure that data is written to disk before continuing. An example of synchronous IO is committing changes to a database file; the application wants to ensure that the data resides on disk before continuing. It should be noted that there are two forms of synchronous IO: full and data. Full synchronous IO (O_SYNC) is where data and inode changes are written to disk synchronously. Data synchronous IO (O_DSYNC) is where changes to data are written to disk synchronously but inode changes are written asynchronously. The mount options we are going to look at affect the use of the buffer cache (asynchronous IO) and the use of synchronous IO. We are not going to look at every option; that's what the man pages are for. We look at some common uses.

8.7.2 Buffer cache related options (mincache=)

The biggest use of the mincache directive is in the use of mincache=direct . This causes all IO to be performed direct to disk, bypassing the buffer cache entirely. This may seem like a crazy idea because, for normal files, every IO would bypass the buffer cache and be performed synchronously (O_SYNC). Where this is very useful is for large RDBMS applications that buffer their data in large shared memory segments, e.g., Oracle's System Global Area. In this case, the RDBMS buffers the data and then issues a write() , which may or may not be buffered again by the filesystem. Where all the files in a filesystem are going to be managed by the RDBMS, that specific filesystem could be mounted with the mincache=direct directive. Be careful that no normal files exist in this filesystem because IO to those files will be performed synchronously.

Most other situations will use the default buffering used by the filesystem. Where you want to improve the integrity of files for systems that are turned off by users, i.e., desktop HP-UX workstations, you may want to consider the mincache=closesync where a file is guaranteed to be written to disk when it is closed. This would mean that there was only a possibility of data lost for files that were open when the user hits the power button.

To make a filesystem go faster, there is the caching directive mincache=tmpcache . It should be used with extreme caution, because a file could contain junk after a system crash. Normally, when a user extends the size of a file, the user data is written to disk and then the inode is updated, ensuring that the inode points only to valid user data. If a system crashed before the inode was updated, the data would be lost and the inode would not be affected. With mincache=tmpcache , the inode is updated first, returning control to the calling function (the application can proceed), and then sometime later the data is written to disk. This could lead to an inode pointing to junk if the system crashes before the user data is written to disk. This is dangerous. Be careful.

8.7.3 Controlling synchronous IO (convosync=)

Straight away, I must state that changing the behavior of synchronous IO is exceptionally dangerous. If we think about applications that perform synchronous IO, they are doing it to ensure that data is written to disk, e.g., checkpoint and commit entries in a database. This information is important and needs to be on disk; that's why the application is using synchronous IO. Changing the use of synchronous ( convosync = convert O_SYNC) IO means that you are effectively saying, " No, Mr. Application, your data isn't important; I'm going to decide when it gets written to disk. " Two options that might be considered to speed up an application could be convosync=delay where all synchronous IO is converted to asynchronous, negating any data integrity guarantees offered by synchronous IO. I think this is the maddest thing to do in a live production environment. Where you want to make a filesystem perform at its peak, this might be a good idea, but at the cost of data integrity. For some people, convosync=closesync is a compromise where synchronous and data synchronous IO is delayed until the file is closed. I won't go any further because all these options give me a serious bad feeling. In my opinion, synchronous IO is sacrosanct; application developers use it only when they have to, and when they use it, they mean it. In live production environments, don't even think about messing with this!

8.7.4 Updating the /etc/fstab file

When we have decided which mount options to use, we need to ensure that we update the /etc/fstab file with these new mount options. I don't need to tell you the format of the /etc/fstab file, but I thought I would just remind you to make sure that you remount your filesystem should you change the current list of mount options:

 root@hpeos003[]  mount -p  /dev/vg00/lvol3          /          vxfs  log                    0 1 /dev/vg00/lvol1          /stand     hfs   defaults               0 0 /dev/vg00/lvol8          /var       vxfs  delaylog,nodatainlog   0 0 /dev/vg00/lvol9          /var/mail  vxfs  delaylog,nodatainlog   0 0 /dev/vg00/lvol7          /usr       vxfs  delaylog,nodatainlog   0 0 /dev/vg00/lvol4          /tmp       vxfs  delaylog,nodatainlog   0 0 /dev/vg00/lvol6          /opt       vxfs  delaylog,nodatainlog   0 0 /dev/vx/dsk/ora1/logvol  /logdata   vxfs  log,nodatainlog        0 0 /dev/vg00/library        /library   vxfs  log,nodatainlog        0 0 /dev/vg00/lvol5          /home      vxfs  delaylog,nodatainlog   0 0   /dev/vx/dsk/ora1/dbvol   /db        vxfs  log,nodatainlog        0 0   root@hpeos003[] root@hpeos003[]  vi /etc/fstab  # System /etc/fstab file.  Static information about the file systems # See fstab(4) and sam(1M) for further details on configuring devices. /dev/vg00/lvol3 / vxfs delaylog 0 1 /dev/vg00/lvol1 /stand hfs defaults 0 1 /dev/vg00/lvol4 /tmp vxfs delaylog 0 2 /dev/vg00/lvol5 /home vxfs delaylog 0 2 /dev/vg00/lvol6 /opt vxfs delaylog 0 2 /dev/vg00/lvol7 /usr vxfs delaylog 0 2 /dev/vg00/lvol8 /var vxfs delaylog 0 2 /dev/vg00/lvol9 /var/mail vxfs delaylog 0 2   /dev/vx/dsk/ora1/dbvol /db vxfs delaylog,nodatainlog,mincache=direct 0 2   /dev/vg00/library        /library   vxfs  defaults 0 2 /dev/vx/dsk/ora1/logvol  /logdata   vxfs  defaults 0 2 root@hpeos003[] root@hpeos003[]  mount -o remount /db  root@hpeos003[]