11.9 Common Bottlenecks for Processes and Threads | HP-UX CSE(c) Official Study Guide and Desk Reference

Some people would argue that in an ideal world we would always have system resources to spare, in which case we would see these resources in an idle state for some of the time. Others would argue that having no idle time is a good thing as long as there were no outstanding requests that we couldn't satisfy . Similar issues surround the argument about what " good performance " means. I think the crux of the matter is, " Are we getting the job done quickly enough? " The arguments surrounding what " the job " is are another story altogether, as are the arguments surrounding what " quickly enough " means. It all depends on your perspective of the problem. Are you a senior manager who just wants to ensure that there are enough computing resources available to process all orders, sales, and invoices, for example? Or are you an IT performance specialist who needs to know how long the system spent executing kernel code as opposed to user code? I'll leave it up to you to decide which perspective you are taking today and also how you want to define good performance .

Bottlenecks occur because the system does not have enough of a particular resource. Processes will be blocked from running until the resource can be allocated. Consequently, queues form. If there are no bottlenecks, the average size of a queue will effectively be one; in other words, over a period of time, there is on average only one process requesting a resource, so the system is able to satisfy all other requests within that time period. There is effectively no queue. In such a situation, if we are not happy with the level of performance, we will have to invest in a bigger, faster system than the one we have because the current one has no queues, is operating at 100 percent efficiency, and we're still not happy. Bottlenecks occur for a number of reasons, including the following:

There aren't enough CPU cycles to perform all the tasks requested .
There isn't enough memory to accommodate all data.
Virtual memory runs out because we haven't configured enough swap space.
Disks cannot react fast enough to satisfy all IO requests.
The network gets congested because of the size and number of requests to use it.

Some bottlenecks are relatively simple to resolve if you can rearrange workloads to occur at non-critical times. This can be harder than it sounds; try telling your users not to log in until two o'clock in the morning. This is not a detailed book on performance and tuning. In the last few sections of this chapter, we have discussed aspects of the life cycle of a process. I hope that discussion has given us some ideas as to what will affect the performance of a process. In this section, we mention various process and system metrics that hint at a potential bottleneck. We mention tools that allow us to identify such a bottleneck and possibly some solutions to those bottlenecks. Some solutions are hardware based, which can be expensive to implement. It is not for me to decide how you manage your company's finances, but simply to point out some potential solutions to common problems. You need to work out which of these solutions is best in the long term for your particular situation.

11.9.1 Common CPU bottlenecks

The symptoms of a CPU bottleneck are usually all too evident in that the telephone never stops ringing with users complaining, " Why is the system going so slow? " The fact that it might not be a CPU bottleneck causing the perceived problem is a good point worth noting. I think we can categorize a CPU bottleneck as a situation in which a process can't continue executing simply because there aren't enough CPU cycles available to get the job done . This covers situations where we have a single process on the system that is simply number- crunching and the system can't execute instructions fast enough, as well as systems that have many thousands of processes each performing small tasks and the system can't context switch quick enough between processes without there being a perceptible delay in getting the job done . We need to consider three important metrics when trying to establish if the CPU is a bottleneck:

CPU Utilization
The average size of the CPU Run Queue . This is known as the Load Average . This is the average number of thread/processes in the Run Queue measured over the last 1, 5, and 15 minutes. On HP-UX 11i, this does not include processes/threads that are waiting for IO to complete, i.e., high-priority sleepers . On previous versions of HP-UX, the Load Average included hi-priority sleepers, which skewed the metric.
The time processes/thread were blocked on priority, which is known as the Priority Queue .

If we experience CPU Utilization of 100 percent, then the CPU is executing instructions 100 percent of the time. On its own, this is not an indication of a CPU bottleneck. We need to consider the average size of the CPU Run Queue. The CPU Run Queue is the number of processes/threads that are runnable or running. If this average is running at 1, this simply means that on average there is always only one process/thread in the CPU Run Queue. It may be that instantaneously there are LOTS of processes/threads in the Run Queue, but over a given time period the CPU is able to respond to all current requests resulting in the average being kept at 1. The CPU Run Queue should be aggregated over the total number of CPUs in the system. Take an example where we have four processors in a system and four CPU- intensive threads. The CPU Run Queue would equal 1, and none of the threads were ever blocked on priority, resulting in the Priority Queue effectively being 0. If each CPU was executing only user code and was operating at 100 percent CPU Utilization, could we say that the CPU was a bottleneck? We would have to ask, " Is the job getting done ?" If the users were happy and the current workload was being satisfied, I would argue that we don't have a CPU bottleneck. It is always nice to have some idle CPU time because this indicates that we have capacity to grow into, but it is not always the case that given more CPU cycles a process will actually do any more useful work. I have been privy to applications that cause the CPU to operate at 100 percent by simply going into a tight loop whenever there is nothing useful to do. Where the application is truly performing useful work and pushing the CPU to 100 percent utilization, the only thing we could do would be to buy bigger, faster processors and see if the job gets done any quicker. We can't reschedule any tasks to occur at a different time; there seems little point in buying more memory, because the CPU isn't executing system code to manage the virtual memory system; there seems little point in reconfiguring our disks, because we are not executing system code to perform IO. In this case, we are restricted simply by the raw processing power of the processors. This is seldom the case. Commonly, the CPU is executing user and system code. Essentially, we want our CPU to be executing user code for most of the time. If the CPU is not executing user code, it is executing system code possibly on behalf of the process, e.g., performing IO to get data from a disk into memory. Executing system code is necessary in order to page-in data and instructions into memory in order to allow the process to continue executing. If there are enough CPU cycles available to perform all user and system requests, we might even be in the luxurious situation of seeing the CPU idle at times. If there aren't enough CPU cycles, this means the CPU is compromising the process's/thread's ability to continue executing. The process/thread will be blocked on a particular resource or activity. This is commonly referred to as CPU saturation. If we can gain an understanding of why the process/thread was blocked, it will indicate where we should focus our energies in order to alleviate the problem.

To establish the current CPU Utilization, we can run various commands such a top and glance . The Load Average can be gained from commands such as uptime , w , top , vmstat , sar -q , and glance . The Priority Queue is a little trickier. The only source of information for the Priority Queue is glance using the B command to get to the screen known as the Global Wait States . Take another look at our previous example of four processes and four processors. If we introduce another four CPU-intensive processes into the system, how would that affect our CPU Utilization, Load Average/Run Queue, and Priority Queue ? Table 11-5 shows these effects.

Table 11-5. Example of a CPU Bottleneck

Metric	Value	Description
CPU Utilization	100%	The CPU is working flat out either context switching between processes or executing user code.
Load Average/Run Queue	2	We have eight processes and four processors. On average, there will be one running and one runnable process per CPU.
Priority Queue	4	There are four processes that are runnable, but the only thing stopping them from gaining the processor is the fact that another process has a higher priority.

Do we have a CPU bottleneck? I think it is evident that in this situation we do. The crucial factor here is the Priority Queue . This gave us an insight into how many processes were blocked simply because we didn't have enough CPU cycles to fulfill all current requests. The combination of CPU Utilization, Load Average, and Priority Queue is a key indicator of a CPU bottleneck. A Priority Queue above 3 or 4 is commonly regarded as a reasonable indication of a CPU bottleneck.

Let's consider a real-life system, one where processes are not only executing user code but also performing the normal things a process does, i.e., IO to disk, communicating over the network, requesting more memory. When trying to build up a profile of what the system is spending most of its time doing, we need to accumulate information on the main reasons why processes are being blocked. The Global Wait States in glance is an excellent starting point. When we considered the life cycle of a process, we realized that when a process is actually executing code, it is either executing in User mode (executing user code) or executing in System mode whereby the kernel is running on behalf of the process in order to perform some system-level activity such as IO. Essentially, we want our process to be executing in User mode for the vast majority of the time. We realize that it will execute in System mode some of the time. The key is to get the balance more in User mode than in System mode. An average ratio between User mode and System mode is 70 percent User mode and 30 percent System mode. It should be stressed that this is an average, but over time it has been seen to be a reasonable average. Beyond 30 percent System mode, we need to ask ourselves why the process is having to perform system-level functions so much? If we aren't executing user code, what are we doing? This is where the Global Wait States screen becomes extremely useful. If we learn, for example, that on average a high proportion of processes were blocked on Disk IO, we could then look at which disks were particularly busy. From there, we could consider whether there was anything we could do to improve the overall performance of the disk subsystem, e.g., striping, mirroring, and so on. The Global Wait States screen is allowing us to build a profile of what our system does under normal workload conditions. Once we have established a profile for our system, we can compare it against current usage that will highlight any anomalies. Some metrics are expressed as an absolute value. Absolute values are difficult to deal with because on their face they don't indicate anything. If metrics can be expressed as a percentage, it gives us an idea of proportionally how much of a resource is being used. From there, we need to be able to focus on individual processes and try to establish what individual processes are spending their time doing. Per-process statistics are going to require a tool such as glance using the s command for individual processes. For threads, we can look at all threads on the system with the G command, and from there we can look at individual threads with the S command. From this individual process/thread screen, we can look at the Wait States to try to decipher where the process/thread is spending most of its time. Ultimately, we can monitor some CPU- related metrics that can indicate whether the system and individual processes are operating less than optimally (see Table 11-6).

Table 11-6. Other CPU Metrics to Monitor

Metric	Tool to monitor	Description
% User CPU	top, glance, vmstat, sar -u, timex	Ideally, at or above 70%
% System CPU	top, glance, vmstat, sar -u, timex	Ideally, at or below 30%
% Idle	top, glance, vmstat, sar -u
% Blocked on Priority	glance	At a system and per-process level, this indicates that the process doesn't have enough priority to be selected as the process to execute on the CPU.
Nice value	top, glance, ps	Default of 20. Background processes run at nice=24. A low nice value means a nasty process. A high nice value indicates a very nice process.
Priority	top, ps, glance	Can indicate real-time processes if < 128. Can also indicate if a sleeping process is interruptible; priority 154 177 or not interruptible; priority 128 153.
Context switch rate	glance, vmstat, sar -w	A high rate can indicate that processes are not being allowed to execute because of other bottlenecks.
Interrupts	glance, vmstat	The time the kernel is spent processing interrupts. This could indicate an inordinate high disk usage.
System call rate	glance, vmstat, sar -c	There is little we can do about this except ask our application developers why they are using so many system calls. A call to system call will cause a process context switch in order to store the context of the process while in user mode in order to free the registers in preparation to execute operating system code.
CPU switches	glance	If we are moving from CPU to CPU, we may consider implementing processor affinity in an attempt to keep processes running on a small subset of processors. This will avoid having to reload the CPU cache and TLB every time a process moves to a different CPU. Cache coherency is an expensive use of CPU time and will be calculated as System time.

11.9.1.1 RESOLVING CPU BOTTLENECKS

Essentially, we can employ either hardware or software solutions in order to resolve CPU bottlenecks. A hardware solution may be used to improve the number of available CPU cycles. A software solution will typically attempt to use the existing CPU cycles more efficiently or to influence which processes are given access to the CPU. Software solutions include rewriting applications to make use of features such as compiler optimizations and transforming a single-threaded application into a multithreaded one. Some solutions for resolving CPU bottlenecks are listed in Tables 11-7 and 11-8.

Table 11-7. Hardware Solutions to CPU Bottlenecks

Solution	Description
Add an additional CPU	Provides more CPU cycles to the system as a whole. Where processes use IPC facilities such as semaphores, there may be little advantage in adding an additional CPU, because they will continue to be blocked on a semaphore. If the current system cannot accommodate an additional CPU, the upgrade will need to be to a multiprocessor system.
Upgrade to a faster CPU	A program having the same instruction stream will execute quicker on a faster processor, all other things being equal. While being considered a brute force approach, it is commonly used because it involves no application upgrades or code migration.
Upgrade to a CPU with larger/separate data and instruction cache	The cost of flushing cache lines can run to hundreds of instructions. While having to flush a cache line, the process is blocked. Having a larger cache or separate data and instruction caches can resolve this.
Distribute applications to multiple systems	Moving applications to separate systems is a simple solution, but it's often overlooked. Current trends toward server consolidation go against this solution but don't detract from the fact that an application running on its own system will experience fewer CPU bottlenecks than running on a shared system, all things being equal.
Remove a CPU	Where processes use IPC facilities such as semaphores, there may be little advantage in adding an additional CPU, because they will continue to be blocked on a semaphore. In such a situation, there is no benefit to having more CPU cycles. In fact, one CPU may be idle most of the time. The additional processing required to relocate a process to a different CPU may be detrimental to overall performance.

Table 11-8. Software Solutions to CPU Bottlenecks

Solution	Description
Use PRM and/or WLM	PRM and WLM can be used to control allocation of CPU time to individual users, groups of users, and individual programs themselves .
Use `nice` values	The `nice` value can influence the priority of a process and hence influence when a process will next get access to the CPU.
Use real-time priorities	Although their use should be carefully considered, a real-time priority process can preempt every other process in the system.
Use Processor Affinity techniques	Use of `mpctl()` and processor sets can tie a process to a particular processor. This can avoid having to reload the cache on a different processor whenever a process is scheduled to run on a processor other than its original processor.
Use POPS	Variable page sizes can reduce the number of virtual address translations that the operating system needs to perform on behalf of a process. This can significantly reduce the amount of System CPU time allocated for individual processes and the system as a whole.
Modify the `timeslice` kernel parameter	Where applications are single CPU-bound processes, it may be appropriate to increase the `timeslice` parameter. This will reduce the number of forced context switches at the end of a `timeslice` , allowing the process to continue executing for longer without any interruptions. This can have a detrimental effect on online system responsiveness. Reducing the `timeslice` parameter will cause more context switches but may avoid one compute-bound process hogging CPU time.
Serialize processes	The `serialize` command will ensure that one process does not execute before another finishes. This avoids CPU saturation simply by virtue of fewer processes requiring CPU resources simultaneously .
Use a batch scheduler	This is similar to the `serialize` solution where processes are scheduled to run at different times to avoid CPU saturation. While this is not applicable to online-session-based activities, overnight processing can often be scheduled to run within a batch scheduler environment.
Stop CPU intensive processes	While it would be ill-advised to terminate a running process, we can send a process the `SIGSTOP` or `SIGTSTP` signal. This will take the process out of the Run Queue, but not remove the process from the process table. When CPU saturation has passed, we can send the process the `SIGCONT` signal whereby the process will continue from where it last left off.
Set the permissions on shared libraries to 555	Every time a shared library is accessed, a protection ID fault will occur to ensure that pages of the shared library have not changed. This can be expensive in System CPU time because there are limited registers where protection ID keys can be stored: four in a PA-RISC 1.X architecture and eight in a PA-RISC 2.X architecture. If the appropriate key is not found, the process context is searched to find the appropriate key, which is then loaded into the least used register. If the permissions of the shared library are set to 555, the protection ID key used is a public ID and does not cause a protection ID fault whenever a shared library page is accessed.
Recompile the application	This is not trivial. If it is possible to recompile the application, compiler optimizations can be used as well as the compiler taking advantage of any new architectural features of a new processor, e.g., out-of-order execution and pipelining.
Rewrite the application	This can involve application profiling to improve individual functions, rewriting the application to use threads, and reorganizing data structures to ensure that the data is aligned with the size of cache lines. This solution also gives application developers the chance to review the use of expensive system calls such as `fork()` and IPC message queues.

11.9.2 Common memory bottlenecks

Main memory is one of the scarcest resources in our system, only to be surpassed possibly by CPU cycles. While we can manage the use of CPU cycles using various scheduling techniques, there aren't comparable scheduling techniques for memory. A process/thread is going to be allowed to use memory or it isn't. The biggest controlling factor over this will be the kernel parameters that control the size of various memory resident objects such as user data segments, text segments, and shared memory objects such as shared memory segments. In Chapter 9, Swap and Dump Space, we discussed in detail when the virtual memory system will start freeing pages of memory in its belief that the amount free memory has reached a critical threshold: the thresholds of lotsfree , gpgslim , desfree , and minfree . It is the act of starting to page-out that is the biggest indicator of a memory bottleneck. Once we start to page-out, we see effects in other system resources:

The vhand process will gain CPU time. Being a process with a high priority, it will preempt other processes on the system when it needs to run.
System CPU utilization will increase.
IO to disk will increase as pages are paged-out to disk.
Processes will start to be blocked on VM when pages are deemed to have been de-staged to disk. This will require more IO to bring pages back into main memory.

If we were to categorize the most commonly used memory metrics to indicate a memory bottleneck, they would include the following:

Page-out rate : This indicates that the virtual memory system has decided we don't have enough free memory and is trying to free pages that haven't been accessed recently. This is the job of the vhand process. The page-out rate can be viewed from vmstat ( po field) and from the glance memory report. A per-process fault from memory and disk can be seen from the glance per-process screen.

Deactivations : If a process is deactivated, we have reached the desfree threshold and the swapper process is activated to find processes that can be removed entirely from memory. We don't swap out processes any more because this can cause serious problems for other system resources, namely the IO system can be put under intense pressure if it was asked to swap out a 100MB process. The swapper process will mark a process as being deactivated, which takes the process off the Run Queue. Since it can't run, the pages for a deactivated process will be aged by the vhand process. When the steal hand comes by, it can steal all the aged pages; they haven't been referenced because the process can't run while it is deactivated. Initially, swapper will use an intelligent algorithm to choose processes to deactivate: non-interactive processes over interactive ones, sleeping processes over running ones, and processes that have been running longest over newer processes (see Table 11-9). To view the number of deactivations, we use the Memory Report from glance .

Table 11-9. Other Memory Metrics to Monitor

Metric	Tool to monitor	Description
Active virtual memory (avm)	glance, vmstat, top	This is the amount of active virtual memory for data and stack regions for processes that ran in the last 20 seconds.
Free Memory	glance, vmstat, top	If we know how much free memory we have, we can estimate whether we are going to reach any of the paging thresholds in the near future. If so, we may have to STOP any processes currently consuming large amount of memory.
Number of VM read and writes	glance (per-process statistic and a global metric in the Memory Report)	This is the amount of reads from and writes to a swap device.
Resident Set Size (RSS) of a process	glance	This is the amount of memory currently in use by the process for code, data, and stack. This also includes a process's allocation of shared objects based on how many attachments a process has to a particular object.
Virtual Set Size (VSS) of a process	glance, top	This includes all the RSS space as well as text and data not yet paged-in from the program and shared libraries, data from memory-mapped file not yet paged-in, and data that has been paged-out to disk.
Dynamic Buffer Cache size ( `dbc_max_pct` )	glance, kcweb	The maximum size of the dynamic buffer cache is limited by the kernel parameter dbc_max_pct. If we are performing lots of buffered IO to the file system, we will see an increase in the size of the buffer cache. Like other memory users, buffer cache pages will only be paged-out when we reach our normal paging thresholds. But this can cause us problems if a user inadvertently performs lots of IO to the filesystem causing a dramatic rise in the size of the buffer cache. Significantly limiting the size of the dynamic buffer cache is a common solution to memory bottlenecks.

11.9.2.1 RESOLVING MEMORY BOTTLENECKS

As with CPU bottlenecks, we need to consider both hardware and software solutions to resolving memory bottlenecks. Hardware solutions are the easiest to put in place, but they can be the most expensive (see Tables 11-10 and 11-11).

Table 11-10. Hardware Solutions for Memory Bottlenecks

Solution

Description

Add more physical memory

This is the simplest and still the most effective way to resolve a memory bottleneck.

Add more swap space

Having additional swap space will allow processes to continue to run even though there is a severe memory shortage. This should be a stopgap solution until we can afford more memory. Using interleaved swap areas of the same size on different disks can maximize the IO performance of the virtual memory system. Avoid filesystem swaps if possible and don't even think about remote, network space areas.

Table 11-11. Software Solutions for Memory Bottlenecks

Solution	Description
Control the maximum size of process memory objects via kernel parameters	Controlling the size of kernel parameters such as maxdsiz, maztsiz, shmmax, and so on, is going to be controversial . If an application requires a large user data segment, it's going to be hard to argue not to allow it.
Use `ulimit` to limit the amount of memory that a process is allowed	If a process is using the POSIX shell, you could use the `ulimit` feature of the POSIX shell. This can limit the size of the user stack and data area up to the limits imposed by the kernel. Applications could be coded with the `setrlimit()` system call that underpins this solution.
Lock memory	Processes with the MLOCK privilege can lock themselves in memory. This can be a good thing in that they will never be paged-out, but also a bad thing because they will never be paged-out. You should consider which groups of users are allowed this privilege (see `setprivgrp` ).
Limit the amount of lockable memory	While it might be a good idea for processes to lock themselves in memory, we should always have some memory that is unlockable so that the virtual memory system can always free up some pages. This is controlled by the kernel parameter `unlockable_mem` .
Reclaim unreferenced shared memory segments	A shared memory segment with an `NATTACH` =0 is a candidate to be reclaimed. If we are sure that the application has terminated (we can use the `CPID` and `LPID` to help), we can reclaim the shared memory segment using the `ipcrm` command.
Upgrade to a bigger bit-size operating system	A 32-bit operating system limits the address space of a process to 4GB. Memory partitioning (quadrants) imposes further limits on the amount of private and shared objects that a process can map into its address space. A 64-bit operating system gives a theoretical maximum address space of 16EB.
Carefully use program magic numbers to change the use of memory quadrants	We saw that altering the `magic number` of a program can alter the amount of private and shared objects that a process can map into its address space. This can be a good thing when we want to increase the amount of data that a process can utilize, but it can also be a bad thing when rogue processes utilize more memory than they really need.
Use batch scheduling of large processes	Using a batch scheduler or the serialize command will restrict when large processes can run. If fewer processes are running simultaneously, there are fewer chances for contention to occur.
Carefully monitor kernel configuration	Parameters that size in- core kernel tables can be expensive as far as memory is concerned . Thankfully, one of the worst culprits, `nproc` , is less of a problem now that the process table is not a created statically at boot time. We should also monitor the use of device drivers and software subsystems. Commonly, the kernel will include device drivers for devices/subsystems we currently and never intend to use.
Use PRM to prioritize memory use	PRM can be used to limit the amount of memory that a user, group of users, and programs can allocate. If the limit is reached, PRM can STOP all or the largest process from executing further. PRM can also be configured to allow part of a memory allocation to be leased out to other users/processes if limits are under subscribed.
Turn off buffer cache use on a filesystem basis	VxFS allows us to turn off buffer cache usage on a filesystem basis. If fewer filesystems are using the buffer cache, we can potentially reduce the overall size of the buffer cache, increasing available memory for user processes.
Rewrite applications	Revisiting application code can sometimes reveal the inappropriate use of certain system features such as message queues where it may be more appropriate to use semaphores. We might also be able to identify whether the use of `malloc()` to allocate memory is using appropriate values and is returning memory to the system in the best manner.

11.9.3 Common disk bottlenecks

As we are aware, disks are the slowest devices in our system. This means that they are given a significant amount of administrative time and effort to make them perform at their best. This commonly requires an understanding of the IO characteristics of an application; the benefits of striping are best seen when the application performs sequential IO to the disk/filesystem. Although Logical IO gives us an indication of how many IOs a process might perform, it is more important to monitor Physical IO, because this is the IO that is causing the operating system to actually schedule a read/write from a disk(s).

The main metrics that hint at a disk IO bottleneck are the following:

Disk queue length : If the number of IOs per disk becomes too large, the disk will become saturated with IO requests. The overall number of IOs to a disk is not really relevant as long as the disk can respond quickly enough to these IO requests. If the time that an individual request is waiting in the disk queue ( avwait ) becomes greater than the time to service an individual request ( avserv ), the disk queue will grow and grow. It is a common perception that a disk queue length of more than 4 or 5 is indicative of a disk or controller that cannot react quickly enough to the outstanding requests. Disk queue lengths for individual disks can be monitored with glance (the u command) in the IO by disk screen, and the time the queue length was between 0 2, 2 4, and 4 8 can be viewed from the disk detail screen (the S command from the IO by disk screen). We can also monitor the disk queue length with sar “d .

Processes blocked on Disk IO, IO, Buffer Cache, Inode : The Global Wait States can be viewed with glance (the B command) or on a per-process basis. If we suspect that processes are blocked on IO activity, it is crucial that we find out not only which physical disk this relates to but also any logical disk configuration associated with it. If we are using volumes , we can identify which volume is performing physical IO. We can then marry the volume configuration to the IO per physical disk to try to work out why a particular disk is busy; if a volume is striped but only one physical disk exhibits a large queue, the stripe unit size we are using for the volume and/or the number of disks in the stripe set are inappropriate for the IO patterns exhibited by the application. Table 11-12 lists disk metrics to monitor.

Table 11-12. Other Disk Metrics to Monitor

Metric	Tool	Description
Disk utilization	glance, sar -d	If the disk is 50% busy, then queuing theory dictates that it take twice as long to perform an IO than if the disk was idle. Sustained disk utilization over 50% may indicate that we need to look at the workload being placed on a particular disk. We need to ensure that the metric is dealing with actual time spent performing IO and not the utilization of the disk queue. If there is only one IO in the disk queue, then we don't have a problem. However, if the metric were measuring Queue utilization, it would indicate 100% because there was at least one IO in the queue. This could be very misleading.
Average Wait (avwait) and Average Service (avserv) time	sar -d	If we have lots of IO requests in the disk queue, the amount of time it takes to service a request is crucial. If the disk cannot respond quickly enough (avserv), the average time waiting in the IO (avwait) queue will increase. If the avwait time stays below avserv, it indicates that although it takes longer to service a request than the requests are waiting, the disk is still processing requests at a reasonable rate. If the avwait time exceeds the avserv time, the disk queue will continue to increase because the disk cannot service requests from the queue as quickly as the rate they are arriving on the queue.
Buffer cache utilization	sar -b	A read cache hit ratio of < 90% or a write cache hit ratio of <70% indicates that the buffer cache wasn't able to satisfy a particular request from the buffer cache. We need to investigate why. The causes can include processes performing large amounts of Physical IO instead of Logical IO as well as the buffer cache being too small.
Physical IO rates	glance (disk IO report (d) and per-disk statistics (u))	Physical IO is the most expensive activity the kernel can undertake. The reasons for Physical IO can include applications performing synchronous writes, the buffer cache not operating optimally, the VM system paging-out to disk, access to raw volumes, access to memory mapped files. All the activities will accrue Physical IO time.
Virtual Memory reads and writes	vmstat, glance	If any page-outs occur, this can be an indication of a memory bottleneck. We should be aware of this and the fact that page-outs will put additional pressure on an already busy system. Access to memory-mapped files is attributed to Virtual Memory statistics because the IO bypasses the buffer cache.
Filesystem IO reads/writes	glance (disk report (u))	The System IO rates monitored by glance can give you an idea how much of the overall IO activity is attributed to filesystem activity.
Raw IO	glance (disk report (u))	If we have configured raw disks or volumes for our applications to use, they will use their own buffering techniques to get data to and from the disk. We can only hope that the buffering techniques are optimal for the application. The amount of Raw IO displayed by glance will give us an idea how much overall IO activity can be attributed to this type of activity.
< 100% CPU utilization and memory bottlenecks	glance, top, vmstat, top	This is not one single metric but a collection of metrics that we have looked at previously. In essence, we have a system that has spare CPU cycles. Why isn't it using them? We aren't using them because we are waiting for IO to complete. We have ruled out the Buffer Cache performance, but the system is showing symptoms of memory starvation . If we are running out of memory, we will be paging-out to disk. This will put an inordinate strain on the IO system that is trying to cope with normal IO from processes and now has the VM system to contend with as well. In this situation, we would want to ensure that our swap area configuration was optimal in order to give the IO system a good chance at meeting the requirements of the VM system.

11.9.3.1 RESOLVING DISK BOTTLENECKS

While we can consider hardware and software solutions to disk bottlenecks, we need to be careful how we categorize solutions that effectively sit between a software solution and a hardware solution. I am thinking of solutions such as using LVM/VxVM striping. This is a software solution that is intimately linked to hardware configuration . I will categorize these solutions simply as software solutions, because without the software we would be dealing with simple hardware disk drives (see Tables 11-13 and 11-14).

Table 11-13. Hardware Solutions to Resolve Disk Bottlenecks

Solution	Description
Use more disks	With more physical spindles, we can spread the IO load over more disks and increase our overall aggregate IO bandwidth. Simply adding more disks is not enough; we will have to rebalance IO workload to ensure that IO is spread proportionally over all disks.
Use faster disks	With faster disks, we improve the average service time for individual IO requests. If we can react faster to IO requests, there is a better chance that IO requests will spend less time in the disk queue.
Use more IO interface cards	In the same way, using more disk drives to spread the IO over multiple interface cards increases the total IO capacity of the system. We will need to rebalance the IO workload to ensure that IO is spread over all IO paths. This solution also covers using multiple paths to a single device. Software solutions such as Dynamic Multi-Pathing can spread IO to a single device over multiple interfaces.
Use faster interface cards	Having faster interface cards will allow us to react more quickly to individual IO requests. This will keep overall IO queue lengths down as much as possible. There's no point In having fast disks if the interface card cannot deliver IO requests to them.
Use mirroring	Having two places to read from can improve read performance. Most mirroring solutions will be software based unless you use a disk array.
Use striping	Sequential IO can benefit dramatically if we send successive IO to successive disks. We need to marry the IO size of the application with the stripe size. Most striping solutions are software based unless you use a disk array. Striping solutions are vulnerable to data loss if a single disk fails. Using mirroring and striping can alleviate this problem. Using RAID 5 can be detrimental to performance due to the read-modify-write problem with parity data.
Use immediate reporting on disk drives	Immediate reporting is the ability of a disk to report that an IO has occurred even though the data has not landed on the disk platter. The IO will be held in the disk drive cache memory until the IO has completed. This is controlled via the `scsictl` command. When turned on, there is a compromise between data integrity and performance because the disk may fail before the IO occurs. The application using the disk will have assumed that the IO completed successfully.
Use high performance disk arrays	Some disk array solutions can utilize gigabytes of cache memory. When performing IO to these devices initially, the disk array will have to read from a physical disk. With high performance features such as striping, multiple data paths, and high bandwidth interfaces, this can be accomplished at high speed. Subsequent IOs to the same data blocks are performed via high-speed cache with the disk array de-staging data to disk when it deems it necessary, avoiding actual disk IO as much as possible.

Table 11-14. Software Solutions to Disk Bottlenecks

Solution	Description
Identify IO hot spots	Identifying disks that are saturated with IO requests allows us to rework the underlying logical disk configuration to balance the IO over all available disks and interface cards. IO hot spots occur when IO-intensive activities are all targeted toward only a few disks in the system.
Identify application-specific IO patterns and sizes	If we know the typical access patterns used by our applications, it can make a direct impact on the type of logical disk configuration we choose. If striping is a possible solution, we need to know the size of an application IO in order to match that with the stripe size, thereby ensuring that successive IOs are resolved by successive disks.
Use striping	See similar solution in "Hardware solutions."
Use mirroring	See similar solution in "Hardware solutions."
Dedicate disks to particular aspects of the application	Having disks dedicated to a particular aspect of an application, e.g., database check-pointing, can guarantee a level of IO performance for that particular IO-intensive process with no other activities compromising the performance of a specific disk.
Tune the filesystem at creation time	Various filesystem options used at filesystem creation time can dramatically affect performance, including block and fragment sizes. These really apply only to HFS filesystems.
Tune filesystem IO patterns at mount time	VxFS allows us to tune filesystem IO patterns to match the IO activity in our applications and underlying logical disk configuration. The options chosen will reflect the application and logical disk IO attributes.
Use performance specific mount options	Mount options, especially for VxFS filesystems, can dramatically affect performance. Some options can affect the integrity of the data as well as the integrity of the filesystem by altering the behavior of synchronous writes to the filesystem and should be used with extreme care.
Turn off buffer cache access for specific filesystems	VxFS allows us to bypass the buffer cache and write directly to disk. This is applicable only for applications that will perform their own internal caching. If used for normal file IO, this can be detrimental to performance.
Use asynchronous IO	A kernel parameter, `fs_async` , can cause all meta-data updates to filesystems to happen asynchronously, even though this would seriously impact the integrity of the filesystem. VxFS filesystems can do this on a filesystem basis using specific mount options relating to the use of the intent log and caching advisories.
Defragment file systems	Having contiguous data files can improve IO performance because the disk read/write heads don't have to jump over the disk surface to read successive filesystem blocks. VxFS allows us to defragment filesystems while they are being used. Defragmenting an HFS filesystem would require storing all data to tape, reformatting the filesystem, and then reloading all the data.
Pre-allocate space for files in an empty filesystem	Pre-allocating space in an empty filesystem will ensure that the file is contiguous within the filesystem. This can significantly improve IO performance in the filesystem.
Use raw disks	Some applications promote the use of raw disks because it avoids the filesystem performing caching via the buffer cache as well as the application performing its own internal caching. These problems can be overcome either by using raw disks or, if using VxFS, by turning off buffer cache usage on an individual filesystem basis.
Use short directory paths	Resolving deep directories can result in a higher-than-average System CPU utilization because the kernel needs to resolve a large number of inodes. Having shorter directory paths avoids this. We need to weigh the configuration complexities of having lots of files in a single directory against application installation requirements and against any performance gains we may obtain.
Avoid symbolic links	Symbolic links cause additional reads of inodes and data blocks for the symbolic links themselves. Using fast symbolic links can improve this situation. VxFS will use fast symbolic links by default; HFS won't.