Bursty UtilizationBursty utilization (also known as burstiness ) represents a special case of underutilization. The web site runs at high utilization for part of the test, but at other points goes quiet. During these quiet periods, throughput drops as well. Burstiness occurs for many of the same reasons as underutilization. A synchronization point in a web site component stalls the site for some period of time. After the congestion clears, the site continues with normal operation until a backlog forms again at the synchronization point. Figure 13.2 shows bursty behavior ten minutes into a test run. The application server and database server CPU utilization drops to only 5% and 2% respectively. Several minutes later, the application server CPU utilization rises to 80%, and the database server CPU utilization goes up to 22%. Figure 13.2. Example of bursty behavior. From IBM AIM Services Performance Workshop presented by Stacy Joines, 2001, Hursley, U.K. IBM Corp. 2001. Reprinted by permission of IBM Corp.
Bursty behavior is probably the most challenging performance issue to resolve. Let's look at a few of the more common causes of bursty behavior. Application SynchronizationApplication synchronization sometimes triggers burstiness. The application might contain an infrequently accessed section of code that contains a severe synchronization point. As requests hit this point, they stall until they clear the synchronized block. How to Diagnose
How to Resolve
Client SynchronizationThe test client is a surprisingly frequent cause of burstiness. Sometimes the test team inadvertently sets up the test tool to feed the web site traffic in bursts. Usually this occurs when the test scripts include think time. The test client software synchronizes these think times, resulting in bursty traffic. How to Diagnose
How to ResolveBetter quality test client software usually allows you to randomize the virtual client think times. You provide a range of acceptable think time values, and the test client software randomly picks a think time within the range. Random think times provide a more even loading of the web site and more realistically represent production traffic. Back-End SystemsWe discussed earlier the impact of stalled back-end systems. Frequently these show up in a thread trace, but you may detect these issues without obtaining a trace. A periodic back-end system stall occurs for a number of reasons. For databases, the system may stall while resolving a deadlock. Other stalls might result from network issues or disk I/O. Once the bottleneck clears, the system runs at capacity until the condition leading to the bottleneck occurs again. Figure 13.3 shows a misbehaved back-end system where the database CPU jumps to 100% CPU utilization 10 minutes into the run and then returns to 22% at 13 minutes. Figure 13.3. Example of a bursty back-end database system
How to DiagnoseObserve the CPUs and disk I/O of your back-end systems. Does the CPU stay busy on one of your systems while the other systems are idle? The system still busy probably contains the source of the bottleneck. How to ResolveYou need assistance from the administrator of the system in question. Use your DBAs or host specialists to find the cause of the bottleneck on the remote system and help you resolve it. Garbage CollectionDuring normal operation, the application server stops servicing requests during the synchronized parts of the garbage collection cycle. As JVM garbage collection technology continues to improve, the synchronized portion of the cycle shortens; however, all other JVM work still ceases for a period of time during garbage collection. This especially applies during compaction cycles. Usually, garbage collection does not produce prolonged periods of burstiness. However, very large JVM heaps often require long garbage collection cycles and cause noticeable burstiness. How to Diagnose
How to Resolve
Timeout IssuesAbnormal timeout values often trigger bursty behavior. In these cases, the test system may wait for an unusually long period until the timeout associated with some resource expires . Solaris, for example, defaults the timeout for TCP/IP connections to four minutes. A test generating heavy load against these systems may suddenly encounter a four-minute idle period while waiting for a TCP/IP connection to come available on the servers. Lowering the timeout interval ( tcp_time_wait ) resolves this problem. How to DiagnoseLook for a consistent pattern or interval in the idle period. (Use your monitoring tools to closely measure the burstiness pattern.) If the idle period lasts roughly three minutes, for example, then look for a corresponding configuration setting set to approximately three minutes in the test environment. How to Resolve
Network IssuesFinally, if all else fails, look at the network involved. Check the settings on your firewalls and any other intervening hardware. Network-induced burstiness occurs less frequently than the other causes we've discussed, but sometimes improper network configuration leads to burstiness under load. Even if the network isn't at fault, it usually holds clues to help you resolve the problem. How to Diagnose
How to ResolveObviously, you need help from your network team. Once you isolate the problem component, you need someone who is knowledgeable about the offending equipment to resolve the issue. |