Bursty Utilization

Bursty utilization (also known as burstiness ) represents a special case of underutilization. The web site runs at high utilization for part of the test, but at other points goes quiet. During these quiet periods, throughput drops as well.

Burstiness occurs for many of the same reasons as underutilization. A synchronization point in a web site component stalls the site for some period of time. After the congestion clears, the site continues with normal operation until a backlog forms again at the synchronization point.

Figure 13.2 shows bursty behavior ten minutes into a test run. The application server and database server CPU utilization drops to only 5% and 2% respectively. Several minutes later, the application server CPU utilization rises to 80%, and the database server CPU utilization goes up to 22%.

Figure 13.2. Example of bursty behavior. From IBM AIM Services Performance Workshop presented by Stacy Joines, 2001, Hursley, U.K. IBM Corp. 2001. Reprinted by permission of IBM Corp.

graphics/13fig02.gif

Bursty behavior is probably the most challenging performance issue to resolve. Let's look at a few of the more common causes of bursty behavior.

Application Synchronization

Application synchronization sometimes triggers burstiness. The application might contain an infrequently accessed section of code that contains a severe synchronization point. As requests hit this point, they stall until they clear the synchronized block.

How to Diagnose

A thread trace gives you detailed information on what your threads are actually doing. If your threads are waiting to clear a synchronization block, the thread trace makes this clear.

In bursty situations, obtain the thread dump during the idle phase of the bursty behavior, because any wait occurs during the idle cycles.
If you cannot obtain a thread trace, try observing the response time of individual servlets and EJBs, if possible. During the idle phase, look for response time spikes on specific paths through your web site. Considering adding traces to your code to analyze the performance of specific methods . Finally, look for timeouts and other error conditions in your logs.

How to Resolve

If you find that the thread trace shows threads waiting for a serialized resource, remove the synchronization point, or reduce the size of the synchronized code block. If you choose to remove a synchronization point, make sure the code inside the block is threadsafe.
Resolve any stalled back-end systems impacting your web application. For example, if the thread trace shows all of your threads waiting for responses to JDBC queries, improving the responsiveness of the database involved should resolve the wait condition.

Client Synchronization

The test client is a surprisingly frequent cause of burstiness. Sometimes the test team inadvertently sets up the test tool to feed the web site traffic in bursts. Usually this occurs when the test scripts include think time. The test client software synchronizes these think times, resulting in bursty traffic.

How to Diagnose

Use your HTTP server and application server monitors to see if work is reaching the servers. For example, look at the active thread count in your application server. If the processes and threads in your server are waiting for work during the idle period, this indicates an upstream problem such as client synchronization.
Obtain a thread trace. In these cases, the application server doesn't contain any running threads! The system is idle because it doesn't have anything to do.
Look at the minimum, maximum, and average response times at the test client. Are the response times shorter than the burstiness period? If so, this indicates test clients are not waiting on requests to return during the slow period.
Check the network with a network protocol analyzer. This diagnosis requires more skill than network capacity analysis; you need the inbound and outbound traffic analyzed . Has the web site responded to the inbound requests, but not received additional requests in a while? This is usually a good indication of test client synchronization.

How to Resolve

Better quality test client software usually allows you to randomize the virtual client think times. You provide a range of acceptable think time values, and the test client software randomly picks a think time within the range. Random think times provide a more even loading of the web site and more realistically represent production traffic.

Back-End Systems

We discussed earlier the impact of stalled back-end systems. Frequently these show up in a thread trace, but you may detect these issues without obtaining a trace. A periodic back-end system stall occurs for a number of reasons. For databases, the system may stall while resolving a deadlock. Other stalls might result from network issues or disk I/O. Once the bottleneck clears, the system runs at capacity until the condition leading to the bottleneck occurs again.

Figure 13.3 shows a misbehaved back-end system where the database CPU jumps to 100% CPU utilization 10 minutes into the run and then returns to 22% at 13 minutes.

Figure 13.3. Example of a bursty back-end database system

graphics/13fig03.gif

How to Diagnose

Observe the CPUs and disk I/O of your back-end systems. Does the CPU stay busy on one of your systems while the other systems are idle? The system still busy probably contains the source of the bottleneck.

How to Resolve

You need assistance from the administrator of the system in question. Use your DBAs or host specialists to find the cause of the bottleneck on the remote system and help you resolve it.

Garbage Collection

During normal operation, the application server stops servicing requests during the synchronized parts of the garbage collection cycle. As JVM garbage collection technology continues to improve, the synchronized portion of the cycle shortens; however, all other JVM work still ceases for a period of time during garbage collection. This especially applies during compaction cycles. Usually, garbage collection does not produce prolonged periods of burstiness. However, very large JVM heaps often require long garbage collection cycles and cause noticeable burstiness.

How to Diagnose

Using CPU monitoring techniques described in Chapter 12, look for the CPU percentage utilized on the application server box dropping to 1/number of processors. This behavior occurs during the single-threaded portions of the garbage collection cycle. (in this part of the cycle, only one processor is busy.) Also, look for reduced utilization of back-end systems during this time period.
Use your application server JVM monitors or verbosegc , as described in Chapter 12, to determine if the idle period on the web site corresponds to garbage collection cycles. Also, check the verbosegc log for the length of the garbage collection cycles.

How to Resolve

Reducing the maximum heap setting on your JVM shortens the duration of garbage collection. The JVM garbage collects more often but for shorter periods of time, minimizing the bursty behavior. Refer to Chapter 4 for more discussion on garbage collection dynamics.
Reduce the amount of garbage created by your application. Use this technique in conjunction with reducing the maximum heap size.

Timeout Issues

Abnormal timeout values often trigger bursty behavior. In these cases, the test system may wait for an unusually long period until the timeout associated with some resource expires . Solaris, for example, defaults the timeout for TCP/IP connections to four minutes. A test generating heavy load against these systems may suddenly encounter a four-minute idle period while waiting for a TCP/IP connection to come available on the servers. Lowering the timeout interval ( tcp_time_wait ) resolves this problem.

How to Diagnose

Look for a consistent pattern or interval in the idle period. (Use your monitoring tools to closely measure the burstiness pattern.) If the idle period lasts roughly three minutes, for example, then look for a corresponding configuration setting set to approximately three minutes in the test environment.

How to Resolve

Adjust the configuration setting and retest.
If you cannot adjust the setting, consider adding more of the constrained resource to work around the problem.

Network Issues

Finally, if all else fails, look at the network involved. Check the settings on your firewalls and any other intervening hardware. Network-induced burstiness occurs less frequently than the other causes we've discussed, but sometimes improper network configuration leads to burstiness under load. Even if the network isn't at fault, it usually holds clues to help you resolve the problem.

How to Diagnose

Use a network protocol analyzer. Plug it into all the segments of your network in a systematic manner. Where does the traffic go before the idle phase of the burst hits? Are you seeing network timeouts during the idle phase? Try to determine if the traffic flows through all components and segments of the network.

This requires some skill in network analysis. You need experience with following network "conversations" at a low level inside your network. Some tools support HTTP protocol diagnosis, while others only support analysis at the IP level. Of course, if you use SSL, obtaining detailed conversation information may be difficult because of encryption of the packets. Turn off SSL, if possible, to make analysis possible.
Eliminate unnecessary components. Take firewalls, routers, switches, and load balancers out of the equation. If the burstiness goes away, add the equipment back to the network one component at a time. Once you find the offending component, check its settings and operation to uncover the source of the problem.

How to Resolve

Obviously, you need help from your network team. Once you isolate the problem component, you need someone who is knowledgeable about the offending equipment to resolve the issue.