Troubleshooting Performance Problems | Mastering BEA WebLogic Server: Best Practices for Building and Deploying J2EE Applications

Your application and environment are now tuned to perfection , users are happy, and the system is taking hundreds of hits per second without batting an eye, right? If not, then read on as we present a tried and true methodology for troubleshooting performance problems.

Successful troubleshooting requires a strong understanding of the system and its components , a good problem-resolution process, and knowledge of common performance problems and their solutions. Every system is different, and every performance problem is likely to be different, but there are a number of best practices worth outlining to help you through your own troubleshooting efforts.

Preparing for Troubleshooting

Troubleshooting performance problems can be a difficult and time-consuming process unless you prepare ahead of time. When the users are unhappy and the pressure is on, you must have the proper infrastructure, processes, and people in place to address the problem.

First, the application should have been thoroughly tested and profiled during performance testing. You need to know how the application performed in the test environment to know if the performance problem you are tackling is real or simply a normal slowdown under peak loads. Your test results also indicate the normal resource usage of the individual transaction under investigation for comparison with observed resource usage in production. Good testing is critical to efficient production troubleshooting.

Next , you must have all necessary performance monitoring mechanisms in place to provide information concerning system performance and activity. Recognize that many performance problems do not happen on demand, so you will need some form of logging to reconstruct system resource usage and activity during a period in question. Simple shell scripts that log selected output from system monitoring tools are often sufficient for this purpose.

Finally, you need a team and a process in place before the problem occurs. It is a good idea to form a multi-disciplinary swat team and make that team responsible for troubleshooting performance problems. Typically, we recommend using many of the same people who did the original performance testing because they already understand the behavior of the system under various loads. Create a well-documented process for responding to performance problems, including a database or other knowledge repository for storing information on previous incidents and remedies.

Once you ve done everything you can to prepare for performance problems, all you can do is wait and see how the system performs . Should a problem arise, the team s first order of business is to identify the root cause of the performance problem, also known as the bottleneck .

Bottleneck Identification and Correction

A bottleneck is a resource within a system that limits overall throughput or adds substantially to response time. Finding and fixing bottlenecks in distributed systems can be very difficult and requires experienced multi-disciplinary teams . Bottlenecks can occur in the Web server, application code, application server, database, network, hardware, network devices, or operating system. Experience has shown that bottlenecks are more likely to occur in some areas than in others, the most common areas being these:

Database connections and queries
Application server code
Application server and Web server hardware
Network and TCP configuration

Remember that there is rarely a single bottleneck in a system. Fixing one bottleneck will improve performance but often highlights a different bottleneck. Bottlenecks should be identified one at a time, corrected, and the system tested again to ensure that another bottleneck does not appear before reaching the required performance levels.

In order to identify bottlenecks quickly and correctly you must understand your system. The team responsible for problem resolution must know all of the physical components of the system. For each physical component (server, network device, etc.), the team needs detailed knowledge of all the logical components (software) deployed there. Ideally, all of this information will be documented and available to the swat team members who are responsible for troubleshooting. The team can prepare for problems by identifying all the potential bottlenecks for each component and determining the proper way to monitor and troubleshoot these areas.

The following lists document some of the typical components and areas of concern related to each of them. The team must be aware of these potential bottlenecks and be prepared to monitor the related resource usage to identify the specific bottleneck responsible for a given performance problem quickly.

Common areas of concern for firewall devices include the following:

Total connections.
SSL connections ”If you exceed more than 20 SSL handshakes per second per Web server you may need an SSL accelerator.
CPU utilization ”Make sure CPU utilization does not average above 80 percent.
I/O ”If the firewall is logging make sure it is not I/O bound.
Throughput.

Common areas of concern for load balancers include these:

Total connections.
Connection balance.
CPU utilization ”Make sure average CPU utilization does not exceed 80 percent.
Throughput.

Common areas of concern for Web servers include the following:

CPU utilization ”Make sure average CPU utilization does not exceed 80 percent.
Memory ”Make sure excessive paging is not taking place.
Throughput ”Monitor network throughput to make sure you do not have an overutilized network interface card.
Connections ”Make sure connections are balanced among the servers.
SSL connections ”Make sure that the number of SSL handshakes per second is not too much for the hardware and Web server software. Consider using SSL accelerators if it is too high.
Disk I/O ”Make sure the Web servers are not I/O bound, especially if they are serving a lot of static content.

Common areas of concern for application servers include the following:

Memory ”Make sure there is enough memory to prevent the JVM from paging.
CPU ”Make sure average CPU utilization does not exceed 80 percent.
Database connection pools ”Make sure application threads are not waiting for database connection excessively. Also, check to make sure the application is not leaking connections.
Execute queue ”Watch the queue depth to make sure it does not consistently exceed a predetermined depth.
Execute queue wait ”Make sure messages are not starved in the queue.

Common areas of concern for database servers include these:

Memory ”Make sure excessive paging and high I/O wait time are not occurring.
CPU ”Make sure average CPU utilization does not exceed 80 percent.
Cache hit ratio ”Make sure the cache is set high enough to prevent excessive disk I/O.
Parse time ”Make sure excessive parsing is not taking place.

For each area of concern, you may want to put system-monitoring tools in place that will take measurements of these variables and trigger an alert if they exceed normal levels. If system-monitoring tools are not available for a component, you will need to have scripts or other mechanisms in place that you can use to gather the required information.

Best Practice

Identifying bottlenecks quickly in production systems requires a thorough knowledge of the hardware and software components of your system and the types of potential bottlenecks common in each of these areas. Ensure that system-monitoring tools capture appropriate information in all areas of concern to support troubleshooting efforts. Consider creating scripts or processes that monitor system resources and notify team members proactively if values exceed thresholds.

Problem Resolution

Troubleshooting performance problems should be accomplished using a documented, predefined problem resolution process similar to the high-level flowchart depicted in Figure 12.3. We will touch briefly on each step in the flow chart to give you a better feel for the process.

Figure 12.3: Problem resolution flow chart.

The first step in the process is to define the problem. There are two primary sources of problems requiring resolution: user reports and system-monitoring alerts. Translating information from these sources into a clear definition of the problem is not as easy as you might think. Reports such as the system seemed slow yesterday don t really help you define or isolate the problem. Provide users with a well-designed paper form or online application for reporting problems to ensure that all important information about the problem is captured while it is still fresh in their minds. Understanding how the user was interacting with the system may lead you directly to the root of the problem. If not, move on to the next step in the process.

The next step involves checking all potential bottlenecks, paying special attention to areas that have been problems in the past. Consult your system monitoring tools and logs to check for any suspicious resource usage or changes in normal processing levels.

If you are unable to identify the location of the bottleneck or root cause of the performance problem you will need to perform a more rigorous analysis of all components in the system, looking for more subtle evidence of the problem. Start by identifying the layer in the application most likely to be responsible for the problem and then drilling in to components in that layer looking for the culprit. If you discover a new bottleneck or area of concern, make sure to document the new bottleneck, adding it to the list of usual suspects for the next time.

Once you ve identified the location of the bottleneck you can apply appropriate tuning options and best practices to solve the problem. Document the specific changes made to solve the problem for future use. If nothing seems to work, you may need to step back, revisit everything you ve observed and concluded, and try the process again from the top. Consider the possibility that two or more bottlenecks are combining to cause the problem or that your analysis has led you to an incorrect conclusion about the location of the bottleneck. Persevere, and you will find it eventually.

Common Application Server Performance Problems

This section documents a variety of common problems and how you can identify and solve them in your environment.

Troubleshooting High CPU Utilization and Poor Application Server Throughput

The first step in resolving this problem is to identify the root cause of the high CPU utilization. Consider the following observations and recommendations:

Most likely the problem will reside in the application itself, so a good starting point is to profile the application code to determine which areas of the application are using excessive processor resources. These heavyweight operations or subsystems are then optimized or removed to reduce CPU utilization.
Profile the garbage collection activity of the application. This can be accomplished using application-profiling tools or starting your application with the -verbose:gc option set. If the application is spending more than 25 percent of its time performing garbage collection, there may be an issue with the number of temporary objects that the application is creating. Reducing the number of temporary objects should reduce garbage collection and CPU utilization substantially.
Refer to information in this chapter and other tuning resources available from BEA to make sure the application server is tuned properly.
Add hardware to meet requirements.

Troubleshooting Low CPU Utilization and Poor Application Server Throughput

This problem can result from bottlenecks or inefficiencies upstream, downstream, or within the application server. Correct the problem by walking through a process similar to the following:

Verify that the application server itself is functioning normally using the weblogic.Admin command-line administration tool to request a GETSTATE and a series of PING operations. Chapter 11 walked through the use of this tool and the various command-line options and parameters available. Because the GETSTATE and PING operations flow through the normal execute queue in the application server, good response times are an indication that all is well within the server. Poor response times indicate potential problems requiring additional analysis.
If the GETSTATE operation reports a healthy state but the PING operations are slow, check to see if the execute queue is backed up by viewing the queue depth in the WebLogic Console.
A backed-up execute queue may indicate that the system is starved for execute threads. If all execute threads are active and CPU utilization is low, adding execute threads should improve throughput.
If the queue appears starved but adding execute threads does not improve performance, there may be resource contention . Because CPU utilization is low, the threads are probably spending much of their time waiting for some resource, quite often a database connection. Use the JDBC monitoring facilities in the console to check for high levels of waiters or long wait times. Adding connections to the JDBC connection pool may be all that is required to fix the problem.
If database connections are not the problem you should take periodic thread dumps of the JVM to determine if the threads are routinely waiting for a particular resource. Take a series of 4 thread dumps about 5 to 10 seconds apart, and compare them with one another to determine if individual threads are stuck or waiting on the same resource long enough to appear in multiple thread dumps. The problem threads may be waiting on a resource held by another thread or may be waiting to update the same table in the database. Once the resource contention is identified you can apply the proper remedies to fix the problem.
If the application server is not the bottleneck, the cause is most likely upstream of the server, perhaps in the network or Web server. Use the system monitoring tools you have in place to check all of the potential bottlenecks upstream of the application server and troubleshoot these components.

Troubleshooting Low Activity and CPU Utilization on All Physical Components with Slow Throughput

If CPU utilization stays low even when user load on the system is increasing, you should look at the following:

Is there any asynchronous messaging in the system? If the system employs asynchronous messaging, check the message queues to make sure they are not backing up. If the queues are backing up and there are no message-ordering requirements, try adding more dispatcher threads to increase throughput of the queue.
Check to see if the Web servers or application servers are thread starved. If they are, increase the number of server processes or server threads to increase parallelism.

Troubleshooting Slow Response Time from the Client and Low Database Usage

These symptoms are usually caused by a bottleneck upstream of the database, perhaps in the JDBC connection pooling. Monitor the active JDBC connections in the WebLogic Console and watch for excessive waiters and wait times; increase the pool size , if necessary. If the pool is not the problem, there must be some other resource used by the application that is introducing latency or causing threads to wait. Often, periodic thread dumps can reveal what the resource might be.

Troubleshooting Erratic Response Times and CPU Utilization on the Application Server

Throughput and CPU will always vary to some extent during normal operation, but large, visible swings indicate a problem. First look at the CPU utilization, and determine if there are any patterns in the CPU variations. Two patterns are common:

CPU utilization peaks or patterns coincide with garbage collection. If your application is running on a multiple CPU machine with only one application server, you are most likely experiencing the effects of non-parallelized garbage collection in the application server. Depending on your JVM settings, garbage collection may be causing all other threads inside the JVM to block, preventing all other processing. In addition, many garbage collectors use a single thread to do their work so that all of the work is done by a single CPU, leaving the other processors idle until the collection is complete. Try using one of the parallel collectors or deploying multiple application servers on each machine to alleviate this problem and use server resources more efficiently . The threads in an application server not performing the garbage collection will be scheduled on processors left idle by the server performing collection, yielding a more constant throughput and more efficient CPU utilization. Also consider tuning the JVM options to optimize heap usage and improve garbage collection using techniques described earlier in this chapter.
CPU peaks on one component coincide with valleys on an adjacent component. You should also observe a similar oscillating pattern in the application server throughput. This behavior results from a bottleneck that is either upstream or downstream from the application server. By analyzing the potential bottlenecks being monitored on the various upstream and downstream components you should be able to pinpoint the problem. Experience has shown that firewalls, database servers, and Web servers are most likely to cause this kind of oscillation in CPU and throughput. Also, make sure the file descriptor table is large enough on all Unix servers in the environment.

Troubleshooting Performance Degrading with High Disk I/O

If a high disk I/O rate is observed on the application server machine, the most likely culprit will be excessive logging. Make sure that WebLogic Server is set to the proper logging level, and check to see that the application is not making excessive System.out.println() or other logging method calls. System.out.println() statements make use of synchronized processing for the duration of the disk I/O and should not be used for logging purposes. Unexpected disk I/O on the server may also be a sign that your application is logging error messages. The application server logs should be viewed to determine if there is a problem with the application.

Java Stack Traces

This section discusses the reading and interpretation of Java stack traces in WebLogic Server. A Java stack trace displays a snapshot of the current state of all threads in a JVM (Java Virtual Machine) process. This trace represents a quick and precise way to determine bottlenecks, hung threads, and resource contention in your application.

Understanding Thread States

The snapshot produced by a Java stack trace will display threads in various states. Not all Java stack traces will use the same naming convention, but typically each thread will be in one of the following states: runnable, waiting on a condition variable, and waiting on a monitor lock.

Threads in the runnable state represent threads that are either currently running on a processor or are ready to run when a processor is available. At any given time, there can be only one thread actually executing on each processor in the machine; the rest of the runnable threads will be ready to run but waiting on a processor. You can identify threads in a runnable state by the runnable keyword in the stack trace, as shown here:

 "ExecuteThread: