Improving Performance and High Availability | Performance Tuning for Linux Servers

This section discusses three major approaches that enterprise businesses use to improve the performance of their J2EE enterprise applications as well as their high availability. These approaches are SMP scaling, clustering, and topology.

SMP Scaling

Systems with multiple processors are very common nowadays, even for low-end systems. For example, the IBM xSeries systems for desktops can easily be equipped with two to eight Intel Xeon processors. Adding extra processors to an existing system has become an attractive option for business owners. Adding more processors to a system basically adds more horsepower, or computing power. Theoretically, when more computing power is added to the system, the application server should be able to process more requests in the same amount of time as it did previously with fewer requests

SMP systems are called symmetric because the processors added are identical and connected to the same bus. The processors also share a common memory store. Ideally, when n processors are added to a system, an n times boost in performance is the result. However, to achieve this, a balance must be maintained between the additional processors and the other resources that are being shared, such as the system bus, memory, and network interfaces. By adding more processors, it is likely that more contentions will occur in accessing the shared resources. When this occurs, synchronization needs to be imposed; thus, some of the computing power cannot be used.

SMP systems are definitely a good way to boost an application server's overall performance. As more processors are added, however, the application server might not yield the expected performance boost. If this occurs, the system is experiencing scalability problems.

The cause of scalability problems can come from all layers in the software performance stack shown in Figure 18-1. It is possible that the enterprise application itself is not scalable, or the problem could be the application server, or even the JVM. Do not discount the fact that Linux might not be scaling very well. The hardware itself has limitations on the maximum number of processors it can handle to get good scalability. Thus, it is important to become familiar with the performance specifications of your hardware platform.

Because Linux was originally designed as a UNIX-like operating system for the desktop, its internal design did not originally consider SMP scaling. Today, the Linux kernel has undergone major improvements to address SMP scaling issues. If you are using an SMP system, make sure that your kernel supports SMP. You can verify that your kernel supports SMP by issuing the uname a command and looking for the smp extension following the kernel version and the SMP before the date, as shown in the following sample output:

 Linux vivaldi 2.4.21-17-smp #1 SMP Wed Jul 30 17:18:41\  UTC 2003 i686 unknown

SMP systems are a common configuration choice of many business enterprises for their application servers running on non-Linux systems. So far, the most likely cause of scaling problems is either the enterprise application or the operating system.

Clustering

Clustering is another approach to improve the overall performance of enterprise applications. Clustering can also be used to ensure high availability. Clustering has nothing to do with adding more processors, but instead with adding more instances of the application server. These instances serve exactly the same applications. In other words, imagine these application servers to be clones of each other. In so doing, incoming HTTP requests can be processed by any of these application server instances.

There are two types of clustering: vertical and horizontal. Vertical clustering refers to the creation of two or more instances of the same application server on the same Linux server, as shown in Figure 18-3. Because these application servers are independent processes, it is important in vertical clustering that there is enough memory to share among the instances. Requests that come in through the web server are still channeled to the same host machine. However, so that these requests are sprayed to the different instances, these application servers must be identical and deploy the same enterprise applications. Any assignment policy, normally round-robin, can be used when spraying the requests. Vertical clustering is typically used when the machine's CPU utilization is better consumed with more application servers. Also, vertical clustering can be used for fault tolerance for the server machine so that when one application server dies, requests can be rerouted to a different application server.

Figure 18-3. Vertical clustering. Two or more instances of the same application server are co-located in the same host machine.

Horizontal clustering is the creation of two or more instances of the same application server across different Linux servers. Thus, strictly more CPU power is added in this approach because more Linux servers are actually added. Note that these servers do not have to be identical but the application servers do, as shown in Figure 18-4. The idea is the same as vertical clustering when requests come in through the web server. The only difference is that requests are sprayed across different Linux servers. Horizontal clustering shows more performance gains because there are physically more machines. Unlike with vertical clustering, where the application servers still have to contend for the available CPUs, application servers in horizontal clustering have their own set of CPUs. Also, the fault tolerance provided by horizontal clustering is at the Linux server level. When a Linux server fails, incoming requests can be rerouted to a different Linux server, thus increasing high availability. In vertical clustering, when the Linux server itself fails, all the application servers fail as well. Of course, horizontal clustering might require more investment because it involves more than one Linux server.

Figure 18-4. Horizontal clustering. Two or more instances of the same application server are created in different Linux servers.

Of course, it is also possible to combine the two approaches to have a cluster of several Linux servers, where each Linux server hosts more than one application server. Some business owners do this combined approach to implement two or more sets of horizontal clusters. In other words, the application servers in a Linux server are not really identical in the strict sense of vertical clustering, in that they do not have the same applications. They were created in the same server just to make use more of the available CPU power. In some application servers, the plug-in can spray the requests across the different servers. In some other application servers, a third-party sprayer programfor example, the IBM Network Dispatcher or BigIP by f5 Networksis placed between the web server and the Linux servers.

Scalability is also a concern in clustering. In the case of vertical clustering, only so many instances of the application server can be added. Beyond that limit, additional instances of the application server are not helpful and can lead to poorer performance.

The most likely causes of scalability problems in a cluster are the hardware resourcesthe CPU, system bus, and memory. Another possible cause of scalability problems is the scheduler system of the operating system, because a cluster has more processes to schedule for the same amount of resources. In the case of horizontal clustering, scalability can be hampered by the networking subsystem, because as more and more requests come in, the traffic in the network connecting the cluster becomes heavier. Additionally, the plug-in or sprayer might not be scalable. Also, some application servers might include a workload manager that routes the requests based on the load of each node in the cluster. The workload manager can also be a potential cause of bottlenecks.

Topology

Another approach that enterprise businesses use to improve the performance of their J2EE enterprise applications is the type of topology they use to set up the application server infrastructure. Each business system can come up with its own topology based on its needs and, possibly, the nature of the applications that need to be deployed. Remember, however, that the goal of a topology should always be that of addressing the fundamental requirements of high performance and high availability.

Generally, how an application server is used covers a wide spectrum, which is also why the performance characterization of application servers is not a straightforward process. On the one hand, an application server can be used only to serve static pages, like a web server. On the other hand, a cluster of application servers can be configured to communicate with back-end database servers and gateways to external services. In addition, the cluster can be protected by a firewall and replicated in a geographically different location. The application servers can be running hundreds of applications. The choice of a topology mostly depends on the system's complexity and the requirements, such as security, performance, fail-over, and high availability. Figure 18-5 shows the basic three-tier topology in which the web server and the application server are hosted in the same server machine.

Figure 18-5. Basic three-tier topology.

Most businesses use a database to store their persistent data. In this topology, the database server is hosted on a separate server box, possibly a mainframe. The J2EE application makes some JDBC calls to retrieve and update data from the database. In a typical scenario, a web user sends an HTTP request to the web server that is hosted on Server A. By examining the request, the web server determines that it should be processed by the application server and therefore passes the request to the application server. The application server processes the request by giving it to the appropriate application deployed on the server. That application processes the request and returns a response to the web server. The web server in turn returns the response to the original sender of the request. This type of topology might suffice for small businesses. Small businesses typically use Secure Sockets Layer (SSL) to protect customer information and make use of the security features provided by the application server.

Figure 18-6 is a more sophisticated topology for a medium- to large-scale enterprise that requires more stringent security. In this topology, firewalls are set up to put the web server in a demilitarized zone (DMZ) while the application servers and back-end databases or legacy systems are completely protected from the Internet. The DMZ is a location where a computer or a small subnetwork sits between a trusted internal network, such as a corporate private local area network (LAN), and an untrusted external network, such as the public Internet.

Figure 18-6. DMZ topology.

Typically, a security mechanism such as basic user ID and password is used to authenticate the web user. LDAP servers are commonly used to store users and their respective authorization information. The web server queries the LDAP server to authenticate and authorize a user who just logged in. In the case of an authorized user, the request is passed to the application server through a firewall, and the latter processes the request. For example, the request might be processed by a servlet, which requires data that exists on two different data sources, an Oracle database and a SQL Server database. When the request is made, an application requests the data by summoning either a native database driver or a generic (JDBC or ODBC) driver, which formats the resulting data in a way the web browser can understand using HTML, Java, or XML, and then places the data into a template. The web browser then receives the data and renders it for the user.

This example is a simplified view that hides a lot of the things that happen "under the hood." For example, the application server must be able to authenticate the user in some fashion, the server must have the processing power to run an applet and deal with the importing of data, and it must work in conjunction with an efficient web server (such as the Apache web server) to make sure that data is sent back to the user properly.

The next topology, shown in Figure 18-7, extends the previous topology by adding more application servers. The idea behind this topology is to increase the enterprise's processing capability, and thus the overall throughput by increasing the number of processors. In this case, one or more machines are added. Each additional box has the same configuration as the originalthe same application server and the same applications are deployed. When the number of simultaneous requests coming in to the web server is high, so that the performance of only one application server reaches its saturation point, the additional application servers can help by taking in some of the load. The web server, therefore, must be able to spray the requests across the available application servers using a load-balancing algorithm. The most straightforward algorithm is round-robin, where requests are distributed to each application server in a fixed sequence one after the other and repeating the same sequence over and over. Theoretically, the round-robin algorithm distributes the load uniformly among all the application servers. With load balancing, requests are spread among all running application servers in a cluster, rather than risking having one or two servers overworked and on the edge of failure. This type of load balancing is a staple in the overall web server world. Typically, application servers take one or two approaches: they either employ a simple round-robin method that sends an incoming request to a less-busy server, or they institute an algorithm-based system in which usage on the entire network is analyzed and a request is sent to the server that can best handle the request based on parameters set up by the system administrator.

Figure 18-7. Load balancing.

The next topology, shown in Figure 18-8, is a further extension of the previous topology. The main distinction is that the same topology is now being replicated into several instances. Figure 18-8 shows only two instances. Each instance is called a domain, so this topology is called the multidomain topology. This topology essentially achieves another layer of high availability and load balancing. Domains can be spread geographically in many places. Notice that a new server, called the Request Sprayer, becomes the front end (and thus a single point of entry), which accepts requests and distributes them to the available web servers. A web server performs the necessary security checks and the distribution of requests to the application servers, as described in the previous topology. Note that the web servers typically share the same LDAP repository because the splitting of the application servers into several domains is transparent to users. In some architectures, there could be a pool of LDAP servers so that the web servers can access any of them. It is important in this situation, however, that the repositories of these LDAP servers are synchronized for consistency.

Figure 18-8. Multidomain topology.

This multiple LDAP servers approach is excellent for fail-over. The multidomain topology shown in Figure 18-8 shows that the back-end is shared by all domains. Again, this is not necessarily true, as in the case of the multiple-LDAP servers.

Although it's costly, the advantage of the multidomain topology is high availability. A domain can be down for a while due to different reasonsfor example, a power outage, application deployment, version upgrade, or system crash. However, this outage will not shut down the entire operation of the enterprise because requests can be routed to the other domain that is still online. The multidomain topology also provides scalability to the system as the workload increases.

Performance Tuning

Performance tuning is a fundamental way to improve the performance of any server. Key to performance tuning is finding the performance bottlenecks in the system under investigation. After they are identified, the bottlenecks should be fixed and/or tuned. Performance tuning can be performed at any layer in our software performance stack where parameters can be set. Performance tuning is about finding the right values for these parameters. Performance tuning is also the art of putting the system in harmony by making sure that the combinations of all parameter values create a balanced system.

Tuning the performance of an application server is broader in scope than general performance tuning. The system under test must be analyzed from end to end, including all the external components that communicate with the application server. This can be done in a top-down fashion, where the whole system is treated as a blackbox at first. As performance data is gathered, an analysis is made, and possible suspects are identified, you can go one step lower to get more information related to the suspects. This is done repeatedly until the cause of the bottleneck is discovered.

Performance analysis tools are very important in this undertaking. The golden rule is to use the right tool for the right problem. Hundreds of tools are available for analyzing the performance of J2EE applications, and some of them are commercial tools. Tools for Linux servers are discussed in this book in Part III, "System Tuning." These tools are very handy for analyzing performance at the operating system level.

Performance analysis of application servers can be divided into three levels: J2EE- related, JVM-related, and Linux operating system- and hardware-related. These areas are shown in Figure 18-9.

Figure 18-9. Three levels of performance tuning. Level 3 is typically independent of the operating system, whereas Levels 1 and 2 may involve closer investigation of the OS.

Level 3 represents J2EE-specific areas, including the application server itself. These areas can be tuned independent of the underlying operating system. For example, certain parameters in the application servers can be tuned without regard to Linux-specific parameters, including prepared statement cache size, connection pools, and transaction timeouts. Literature on performance analysis and tuning of J2EE applications is abundant; a good reference is.

Level 2 represents areas very specific to the JVM, the native code that implements the JVM, and the native code emitted by the JIT compiler. Specific parameters in the JVM are also independent of the operating system, such as the maximum and minimum heap size and garbage collection policies. But other parameters are tightly coupled with the operating systemfor example, the maximum heap size. Most literature on J2EE performance tuning includes JVM performance tuning as well.

Level 1 represents areas that are more specific to the Linux operating system and the hardware platform it is running on.

When looking for the cause of a performance problem, the recommended approach is to go from Level 3 to Level 1. This means that before delving into tuning Linux, make sure as much as possible that all the other layers above it have been exhausted. Give Linux the benefit of the doubt by making it the last suspect except when it is rather obvious and undeniable that the problem is Linux-related. Your system administrator should have configured Linux to be at its optimal configuration. However, not all global configurations of Linux may work well with all applications in your system.

When tuning an application server, the best approach is to use CPU utilization as your goal for optimization. This is true in both single-server and clustered configurations. Here is a rough outline of the steps involved in this approach:

1.	Choose an application to run on the application server.
2.	Set up your environment for testing.
3.	Perform a user ramp-up test. Run a test with one user and gather the necessary statistics, such as the throughput, average response time, CPU utilization, and so on. If other systems are involved in the environment, such as a remote database machine or a web server, also gather OS-level statistics such as CPU utilization and I/O statistics. Perform the previous step, but double the number of users. Stop when the CPU utilization is either maxed out (100% used) or tends to stay approximately the same as the previous runs.
4.	If the CPU was maxed out, the system has no bottleneck. Otherwise, you need to investigate what is causing the idle time.
5.	After you have determined a possible cause of the bottleneck and have fixed/tuned it, perform another user ramp-up test.
6.	Repeat this process until you can max out the application server's CPU.

In the case of a clustered configuration, investigate the CPU utilization of all application server boxes. Test a clustered configuration only after you have done a single-server test. A clustered configuration is good for testing scalability problems. Also, it is recommended that you assign a box exclusively to an application server and move all other servers (database server, JMS server, and so on) to remote boxes. A later chapter illustrates this approach by using a real-life example.

When you have determined that a bottleneck exists, the next step is to look for it. It takes experience and exposure to a number of problem cases to be able to hunt down the culprit quickly. Some guidelines that we can offer include the following:

Enterprise application. Make sure that the enterprise application is not causing the problem. Profile the application using tools to detect problems with path length, method time, memory usage, and thread waits. Some commercial tools for J2EE performance profiling that are now available on the Linux platform are JProbe from Quest Software, OptimizeIt from Borland, and Rational Purify/Quantify from IBM. Eclipse is an open-source profiler available on Linux.
To be an expert performance analyst, you have to be familiar with J2EE programming concepts and the best programming practices for good performance.
Java Virtual Machine. The JVM is crucial for the application server. Make sure that it is properly tuned based on the behavior of the enterprise applications. More specifically, check if the heap size is set properly. Examine the behavior of the garbage collector. Does using the parallel garbage collector help? If the JVM is based on hot-spot technology, is it enabled? Is large page support available, and does your Linux kernel support it as well? Is JIT enabled? These are just some of the things you can check.
The best thing you can do is check the JVM implementor's help guide or web site to find what parameters are available for tuning. If you intend to be an expert performance analyst for application servers, you must understand intimately the underpinnings of the Java Virtual Machine and some of its implementation details.
Security. If the web server uses security, verify that it is tuned properly to handle SSL requests. Also, the application server might have some performance tuning knobs that are available for SSL or J2EE security. Security itself is an overhead. It is estimated that security can slow down an application server by 20% to 25%. Test your system by turning off security, and check if you are getting a lot more than this percentage.
Networking. Because the application server is very intensive with network communications to the web server, database server, JMS server, LDAP server, and other remote servers, verify that the TCP/IP stack is properly tuned. It is also important to check the performance of the servers that the application server is talking to. For example, check if the database server is properly tuned and if it can process the amount of data being passed to it by the application server in a timely manner. Another critical server to look at is the web server. In these situations, you might need the help of other people with the necessary skills to examine those servers.
File system. Take note of the file system activity by using tools such as sar and iostat. Also check to see if a lot of file swapping is occurring.
Concurrency. If you are using kernel 2.4, examine the context switches that are taking place by running vmstat. If the context switches are a bit too high, make sure that the application server's thread pool is not too high. Typically, a thread pool of 10 to 20 threads is sufficient. Beyond that, performance can degrade. With NPTL supported in the Linux 2.6 kernel, this limitation might be alleviated.
Memory. Check the system memory usagespecifically, the memory that is allocated to the application server. Determine if the system needs to be upgraded with more memory.
Scaling. Verify that the system is scaling well. In SMP systems, if the number of processors is a detriment rather than a help, it might be wise to disable some of the processors. If vertical clustering is configured, make sure that there are not too many instances of the application server. Determine whether adding more application servers increases CPU utilization. If it does, kill some of your application servers.

Depending on the type of environment you have (for example, hyperthreading or NUMA), there might be other things you can look into. Verify that your Linux kernel supports these additional features on your system.

Tuning an application server is a system-wide activity. All the pieces of the system need to have a performance story, and you should be able to work on all the layers in the software performance stack. The application server is a complex piece of middleware and so is the tuning process.