Best practices advice | High-Volume Web Sites Team - More about High-Volume Web Sites

< Day Day Up >

This section reviews what we consider the best practices derived from our tests. It starts with general considerations for capacity planning, system tuning, and managing software levels. Next it provides specific recommendations pertaining to parameters for the WebSphere Application Server plug-in, the WebSphere Application Server, and the WebSphere Portal. It concludes with other hints and a summary of recommended parameters.

Capacity planning

To minimize the effect of failures on users, include failure situations in your capacity planning.

For example, if a Web site is supported by four application server machines, and you expect the site to deliver good performance if one of the servers become unavailable, then plan such that the site workload can be handled by the remaining three servers.

If the normal workload level of the four servers is such that at peak times they are 80% busy, then if one of the servers become unavailable, the surviving three servers, even at 100% use, cannot sustain the level of workload with equivalent throughput and response time. This can be verified with a simple calculation:

 Equivalent number of machines in normal operation = 4 x 80% = 3.2 machines.

The three surviving servers cannot handle the equivalent of 3.2 machines-worth of power without degrading throughput and response time.

System tuning

To optimize the throughput and response time that is delivered by the surviving servers after a failure, tune all servers for optimum performance for the workload level that would apply in the event of a server becoming unavailable.

A well-managed installation should follow acknowledged best practices and tune its servers to eliminate unnecessary system bottlenecks, and will normally do so at the workload level reached during normal peak periods. In addition, tune servers for the higher workload level that would pertain if a server was unavailable during a peak period.

It is quite possible that in the capacity planning example above, the servers would be capable of handling the workload smoothly, albeit with longer response times and reduced throughput. In terms of their position in the throughput curve, they will be above the saturation point, but not have reached the buckle zone ^[2].

Accordingly, we recommend you carry out tuning activities at the normal peak level of workload and at the higher level that would pertain in cases of server unavailability at peak times.

Determining suitable levels of tuning parameters that deliver good performance at both levels of workload may require compromises. What is ideal for normal peak workloads may be less than ideal for the higher workload during times of server unavailability, and vice versa.

If compromises are necessary, make them as conscious business decisions based on evaluating the costs and benefits of the different options.

Software levels

The WebSphere plug-in is the key component handling workload management. Evaluate new versions of the plug-in as they become available, and move to new versions as soon as is practicable. Our testing found that later levels of the plug-in provided better load balancing behavior.

WebSphere Application Server plug-in parameters

In exploring how to minimize the effects of failover on the users of the portal, we edited the plug-in configuration XML to adjust the parameters that control the behavior of the WebSphere plug-in.

When administering a WebSphere Application Server, including one that runs WebSphere Portal, you regenerate the plug-in configuration file from time to time. This is done, for example, if the WebSphere cluster configuration is changed, perhaps to add one or more server nodes or application server clones, or if additional WebSphere Enterprise Applications are deployed.

After you regenerate the plug-in configuration file, be sure to reapply any manual changes, in particular those to the parameters described here.

LoadBalance

This parameter selects a load balancing algorithm - round robin (the default) or random. This is set on the <ServerGroup … tag as follows:

 <ServerGroup Name="xxxxx" LoadBalance="Random"

Our tests measured with both round robin and random load balancing. In our case, we found that random load balancing gave better results than round robin. 'Better' in this case meant:

Less fluctuations during failover
Less time to achieve even distribution
Less variability during clone recovery

To minimize the effect of failovers on portal users, we recommend a trial run of random load balancing to evaluate whether the improvements we observed are obtained.

Factors to study in deciding which load balancing algorithm is better in any particular installation include:

Success and speed of load redistribution during failover
Success and speed of load redistribution during recovery
Any unevenness of load distribution

RetryInterval

This parameter controls the time (in seconds) that the plug-in waits after marking a clone down before attempting to connect to the clone. This is set on the <ServerGroup … tag as follows

 <ServerGroup Name="xxxxx" RetryInterval="300">

Several considerations apply when deciding on the most beneficial value for RetryInterval.

In general the smaller the RetryInterval, the quicker a revived clone begins handling its share of the workload. On the other hand, there is little point in setting the RetryInterval at a level so short that a clone that has become inaccessible or unavailable is not likely to have been revived within the RetryInterval. This leads to a number of attempts being made to connect to the clone when there is no possibility of the clone being available.

An extreme example is an installation that happened to have incorporated a rogue (badly written) portlet on one of its portal pages. The misbehavior of the rogue portlet eventually caused a clone to hang. System monitoring may be required to recognize the problem; operator intervention, such as closing and restarting the clone, may be required to resolve it. In this case, a reasonable value for RetryInterval is a length of time a little longer than expected for the monitoring and operator action.

If clones are marked down when they are really available, you may be tempted to set an extremely short RetryInterval. In this form of 'false negative' an extremely short RetryInterval helps to bring the falsely down clone back into use quickly. In this case, we recommend finding and removing the cause of the false negative availability rather than setting an artificially short RetryInterval to ameliorate its effect.

This situation might arise when an installation experiences significant 'burstiness' in the arrival pattern of user requests. In this case if the WebSphere Application Server Web container transport connection backlog (q.v.) is not set at a sufficiently high level, the burstiness of request arrival patterns could sometimes lead to the connection backlog becoming full, after which subsequent new connection requests would be rejected and the requesting plug-in would mark the clone as down. Again, you might be tempted to set the RetryInterval to an extremely short value (say one second). However, the real problem to address is that the connection backlog is not set at a high enough value.

In our testing, we deliberately set a long RetryInterval (30 minutes) so we could focus on failover separately from recovery. However, this setting is not likely to be appropriate for production systems, other than in exceptional cases of problems experienced with applications that require relatively time-consuming manual intervention to resolve.

When automatic system management and automatic problem identification (for example, using Tivoli management products, or similar functions) are implemented, a criterion that could help decide a suitable threshold for this parameter is the time typically taken for the monitoring system to recognize a problem and correct it.

We suggest a value of 300 minutes.

ConnectTimeout

This parameter controls how long an instance of the plug-in waits when attempting to establish a TCP connection with an application server clone. If the connection has not been made before this timeout expires, the plug-in marks the clone as 'down' and tries another one.

This parameter is one of the principal determinants of the time taken for user requests to be switched from an inaccessible clone to another clone in the cluster. Reducing this timeout value reduces the delay in responding to users during a failover incident.

The parameter is set on the <Server … tag as follows

 <Server Clone ConnectTimeout="10" Name="xxxxxx">

Absence of this parameter means that the system default TCP connection timeout applies. In AIX this is normally 75 seconds but can be altered by setting a new value for the tcp_keepinit parameter using the AIX 'no' command. 75 seconds is a long time in the context of connecting to another server within the local fast Web site infrastructure. In addition, if the HTTP server is a multiprocessor model or thread based server, the connection attempt is a 'blocking' connect, which means that during the timeout, time connection attempts by other threads in the same HTTP Server process are blocked.

We recommend:

Set ConnectTimeout as low a value as possible without causing 'false negatives'. In this case, false negatives are situations where the worst case for the time taken for an application server to respond to a connection request (that is, taking account of peak loads) is less than the ConnectTimeout setting. This leads plug-ins to mark clones down when they are not.
Address the cause of the false negatives rather than try to ameliorate the consequences. Set ConnectTimeout high enough to avoid false negatives instead of setting RetryInterval artificially small.
Set ConnectTimeout in the five to ten seconds range, and monitor the HTTP server's native.log file for clones being marked down when they are not. If such false negatives are observed, increase ConnectTimeout to find the lowest value that prevents false negatives.

WebSphere Application Server parameters

This section introduces the WebSphere tuning parameters and key WebSphere Application Server parameters we focused on. Full details on these can be found in the WebSphere product information.

Web container transport - maximum keepalives

This parameter controls the maximum number of concurrent requests that are accepted for processing in the Web container. To help reduce the time required for plug-ins to detect unresponsive clones, set this value at the minimum level suitable for well-tuned operation at the peak level of workload that is to be handled. Setting this value unnecessarily high increases the time it takes for unresponsive clones to be recognized by the plug-ins.

When deciding on this value, refer also to the WebSphere tuning best practices available at the WebSphere Developer Domain and in the WebSphere Infocenters.

Web container transport connection backlog

The parameter specifies the maximum number of requests that are queued in the application server waiting to be accepted for processing in the Web container. To help reduce the time required for plug-ins to detect unresponsive clones, set this value at the minimum level suitable for well-tuned operation at the peak level of workload that is to be handled. Setting this value unnecessarily high increases the time it takes for unresponsive clones to be recognized by the plug-ins.

In most circumstances you can set a value significantly less than the default value (511) without having a detrimental effect on system operation. We used a value of 128, and during the analysis of failover test 4, we set it to 64 without problems. We recommend a value of 64 as a suitable starting point.

The important factor here is not to make the value so small that 'false negatives' arise. In this case, a 'false negative' would cause a plug-in to mark a clone down because the connection backlog was full when in fact the clone was healthy and had temporarily built up a backlog of requests.

WebSphere Application Server database connection timeout

This parameter specifies the maximum time in seconds that requests for a connection wait if the maximum number of connections is reached and all connections are in use. Reducing the value of this parameter can help to reduce the time taken for plug-ins to recognize the presence of an unresponsive clone.

We changed the default value of 90 seconds to 10 seconds without adversely affecting system behavior. We were using a very large, highly tuned database server engine that was able to deliver very consistent responses. Further, on each node the DB max connections was set to 30 but typically used about 15. In these circumstances therefore, a value of ten seconds made sense because if a database connection timeout were to occur, most likely it would not be recoverable.

We recommend setting this parameter in the range of five to ten seconds.

The criterion for this parameter in terms of reducing the time taken for plug-ins to recognize the presence of an unresponsive clone is to set it at a low value but not so low as to cause false negative connection time-outs. Records of such time-outs are in the WebSphere log.

WebSphere Application Server transaction timeout

This parameter specifies the number of seconds to allow a transaction to proceed before stopping it because it is taking too much time. This parameter can help minimize the time taken for unresponsive clones to be recognized.

This parameter measures the time when a transaction begins to when it starts a commit of held resources. The default value is 120 seconds. A more reasonable value is 10 to 20 seconds. Set the value to the higher limit if all the dependencies for the page are not known or where recursive calls may occasionally be made to a remote database or other network provider from the servlet. Records of such time-outs are in the WebSphere log.

We used 20 seconds without adversely affecting system behavior and we recommend 20 seconds as a starting value.

Again the criterion for this parameter should be to set it at a low value, but not so low as to introduce undesirable side effects.

WebSphere Application Server transaction inactivity timeout

This parameter specifies the number of milliseconds a transaction can remain inactive before it is stopped. This timer is set when a transaction scope includes multiple methods. The remote method that is invoked sets this timer so that it can release resources held in the event that the first method does not return with more requests or a start commit function. The default time is 60 seconds. This value should be set to less than the transaction timeout value. A reasonable value is 5 to 15 seconds. A message is written to the WebSphere log when this event happens.

We used a value of 15 or 20 seconds without adversely affecting system behavior. We recommend 20 as a suitable starting value.

Again the criterion for this parameter should be to set it at a low value, but not so low as to introduce undesirable side effects.

Setting the values for transaction timeout and transaction inactivity timeout

We used LoadRunner to obtain end user response time metrics and Resource Analyzer to monitor servlet times. We chose short enough timeout values that when problems did happen that the end user was not left hanging. On the other hand, the values were large enough to tolerate some level of congestion or peak load and not cause a false timeout to be detected.

WebSphere Portal parameters

Our tuning of WebSphere Portal used the well-known WebSphere best practices as the starting point, and because we were not driving the system to saturation, there was little additional tuning required Other than parameters that are concerned with configuring the portal to function the way we intended, we did not adjust any Portal parameters in optimizing the system for failover resilience.

Other hints and tips

Plug-in versions

We used WebSphere Portal Version 4.2.1, which of course runs on WebSphere Application Server Version 4.x. This occasionally led to confusion about what version of the WebSphere plug-in to use. The plug-in that ships with WebSphere Portal V4.x is the WebSphere Version 5 plug-in. So, always verify you are applying a fixed version of the WebSphere V5 plug-in, particularly when applying an interim fix.

During our testing, IBM shipped interim fixes for the plug-in. Because the behavior of the plug-in was the principal focus of our testing, we generally applied new versions of the plug-in to the system as soon as they were available. By and large the newer versions performed better.

Accordingly, we recommend that customers interested in optimizing their systems for failover resilience should evaluate new versions of the WebSphere plug-in as soon as is practical.

vmtune settings

During our test we had a problem with one of the portlets on one of our test pages. The problem resulted in Portal writing to its log files to such an extent that the log files filled up rapidly and the system was consumed by writing log entries. One by-product of this was that the AIX file system began to consume real memory to cope with the tremendously high rate of write activity. This in turn led to there being insufficient real memory for the JVM heaps of our application server clones, which then began to be subject to paging by the AIX system with a disastrous impact on the performance of the clones.

The AIX vmtune command provides a degree of control over the consumption of real memory by the file system. As an example, the command below limits the file system to 25% of real memory

 vmtune -p 10 -P 25

For more details, see the AIX information on the vmtune command.

Plug-in logging

In studying workload management, we found it useful to monitor the distribution of workload management decisions, and also to explore the workload management decisions made by individual instances of the WebSphere plug-in. We devised a technique for doing this, and have documented it on WebSphere Developer Domain.

The logging provided in the WebSphere plug-in gives a choice of LogLevel="Error", which logs events such as a clone becoming inaccessible, or LogLevel="Trace", which logs an enormous volume of data, only a small fraction of which is relevant for our study of workload management. We used LogLevel="Trace" during our preliminary atomic studies of failover, but we calculated that if we left it enabled during one of our failover test runs lasting more than one hour, then in excess of two Terabytes of log would have been generated.

Our technique consisted of customizing the IBM HTTP Server log files to include the session cookies used by WebSphere to manage session affinity. Session cookies identify the specific application server clone that handles a particular user's session, and after a user first accesses the portal are included in the HTTP headers of every request that passes between the user and the server.

Connection monitoring

We found it useful to track the TCP connection activity. Two articles in WebSphere Developer Domain (see References) describe techniques that we tailored for our environment.

Summary of recommended parameters

This table summarizes the parameters we recommend. The Best practice column shows the recommendations currently documented in the WebSphere Developer Domain as they apply to WebSphere applications in general, principally focusing on optimizing performance. The HVWS recommendation column shows the values we recommend when the particular focus is on optimizing to minimize the impact on end users of outages in the application server tier. In some cases, a compromise between the values may be necessary, depending on the relative priority of performance versus minimizing the impact of an outage.

Table 4-2: Summary of recommended parameters
Property	Default value	Best practice	HVWS recommendation
RetryInterval	60 seconds	900 seconds	300 seconds
ConnectTimeout	(AIX) 75 seconds (Blocking)	10 seconds (nonblocking)	5-10 seconds (nonblocking)
Backlog Connection	511	128	64
LoadBalance	Round robin	Round robin	Random
DB Connection -timeout	90 seconds	N/A	5-10 seconds
Transaction timeout	120 seconds	N/A	20 seconds
Transaction inactivity timeout	60 seconds	N/A	20 seconds

^[2]For an explanation of 'buckle zone' and other performance tuning concepts, see for example 'IBM WebSphere 4.0 Performance Tuning Methodology' available at http://www.ibm.com/software/webservers/appserv/doc/v40/ws_40_tuning.pdf

< Day Day Up >