Ongoing Capacity Planning | Performance Analysis for Javaв„ў Websites

Continuous web site monitoring makes for successful long- term capacity planning. By continually monitoring your web site against your plans and projections, you develop more accurate estimates for the site, and better understand how to develop future test plans as well. We recommend two analysis perspectives for your ongoing monitoring:

Is the performance similar to the test performance?
Is the actual usage similar to your plan expectations?

An error in either of these projections often spells disaster for your web site. Sometimes your actual site does not perform as it does in your tests. If your test scripts and environment did not accurately simulate your production environment, the performance numbers you obtained in your test will not apply to the real site. For example, if your usage patterns actually result in more "buy" traffic than your test scripts simulated, your web site contains application servers improperly sized for the amount of purchasing business logic that "buy" generates. In addition, "buy" uses more SSL processing to support the higher percentage of purchases.

On the other hand, you often find that your tests actually simulated your production performance quite well, but the actual number of users visiting your site differs from your original sizing. If you receive significantly more users than planned, this raises the overall response times beyond your objectives.

Collecting Production Data

For your production site, monitor the same data captured during load testing on your production machines. We recommend capturing at least the following data:

Client load
Throughput ( requests per second)
Response time
CPU utilization on all machines

Many additional metrics come in handy for tuning and troubleshooting problems, but these four metrics give you the essentials for comparing the actual performance of your site against capacity planning projections.

Analyzing Production Data

Let's return to our example from Chapter 6 (rather than continuing with the case study at this point), and compare some actual usage data against the capacity plan. Table 15.20 shows the capacity plan projections and the actual web site performance results.

Table 15.20. Production Performance Results

	Number of Users	Response Time	Transactions/Sec
Plan requirement	5,000	2 sec	250
With headroom	10,000	2 sec	500
Actual peak performance	2,500	2 sec	350

The original example required 250 transactions per second and a two-second response time. On the surface, the production results of 350 transactions per second and a two-second response time look great! The web site meets the established response time criteria and actually serves more transactions than planned. The web site gets more orders than expected, and everyone is happy. Hopefully, you don't bask in the site's success for too long, but take this opportunity to study the actual web site performance data against your plan and test results. There may be serious discrepancies here that spell disaster in the future. Remember to look at the results against both your test performance and the expected usage patterns of your site.

First, how does the actual performance compare to your test results? Remember you built this site for peak load requirements and with a significant 50% headroom buffer. If your site performance actually mirrors the test environment, the peak performance of the site should match the test plan before including the headroom estimate. Table 15.21 shows the expected peak performance versus the actual production web site results.

Actually, we expected a one-second peak response time at this point, not two seconds. By looking just at this statistic, you already see a potential problem looming. Next , let's consider the transaction rate. At first, getting 350 transactions per second seemed like a great thing; however, based on the buffered capacity of the server, we really expected 500 transactions per second at this point. Given your peak load of only 2,500 users, not 5,000 users, you realize that your site does not match the test results.

Table 15.21. Expected versus Actual Production Results

	Users	Response Time	Transaction Rate	Application Server CPU Utilization
"Expected" based on plan/test results	5,000	1 sec	500 requests/sec	75%
Actual	2,500	2 sec	350 requests/sec	80%
Delta	-2,500	+1 sec	“150 requests/sec	+5%

With discrepancies like this, we recommend digging deeper into the web site's performance. Today the site operates just fine, but clearly if the user load increases to the original projections, we can expect the web site to miss its performance objectives. For example, the application server uses a lot of CPU capacity at this point; therefore, if the user load doubles to the originally expected 5,000 users, response time will probably increase significantly and is likely to reach almost four seconds.

At this stage, look at both comparisons to the test performance and to the plan. The first difference is the user load of 2,500, half of the planned load. Looking into this, you find that the advertising campaign for your web site is running behind. The marketing team is just getting the word out, and expects the load to meet the planned projections within the next month!

The next difference is the lower transaction rate and higher response time exhibited by the site. To understand the cause behind the lower throughput, look at the actual transactions to see how they match your tests by comparing the HTTP server access logs to your test scripts. For example, you might find that the actual visits to the site consist of a much higher percentage of "buy" requests than the test scripts simulated. While a lot more purchases explain the excitement with the web site's success, capacity becomes a concern: "Buy" transactions use significantly more application server processing. They also use more HTTP server resources for SSL support, and they allow less caching. This taxes the HTTP server, application server, and database more heavily, which results in lower throughput and correspondingly higher response times per user.

At this point, you should adjust your capacity plan requirements. Plan to increase the amount of HTTP server and application server processing power in time to meet the increased load coming your way. Also, check CPU consumption on your database systems and upgrade them as necessary. Finally, feed this information back to the test team. The next time they test new function for the web site, this information provides them with the correct mix of test scripts to simulate production behavior.

This type of analysis and adjustment needs to be ongoing. Use the actual production site statistics to readjust your capacity planning and to change any future test scripts. Be sure to monitor and compare all the key metrics ”user load, throughput, response time, and CPU utilization ”so you do not miss major trends. In addition, keep abreast of any marketing initiatives or new applications that might significantly increase the user load against your site.