We left the TriMont team in Chapter 14 working in their test phase. Now let's build a capacity plan based on the results of their test. Since the capacity planning task is relatively short, we'll complete it here, rather than devoting an entire chapter. This part of the case study looks at the three phases of capacity planning just outlined in this chapter: Review plan requirements, review performance test results, and project capacity. Review Plan RequirementsIn Chapter 14, the case study testing phase, TriMont estimated load and throughput requirements. They also decided on 25% headroom . Based on the work in Chapter 14, the first part of the Capacity Planning worksheet is completed as shown in Table 15.12. Table 15.12. TriMont Capacity Planning Worksheet, Part 1
Review Performance Test ResultsWe left Chapter 14 without developing any final test results, so before going any further, let's look at some performance results achieved during the TriMont tests. After the DBA investigated and corrected the HTTP session database CPU utilization problem by adding additional disk drives to the database to reduce I/O contention , TriMont completed single-server ramp-ups, with the results shown in Figure 15.2. As you see from the data shown in the figure, the test reached the target response time of five seconds with 400 concurrent users. At this point, we also reach a throughput plateau of 41 pages per second. Figure 15.2. User ramp-up performance test results
TriMont also set up a two system cluster and, as expected, worked through some issues; eventually, however, the team produced good cluster scalability, as shown in Figure 15.3. Figure 15.3. TriMont scalability results
Let's fill in Part 2 of the Capacity Planning worksheet with the TriMont test results. Remember, in order to reduce the load driver virtual user costs, the team ran the tests with reduced think time. Based on the calculations in the testing phase, 1 user with a 45-second average user visit time is approximately equivalent to 10 users with the full 7-minute average user visit. Therefore, the 400 client test load corresponds to 4,000 concurrent users with the full 7-minute user visit time. To fill out the cluster scalability section of the worksheet, we need the cluster load and throughput corresponding to 5-second response time results. However, TriMont's cluster tests only went up to a 2.5-second response time because they did not simulate enough client load to reach the 5-second mark. If you recall, TriMont only purchased a 500 user license for this phase of testing. For now, let's use the cluster results TriMont achieved at 400 users, and decide later if the capacity plan contains too much risk and requires more testing. Similarly, the CPU utilization collected for the cluster run does not represent the 5-second response time objective. However, single-server CPU utilization results do exist for response times at both 2.5 seconds and 5 seconds. We document all this data for Part 2 of the Capacity Planning worksheet in Tables 15.13, 15.14, and 15.15. Table 15.13. TriMont Capacity Planning Worksheet, Part 2: Single-Server Results
Table 15.14. TriMont Capacity Planning Worksheet, Part 2: Cluster Scalability Results
Table 15.15. TriMont Capacity Planning Worksheet, Part 2: CPU Utilization Data
Project CapacityThough TriMont only tested a two-server cluster, the scaling ratio data is still useful to the capacity plan. The cluster results of 42 pages per second with a 400-client load against two servers makes for an excellent 1.9x scaling ratio, as shown in line 12 of Table 15.16. Using this scaling ratio, we estimate the three- and four-server cluster scaling ratios at 2.8x and 3.7x respectively, as shown in Table 15.17. Remember, we do not recommend projecting beyond double the cluster size tested; therefore, we do not include estimates for a five- or six-machine cluster. Now we must estimate the load and throughput even though the cluster results did not reach a five-second response time. Based on the single-server ramp-up results and the cluster results, we assume their two-server cluster reaches five-second response time with approximately 800 test clients and a throughput of approximately 42 pages per second. This closely matches the single-server results shown earlier for 400 users: a five-second response time and 41 pages per second. Using this data, TriMont estimates that a three-server cluster supports 12,000 concurrent users and 118 pages per second. This matches nicely with the test plan estimate of three application servers to support the web site's peak load. TriMont is relatively confident that a cluster of three application servers supports the load and dynamic page requests for the site. Now, to complete the capacity planning exercise, we need to consider the processor utilization data across all the servers in the web site. As you can see from the CPU utilization results in Table 15.15, at 400 clients none of the servers in the cluster goes beyond 50% utilization. As mentioned, the DBA solved the original problem of high utilization on the HTTP session database, bringing it down to only 38% at 400 clients. Table 15.16. TriMont Capacity Planning, Part 3: Cluster Scaling Ratio
Table 15.17. TriMont Capacity Planning Worksheet, Part 3: Estimate Scaling Ratio
Table 15.18. TriMont Capacity Planning Worksheet, Part 3: Estimate Throughput and Load
Now let's estimate the CPU utilization for a three-server cluster supporting 10,080 users and 120 pages per second in Table 15.18. First, we double the projections for 800 users. Then, as shown in Table 15.19, we use these results to estimate the CPU utilization on each server at 1,200 test clients. The last column meets the plan objectives for load, throughput, and response time. Table 15.19. TriMont Processor Utilization Projections
Based on these projections, our original estimate of a 1:1 ratio between the HTTP server and the application server looks good. The addition of the caching proxy servers also lowers the HTTP server processor requirements. However, based on these numbers , the HTTP session database server looks like a potential bottleneck at peak load. At a minimum, the HTTP session database requires a larger server. The Order test database server processor utilization projection of 90% doesn't really concern us, since this is not the production database. Of course, we recommend that TriMont monitor the additional load against the real production Order database to make sure that it is sufficiently sized . Once TriMont resizes the HTTP session database, they begin their production deployment of their web site. Just like any other web site, the TriMont site needs an ongoing capacity plan, as we'll discuss in the next section. |