Testing Underway


In this part of our case study, we'll look at some potential problems the TriMont team might encounter while conducting the performance test. In Chapter 13, we covered some of the common performance symptoms and causes we encounter frequently in the field. We try to apply some of those lessons in this case study. Of course, for any given performance test, you may encounter a completely different set of problems. Likewise, for any given performance problem symptom (burstiness, underutilization, and so on), a variety of solutions exist. Please do not assume that the solutions we present in this case study are the only solutions for the symptoms discussed.

Burstiness

As planned, the TriMont team begins the testing with only one Java web application server machine and one HTTP server machine. They set up a small cluster of test client machines and plan to start the test with 100 virtual users. Remember, we expect 720 virtual users to drive both our systems to full utilization. Testing only one web application server and HTTP server pair, we expect we'll need half that number (360 virtual users); thus, 100 users represent about 27% of the load we expect these servers to handle.

Symptoms

The TriMont test team starts the 100 user test and brings the virtual users on-line over the course of 5 minutes (20 users per minute). The team lets the test run for 10 minutes after all the users log on. Then the virtual users finish the test over a period of 2 minutes. This gives us a total test time of 17 minutes, 10 minutes of which actually generates usable data. (Remember: don't use the ramp-up or ramp-down time in your results; see Chapter 11 for details.)

Unfortunately, the initial data doesn't look good at all. Here's a sample of the CPU data captured during the testing:

Table 14.4. TriMont Initial Test Results
Time (from test start) HTTP Server Java Web App Server Catalog Database Server
6 minutes 20% 20% 5%
8 minutes 0% 0% 0%
10 minutes 21% 21% 5%
12 minutes 0% 0% 0%

The team repeated the test several times, but got similar results for each run attempted. They even tried adding more users to the test, but continued to see similar results (with higher CPU utilization before the CPU went to 0% busy).

Problem Isolation

This is a classic presentation of the burstiness performance problem. For some reason, our servers (and database, in this case) do no work at intervals during the test. As we discussed in Chapter 13, this could be because no work is arriving at the servers or because some back-end resource (such as a database) is not providing needed information in a timely manner to the servers. You ask a few questions of the TriMont test team to try to isolate the problem. Here's the dialog with the test team.

What is the response time during the lulls?

The team has not isolated the response times during the lulls in the test. However, the overall response time remains acceptable (under three seconds).

What type of scripts does the test run?

The test only runs the browse script at this point. This limits the systems involved in the testing to the following:

  • HTTP server machine

  • Application server machine

  • Catalog database server machine

  • HTTP session database server machine

Do you have any CPU data for the HTTP session database server?

Yes. The HTTP session database server exhibits the same pattern of bursty activity as the catalog database.

This exchange tells you that none of the servers involved in the testing have an overwhelmed CPU. Theoretically, all servers involved in the test have enough capacity to support the traffic. Also, you learned that the response time during the test remains good despite the "dead spots" in the test. This raises your suspicions about the test client's role in the burstiness. If the clients waited during the "dead spots," you would expect a high overall response time. Granted, you don't have a good run as a basis for comparison at this point, but still this data puts the test client high on your list of suspects . Also, it's early in the test, and TriMont lacks experience with test tools. You want a closer look at the test client setup.

Problem Resolution

You examine the test client settings and discover a few problems:

  • The think times in the scripts are too long. The team forgot to reduce the times from the 7-minute visit to think times more appropriate to a 45-second visit.

  • The test team did not randomize the think times in the script. All of the virtual users in the test wait exactly the same amount of time after they finish a task.

  • The test does not gradually increase the active virtual users in groups of 20 over a 5-minute period. Instead, the team configured ( accidentally , of course) all of the virtual users to start making requests simultaneously after a 5-minute wake-up period. (The virtual users start in groups of 20 every minute (they "wake up"), but they don't actually start sending requests until all 100 users are "awake.")

Overall, TriMont took several shortcuts that made problem resolution more difficult. As discussed in Chapter 7, always take the time to read the generated scripts (Appendix B, Pretest Checklists, Test Simulation and Tooling). Reviewing the scripts identifies the think time issues. In addition, the team started testing with a 100 client simulation. As described in Chapter 11, a good practice is to start with a single user run as this makes problems easier to identify and diagnose.

Also, the team never caught the last problem (the synchronized start) because they didn't capture data during the start-up period. Of course, you don't want to use this data in your final analysis, but it's a good idea to watch the systems during all parts of the testing for unusual behavior. If the TriMont team had followed this advice, they would have noticed that none of the servers were active during the start-up time.

The TriMont team addresses each of these problems. They reduce the think times in their scripts and randomize them within an upper and lower bound. They also fix the start-up to provide the staggered start they always intended. After they make these changes, the burstiness goes away. They see CPU utilization remain steady throughout the testing.

Other Considerations

The test client headed our list of suspects and actually turned out to be the source of the problem. However, if we had pursued the test client angle but found everything in order, the next step might be a closer examination of the network. A network protocol analyzer always proves useful in these cases, as it allows us to see which component in the test system has stopped sending or receiving traffic. Also, a thread trace might help by telling us if the application servers are active, or if they're waiting on a back-end resource.

Also, we recommend watching the test run and collecting real-time data. This allows you to double-check the information you receive from third parties (in this case, the TriMont test team). In fact, if you had watched the test run in this case, you might have noticed right away that the remote systems showed no activity during test startup.

Bad information costs lots of time in bottleneck resolution. The data points you to potential problem areas. A bad data point may lead you to waste lots of time examining systems that are not involved in the bottleneck. Nothing replaces watching the test run for yourself and collecting key data points.

Underutilization

After resolving the burstiness problem, the performance test moves ahead. Using the corrected test client setup, the performance team successfully runs 100 virtual users against the HTTP server and application server setup. Table 14.5 contains the recorded system metrics for the test. The test team arrived at these numbers by obtaining measurements at steady state. These represent maximum usage for the systems (discounting garbage collection cycles on the application server).

Table 14.5. One Hundred Virtual Users Test Results
Measurement Value
CPU utilization  
HTTP server 20%
Web application server 22%
Catalog database server 5%
HTTP session database server 25%
Order test database server 7%
Boat Selector database server 3%
Account database server 5%
Response time 2 sec
Throughput 11 pages/sec

The systems look terrific at this point. The CPU values indicate that our projections regarding the capacity of the key servers seem just about right (with one exception, but more on that in a moment). The response time remains well under our five-second target for this machine, and the actual throughput is right on target with our projections:

 45 sec/visit / 5 pages per visit = 9 sec/page or 1/9 page/sec 1/9 page/sec * 100 users = 11 pages/sec (projected throughput) 

The only point of concern right now is the HTTP session database server. This server uses more CPU than we anticipated. Remember, the test team only ran 100 virtual users to get these numbers. The HTTP session database must support over seven times this load eventually, but already requires 25% of its available CPU to support our 100 users. The DBA for the HTTP session database agrees to take a look at the server. In particular, he's very interested in whether the database server might write more efficiently to the hard disk space available. In the meantime, the TriMont team rightly decides to proceed with more testing. They want to try doubling the test load to 200 virtual users, and you agree they have enough capacity on the system to do this (and hopefully more!).

Symptoms

The team runs the 200 virtual user test a few times and comes up with some unexpected results, as shown in Table 14.6. Basically, the team adds more load, but the systems do not respond in kind. The CPU utilization remains at about the same levels seen during the 100 user test. The throughput does not increase, although the response time doubles.

Table 14.6. Two Hundred Virtual Users Test Results
Measurement Value
CPU utilization  
HTTP server 22%
Web application server 25%
Catalog database server 5%
HTTP session database server 27%
Order test database server 7%
Boat Selector database server 3%
Account database server 5%
Response time 4 sec
Throughput 13 pages/sec
Problem Isolation

This is a presentation of an underutilization problem. As we discussed in Chapter 13, underutilization occurs when the system lacks resources other than CPU to support the additional load. Underutilization results from a variety of causes, including insufficient network capacity, synchronization inside the web application code, and poor resource management (connection pools, and so on). In TriMont's case, the team successfully ramped up to 100 users before the problem appeared. This seems to point to something other than the web application software, although that software certainly is not completely in the clear just yet.

You decide to take a multipronged approach to resolving the problem, and you make the following requests of the TriMont team.

  • Run the 200 user test again while monitoring key resource pools:

    • Database connection pools

    • Connection requests at the host database servers

    • Application server threads

    • HTTP server listeners

  • Try walking the test up from 100 users more slowly to see if they can find a threshold. (Try 110 users, 120 users, and so on, until the response time begins to increase.)

  • Collect all logs from the 200 user runs on the various servers.

Problem Resolution

Based on your request, the TriMont team runs additional tests to collect more information. In the meantime, you look at the logs they captured on the first runs at 200 users. The logs from the application server show several database errors. The logs show the application server software attempting to establish more connections to the mainframe account and catalog databases, but these attempts to establish new connections with the mainframe fail. Remember, as discussed in Chapter 11, always check the server logs as part of the measurement validation process (also, see Appendix D: Results Verification checklist).

The resource tool provided by the application server vendor also reports some unusual behavior inside the connection pools for the account and catalog databases. While the pools each allow a maximum of 20 database connections, the resource tool shows a maximum of 5 connections in the pool at any given time. Furthermore, when the test runs more than 110 users, the resource tool shows many requests "waiting" to obtain a database connection from the pools for these two databases.

The account and catalog databases reside at the mainframe. The DBA for the mainframe databases runs some reports during a subsequent 200 user test. The reports show the host databases in question only have five connections available each for the Java web application server machines to use. (The DBA also reports all of these connections in use during the 200 user run she observes.)

The DBA bumps the available connections to 60 each for these mainframe databases. Although the maximum connection pool for the server currently under test is only 20, she gives us three times the capacity to support our other two servers when they come on line. In return, we agree to let her know if we raise or lower the individual connection pools based on our testing so she can make similar adjustments. After the DBA makes her changes, the test team runs the 200 user test again. After a few runs, they show the results given in Table 14.7.

These figures look much better and fall in line with expectations. The CPU on the system involved increases pretty much proportionally with the increase in load. The response time remains low, although it begins to increase a bit as the CPU becomes more engaged on the system involved. The system still keeps up well with the throughput demands.

Table 14.7. Two Hundred Virtual Users Test Results (after DBA Changes)
Measurement Value
CPU utilization  
HTTP server 41%
Web application server 43%
Catalog database server 8%
HTTP session database server 49%
Order test database server 15%
Boat Selector database server 6%
Account database server 8%
Response time 2.5 sec
Throughput 22 pages/sec

Also, the HTTP session database remains disproportionately pressured given the level of load. Also notice the burden on the HTTP server. Keep in mind the tests move a lot of static content, so the HTTP server stays very busy during the test. The team may decide to add the caching proxy server to the test a bit sooner than originally planned to remove some of this burden from the HTTP server and improve response time. However, given the linear scalability demonstrated so far by the systems, the web application itself shows good performance characteristics at this point. The system handles 22 pages per second (or about 352 total hits per second) with a very acceptable response time.

Of course, the web application may still hold some surprises for us. Remember, the HTTP session database is surprisingly busy at this point. The web application might actually generate much larger HTTP sessions than we originally estimated. If this is the case, we may need to revisit our network and database capacity estimates (or ask the web application team to reduce the HTTP session size ). However, we need the DBA's analysis before we proceed further along those lines.

Next Steps

What next? The test team really needs to figure out the solution to the HTTP session database problem next. This is most likely the next major system bottleneck, and it must be resolved before the test proceeds much further.

After resolving the HTTP session database issue, the team should continue the test-analyze-solve cycle we've demonstrated in the two problems we examined in detail. As outlined in Chapter 11, the team will continue this process until they reach the capacity of this server. Following this, they will add the next HTTP server/application server pair and continue increasing the load. After they're satisfied with the performance of these systems, they will begin adding the firewalls and other components to the tests to gauge their impact on performance.



Performance Analysis for Java Web Sites
Performance Analysis for Javaв„ў Websites
ISBN: 0201844540
EAN: 2147483647
Year: 2001
Pages: 126

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net