The Performance Envelope | Practical Service Level Management: Delivering High-Quality Web-Based Services

Load testing is used to identify a performance envelope for a service or a mix of services under different operating conditions. The performance envelope represents the service's performance under normal and extreme operational ranges. For example, you probably first want to test services at their expected normal operating loads (number of users or transactions per second) over an extended time period to ensure reliable and stable operation. Then you'll want to test under extreme loads to determine the operational limitations, which should assist in pointing to actions that can reduce bottlenecks.

Determining an accurate performance envelope is notoriously difficult for several reasons. Web environments are complex, and the causes of problem behavior are hard to identify. Even experienced designers and administrators have difficulty developing a feel for the likely factors affecting performance.

The classical performance envelope is represented by Figure 11-1, which shows response time versus offered load. The offered load can be transaction requests, database queries, or network traffic at an interface, as examples. This general form of the load curve applies to a variety of managed resources, such as network devices or servers. Web application-level performance, however, has a surprising difference from this classical behavior: the Internet's Web users abandon sessions when the response time becomes too long, affecting the response time of other sessions. The text discusses that difference at the end of this subsection.

Figure 11-1. Classical Load Curve

There are two areas of interest in the classical load curve: where the behavior is linear and where it changes to being nonlinear. The inflection point is the boundary between linear and nonlinear responses and is a function of the peak service rate (whether measured in packets per second, frames per second, or transactions per second) that the specific layer's infrastructure has been engineered to provide. It's based on elementary queuing theory; above the inflection point, even small increases in applied load result in large changes in performance and, possibly, in availability.

The linear portion of the response curve represents the most stable and predictable part of the performance envelope. It represents conditions where the resources are sufficient for the load applied. Queuing delays are minimal, and the response time is low for a range of loads. As the offered load grows, the response time begins to lengthen as resources are more heavily subscribed. Loading up to the inflection point is (approximately) linear. Any load increases up to the inflection point have the same impactthe incremental response increases are invariant. Each increment of offered load has an equal corresponding incremental increase in response time.

The slope of the linear portion gives important information: it indicates the sensitivity to loading changes. A flat slope shows that response is less sensitive to a loading change than a steeper slope. Flat slopes are desirable because they are characteristics that lend stability to the behavior so that response time doesn't degrade as loads varysomething customers demand. Note that a flat slope may also mean an underutilized system. However, underutilized today also means headroom for further growthtying directly to the challenges of capacity planning.

An administrator or planner wants to work in the linear part of the performance envelope because he or she can make fairly accurate estimates about expected response times with increasing loading.

At the inflection point, the offered loads exceed the ability of the tested environment to process them quickly enough, and queue lengths begin to increase exponentially. Delays grow quickly, degrading time-sensitive activities, congesting servers and networks, and causing customers to go elsewhere and online transactions to fail.

Note that it also takes some time to recover and shift back to linear operation, even if the offered load is removed completely. Queue lengths must be reduced first. Administrators want to avoid the nonlinear area because of its unpredictability. At one moment, operations are still within the metrics of the SLA; then, when a small burst of requests arrives, the entire operation can grind to a halt. This is risky ground to manage.

On the public Web, there are two additional phenomena that must be considered: flash load and abandonment.

On most non-Web, transaction-oriented computer systems, the queue is external; that is, it is outside the system. Customers wait in telephone queues, or there is a pile of incoming documents to be processed in front of each data-entry clerk. The flow of transactions is therefore reasonably steady, with a firm maximum number of sessions set by the number of clerk terminals or dial-in lines. Under heavy load, the external queue builds, and the result is that there's a steady, unchanging workload, classically measured by the concurrent sessions statistic.

On the public Web, in sharp contrast, the incoming traffic hits the system directly. Massive flash loads can appear in response to a television ad or a mention in a news article, with hundreds of thousands of users trying to establish TCP sessions simultaneously. Such loads can overwhelm the system at the precise time that user satisfaction is most important. (Why run a television ad and then convince most of the public that they never want to go to your web site again?)

Loads on public web sites can therefore be much higher than the loads generated by classical load-generation tools; special Web load generation tools are necessary. In addition, load statistics for Web-based transactions should be in terms of arrival rate over a given interval, not concurrent users. For example, the Keynote LoadPro service can handle hundreds of thousands of concurrent user session initiation attempts flooding in from the Internet, and it measures load in terms of session initiation rate, not in terms of concurrent users.

The other major difference between classical load transaction testing and Web load transaction testing is abandonment. In classical systems and on corporate intranets, users don't abandon a transaction. They remain in the transaction until completion, regardless of the amount of time it takes. There simply isn't anywhere else to go. Call center operators and data entry operators must wait if the transaction response time is very slow, and external customers who dialed into a corporate mainframe don't usually disconnect and then dial into a competitor on a whim, just to see if the competitor is faster.

On the public Web, however, it's extremely easy to abandon a transaction; people do it all the time. Worse, web protocols usually don't inform the system when an end user has abandoned a transaction; the web server system must use timeouts or other special methods to guess when an end user has abandoned a transaction. The result is that many transactions in a Web system may be inactive, waiting for timeout, especially under a heavy load with the long response times that encourage abandonment. If the Web system's abandonment-detection and resource-recovery mechanisms are inadequate, the system may clog as a result of massive numbers of abandoned transactionsleading to even worse performance and even more abandonment in a vicious cycle. The system might be able to handle a brief peak load, but be unable to endure a longer-duration peak load because it cannot recover resources efficiently from abandoned transactions.

Load testing of public web sites must therefore include a way to simulate transaction abandonment and must be run long enough to determine the endurance of the system. In the Keynote LoadPro system, dissatisfaction and abandonment scores are kept for each simulated user. They vary according to the type of user (beginner, experienced, and so on) and according to the type of web page (home page, search page, and so on). They vary because different classes of users have different tolerances for server delay, and users are willing to wait different lengths of time for different types of pages. The LoadPro system then simulates abandonment at the appropriate points, and it also reports a dissatisfaction score at the end of the load simulation, to indicate aggregate user satisfaction with the entire experience.

Abandonment is another reason, in addition to flash loads, that concurrent sessions should be avoided as a measure of load in the Web environment, where session termination is difficult to detect. Concurrent sessions can, however, be used as a measure of performance. For a given arrival rate of new transactions, the number of concurrent sessions decreases as the system's performance increases; good performance allows end users to do their work quickly and log off.