Reactive Management | Practical Service Level Management: Delivering High-Quality Web-Based Services

Reactive management will always be needed because, simply, failure happens. Devices unexpectedly fail, changes turn out to have unintended consequences, backhoes cut fiber, or entire electrical grids go down. Reactive management is the most demanding from a time perspective because administrators have no prior warning and still must assemble their resources and attack the problem as best they can.

Components of reaction time include the following:

Problem detection and verification, initiated by the instrumentation and refined by the management tools
Problem isolation, consisting of further analysis to identify and isolate the cause of any (potential) service disruption
Problem resolution, in which steps are taken to resolve the problem and restore service levels, if necessary

Most of the time involved in resolving a problem is usually spent in the problem isolation phase, attempting to determine what is actually causing the problem. Increasingly complex service environments add to the challenge because even the simplest delivery chains span multiple elements and organizations. The instrumentation quickly detects threshold violations, baseline drifts, and other warning conditions.

For an organization that's well prepared for reactive real-time management, many actions to resolve a problem (such as bringing an additional server into the mix, selecting an alternate network route, switching to another service provider, or redirecting traffic to a lightly loaded data center) can be completed quickly. However, even for the most agile organizations, the maximum leverage in reducing resolution time is in reducing problem isolation time with speed and accuracy improvements.

Accelerating problem detection and verificationthereby increasing the speed with which validated alarms can be generatedbuys time for the problem isolation process. Moreover, faster analysis means there is less lead time needed between the arrival of a warning and the assembly of enough information to take corrective action. Automation makes a significant contribution here: with less time needed to perform the analysis, the analytic engine can be used to predict events that will occur very soon, which makes its job simpler. Looking at trends to identify a problem that might occur within the next 15 minutes is much simpler than predicting behavior 6 hours into the future.

The following subsections describe two primary methods for decreasing problem isolation time: triage and root-cause analysis. The descriptions are followed by discussions of how to handle some common factors that complicate both methods.

Triage

Triage is the process of determining which part of the service delivery chain is the most likely source of a potential disruption. First, it's important to understand what triage does not do. It isn't diagnostic; it isn't focused on determining the precise technical explanation for a problem. Instead, it's a technique for very quickly identifying the organizational group or set of subsystems that's probably responsible for the problem.

Triage thereby saves problem isolation time in two ways. First, it ensures that the best-qualified group is identified and set to work on the problem as quickly as possible; second, it decreases finger-pointing time.

Identifying the best qualified group to deal with the problem means finding those who are most likely to have the specialized tools and knowledge that can be used to solve the problem more quickly than if it were left with a generalist group.

Equally important, triage techniques are focused on drastically decreasing finger-pointing time, during which various groups try to avoid taking responsibility for a problem. It does that by presenting the responsible group with data that's sufficiently detailed and credible to convince them that it's truly their problem.

An example of triage technique should clarify the difference between it and detailed diagnosis. In this approach, called the "white box technique," a simple web server (the "white box") is installed, as shown in Figure 6-2, at the point where the enterprise web server systems connect to the Internet infrastructure. (The web server could be extremely inexpensive; it could be just an old PC running a flavor of Unix and the Apache web server system without any configuration, serving the default Apache home page or some other simple content.)

Figure 6-2. Triage Example System

The white box web server in Figure 6-2 is located at the demarcation point between two different organizational groups: the group responsible for the web server systems and the group responsible for Internet connectivity. Active measurement instrumentation is located outside the enterprise server room at the opposite end of the network. It is at end user locations, and it measures both the enterprise's web pages and the web page on the white box. Because no end user knows about the existence of the white box, the white box has almost no workload; it is used only by the measurement agents.

Figure 6-3 shows an example of some response time measurements from the system diagramed in Figure 6-2. It's easy to see that when the "event" occurred, the unloaded white box server was unaffected. The chart can be created in a few seconds, and it is sufficient to convince the server group that it's almost certainly their responsibility. The root-cause reason for the problem is unknown; the chart is not diagnostic. However, the responsible group has almost certainly been correctly identified within a few seconds, and finger-pointing time has been cut to zero. The server group can then use their root-cause analysis tool or other specialized tools and knowledge to study the problem further.

Figure 6-3. Triage Example Measurements

Triage points can be established at many boundaries within a system, and different techniques can be used to establish those boundaries. Triage points can be placed at the demarcations between network and server groups, as shown in Figure 6-3, and they can also be placed just outside a firewall, at a load-distribution device, and at a specialized subgroup of web servers.

White boxes can be measured to create easy-to-understand differential measurements, but they're not always necessary. For example, consider an organization that measures the response time of a configuration screen on a load distribution device to see if there are any problems up to that point. Triage can also be performed by placing active measurement instrumentation at demarcation points, such as just outside a major customer's firewall, to see if response time from that point is acceptable.

Finally, detailed measurements can themselves be used for triage, although more technical knowledge is usually necessary. For example, an external agent can measure the time needed to establish a connection between itself and a file server, followed immediately by a measurement of the time needed to download a file from that server. It can probably be assumed that if file download time increases greatly without any corresponding increase in connection time, then there's a problem with the server, not with the network. (This is further discussed in Chapter 10, "Managing the Transport Infrastructure.")

Such triage techniques are very useful in the heterogeneous, fluid world of web systems. It requires much less detailed knowledge of the internals of the various subsystems than does root-cause analysis. This is a great virtue when things change frequently and the internals of some systems are hidden. It also cuts time from the most time-intensive part of system management. However, it can be difficult to use it for complete diagnostic analysis within a complex system; too many triage points, or demarcation points, are needed. For complex systems, true root-cause analysis tools are a necessary complement.

Root-Cause Analysis

Root-cause analysis tools can require considerable investment and configuration, but they can be surprisingly powerful and beneficial. They use a variety of approaches to organize and sift through inputs from many sources. These sources include raw and processed real-time instrumentation (trip-wires), historical (time-sliced) data, topologies, and policy information. They produce a likely cause more quickly and more accurately than staff-intensive analysis. Analysis tools are activated in a fraction of a second after an alert is generated and are already collecting data much faster than staff could respond to a pager or to e-mail. Because conditions can change quickly, and critical diagnostic evidence may not be preserved, compression of activation time is paid back with more effective analysis.

Root-cause analysis tools can be targeted at elements or servicesor both. The earliest root-cause tools focused on a single infrastructure, usually the network; newer products are focusing on service performance spanning many infrastructures.

Speed Versus Accuracy

Many vendors in this part of the industry have emphasized the speed of their solutionsfinding the root cause in a fraction of a second. Although speed is important, at the element level, it takes second place to accuracy. Most infrastructures have sufficient redundancy and resilience so that an element failure rarely completely disrupts a service. That is why accuracy is more important. For example, identifying a specific interface on a network device, a specific server with a memory leak, an alternate route with added delays, or a database engine that is congested speeds resolution by addressing the right problem with the right tools and staff. The failure must be noted and marked for attention based upon policies and priorities.

Finding the root cause of a service disruption is more challenging than finding the root cause of an element problem because the cause can be found in any of several infrastructures. At the service level, root cause is more a matter of determining which part of the service delivery chain is the most likely source of a potential disruption. The triage technique discussed previously can be used here, and speed is very important because a service disruption is serious and compliance is threatened, along with the business that depends on the service. The root-cause tool must quickly pinpoint areas for further analysis. The easiest case is one in which measurements clearly implicate one infrastructure as the likely culprit. After a candidate infrastructure is identified, more specific tools are deployed to isolate the element(s) involved.

A difficult case arises when all the infrastructures are behaving within their normal operating envelopes. This is an opportunity for automated tools to collect as much information as possible for a staff member to use. The information might not be conclusive, but it can guide the staff member's next steps in an effective way.

Assembled information can include the following:

Performance trends for each infrastructure Are any infrastructures trending toward the edges of their envelopes?
Historical comparisons Are any infrastructures showing a significant change in historical patterns, even if they remain within the envelope? Is there a similar problem signature, which is a known pattern of measurements that has previously been linked to service failures or other difficulties?
Investigating element failure information Is there a time correlation between the failure and the service disruption?

For instance, an end-to-end response problem could automatically result in the comparison of other infrastructure measurements to their historical precedents and could also result in the automatic initiation of new infrastructure measurements. Those automated investigations could fail to find any performance that exceeds thresholds. However, learning that the transport infrastructure delay has suddenly increased fivefold while all the other infrastructures are operating within their normal envelopes would indicate the most likely area for further investigation.

Linking service root-cause analysis to element root-cause analysis adds leverage to accelerate the resolution process at both levels. Passing information and control between the two domains speeds operations and keeps both teams informed and effective.

Case Study of Root-Cause Analysis

The following is an example from a company with which I recently spoke. They have stringent SLAs with their internal business units and external providers. They funded their own integration effort because minimizing service disruptions was so essential. The goal was to have two-way interactions between root-cause analysis for service and elements and to leverage each for the other. Some of this work is being implemented in phases as they learn from experience.

Consider first the case of an end-to-end virtual transaction that is showing transaction response time slowing, drifting toward greater degradation and eventual service disruption. This is a situation for the service root-cause tools because there is nothing specific for element-oriented tools to pursue yet. More detailed measurements at demarcation points suggest problems with the transport infrastructure, where delays are increasing while other components remain stable.

An alert is forwarded to the alarm manager, which in turn activates element tools and notifies staff. Information is also passed to help the troubleshooting process at this time. It includes the following:

The virtual transaction(s) used
The end-to-end transport delay measurements
The actual route from the testing point to the server and from the server to the testing point (these are usually different over the Internet)
Indications of changes in the actual site used (Domain Name System [DNS] redirection)

The troubleshooters already have this information as they start to narrow down the cause, identifying which parts of the services are behaving well and which parts require further investigation.

Today's technology still leaves some manual steps in the hand-off, such as transferring information to the element root-cause tools. This is the step where time is lost and errors can be introduced. In the future, automation of isolation functions (discussed later in this chapter) might be used to simplify and accelerate the process. For example, an automatic script could exercise the route used by agents running the measurement transactions, probing all the devices on the path and looking for any exceptional status or operating loads. It can have this information ready for a staff member or for another tool that can then investigate further.

The flow must be bi-directional. Element instrumentation may detect an element failure first. Redundancy keeps operations flowing while actions are taken to address the failure.

The primary consideration is the services impacted by an element failure. When a service has been impacted, the management system might respond by monitoring more closely and setting thresholds for more sensitivity. Being able to understand the relationship between elements and the services depending on them allows administrators to prioritize tasks and ensure that critical services have the highest degree of redundancy.

Tools in the services domain may notice unwanted trends and correlate them with the element failure; sometimes a simple time correlation between the failure and the detection of the shift is all that is necessary. In the best case, both domains have information and can communicate effectively as they watch for and resolve developing problems. (This is where a lot of the new investment in management products is goingbuilding management systems that can correlate symptoms from disparate elements and understand the impact on the multiple services and customers while helping operations staff prioritize and fix the problems. The InCharge systems from System Management Arts, Inc. is an example of this trend.)

Complicating Factors

Brownouts and virtualized resources make the tasks of triage and root-cause analysis more difficult. These are discussed in the following subsections.

Brownouts

A brownout can be a difficult challenge to diagnose because all the elements are still operable, but performance suffers nonetheless. In contrast, hard (complete) failures are easier to resolve because a hard failure is a binary valuesomething works or it doesn't. There are tests that verify a failure and help identify the source of a problem.

It is harder to identify the likely cause of a brownout because there is no definite service failure that lends certainty to the search. Degrading performance can be caused by any of the following: a configuration error, high loads, or an underlying element failure that increases congestion in another part of the environment. Redundancy further complicates isolation of the cause of the brownout because underlying element failures may be hidden from service measurements by element redundancy.

The steps described for basic root-cause analysis still apply in brownout failures. Troubleshooters need all the information and context that they can assemble. Historical comparisons, indications of recent changes, and other data can help them understand the situation more clearly. Some patterns, such as a fixed percentage of all web requests taking an abnormally long time, strongly suggest the probable causesespecially if the same percentage of web servers has recently been upgraded to new software. Sophisticated root-cause analysis tools can learn to look for these patterns and thereby help diagnose brownout failures.

Virtualized Resources

Another complicating factor is introduced by the common system architecture of virtualizing resources, in which an entire set of similar resources appears to the end user as a single, virtual resource. Virtualization simplifies many tasks for the end user and the application developer; it's most common in storage systems, where rather than identify physical sectors on individual discs, storage software virtualizes the storage resources as volumes and file systems. In the webbed services customer's case, for example, geographic load distribution makes a set of distributed sites available with the same name. The geographic load balancer selects the site, and the end user automatically connects to the closest site without having to know the details.

NOTE

Application developers use object brokers to hide the details of locating and transforming the objects that an application accesses.

Load balancing switches are another means of virtualizing; they hide a tier of servers behind the switch. Requests are directed to the switch, which in turn allocates them to any member of the set. Firewalls and hidden networks using Network Address Translation (NAT) technology also create virtualization.

Unfortunately, from a root-cause perspective, virtualization obscures important details. A synthetic measurement transaction might detect a performance shift because the geographic load distributor has selected a different site with different transport and transaction delays. Understanding that distinction helps the troubleshooting team save time. They may determine no further actions are needed until the usual site is restored to service. In fact, the redirection may have behaved entirely as expected, with service measurements verifying the resilience of the environment. It might also lead to further investigation because service levels must still be maintained even when these actions are taken.

To handle the complications of virtual resources, management tools must be able to distinguish among the various hidden resources or, at least, must be able to suggest that the problem lies somewhere within the virtual group. Instrumentation within the virtual group can take measurements without having the individual group members' identities obscured by the virtualization process.

As suggested before, failure patterns can suggest the cause of the problem, even if the virtualization layer cannot be penetrated. In addition, some IT organizations create special, secret addresses for servers within a virtual group so that they can be measured externally without revealing those addresses to the general end-user base, as in the white box triage technique previously discussed.

The Value of Good Enough

Root-cause analysis is not yet a panaceait is not fool-proof and probably never will be. Nonetheless, a valuable tool can be good enough without being perfect; accurate enough, for example, that it reduces analysis time significantly. In other words, even when the analysis cannot pinpoint the exact cause, it might be close enough that staff efforts can focus and finish the process more efficiently. However, there is a downside to imperfection: a root-cause exercise that produces incorrect results may waste further time and disrupt staff by sending them in the wrong direction.

The real opportunity for leverage is automating the actual analysissorting through large amounts of information, comparing symptoms and test results, and eliminating potential causes. This is the area where humans are easily overwhelmed. Our role, at least for a little while yet, is to determine the rules and policies we want these tools to carry out.