The Instrumentation System | Practical Service Level Management: Delivering High-Quality Web-Based Services

Administrators usually find themselves with a collection of unrelated instrumentation components. These components can be organized into an instrumentation system, which is an adaptable system for the following:

Collecting service management information at several granularities or levels of detail
Organizing a high-volume stream from multiple sources into a manageable set of alerts and alarms
Storing the collected information for a variety of other service management functions

The instrumentation system provides the framework for monitoring service behaviors and reporting them to other parts of the service management system. The instrumentation system manages collectors and aggregators, ensuring that they are operating properly and collecting the appropriate information. The processing functions organize the data and save some for long-term storage. The collectors and aggregators collect and reduce data and pass alerts to the processing or event management functions. An instrumentation system produces the information necessary for making sound management decisions at the tactical or strategic levels.

A service instrumentation system provides an organizing framework for leveraging the installed instrumentation base while guiding the incorporation of new components. Instrumentation is dynamic; new instrumentation emerges with new technologies and services. New information sources must be incorporated with minimal staff intervention and then leveraged by other service management tools.

The major components of a service instrumentation system are shown in Figure 4-3. Event handling, in the real-time event manager, and SLA management tools are also included because they are tightly coupled with the instrumentation system. The basic cyclic behavior of instrumentation management, collection, and processing drives many other management functions.

Figure 4-3. The Instrumentation System and Related Management Functions

These components represent an abstract way of discussing what an instrumentation system does. The reality of how it is actually implemented is usually messiersome of these functions take place in several stages and in different parts of the system. Different vendors offer different sets of features and functions; the completeness of the system functions is the goal. The behavior can be viewed as cyclic. The collected information causes adjustments in the information collection process, which creates new information, which results in a change, and so forth.

Starting with the Instrumentation Managers

Instrumentation managers do the following:

Monitor and control a distributed group of collectors and aggregators
Control the local data collection activities in a distributed set of collectors and aggregators
Establish instrumentation policies by transferring policy information to each collector and aggregator

These policies can specify the types of measurements to be taken, their frequency, and the acceptable ranges of values. For example, simple policies can dictate that more than three consecutive abnormal measures should generate an alert. The measurement frequency policy should be based on the failover latency (how long it takes redundant components to respond to a service disruption and resume service delivery at the specified quality levels). Thus, for example, if a service fails within five minutes, your system should test every 12 minutes.

Instrumentation managers simplify operations because a single command affects the operation of many collectors and aggregators. Thus, staff time and mistakes are reduced and the instrumentation is managed effectively.

Instrumentation managers periodically use a heartbeat to verify continued collector and aggregator operations. A heartbeat is a periodic exchange of messages to verify that both parties are operating properly. Consider that an independent collector (discussed later in this chapter) might not communicate for long periods of time when no problems are detected. The instrumentation manager, in this case, uses a heartbeat to determine whether the collector is still operating; if heartbeats are not returned, the instrumentation manager must take steps to reestablish communication or shift to other monitors.

Collectors

Collectors measure service behavior instead of element behavior. They collect information that is suited for each class of services. The information includes response times for transactions and packet loss for interactive or streaming classes. Collectors measure specific service instances, verifying that individuals, groups, or regions receive acceptable service quality.

Collectors can be programmed to provide more granular service information. They can measure subtransactions to make distinctions among functions, such as downloading a page, executing a stock trade, or ordering merchandise. Collectors use a combination of active and passive techniques. These techniques are discussed later in this chapter.

Collectors and aggregators are shown in Figure 4-3 in relation to other instrumentation system components. They are the source of management information and alerts for the processing and event management components. Alerts are the trip wire; the collectors send alerts when a certain condition, such as an unacceptable delay, has been detected. The system provides time-sliced data for SLA tracking and a variety of purposes.

Collectors can be embedded in network elements (as described in the sidebar), incorporated as software modules in desktops or servers, or packaged as standalone components. Continued processor price/performance improvements reduce the impact when more instrumentation processing is embedded. In addition, the additional processing power enables more complex measurements. In the future, collectors will interact with other collectors and instrumentation managers.

Embedding Collectors in the Network Infrastructure

Cisco Systems has been developing the Service Assurance Agent (SAA) as an advanced collector. The SAA is embedded within network infrastructure elements where it conducts behavioral measurements across the network infrastructure. Embedding the SAA within elements gives wide instrumentation coverage, granularity, and speed. There is no need to physically move agents to a new location before measurements are collected; the embedded agent is activated and it quickly captures the operational information.

The SAAs are designed as a coherent, collaborative monitoring system. They exchange traffic with each other as they carry out measurements, and various relationships between SAAs are created as needed by external management applications. An extensible markup language (XML)-based interface opens the architecture to third-party tools.

SAAs collect information in both active and passive modes. The active measurements use a variety of virtual transactions to probe a range of behaviors. Their monitoring functionality can also be extended through software enhancements. SAAs can carry out periodic measurements to track behavior on an ongoing basis, and they can carry out specialized tasks as needed.

As shown in Figure 4-4, SAAs at the edge of the network can measure the delay across the entire network infrastructure. An alert is forwarded to an external management application when the delay begins to approach a threshold level. More granular measurements are obtained by using other SAAs in the path between the edges. Hop-to-hop delays are monitored between each pair of SAAs to quickly identify the part of the network infrastructure that is causing the slowdown. Periodic measurements can also track jitter, which is a key metric for interactive and streaming service classes.

Figure 4-4. Cisco Systems Embeds SAAs to Measure Performance Within or Across a Network Infrastructure

Aggregators

Aggregators are used for scaling and for providing efficiency through monitoring and managing a set of local collectors. Aggregators consolidate the information and usually carry out simple filtering to reduce the volume of information they forward to the processing functions. Aggregators also conserve bandwidth by filtering alerts and forwarding only those needing further attention. Figure 4-3 shows how aggregators can be cascaded to scale even further.

Aggregators also scale the instrumentation management tasks because they can accept a single management directive and distribute it to the collectors they control; in that case, they are instrumentation managers as well as aggregators. They use heartbeats to check collector health and to set new monitoring policies as directed.

Aggregators can also provide local correlation and integration of the information from multiple collectors. This creates higher-quality information for components higher in the chain.

Processing

Processing involves a range of functions that are packaged in vendor-dependent ways. Further, these processing functions are widely distributed within the instrumentation system. For example, the collectors themselves usually test for trip-wire situations. In addition, they often build baselines and carry out more sophisticated measurements. Remember this rule of thumb: Functions tend to move toward the information source.

Some functions overlap with event management or with features of some management tools. Such situations are acceptable because completeness of monitoring coverage is the goal.

When new information arrives, it may need grooming. Grooming is the process that simplifies the information-handling tasks of the other components. For example, some data values might need normalization because different collectors use different value ranges. For example, collectors from one vendor might have a range from 110, and another collector might provide values from 150 for the same type of information. The data cannot be accurately compared until the ranges are normalized to the same value; in this case, multiplying the first set of data by 5 provides consistency.

Grooming can also include artifact reduction, as discussed in Chapter 5, "Event Management." Some of these functions are packaged differently, depending on particular vendor packaging choices.

Trip wires require real-time processing of the collected data to test when an alert is sent. Developers of collectors are applying more sophisticated testing to reduce the alert volume. For example, the collector might not forward an alert unless a threshold has been exceeded for three measurements in succession.

A single transaction that is close to a performance threshold is another example of trip-wire processing. A single increase in delay times might not be of concern; however, a sustained increase in delay times is a cause for further investigation. Single sporadic incidents might be logged and need no further staff attention at the moment.

Baseline measurements are regularly scheduled to stay abreast of the normal operating envelope. Real-time measurements are compared to current baselines to detect deviations that could be leading to service disruptions.

Most real-time information is discarded after it is checked for conditions that exceed threshold values or that deviate from the activity baselines. Real-time tracking over long time intervals generates a large amount of data with no long-term value. Some real-time information is reduced and saved for use in SLAs and in longer-term trend analyses that are expressed with a few points and a formula.

Ending with the Instrumentation Manager

Completing the cycle in Figure 4-3 brings us back to the instrumentation manager. The event manager influences the instrumentation manager and will adjust the measurement activities to suit its immediate needs.

Consider the options when a response time trend indicates a future disruption. If the potential cause is undetermined, the event manager can ask for measurements with finer granularity and frequency at key collectors to pinpoint the root of the problem before it escalates. The collected information can be saved so that staff can examine the operating conditions before a disruption occurs. Altering the measurement activities keeps the instrumentation system focused on collecting the most useful information.