Service Management Architecture: A General Example | Practical Service Level Management: Delivering High-Quality Web-Based Services

The service management architecture shown in Figure 3-2, and described in the following subsections, is intended to be an example of a typical architecture used by large organizations. It is one that can accommodate changes that take place during the service delivery lifecycle faced by any organization that relies extensively on networked delivery of information. It encompasses both the components on which service delivery relies, as well as the service that is a product of those components.

Figure 3-2. Web Service Management Architecture

Instrumentation

Instrumentation, described in detail in Chapter 4, "Instrumentation," and Chapters 810 and shown at the top of Figure 3-2, monitors and measures the performance and availability of system components, as well as that of services. Instrumentation of components, or element instrumentation, tracks the status and behavior of individual components, such as network devices, servers, and applications. Examples of element measurements are CPU busy percentage and the percentage of received packets that contain transmission errors. Services instrumentation tracks the behavior of services using active and passive collectors. Examples of measured services are round-trip time through a network and transaction response time.

Instrumentation takes two forms:

Active instrumentation Adds traffic to a system, essentially performing a small experiment to validate compliance with key parameters. An example is the "ping" tool that sends a single packet to a remote system component, which then immediately returns a copy. The tool measures the time delay between when the packet went out and the copy returned, and, if multiple packets are sent, the tool also reports the percentage that returned.
Passive instrumentation Relies on system traffic and facilities that are already there to provide performance data. Examples are the use of existing log files to measure workload and server response time. Other passive devices can sit on a network segment and watch the packets passing by, deriving a lot of data about workload, error rates, response time, and more.

Instrumentation Management

Instrumentation managers, described in Chapter 4 and shown in the middle of Figure 3-2, configure the instrumentation systems and receive the measurement data from them. They examine each incoming data item, filtering out obvious measurement errors and comparing measurements to specified thresholds to see if an alert should be issued. If measurements indicate a possible problem, the instrumentation manager may demand additional measurements to help make sense of the problem and to see if the original measurement was an outlier or was a true indicator of a difficulty. There are two primary outputs from the instrumentation manager: alerts and service level indicator data. The former consists of alerts that are important enough to be escalated to the real-time event handler, where they will be combined with other data for evaluation; the latter consists of data sets and aggregated measurements that are all forwarded to the SLA statistics system for statistical treatment and reporting on system performance.

SLA Statistics and Reporting

Data sets received from the instrumentation manager are processed to generate statistics appropriate to their use as service level indicators, as described in Chapter 2, "Service Level Management." The summary information is placed in a database for later reference and for use in generating periodic reports about system performance for the team that manages compliance with SLAs and for other concerned groups. Summary information is also made available to the operations groups to help them determine if changes have to be made to the system to maintain compliance with service level commitments. (The goal, after all, is to find and fix problems before an SLA violation occurs.)

Real-Time Event Handling, Operations, and Policy

The real-time event manager, discussed in Chapter 5, "Event Management," and shown at the center of Figure 3-2, acts as a central switchboard, connecting other parts of the management system to the instrumentation driving them. It's the core component of most commercial management systems, such as HP OpenView, Tivoli Enterprise Console, and Unicenter TNG. It can communicate with many different instrumentation systems, using multiple standards and techniques. In some cases, it passively waits for alerts to be received; in other cases, it actively polls remote instrumentation to obtain data on a regular basis.

Because it has a far broader view of the system than any individual instrumentation manager, the central real-time event manager can identify performance patterns that the individual instrumentation managers cannot see. It's also aware of the topology and interdependencies of the system being measured. It can, therefore, do a better job of data filtering, aggregation, and problem-detection than the instrumentation manager. As just one example, it can have the knowledge necessary to realize that the hundreds of "component unavailable" messages flooding into its receivers are the result of a single router failure because all the messages are about components that depend on that failed router for access to the rest of the network.

The event manager also can determine problem priorities by following preprogrammed rules, and it can automatically activate other management tools. This enables an administrator to design a sequence of steps using each tool in the appropriate steps. The configuration of the event manager is still a manual process, but no further attention is required after rules and process are set.

The real-time operations management function (described in detail in Chapter 6, "Real-Time Operations") provides much more sophisticated event analysis and handling than the basic event manager. Using the output from the event manager, real-time operations management applies complex algorithms to find more subtle patterns in the data. It can try to predict future failures by noticing patterns that have resulted in failures in the past, and it can also automatically take actions to fix existing or predicted problems without operator interventionthereby avoiding violation of the SLA and its associated policies.

The policy manager applies business rules to the operation of the system. It is an automated tool that identifies the service levels allocated to each end user and application, based on rules programmed by the system operators. It then tunes the system and denies system access as needed to enforce those service levels.

Some examples of the functions performed by the trio of event manager, operations manager, and policy manager are listed here and are discussed in more detail in Chapters 57:

Compliance testing Performs real-time monitoring of service behavior, comparing the actual behavior against the objectives in the SLA. Alerts to other real-time functions are forwarded if service quality becomes questionable.
Root-cause analysis Identifies the likely cause of (potential) performance degradation or availability problems. It begins with steps to determine which infrastructure is involved in service quality degradation, later proceeding to identify the elements within the infrastructure that are probably the source of the problem.
Predictive analysis Predicts future behavior and thereby avoids service quality disruptions. These approaches vary from statistical strategies to actual testing for nonlinearity (infection) points.
Automated operation Provides automatic handling of problems. Analyses of various types are useful, but they must lead to actions that rectify the problem (or the threat) of a service-quality disruption. The increasingly stringent downtime requirements in SLAs encourage automated responses if they can be made quickly and accurately.
Policy oversight Provides completely automated tuning of service-delivery systems to enforce policy rules that determine who obtains particular service levels and the amount of service they're allowed to use. Policy oversight is used to apply business rules to the services provided by the system.

Long-Term Operations

Some operations are considered to be longer term because their activation or completion within a short time interval is not critical. Such longer-term operations, shown at the bottom of Figure 3-2, can be associated with strategic changes to the service-delivery environment, or they can offer more fundamental remediation of problems identified by alarms. Some examples of longer-term operations include the following:

Load testing Through real tests of the system, or of a representative environment in a test bed, a load test helps determine the actual nonlinearity points and bottlenecks within a system. This information also validates capacity planning and helps set the appropriate thresholds for detecting problems.
System modeling and capacity planning Using information collected over a period of time, these predict infrastructure usage trends and the resulting resource needs. This provides enough time to get resources in place before there is any service- quality impact.

Load testing is discussed in Chapter 11, "Load Testing," and system modeling and capacity planning are discussed in Chapter 12, "Modeling and Capacity Planning."

Back-Office Operations

Back-office operations, shown at the bottom of Figure 3-2, are related to the business side of service delivery. These processes have usually been described as Operations Support Systems (OSS) in the world of traditional telephone providers. They constitute a bridge between operations of the service-delivery environment and the management of the business that pays for them. Typical back-office functions for service providers include the following:

Billing Tracks resource usage and charges accordingly. Billing will be tied to the negotiated terms of the SLA, and it must be flexible and easily extended to incorporate new services.
Provisioning Allocates resources to support one or more usage instances of specific services, and it associates one or more consumers with that resource allocation for service billing or charge-back purposes.
Customer service Provides the help desk, web pages, and other means of interacting with customers and supporting a wide range of needs, such as ordering services, getting information, and resolving disputes.
Order tracking Follows customer orders through the steps from initial contact through ordering, activation, and revenue capture.
Financials Looks at the business side by tracking metrics, such as the Return on Investment (ROI), which is the profitability of various services, and capital spending projections.

Service consumers must manage their online business with similar types of information. For example, they should track the performance of their providers, the cost of the services that they use, and the benefit or income from their use of those services.

The business-process metrics described in Chapter 2 can be used to suggest metrics that will help manage the overall performance of the back-office operations.