The service management architecture shown in Figure 3-2, and described in the following subsections, is intended to be an example of a typical architecture used by large organizations. It is one that can accommodate changes that take place during the service delivery lifecycle faced by any organization that relies extensively on networked delivery of information. It encompasses both the components on which service delivery relies, as well as the service that is a product of those components. Figure 3-2. Web Service Management ArchitectureInstrumentationInstrumentation, described in detail in Chapter 4, "Instrumentation," and Chapters 810 and shown at the top of Figure 3-2, monitors and measures the performance and availability of system components, as well as that of services. Instrumentation of components, or element instrumentation, tracks the status and behavior of individual components, such as network devices, servers, and applications. Examples of element measurements are CPU busy percentage and the percentage of received packets that contain transmission errors. Services instrumentation tracks the behavior of services using active and passive collectors. Examples of measured services are round-trip time through a network and transaction response time. Instrumentation takes two forms:
Instrumentation ManagementInstrumentation managers, described in Chapter 4 and shown in the middle of Figure 3-2, configure the instrumentation systems and receive the measurement data from them. They examine each incoming data item, filtering out obvious measurement errors and comparing measurements to specified thresholds to see if an alert should be issued. If measurements indicate a possible problem, the instrumentation manager may demand additional measurements to help make sense of the problem and to see if the original measurement was an outlier or was a true indicator of a difficulty. There are two primary outputs from the instrumentation manager: alerts and service level indicator data. The former consists of alerts that are important enough to be escalated to the real-time event handler, where they will be combined with other data for evaluation; the latter consists of data sets and aggregated measurements that are all forwarded to the SLA statistics system for statistical treatment and reporting on system performance. SLA Statistics and ReportingData sets received from the instrumentation manager are processed to generate statistics appropriate to their use as service level indicators, as described in Chapter 2, "Service Level Management." The summary information is placed in a database for later reference and for use in generating periodic reports about system performance for the team that manages compliance with SLAs and for other concerned groups. Summary information is also made available to the operations groups to help them determine if changes have to be made to the system to maintain compliance with service level commitments. (The goal, after all, is to find and fix problems before an SLA violation occurs.) Real-Time Event Handling, Operations, and PolicyThe real-time event manager, discussed in Chapter 5, "Event Management," and shown at the center of Figure 3-2, acts as a central switchboard, connecting other parts of the management system to the instrumentation driving them. It's the core component of most commercial management systems, such as HP OpenView, Tivoli Enterprise Console, and Unicenter TNG. It can communicate with many different instrumentation systems, using multiple standards and techniques. In some cases, it passively waits for alerts to be received; in other cases, it actively polls remote instrumentation to obtain data on a regular basis. Because it has a far broader view of the system than any individual instrumentation manager, the central real-time event manager can identify performance patterns that the individual instrumentation managers cannot see. It's also aware of the topology and interdependencies of the system being measured. It can, therefore, do a better job of data filtering, aggregation, and problem-detection than the instrumentation manager. As just one example, it can have the knowledge necessary to realize that the hundreds of "component unavailable" messages flooding into its receivers are the result of a single router failure because all the messages are about components that depend on that failed router for access to the rest of the network. The event manager also can determine problem priorities by following preprogrammed rules, and it can automatically activate other management tools. This enables an administrator to design a sequence of steps using each tool in the appropriate steps. The configuration of the event manager is still a manual process, but no further attention is required after rules and process are set. The real-time operations management function (described in detail in Chapter 6, "Real-Time Operations") provides much more sophisticated event analysis and handling than the basic event manager. Using the output from the event manager, real-time operations management applies complex algorithms to find more subtle patterns in the data. It can try to predict future failures by noticing patterns that have resulted in failures in the past, and it can also automatically take actions to fix existing or predicted problems without operator interventionthereby avoiding violation of the SLA and its associated policies. The policy manager applies business rules to the operation of the system. It is an automated tool that identifies the service levels allocated to each end user and application, based on rules programmed by the system operators. It then tunes the system and denies system access as needed to enforce those service levels. Some examples of the functions performed by the trio of event manager, operations manager, and policy manager are listed here and are discussed in more detail in Chapters 57:
Long-Term OperationsSome operations are considered to be longer term because their activation or completion within a short time interval is not critical. Such longer-term operations, shown at the bottom of Figure 3-2, can be associated with strategic changes to the service-delivery environment, or they can offer more fundamental remediation of problems identified by alarms. Some examples of longer-term operations include the following:
Load testing is discussed in Chapter 11, "Load Testing," and system modeling and capacity planning are discussed in Chapter 12, "Modeling and Capacity Planning." Back-Office OperationsBack-office operations, shown at the bottom of Figure 3-2, are related to the business side of service delivery. These processes have usually been described as Operations Support Systems (OSS) in the world of traditional telephone providers. They constitute a bridge between operations of the service-delivery environment and the management of the business that pays for them. Typical back-office functions for service providers include the following:
Service consumers must manage their online business with similar types of information. For example, they should track the performance of their providers, the cost of the services that they use, and the benefit or income from their use of those services. The business-process metrics described in Chapter 2 can be used to suggest metrics that will help manage the overall performance of the back-office operations. |