An SLM Project Implementation Plan


This section of the chapter presents a plan for implementation of SLM on an existing system; implementation for a new project is similar. These steps can proceed somewhat in parallel; for example, SLAs can be drafted and refined while instrumentation is collecting data to be used in setting service level objectives.

The steps of the plan are as follows:

  • Census and documentation of the existing system

  • Specification of performance metrics

  • Instrumentation choices and locations

  • Baseline of existing system performance

  • Investigation of system performance sensitivities and system tuning

  • Construction of SLAs

Census and Documentation of the Existing System

An overall census and documentation of the existing system configuration and capabilities provides the basic data for filling in the details of any implementation plan. For example, a typical network census identifies network devices and their connections to other devices. The census can be partially performed by automated tools; most organizations have a multitude of automated discovery capabilities. The traditional Simple Network Management Protocol (SNMP) management platforms provide discovery, as do many reporting and troubleshooting tools. Manual effort is also needed to ensure that the tools haven't misinterpreted some configurations and to ensure that unusual network features are correctly documented.

A services census checks whether devices have service management capabilities; the goal is to catalog the service management capabilities already in place. In addition to Quality of Service (QoS)enabled network devices, elements such as caches, load balancers, and traffic shapers should be identified.

Specification of Performance Metrics

Metrics are the currency by which the service relationship is conducted. Therefore, they must be planned for, accounted for, and reviewed on a regular basis. Just as you wouldn't want to employ a bookkeeper who had only a vague idea about how much money was coming and going in your business, your metrics need to be precise and focus on the key contributors that affect the desired outcomes of the services. Like accounting, the art and science of metrics can range from ad hoc (more than pencil and paper, but not much) to heavy-duty statistical analysis. When the value invested in the service is high, it's prudent to make more than a cursory investment in the metrics. These performance metrics (service level indicators) and service level objectives must be clearly defined in advance to minimize disputes.

The particular metrics chosen depend on the application, of course. For example, transactional applications do not usually specify a packet loss metric. Delays due to retransmissions will affect the response time metric, which is more directly relevant to transaction end users. In contrast, a certain level of packet loss makes VoIP and some other interactive services unacceptable because their real-time nature doesn't allow time for retransmissions of lost packets. Specific recommendations for metrics are detailed in Chapter 2, "Service Level Management," and in Chapters 810.

Each service level indicator and its accompanying service level objective must be defined clearly and unambiguously. For example, transaction response time can be measured as the time between sending the last request packet and receiving a complete response. Alternatively, it can be specified as the time between the last request packet and the first response packet. These two specifications will yield different results, and while neither is necessarily superior, both parties must agree on which of the two they will use. The degree to which the metric clearly indicates the end user's experience should be considered in this choice.

Service metrics often involve synthetic (virtual) transactions. Those synthetic transactions must be specified and should be included in any agreement. Synthetic transactions must be updated as the transaction mix changes; provisions for adding new ones to the agreements should also be addressed. Similarly, the SLA must include them in regularly scheduled reviews.

Measuring availability also demands unambiguous specifications. For example, a service customer would characterize an outage as lasting from the time it was first detected until a customer transaction verifies that the service is available and functioning within the SLA compliance criteria. Providers who own a subcomponent will tend toward defining and measuring the outage duration in terms of the service offeredsuch as the period of time during which a piece of the network was not functioning. Where the commitments made by providers do not integrate effectively, end users will perceive a different impact from the outages than might be indicated by the availability statistics of the underlying subcomponents.

It is also necessary to specify measurement validation and any statistical treatments that should be applied to the data. These should be combined with sampling frequency to ensure that confidence intervals are acceptable. For example, a critical service might be probed every five minutes, while those of lesser importance are checked every fifteen minutes. (See Chapter 2 for detailed discussions.) The increased granularity of more frequent measurements must be balanced against the additional demand on servers and networks. More organizations are adding a dynamic specification that shortens the measurement interval if the metric is trending toward unacceptable values.

After the metrics and their measurement procedures are specified, the service level objectives can be established based on the requirements of the application. In cases where the performance characteristics of a service are well-established, such as those associated with a service from a major external supplier, it may be necessary to choose from the service classes that the supplier offers. For example, an interactive application might be able to choose among three offered classes of service with three different sets of acceptable response times and packet losses, as shown in Table 14-1. Major Internet Service Providers (ISPs) offer service guarantees for transit that's completely on their networks, and some are offering guarantees for transit to and from endpoints on other networks.

Table 14-1. Examples of Interactive Service Classes

Service Class

Maximum Response Time (in milliseconds [ms])

Maximum Packet Loss

Platinum

60

0.25%

Gold

100

0.5%

Silver

150

2%


Instrumentation Choices and Locations

Getting the proper instrumentation in place is an important step. Note that the measurements needed for reporting are often somewhat different from the measurements needed for problem management and performance optimizationand both are needed for effective SLM. (Detailed discussions are in Chapter 2 and in Chapters 810.)

Both passive and active measurements are used to gather all the necessary information. Each is discussed in the following subsections.

Passive Measurements

Passive measurements provide insight into the actual services being used and their volumes throughout the day. Passive monitors should be placed on the access links from data centers where they can capture the actual traffic flowing to and from the center. Placing agents near the organizational boundary also tracks the outbound traffic originating within the organization.

Remote Monitoring (RMON) agents are passive agents that can provide a rich view of application flows across the networked infrastructure. Many LAN switches have embedded RMON agents that can be used. A stand-alone agent can also be used to collect the information, allowing measurements at sites where a switch does not have an RMON probe or where the large volume of collected data impacts the switch performance.

Passive agents in servers can also provide information about the applications that are executing, the numbers of concurrent users, and the time distribution of usage. The server information can supplement or replace the RMON data. The measurements at the organization's edge will still be needed to understand the outbound traffic.

Active Measurements

Active measurements are used to build a consistent view of service behavior as seen by the end users. For example, a set of synthetic transactions can be constructed that are realistic approximations of actual end-user activity. Performance is tracked by sending synthetic transactions to the actual site. This offers the performance perspective as experienced by customers, partners, or suppliers. The measurements are used to alert the local management team that service disruptions may be threatening sales or business relationships.

Active probes should be placed in multiple locations for the best results. The internal environment can be as transparent as desired because multiple probes can provide detailed and granular measurements. There is more flexibility within an infrastructure that the organization controls and manages; probe placements should be used to provide overall end-to-end measurements and to break down the components along the path. Performance of different areasacross the backbone, on the web server, or on the backend server, for examplegives the detailed data needed for resource planning.

Placing probes at different points in the internal network also gives a broad picture of any service quality variations that are related to differences between locations. This gives resource planners a finer level of detail and identifies specific areas that require further attention.

One caution in distributing active probes across the infrastructure is that overduplication of measurements should be avoided. For example, it may seem reasonable to place a probe that monitors an end-to-end service, plus individual probes that monitor each component of that service. However, when multiple edge services use the same core service, you don't need to add multiple monitors for the core service as well.

External parties will not expose their internal operations to outsiders; nonetheless, they still must be measured. Therefore, the main measurement objective for external services is the identification of the proper demarcation points so that the performance of external partiessuppliers, partners, or hosted servicescan be isolated and measured by appropriate instrumentation deployed at those points.

Demarcation points close to edge routers connecting to external services are the most desirable locations for such instrumentation. Active measurements can track the performance between these demarcation points, and that performance can be used in SLAs with the external service providers. These measurements can evaluate the delay in provider networks, the delays at hosting sites, or the delays within a partner environment.

Measuring a provider is simple when both demarcation points are within the same organization. Placement is not restricted, and the provider delay can be clearly determined.

Negotiating with key partners or suppliers for placing measurement probes is becoming more common. For security reasons, a business partner may not want any external equipment connected inside the firewall. If a probe cannot be placed inside a firewall, a probe located at a demarcation point just outside the firewall can be used to measure the external network delays.

Baseline of Existing System Performance

It's important that SLA specifications be determined not only in the abstract, but also by baselining real services under real-world conditions. Neither providers nor consumers are well served by specifications and performance targets that have not undergone a "shakedown cruise" that covers the range of operational extremes.

A baseline of application and service activity defines the actual behavior of the services before any SLM solution is deployed. This information is used for planning the detailed steps of the implementation process and in evaluating the success of the implementation. In other words, the baseline defines the gap between what is wanted and what can be delivered.

Investigation of System Performance Sensitivities and System Tuning

The agreement to provide specified levels of service quality must be realistic; what is promised must be matched with the capacity for delivery. The census of the existing system components, the draft specification of metrics, and the baseline of existing system performance are therefore used in this step, which tries to understand the costs of meeting various performance objectives by understanding the sensitivity of the system to changes in design. Final tuning of the system design and the service objectives may be necessary to match the delivery capabilities of the system without incurring excessive costs.

Adjustments to the design are indicated when the actual performance isn't sufficient to meet the objectives. The gap between the desired and the actual performance will be a gauge of the effort and expense to bridge the gap. A small gap may be bridged with a small upgrade or some simple reorganization of the resources. A small gap may also be resolved by upgrading a single resource, such as adding a faster server or distributing content to edge servers to reduce congestion. The census process can help by identifying system components that can be moved to places where they add more leverage and control while reducing the need for new purchases.

Larger differences may indicate that an investment in multiple areas, such as network bandwidth or a faster database server, may be necessary. A balance between larger investments and the target levels may be considered. Would adding two seconds to the proposed response-time metric result in lost business or productivity? Would the two seconds be a good idea if it saved a substantial sum of money?

Granular internal instrumentation is very helpful at this stage. Measurements may directly identify the main contributors to the delay. If the internal network delays are 10 ms and the server delays are 6.5 seconds, most leverage is going to be found in improving server performance (or whatever back-end services are activated).

Capacity of the systems under the actual workload is also a critical consideration. Acceptable response when loads are light may be deceptive. If some of the services are not yet deployed, there is more uncertainty about the actual infrastructure capacities. Estimating expected growth in transactions or users is important to ensure that a new service management system has the headroom to accommodate growth for some initial period, all the while staying below the inflection point, which is where performance becomes nonlinear. Baseline information or data from load testing is very helpful in building a realistic assessment of the implementation effort, its costs, and time frames. (See Chapter 11, "Load Testing," for more information about inflection points and load testing.)

After the service-level and system capacity needs have been determined, the process of tuning system performance can begin. Having a set of choices increases options and leverages competition. It also makes the selection process more complex because there will be several ways of satisfying any particular requirement. For example, money can be spent directly on the servers to improve processing power or memory, or it can be spent on storage systems to boost server performance. The expenditures may also be indirect and include buying load balancers, web server front ends, or content management and delivery systems.

The instrumentation used for baselines can now be used in sensitivity testing as the new elements are added to the test bed and then to the production environment. It is best to measure one change at a time to get a better feel for the changes that are making the most significant contributions.

Instrumentation may also reveal that some other applications carried by the system are interfering with service performance. For example, some activities, such as playing games over the network or downloading media files, are not related to business goals and waste time and resources. At other times, a legitimate service, such as backing up a database, may interfere with other critical services because they are scheduled incorrectly. Data from instrumentation may therefore indicate a need for admission control policies and enforcement. Undesired applications would be barred from using network resources, while others could be scheduled to reduce their interference with other operations.

Construction of SLAs

The foundation of any service management system is based on instrumentation, reporting, and a clearly defined SLA. As part of an SLM implementation, most organizations will create one set of SLAs with a range of external providers and partners. They will create a separate internal set of SLAs between the IT group and the business units. All parties gain from having an SLA; service customers expect to have more control of service quality and their costs, while providers have clearer investment guidance and can reap premiums for higher service quality.

Because it specifies the consensus across all parties, the SLA becomes the foundation for deciding if services are being delivered in a satisfactory fashion. I discussed SLAs in Chapter 2, and they are reviewed here, along with some additional discussion of SLA dispute resolution.

An SLA should clearly define the following:

  • The metrics (service level indicators) used to measure service quality

  • The service level objectives

  • The roles and responsibilities of the service provider and the service customers

  • Reporting mechanisms

  • Dispute resolution methods

There may be other areas, such as determining financial penalties, that must also be addressed and resolved. This is driven by business considerations: the providers want the lowest possible exposure to penalties, while the customers want realistic compensation for service disruptions.

Usually the penalties suit the provider rather than the customer. For instance, the common remedy for a disruption is a rebate on future bills or a refund. The downside for customers is that the rebate may be a small fraction of the customer losspossibly thousands of dollars of lost revenue per minute. As discussed in Chapter 2, there are strategies that can be applied to encourage desired supplier behavior and that can be coupled with risk insurance, if necessary, to compensate for losses.

Even where legally binding SLAs with external providers are not involved, it's important to reduce ambiguities to an absolute minimum.

The SLA areas of metrics and service level objectives were discussed in preceding sections; the other SLA areas are described in the following subsections.

Roles and Responsibilities

Although individual roles within a team can be clarified in conversation, the set of activities and responsibilities across separate entities is best agreed upon in advance.

The service provider provides a set of services at a specified service quality and cost. The provider is also responsible for reporting service level indicators and costs on a regular and as-needed basis. This may include business process metrics such as help-desk response times, escalation times, and service activation times.

Responsibilities of the service user, such as adherence to specified workload characteristics and methods for interacting with the service provider, should also be described.

Reporting Mechanisms and Scheduled Reviews

In my experience, putting service quality reporting in place early is most helpful, especially in building internal support for SLM expenditures and potential inconvenience during deployment and cut-over.

There are many reporting tools available. Many management tools have reporting capabilities as part of the package. These capabilities are usually product-specific, providing a set of reports about the device, or server, for example.

Active measurement systems almost always include methods for reporting their results in a form that can be used for service level indicators. Your choice of tools will depend on the particular service level indicators included in the SLA, the capabilities of the various management tools being used, and the statistical manipulation required by the SLA.

All these products or features for reporting service levels share common characteristics. For instance, they offer a variety of presentation formats, such as bar charts, pie charts, and histograms, so that the information is in a form that is most useful to the person using the report.

Reporting tools save administrators time with a set of predefined templates that are ready to use out-of-the-box. This feature enables administrators to get immediate value without needing to learn much about the reporting product itself. Templates can be supplemented with easy customization so that the reports can be tailored as needs change.

The reporting tools must offer various levels of granularity and detail because there are different sets of consumers. The technical teams usually need more extensive details, such as a breakdown of each service disruption. They want information defining the initial indications of a disruption and quantifying the disruption. (How slow was the transaction relative to compliance guidelines? How long was the duration of the problem?) Upper management usually wants summariesthe numbers of disruptions and the applicable penaltiesbut underlying data must be easily accessible.

Reporting becomes a useful ally in building an SLM plan because it helps to set expectations. Publishing a set of reports on the Web makes the information available to all internal usersmuch like traffic reports on the radio. This enables them to see the actual service quality they are experiencing, and they can track improvements over time. Published reports reduce the load on the help desk because users can check performance and other parameters for themselves rather than asking help desk staff for information. User access to SLA compliance reporting also keeps the management service team accountable because everyone sees the results.

It's important that there be consistency between the published reports used for determining SLA compliance and the instant reports available on the Web. However, there may need to be adjustment of the instant reports because of measurement errors and other problems; customers must be made aware of that possibility to avoid losing credibility. Credibility problems can also appear if instant measurements are made available before the reporting system has been fully tested; spurious reports of service level problems will make the load on the help desk worse instead of better.

Accountability is enhanced by scheduled, periodic reviews of the service level reports. Defining who participates in these reviews from each side, how often they are scheduled, and what material is to be reviewed should go hand in hand with the reporting requirements.

Dispute Resolution

There should be a mechanism defined for resolving disputes because they will inevitably arisedespite having as many details as possible spelled out in the SLA. Given the pressures to minimize penalties on the provider's side and the criticality of services on the customer side, discrepancies in the measurements and their interpretation will be subjected to substantial scrutiny.

In the past, service providers traditionally made the measurements themselves and just reported them to customers. Today, many customers want to conduct their own measurements. Some customers feel it keeps the provider honest. In this, the webbed services industry is just catching up with more traditional industries, in which regular monitoring of supplier inputs to manufacturing or other elements of the supply chain is a critical, standard operating procedure.

When provider and consumer are measuring from different points or using different intervals, the results will not be consistent and will lead to disputes when disruptions occur. As the financial consequences mount, the probability of disputes rises accordingly.

There is, therefore, a move to use a trusted third-party whose measurements are assumed to be objective. Companies that measure Internet performance, such as Keynote Systems, are used to verify the performance of the cloudthe integrated set of services. Other companies, such as Brix Networks, have also been founded to address this specific concern. Brix places measurement appliances at key demarcation points, collects the performance information, and analyzes it at a central site. Mercury Interactivethrough their Topaz Managed Servicesoffers some of the same capabilities.

It benefits both parties when accountability is clearly determined from agreed-upon metrics and measurements. Any dispute resolution process needs specific steps, such as both parties simultaneously conducting measurements to determine the differences in the readings. Any differences can be used to correct and calibrate the results before determining if services are compliant. Such a process is critical to building a workable relationship between service providers and service customers.

In some cases, especially during initial implementations of SLM, it may be necessary to adjust the service level objectives because it has become apparent that the costs are too high, or the system is not yet capable of meeting the original targets. The possibility of that readjustment must be understood by all parties involved in those initial efforts.




Practical Service Level Management. Delivering High-Quality Web-Based Services
Practical Service Level Management: Delivering High-Quality Web-Based Services
ISBN: 158705079X
EAN: 2147483647
Year: 2003
Pages: 128

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net