Implementing VoIP SLAs
Armed with a good understanding of the SLA metrics that are important for VoIP, you can begin actually implementing the right SLAs for your enterprise. Like many topics covered in this book, it is best to view this as a staged process. Working through the five stages discussed in this section, listed next, will bring you the results you are seekinga trouble-free VoIP system, transparent to your end users:
Define who is responsible for each role in the overall task of SLA implementation.
Identify the right VoIP service levels for your enterprise.
Negotiate the SLA itself.
Begin measurements, to determine when the SLA is being met and when it is not. You want this stage to be automatic, so you will choose among the tools that are available to make this painless.
Manage and enforce the SLA.
The implementation of an SLA should be segmented into several roles. Most of the following roles apply more readily to external SLAsthose negotiated with a third-party provider. However, each implicit task must also be completed by someone when an internal SLA is being developed, and must also remain someone's responsibility once the SLA is in force. The following series of questions illustrates the types of roles that need to be assumed:
Who defines the SLA? Who decides which metrics are important for the organization?
Who writes the contract and guides it through the negotiations? Who determines the penalties?
Who manages the network, its equipment, and the related computer hardware and software? These are the items against which the service covered by the SLA is being measured. Who is responsible for maintaining them?
Who takes the measurements for the metrics specified in the SLA? Who assures the quality of the measurements as they are taken?
Who manages the SLA? Who specifies the thresholds for the metrics and gets notified when they have been crossed?
Who responds to and resolves incidents when a managed threshold or an SLA metric is crossed?
Who does the SLA-related accounting? Who measures the percentage of compliance and presents it to the provider or service recipient at the end of each week or month?
Who enforces compliance with the SLA? This involves determining penalties and collecting them.
Who decides when it is time to get a new provider, if compliance with the SLA becomes an issue?
In a small organization, many of these roles may be handled by the same person. But, in larger organizations, these responsibilities are probably divided among several people who need to communicate well with one another.
Identify Service Levels for VoIP and Other Applications
Before beginning your SLA contract negotiations, determine what you are going to measure and what the target SLA values should be. Although the VoIP SLA metrics discussed earlier in "Determining What to Measure in a VoIP SLA" are all important to varying degrees, your own SLA should comprise the metrics that are most meaningful to your business. A good piece of advice to take to heart follows: "Creat[e] SLAs that are based on the quality of end-user application experience, rather than IT metrics which customers cannot or do not want to understand." VoIP is an example of a network application whose value is driven nearly 100 percent by the perceptionspositive or negativeof ordinary users.
For VoIP, the relevant metrics for an SLA are the metrics discussed previously: availability, call quality, call-setup performance, and incident-resolution time. The first three are end-user, end-to-end measurementsthey are what counts. You could alternatively measure only their lower-layer constituents, such as delay and packet loss, but it is best to consider those metrics principally for diagnostic purposes, to reduce the "find and fix" time. You also should include measurements related to problem repair, such as MTTR and incident management: how problems are identified, submitted, and passed among the team.
Having added VoIP to the network, do you now need to add SLAs for your other business-critical applications, which you may not have been monitoring before? These include e-mail, groupware, e-commerce, and industry-specific business programs. Maybe you should have had a response-time SLA for your ERP applications, but you were not aware of much dissatisfaction before you deployed VoIP. Adding VoIP traffic to a system near its capacity may significantly increase the response time of other business-critical applications. Application performance is something you now really need to pay attention to.
To establish the target SLA metric for your most important applications, begin by establishing performance baselines. These let you know what is actually possible and provide a starting point. You know that at the time you take the baseline, here is where MOS, response time, or throughput stands. You obviously should not write an SLA to support 100 VoIP calls with a minimum MOS of 3.9 if there is clearly not enough bandwidth or other resources to support 100 calls with that level of quality.
Start your VoIP call-quality baseline with the results provided by the VoIP-readiness assessment discussed in Chapter 3. Work from the last assessment you didthe one where the MOS met your standards and was conducted after you had done all the necessary upgrades and eliminated any bottlenecks or other problems that were identified. It is similarly straightforward to get baselines for the network performance of other applications, such as response time, throughput, or packet loss. Don't go overboard; be sure to measure what is important to end users for each application. For example, for a database application, focus on the response time for queries or updates rather than on throughput.
Likewise, create your availability baseline using the best numbers you have to describe your current availability statistics. Without making any significant changes to the network or the users, this gives you a place to start, a place where users know what to expect.
If you plan to monitor application response time, avoid insisting that all requests be met with a response time of 1 second or less. That is unrealistically strict. Instead, it would be better to state, for example, that 95 percent of requests must have no more than a 1-second response time and 5 percent may have a response time of between 1 second and 5 seconds.
Starting from these baselinesthe expected and observable behaviorcreate some SLA targets. SLA targets are values representing performance that is so bad that it is no longer acceptable. For example, if your VoIP baseline MOS is 4.15 today, you might create an SLA target that reads something like this:
MOS of 4.0 or above 85 percent of the time; 3.9 or above 95 percent of the time; and 3.8 or above 100 percent of the time; measured on 10 concurrent calls with the G.711 codec, between Raleigh and Houston.
Negotiate the SLA
The intention of an SLA is to spell out which services are to be provided, how the services will perform, and what should happen if their performance does not meet the expected service levels. But a certain amount of negotiation, compromise, and perhaps even controversy will undoubtedly enter into any SLA you implement. One author made the analogy that "SLAs are nothing more than insurance policies. Just as life insurance doesn't guarantee life, SLAs don't guarantee levels of service. They provide you with compensation in case something goes wrong."
You should anticipate some give-and-take in the relationships affected by a VoIP SLA. Here is a top ten list of topics to be addressed in your VoIP SLA negotiations:
Specify the SLA metrics and their target values.
The earlier section "Identify Service Levels for VoIP and Other Applications" describes this topic in detail. The measurements that affect the end-user experience are important to your organization and should be included in your VoIP SLA: availability, call quality, and call-setup performance. You also may want to include metrics for other applications so that their performance does not degrade because of the addition of VoIP, as well as a metric for incident rates and their rates of resolution.
Describe how the SLA metrics are measured and who measures them.
The earlier section "Define Responsibilities" identifies the wide range of roles and responsibilities associated with creating and enforcing a VoIP SLA. Do you take SLA compliance measurements in your organization, or are they taken by a third party or the service provider? If the provider takes the measurements, how do you, the customer, verify them? Tools for taking measurements are described in the next section, "Deploy Tools to Measure SLAs." The SLA should describe in detail how the measurements are to be taken. It should specify the locations to be monitored. And the SLA should spell out how measurements and compliance should be handled if an end-to-end metric involves multiple ISPs.
The SLA should also explicitly describe what time periods are covered. The following quotation appeared in an ISP offering brochure: "A high-end VoIP carrier will offer 99.99% availability, which does not include scheduled maintenance windows where the carrier may take down the network to upgrade equipment; clean or switch fibres or perform any other work that could lead to network downtime." [Italics added.] Wow! A lot of time may elapse in these periods that are not included in the availability agreement; what time periods are covered in your SLAs, and how are they measured?
Describe the SLA reports and their schedule.
Your service provider should demonstrate its compliance with the SLA by sending you monthly or even weekly reports. In your negotiations, make sure the contract specifies what metrics and what parts of the network will be included in the reports. It should also say how often you will get reports.
Allow requests to review SLA compliance information on demand.
The SLA should establish a procedure for requesting SLA compliance information on demand. This type of data can be helpful for trouble-shooting. For example, if you are experiencing a delay problem, information from your ISP may help you narrow the problem down to a WAN link that you don't control.
Specify the turnaround times for change requests, by severity.
As you gain more experience with your VoIP system, you will make changes. For example, as you add new locations and new users, you may want to add more locations to be monitored as part of the SLA. This may require a change request to your service provider. What is the expected turnaround time for the change request? It is also reasonable to include a prioritization scheme in the SLA's timetable for such requests. A slow or overburdened link may be one of your highest-severity items and should be expedited accordingly.
Specify support-personnel and help-desk staffing levels.
The last thing that you want is to be placed on hold indefinitely when there is an SLA-related fire to put out. Get it in writing: How many people are available to support your VoIP system when incidents occur? What is their skill level? What hours do they work?
Schedule periodic reviews and adjustments to contract provisions.
Your initial VoIP deployment will no doubt change over time: new users, new locations, new applications, new hardware, more bandwidth, mergers, and so on. These may cause your SLA requirements to change. Don't let your SLA requirements get too far out of date. Schedule regular reviews with your service provider.
Describe the rewards for great compliance and penalties for noncompliance.
What penalties should you build in to your SLA? And if an unsatisfactory situation drags on, how long do the penalties build up before you call it quits?
CommWeb.com points out that, "It's easy for service providers to promise 99.999 percent uptimeespecially when the penalty for not delivering is a meager day or two worth of credit. Obviously, penalties of this sort are no compensation for the potential loss in revenue when a company's web site is down or critical applications aren't performing." Your provider must have a strong motive for complying with the SLA you have negotiated. That motive may be either positive (a bonus or additional business) or negative (a substantial monetary penalty).
Your best safeguard when entering into an SLA is a "system of rewards and penalties for compliance and noncompliance," notes Mandy Andress of InfoWorld. "An unenforceable SLA serves little purpose. It is well and good to say that all requests should have a 1-second response time, but if the group responsible for system performance does not incur any penalties for slower response times or reap any rewards for faster response times, then they have no real incentive to comply."
Discuss transition assistance for services, should the service provider fail or suffer a setback.
Put together a plan that gets you through the difficulty if something catastrophic happens with your service provider. This type of situation has unfortunately become more common in recent years. Aside from bankruptcies, service providers face the same scary threats that you do; floods, tornadoes, malicious attacks by network intruders, and other unpleasant possibilities that need to be anticipated and planned for.
Create a procedure for terminating an SLA contract.
If you will pardon the analogy, sometimes the relationship with the service provider just does not work out, and an amicable divorce makes sense. Write the "prenuptial" agreement before the marriage, not after the relationship starts to go bad.
Is it reasonable to expect your service provider to agree to all of the types of stipulations just outlined? Figure 7-3 shows the results of a 2001 survey in Network Computing, asking service providers and outsourcers what is covered in their SLA contracts.
Figure 7-3. Survey Results from Network Computing, Showing Provisions Covered in SLA Contracts
Most boilerplate SLA contracts are probably not good enough for you. They can have lots of holes and exceptions. For example, if a carrier's subcarrier goes down, who is responsible? Depending on the size of your deployment and its geographical scope, there may be a chain of subcarriers and subcomponents to take into account; determine who is ultimately responsible. Make sure you fully negotiate the contract details with everyone potentially involved.
Additionally, consider letting SLA contract quality guide your choice of service provider. You are now armed with a top ten list of things to include in the negotiations. "If you are choosing between two otherwise-equivalent service providers, if one has a better SLA does that make a difference? And is that more important than past brand experience, than price?" A 2002 survey of enterprises with SLAs found that "not only were SLAs important, but the enterprises were willing to pay a significant premium for verified quality and guaranteed service…."
Deploy Tools to Measure SLAs
After you have deployed your VoIP system, determined the expected performance, and negotiated your SLA contract, you need to watch the metrics specified in your contract. This means that the performance values must be monitored on an ongoing basis, and events must be triggered when the target SLA value is about to be crossed.
Monitoring SLA compliance can be done by the service provider, by a team in your enterprise, by a third party, or by some combination of these. In any case, the provider will surely be motivated to allow for some comprehensive monitoring of its offerings. And for you, monitoring is even more important. "Enterprises want to outsource their networks, service offerings and have proof that they're getting what they're paying for," notes Laura Spear, VP of marketing at Trinagy. "That means they need tools to provide proof and credibility back to their customers." Figure 7-4 shows the results of a survey in Network Computing that describes how SLA performance measurements are received.
Figure 7-4. Survey Results from Network Computing, Showing how SLA Performance Measurements Are Received
Although one reason to perform consistent SLA monitoring is to check SLA compliance, a more important reason is to avoid SLA infractions altogether. This means determining how much early warning you need to deal with developing problems. Although you would like to know what is going on at any given moment, the closer you get to "real-time monitoring," the greater the amount of data that is collected and the greater the amount of network traffic this is generated in reporting it. A better method is to set useful thresholds.
For a given SLA measurement, set a pair of thresholds that are stricter than the SLA target. When the second threshold is triggered, take immediate action so that the SLA level is not reached. You want to force action to be taken before an SLA violation, not when the SLA metric has been crossed and it is too late. Consider this the good-to-bad threshold; when it is crossed, initiate the incident/fault-management processes discussed in previous chapters.
When setting thresholds, create two threshold crossings: crossing on the way down (going from good to bad), and crossing on the way back up again (going from bad to goodindicating that the incident has been resolved). And make sure you allow for some gap between these threshold-crossing valuesyou don't want a flurry of alarms to besiege your e-mail inbox if the value you are measuring is fluctuating back and forth across this boundary.
Reading up from the bottom of Figure 7-5, you first encounter the SLA target; if the measured MOS crosses below the line, an SLA violation occurs. Above that is the threshold where, as the MOS declines, you decide that it has gone from good to bad, and you trigger the event and actions necessary to avoid a further decline. You would like to reset that event when the problem is truly fixed, so the top line is the threshold that is crossed on the way back up; as the MOS improves after having crossed below the "Good to Bad Threshold," it can be declared good again when it crosses above the "Bad to Good Threshold."
Figure 7-5. Example of a Fixed SLA Target for the MOS, Along with a Pair of Thresholds Preceding It
As you become more adept at working with your SLA thresholds, you may consider implementing thresholds that are not just fixed lines or numeric targets. Thresholds can be intelligent, responding to changes in overall behavior, the time of day, or the number of users.
A wide range of SLA monitoring tools is available to help you monitor your VoIP SLAs. Sterling Research summarizes the different choices available to you:
Some companies offer a limited tool set because they have decided to focus primarily on the monitoring aspect of SLA compliance. Others offer a range of tools that treat SLA compliance as an end-to-end process. One tool is initially used to establish baseline performance for a particular service; the next is used to monitor the service on a day-to-day basis; andfinallysimulation tools are used to spin what-if scenarios that calculate the impact on service performance if changes are made to the environment.
Manage SLA Compliance and Enforcement
Suppose it is the end of the month. You review your VoIP SLA reports and see that one of the SLAs has been violated: That is, too much time has been spent outside the SLA target. What transaction now needs to occur between you and the service provider?
First, don't get in this situation. Overcommunicate with the team that is fulfilling the SLA responsibilities described in the earlier section "Define Responsibilities." All members of the team should be well informed all along the way. You don't ever really want to get into the enforcement or penalty stage of an SLA contract. Avoiding disputes and legal actions altogether is almost always cheaper and less stressful than pursuing them.
Jared Huizenga of Sage Research believes that enterprises are currently trying to develop "a more proactive SLA" with their providers. In a "proactive" SLA, providers "have to spot, correct, and recompense customers for any problems before customers inform the service providers of the problems." Huizenga also believes that most enterprises that enter into an SLA with their providers "want to be able to actually monitor, at their own site, compliance" and "receive an automatic credit" if a compliance issue arises.
As discussed previously, your SLA contract should establish a system of rewards and penalties for compliance. These are the incentives for the SLA provider to perform well. Rewards for excellent SLA compliance may include things like cash bonuses. The penalties for SLA infractions can include automatic credit or reimbursement of your charges, withholding of payment, or cancellation of the contract. Penalties must be stiff enough to have real meaning for a larger provider. A Network Computing survey showed that 67 percent of respondents expected "financial remedies" from their provider if the SLA was breached, as shown in Figure 7-6.
Figure 7-6. Survey Results from Network Computing, Showing Expected Legal Remedies for SLA Noncompliance
When expectations are not met, any changes for the better should come from the provider. They need to determine how to improve their quality and their processes so that expectations are consistently met in the future.
SLAs should be reviewed regularly. An annual review is specified in many contracts. Because of the rapid pace of technology development, user expectations may change frequentlythis is especially true of expectations for availability and the response time of business transactions. This means that SLAs must be periodically updated to reflect these changes. Otherwise, SLAs can quickly become outdated, demanding service levels far below existing technological capabilities.
Contract cancellation may be the most effective penalty to levy in cases of SLA noncompliance. David Kaufman, of Brix Networks, argues that the proactive testing and monitoring with SLA thresholds described previously in this chapter is appealing to service providers for that very reason. "Having the advance warning that something is beginning to go wrong with a[n] SLA is really vital because, if there's an SLA outage, you have a one-in-three chance that you've lost that customer," he says.