Optimizing Voice Quality

Because of the inherent characteristics of a converged voice and data IP network, administrators face certain challenges in delivering voice traffic correctly. This section describes these challenges and offers solutions for avoiding and overcoming them when designing a VoIP network for optimal voice quality.

Factors that Affect Voice Quality

Because of the nature of IP networking, voice packets sent via IP are subject to certain transmission problems. Conditions present in the network might introduce problems such as echo, jitter, or delay. These problems must be addressed with QoS mechanisms.

The clarity, or cleanliness and crispness, of the audio signal is of utmost importance. The listener must be able to recognize the speaker's identity and sense the mood of the speaker. These factors can affect clarity:

Fidelity The degree to which a system, or a portion of a system, accurately reproduces, at its output, the essential characteristics of the signal impressed upon its input or the result of a prescribed operation on the signal impressed upon its input (definition from the Alliance for Telecommunications Industry Solutions [ATIS]). The bandwidth of the transmission medium almost always limits the total bandwidth of the spoken voice. Human speech typically requires a bandwidth from 100 to 10,000 Hz, although 90 percent of speech intelligence is contained between 100 and 3000 Hz.
Echo A result of electrical impedance mismatches in the transmission path. Echo is always present, even in traditional telephony networks, but at a level that cannot be detected by the human ear. The two components that affect echo are amplitude (that is, loudness of the echo) and delay (that is, the time between the spoken voice and the echoed sound). You can control echo using echo suppressors or echo cancellers.
Jitter Variation in the arrival of coded speech packets at the far end of a VoIP network. The varying arrival time of the packets can cause gaps in the re-creation and playback of the voice signal. These gaps are undesirable and annoy the listener. Delay is induced in the network by variation in the routes of individual packets, contention, or congestion. You can often resolve variable delay by using dejitter buffers.
Packet drops The discarding of voice packets. Typically, when a VoIP packet is dropped from a network, 20 ms of audio is lost.
Delay The time between the spoken voice and the arrival of the electronically delivered voice at the far end. Delay results from multiple factors, including distance (that is, propagation delay), coding, compression, serialization, and buffering.
Sidetone The purposeful design of the telephone that allows the speaker to hear the spoken audio in the earpiece. Without sidetone, the speaker is left with the impression that the telephone instrument is not working.
Background noise The low-volume audio that is heard from the far-end connection. Certain bandwidth-saving technologies can eliminate background noise altogether, such as voice activity detection (VAD). When this technology is implemented, the speaker audio path is open to the listener, while the listener audio path is closed to the speaker. The effect of VAD is often that speakers think that the connection is broken, because they hear nothing from the other end.

Although each of the preceding factors affects audio clarity, factors that present the greatest challenges to VoIP networks include jitter, delay, and packet drops. A lack of network bandwidth is usually the underlying cause for these issues, which are addressed in the following sections.

Jitter

Jitter is defined as a variation in the delay of received packets, as illustrated in Figure 7-1. On the sending side, packets are sent in a continuous stream with the packets spaced evenly. Because of network congestion, improper queuing, or configuration errors, this steady stream can become uneven, because the delay between each packet varies instead of remaining constant.

Figure 7-1. Jitter in IP Networks

When a router receives a VoIP audio stream, it must compensate for the jitter that is encountered. The mechanism that handles this function is the playout delay buffer, or dejitter buffer. The playout delay buffer must buffer these packets and then play them out in a steady stream to the digital signal processors (DSPs) to be converted back to an analog audio stream. The playout delay buffer, however, affects the overall absolute delay.

When a conversation is subjected to jitter, the results can be clearly heard. If the talker says, "Watson, come here. I want you," the listener might hear "Wat....s...on.......come here, I......wa......nt........y......ou." The variable arrival of the packets at the receiving end causes the speech to be delayed and garbled.

Delay

Overall or absolute delay can affect VoIP. You might have experienced delay in a telephone conversation with someone on a different continent. The delays can cause entire words in the conversation to be cut off, and can therefore be very frustrating.

When you design a network that transports voice over packet, frame, or cell infrastructures, it is important to understand and account for the predictable delay components in the network. You must also correctly account for all potential delays to ensure that overall network performance is acceptable. Overall voice quality is a function of many factors, including the compression algorithm, errors and frame loss, echo cancellation, and delay.

Figure 7-2 shows various sources and types of delay. Notice that there are two distinct types of delay:

Fixed delay components are predictable and add directly to overall delay on the connection. Fixed delay components include the following:
- Coding The time it takes to translate the audio signal into a digital signal

- Packetization The time it takes to put digital voice information into packets and remove the information from packets

- Serialization The insertion of bits onto a link

- Propagation The time it takes a packet to traverse a link
Variable delays arise from queuing delays in the egress trunk buffers that are located on the serial port connected to the WAN. These buffers create variable delays (that is, jitter) across the network.

Figure 7-2. Sources of Delay

Acceptable Delay

The ITU specifies network delay for voice applications in Recommendation G.114. This recommendation defines three bands of one-way delay, as shown in Table 7.1.

Table 7-1. Components and Services
Range in Milliseconds	Description
0 to 150	Acceptable for most user applications.
150 to 400	Acceptable, provided that administrators are aware of the transmission time and its impact on the transmission quality of user applications.
Above 400	Unacceptable for general network planning purposes; however it is recognized that in some exceptional cases, this limit will be exceeded.

Note

This recommendation is for connections where echo is adequately controlled, implying that echo cancellers are used. Echo cancellers are required when one-way delay exceeds 25 ms (G.131).

This G.114 recommendation is oriented toward national telecommunications administrations and therefore is more stringent than recommendations that would normally be applied in private voice networks. When the location and business needs of end users are well known to a network designer, more delay might prove acceptable. For private networks, a 200 ms delay is a reasonable goal and a 250 ms delay is a limit. This goal is what Cisco proposes as reasonable, as long as excessive jitter does not impact voice quality. However, all networks must be engineered so that the maximum expected voice connection delay is known and minimized.

The G.114 recommendation is for one-way delay only and does not account for round-trip delay. Network design engineers must consider both variable and fixed delays in their design. Variable delays include queuing and network delays, while fixed delays include coding, packetization, serialization, and dejitter buffer delays. Table 7.2 provides an example of a delay budget calculation.

Table 7-2. Sample Delay Budget
Delay Type	Fixed (ms)	Variable (ms)
Coder delay	18	N/A
Packetization delay	30	N/A
Queuing and buffering	N/A	8
Serialization (64 kbps)	5	N/A
Network delay (through public network)	40	25
Dejitter buffer	45	N/A
Totals	138	33

Packet Loss

Lost data packets, as depicted in Figure 7-3 are recoverable if the endpoints can request retransmission. However, lost voice packets are not recoverable, because the audio must be played out in real time and retransmission is not an option.

Figure 7-3. Packet Loss

Voice packets might be dropped under the following conditions:

The network is unstable (flapping links).
The network is congested.
There is too much variable delay in the network.

Packet loss causes voice clipping and skips. As a result, the listener hears gaps in the conversation. The industry-standard coder-decoder (CODEC) algorithms used in Cisco DSPs correct for 2050 ms of lost voice through the use of Packet Loss Concealment (PLC) algorithms. PLC intelligently analyzes missing packets and generates a reasonable replacement packet to improve the voice quality. Cisco VoIP technology uses 20 ms samples of voice payload per VoIP packet by default. Effective CODEC correction algorithms require that only a single packet can be lost at any given time. If more packets are lost, the listener experiences gaps.

If a conversation experiences packet loss, the effect is immediately heard. If the talker says, "Watson, come here. I want you," the listener might hear, "Wat...., come here, ......you."

Quality Metrics

Quality must be measurable in order to be manageable. Three quality metrics include the Mean Opinion Score (MOS), the Perceptual Speech Quality Measurement (PSQM), and the Perceptual Evaluation of Speech Quality (PESQ).

MOS

MOS is a scoring system for voice quality. An MOS score is generated when listeners evaluate prerecorded sentences that are subject to varying conditions, such as compression algorithms. Listeners then assign the sentences values, based on a scale from 1 to 5, where 1 is the worst and 5 is the best. The sentence used for English language MOS testing is, "Nowadays, a chicken leg is a rare dish." This sentence is used because it contains a wide range of sounds found in human speech, such as long vowels, short vowels, hard sounds, and soft sounds.

The test scores are then averaged to a composite score. The test results are subjective because they are based on the opinions of the listeners. The tests are also relative, because a score of 3.8 from one test cannot be directly compared to a score of 3.8 from another test. Therefore, a baseline needs to be established for all tests, such as G.711, so that the scores can be normalized and compared directly.

PSQM

PSQM is an automated method of measuring speech quality "in service," or as the speech happens. PSQM software usually resides with IP call-management systems, which are sometimes integrated into Simple Network Management Protocol (SNMP) systems.

Equipment and software that can measure PSQM is available through third-party vendors but is not implemented in Cisco devices. The PSQM measurement is made by comparing the original transmitted speech to the resulting speech at the far end of the transmission channel. PSQM systems are deployed as in-service components. The PSQM measurements are made during real conversation on the network. This automated testing algorithm has over 90 percent accuracy compared to subjective listening tests, such as MOS. Scoring is based on a scale from 0 to 6.5, where 0 is the best and 6.5 is the worst. Because it was originally designed for circuit-switched voice, PSQM does not take into account the jitter or delay problems that are experienced in packet-switched voice systems.

PESQ

MOS and PSQM are not recommended for present-day VoIP networks. Both were originally designed before the emergence of VoIP technologies and do not measure typical VoIP problems such as jitter and delay. For example, it is possible to obtain an MOS score of 3.8 on a VoIP network when the one-way delay exceeds 500 ms, because the MOS evaluator has no concept of a two-way conversation and listens only to audio quality. The one-way delay is not evaluated.

PESQ, whose operation is illustrated in Figure 7-4, was originally developed by British Telecom, Psytechnics, and KPN Research of the Netherlands. It has evolved into ITU Standard P.862, which is considered the current standard for voice quality measurement. PESQ can take into account CODEC errors, filtering errors, jitter problems, and delay problems that are typical in a VoIP network. PESQ combines the best of the PSQM method along with a method called Perceptual Analysis Measurement System (PAMS). PESQ scores range from 1 (worst) to 4.5 (best), with 3.8 considered "toll quality" (that is, acceptable quality in a traditional telephony network). PESQ is meant to measure only one aspect of voice quality. The effects of two-way communication, such as loudness loss, delay, echo, and sidetone, are not reflected in PESQ scores.

Figure 7-4. PESQ

Many equipment vendors offer PESQ measurement systems. Such systems are either stand-alone or they plug into existing network management systems. PESQ was designed to mirror the MOS measurement system. So, if a score of 3.2 is measured by PESQ, a score of 3.2 should be achieved using MOS methods.

Quality Measurement Comparison

Early quality measurement methods, such as MOS and PSQM, were designed before widespread acceptance of VoIP technology. PESQ was designed to address the shortcomings of MOS and PSQM.

MOS uses subjective testing where the average opinion of a group of test users is calculated to create the MOS score. This method is both time-consuming and expensive, and might not provide consistent results between groups of testers.

PSQM and PESQ use objective testing where an original reference file sent into the system is compared with the impaired signal that came out. This testing method provides an automated test mechanism that does not rely on human interpretation for result calculations. However, PSQM was originally designed for circuit-switched networks and does not take into account the effects of jitter and packet loss.

PESQ measures the effect of end-to-end network conditions, including CODEC processing, jitter, and packet loss. Therefore, PESQ is the preferred method of testing voice quality in an IP network. Table 7.3 offers a comparison of the various quality metrics.

Table 7-3. Voice Quality Measurement Comparison
Feature	MOS	PSQM	PESQ
Test method	Subjective	Objective	Objective
End-to-end packet test	Inconsistent	No	Yes
End-to-end jitter test	Inconsistent	No	Yes

Objectives of QoS

To ensure that VoIP is an acceptable replacement for standard public switched telephone network (PSTN) telephony services, customers must receive the same consistently high quality of voice transmission that they receive with basic telephone services.

Like other real-time applications, VoIP is extremely sensitive to issues related to bandwidth and delay. To ensure that VoIP transmissions are intelligible to the receiver, voice packets cannot be dropped, excessively delayed, or subjected to variations in delay (that is, jitter).

VoIP guarantees high-quality voice transmission only if the signaling and audio channel packets have priority over other kinds of network traffic. A successful VoIP deployment must provide an acceptable level of voice quality by meeting VoIP traffic requirements for issues related to bandwidth, latency, and jitter. QoS provides better, more predictable network service by performing the following:

Supporting dedicated bandwidthDesigning the network such that the necessary bandwidth is always available to support voice and data traffic
Improving loss characteristicsDesigning a Frame Relay network such that discard eligibility is not a factor for frames containing voice, keeping voice below the committed information rate (CIR)
Avoiding and managing network congestionEnsuring that the LAN and WAN infrastructure can support the volume of data traffic and voice calls
Shaping network trafficUsing Cisco traffic-shaping tools to ensure smooth and consistent delivery of frames to the WAN
Setting traffic priorities across the networkMarking the voice traffic as priority traffic and queuing it first

Cisco routers support multiple QoS mechanisms that can be leveraged to accomplish the objectives listed in the preceding bullet points. The following sections detail specific QoS mechanisms and caution against poor design characteristics.

Using QoS to Improve Voice Quality

Voice features that provide QoS are deployed at different points in the network and are designed for use with other QoS features to achieve specific goals, such as minimization of jitter and delay. Cisco IOS includes a complete set of features for delivering QoS throughout the network. Following are a few examples of Cisco IOS features that address the voice packet delivery requirements of end-to-end QoS and service differentiation:

The output queue of the router can use the following QoS mechanisms:
- Class-based weighted fair queuing (CBWFQ) Extends the standard Weighted Fair Queuing (WFQ) functionality by providing support for user-defined traffic classes. You can create a specific class for voice traffic by using CBWFQ.

- Low latency queuing (LLQ) Provides strict priority queuing in conjunction with CBWFQ. LLQ configures the priority status for a class within CBWFQ, in which voice packets receive priority over all other traffic. LLQ is considered a "best practice" by the Cisco Enterprise Solutions Engineering (ESE) group for delivering voice QoS services over a WAN.

- Weighted fair queuing (WFQ) Segregates traffic into flows and then schedules traffic to meet specified bandwidth allocation or delay bounds.

- Weighted random early detection (WRED) Provides differentiated performance characteristics for different classes of service. Specifically, WRED drops lower-priority traffic more aggressively than higher-priority traffic, as an interface's output queue begins to become congested.
The WAN or WAN protocol can use the following QoS mechanisms:
- Class-based policing Provides a rate-limiting feature for allocating bandwidth commitments and bandwidth limitations to traffic sources and destinations. At the same time, it specifies policies for handling the traffic that might exceed bandwidth allocation.

Note

Class-based policing typically replaces the rate-limiting feature previously provided by the committed access rate (CAR) feature.

- Traffic shaping Delays excess traffic by using a buffer or queuing mechanism to hold packets and shape the flow when the data rate of the source is higher than expected.

- Frame Relay Forum 12 (FRF.12) Ensures predictability for voice traffic by providing better throughput on low-speed Frame Relay links (that is, link speeds less than 768 kbps). FRF.12 interleaves delay-sensitive voice traffic with fragments of a long frame.

- Multilink PPP (MLP) Allows large packets to be multilink encapsulated, fragmented, and interleaved so that they are small enough to satisfy the delay requirements of real-time traffic.
VoIP traffic can use the following QoS mechanisms:
- Compressed Real-Time Transport Protocol (cRTP) The Real-Time Transport Protocol (RTP) is a protocol for the transport of real-time traffic, including voice. RTP uses extensive headers that incorporate time stamps for individual packets. The cRTP feature compresses the extensive RTP header. The result is decreased consumption of available bandwidth for voice traffic and a corresponding reduction in delay.

- Resource Reservation Protocol (RSVP) RSVP supports the reservation of resources across an IP network, allowing end systems to request QoS guarantees from the network. For networks that support VoIP, RSVPin conjunction with features that provide queuing, traffic shaping, and voice call signalingprovides Call Admission Control (CAC) for voice traffic.

Recognizing Common Design Faults

Successful implementations of delay-sensitive applications such as VoIP require a network that is carefully engineered with QoS from end to end. Fine-tuning the network to adequately support VoIP involves a series of protocols and features geared toward improving voice quality.

QoS is the ability of a network to provide better service levels to selected network traffic over various underlying technologies. However, QoS is not inherent in a network infrastructure. Instead, QoS is implemented by strategically enabling appropriate QoS features throughout the network.

Poor design is characterized by the following issues:

Ignoring Layer 2 QoS requirements QoS technologies such as priority Layer 2 congestion management, FRF.12, Link Fragmentation and Interleaving (LFI), and traffic shaping must be correctly configured.
Ignoring other QoS requirements QoS technologies such as LLQ, RTP, congestion management, and congestion avoidance must be enabled.
Ignoring bandwidth considerations Planning for the total number of calls and their effect on data bandwidth is critical to all users of the network.
Simply adding VoIP to an existing IP network When considering VoIP, network administrators might need to insist on a complete network redesign for a comprehensive end-to-end solution.

Many people believe that the fastest way to fix network performance is simply to add a lot of bandwidth. That approach might work well in certain situations like campus networks, in which upgrading from 10 Mbps to 100 Mbps or even 1 GB links might be possible. However, it is not always feasible to add bandwidth in a WAN. Upgrading a WAN circuit from 56 kbps to T1 might be cost prohibitive and might not be possible for certain locations on the network. To provide effective performance in a voice network, you should configure QoS throughout the network, not just on the devices running VoIP. Not all QoS techniques are appropriate for all network routers. Edge routers and backbone routers in a network do not necessarily perform the same operations. The QoS tasks these routers perform might differ as well. To configure an IP network for real-time voice traffic, you should consider the functions of both edge and backbone routers and select the appropriate QoS tools accordingly.