Solution: Voice Codecs Designed for VoIP, Especially VoIP over 802.11 | Wi-Fi Handbook : Building 802.11b Wireless Networks

Many of the detractors to good speech quality in VoIP over 802.11 can be overcome by engineering a variety of fixes into the speech codecs used in both circuit- and packet-switched telephony. The following pages will describe speech coding and how it applies to speech quality.

Speech Coding

Voice or speech compression, as performed by speech codecs, is a crucial factor affecting the speech quality, and thereby QoS, in IP telephony. The speech codecs are deployed at the endpoints (gateways, IP phones, or PCs) and therefore determine the achievable end-to-end quality. The speech encoder converts the digitized (after analog-to-digital conversion) speech signal to a bitstream. This bit-stream is packetized and transported over the IP network. The speech decoder reconstructs the speech signal from the bits in the received packets. The reconstructed speech signal is an approximation of the original signal. A speech codec has several attributes, and the most important ones are outlined in Table 6-1.

Table 6-1: Popular voice codecs developed for circuit-switched telephony
ITU Standard	Description
P.800	Subjective rating system to determine the MOS or the quality of telephone connections
G.114	Maximum one-way delay end-to-end for a VoIP call (150 ms)
G.165	Echo cancellers
G.168	Digital network echo cancellers
G.711	PCM of voice frequencies
G.722	7 kHz audio coding within 64 Kgps
G.723.1	Dual-rate speech coder for multimedia communications transmitting at 5.3 and 6.3 Kgps
G.729	Coding for speech at 8 Kgps using conjugate structure algebraic code excited linear prediction (CS-ACELP)
G.729A	Annex A reduced complexity 8 Kgps CS-ACELP speech codec
H.323	Packet-based multimedia communications system
P.861	Specifies a model to map actual audio signals to their representations inside the human head
Q.931	Digital Subscriber Signaling System (DSS) No. 1 Integrated Services Digital Network (ISDN) User-Network Interface Layer 3 Specification for Basic Call Control

Modifying Voice Codecs to Improve Voice Quality

One of the first processes in the transmission of a telephone call is the conversion of an analog signal (the wave of the voice entering the telephone) into a digital signal. This process is called pulse code modulation (PCM). This is a four-step process consisting of pulse amplitude modulation (PAM) sampling, companding, quantization, and encoding. Encoding is a critical process in VoIP and Vo802.11. To date, voice codecs used in VoIP (packet switching) are taken directly from PSTN technologies (circuit switching). Cell phone technologies use PSTN voice codecs. New software in the Vo802.11 industry utilizes modified PSTN codecs to deliver voice quality comparable to the PSTN.

Encoding The final process in the PCM process used in circuit switching (as opposed to packet switching in IP networks) is encoding the voice signal. This is performed by a codec, of which three types exist: waveform codecs, source codecs (also known as vocoders), and hybrid codecs.

Waveform codecs sample and code an incoming analog signal without regard to how the signal was generated. Quantized values of the samples are then transmitted to the destination where the original signal is reconstructed at least to a certain approximation of the original. Waveform codecs are known for simplicity with high-quality output. The disadvantage of waveform codecs is that they consume considerably more bandwidth than the other codecs. When waveform codecs are used at a low bandwidth, speech quality degrades markedly.

Source codecs match an incoming signal to a mathematical model of how speech is produced. They use the linear predictive filter model of the vocal tract, with a voiced/unvoiced flag to represent the excitation that is applied to the filter. The filter represents the vocal tract and the voice/unvoiced flag represents whether a voiced or unvoiced input is received from the vocal chords. The information transmitted is a set of model parameters as opposed to the signal itself. The receiver, using the same modeling technique in reverse, reconstructs the values received into an analog signal.

Source codecs operate at low bit rates and reproduce a synthetically sounding voice. Using higher bit rates with source codecs does not result in improved voice quality. Vocoders (source codecs) are most widely used in private and military applications. Hybrid codecs are deployed in an attempt to derive the benefits from both technologies. They perform some degree of waveform matching while mimicking the architecture of human speech. Hybrid codecs provide better voice quality at a low bandwidth than waveform codecs. The following section examines the popular speech codecs.

G.711 G.711 is the best known coding technique in use today. It is the coding technique used in circuit-switched telephone networks all over the world. G.711 has a sampling rate of 8,000 Hz. If uniform quantization were to be used, the signal levels commonly found in speech would be such that at least 12 bits per sample would be needed, giving it a bit rate of 96 Kbps. Nonuniform quantization is used with eight bits to represent each sample. This quantization leads to the well-known 64 Kbps DS0 rate. G.711 is often referred to as PCM.

G.711 has two variants: A-law and mu-law. Mu-law is used in North America and Japan where T-carrier systems prevail. A-law is used everywhere else in the world. The difference between the two is the way nonuniform quantization is performed. Both are symmetrical at approximately zero. Both A-law and mu-law offer good voice quality with an MOS of 4.3. Despite being the predominant codec in the industry, G.711 suffers one significant drawback: It consumes 64 Kbps in bandwidth. Carriers seek to deliver like-voice quality using less bandwidth, thus saving on operating costs.

G.723.1 ACELP G.723.1 ACELP can operate at either 6.3 or 5.3 Kbps with the 6.3 Kbps providing higher voice quality. Bit rates are contained in the coder and decoder, and the transition between the two can be made during a conversation. The coder takes a bank-limited input speech signal that is sampled at 8,000 Hz and undergoes uniform PCM quantization, resulting in a 16-bit PCM signal. The encoder then operates on blocks or frames of 240 samples at a time. Each frame corresponds to 30 ms of speech, which means that the coder causes a delay of 30 ms. Including a look-ahead delay of 7.5 ms gives a total algorithmic delay of 37.5 ms.

G.723.1 gives an MOS of 3.8 in a circuit-switched application, which is highly advantageous in regards to the bandwidth used. The delay of 37.5 ms one way presents an impediment to good quality, but the round-trip delay over varying aspects of a network determines the final delay and not necessarily the codec used.

G.729 G.729 is a speech coder that operates at 8 Kbps. This coder uses input frames of 10 ms, corresponding to 80 samples at a sampling rate of 8,000 Hz. This coder includes a 5 ms look-ahead, resulting in an algorithmic delay of 15 ms (considerably better than G.723.1). G.729 uses an 80-bit frame. The transmitted bit rate is 8 Kbps. Given that it turns in an MOS of 4.0, G.729 is perhaps the best tradeoff in bandwidth for voice quality.

The previous paragraphs provide an overview of the multiple means of maximizing the efficiency of transport via the PSTN. What we find today is that time-division multiplexing (TDM) is synonymous with circuit switching. Telecommunications engineers use the term TDM to describe a circuit-switched solution. The standard in use on the PSTN is 64 Kbps.

The codecs described in the previous pages apply to VoIP as well. VoIP engineers seeking to squeeze more conversations over valuable bandwidth have found these codecs very valuable in compressing VoIP conversations over an IP circuit.^[12]

The construction of IP packets to be transmitted determines both the bandwidth and delay. Smaller packets, in terms of the number of speech bits, are less efficient since the overhead for header information is the same as for larger packets. Furthermore, the delay increases when several speech-coded frames are sent in the same packet. Table 6-2 shows the actual bandwidth load on the network, including header information. The numbers are based on the common situation where IP, UDP, and RTP (40 bytes per packet) are used. All numbers are based on 20 ms packets, except for G.723.1 where the 30 ms frame size requires 30 ms packets. The delay figure given includes encoding delay but not processing delay and packet-assembling delay. It can be noted that with voice activity detection and silence compression, the bandwidth load is reduced by approximately 50 percent. It is evident from the table that for the low-rate codecs, headers contribute to the major part of the bandwidth.

Table 6-2: MOS scores of speech codecs
Standard	Data Rate (Kbps)	Delay (ms)	MOS
G.711 G.721 G.723	64	0.125	4.8
G.726	16, 24, 32, and 40	0.125	4.2
G.728	16	2.5	4.2
G.729	8	10	4.2
G.723.1	5.3 and 6.3	30	3.5 and 3.98

Circuit-Switched Speech Coding in IP Telephony The most commonly used codecs for IP telephony today are G.711, G.729, and G.723.1 (at 6.3 Kbps). All these codecs were designed for, or based on technology designed for, circuit-switched telephony. Mobile telephony has been the major driver for the development of speech-coding technology in recent years. All the coders used in mobile telephony, as well as G.729 and G.723.1, are based on the CELP paradigm. These codecs are designed for use in circuit-switched networks and do not work well for packet-switched networks, as their design is focused on handling bit errors rather than packet losses. The important points regarding G.711 as a Vo802.11 codec are that the coder was designed for circuit-switched telephony and it does not include any means to counter packet loss. The insertion of zeros is commonly used when packet loss occurs, leading to a disrupted voice stream (coming in "broken") and the steep degradation of quality with increasing packet losses.

It is possible to introduce error concealment by extrapolating and interpolating received speech segments, which improves quality. An example is the new Annex I to G.711 called G.711 PLC, which does not always work well and does not guarantee robust operation.

G.729 and G.723.1 belong to a different class of coders compared to G.711. Many important points must be made regarding G.729 and G.723.1 (as well as other CELP coders). The coding paradigm used in these coders has been developed for circuit-switched and mobile telephony. The basic speech quality is worse than PSTN quality (that is, they have mobile telephony quality) and the coding process is based on interframe dependencies leading to interpacket dependencies. Packet loss performance is also very poor because of error propagation resulting from interpacket dependencies and speech quality degrades rapidly with increasing packet losses. The coders have built-in heuristic error concealment methods and they also suffer from interframe dependencies (for some coders, more frames than the lost one need error concealment). The coders also produce an inflexible bitstream and the packet size is restricted to an integer number of frames, which reduces flexibility.

^[12]Report to Congress on Universal Service, CC Docket No. 96-45, a white paper on IP voice services, March 18, 1998.