6.2. Voice Channels
A VoIP softswitch has two main functions: call management (or switching), which is covered in the next chapter, and voice transmission.
Voice transmissionthe packaging, transmittal, receiving, and reconstruction of digitized voice dataoccurs inside virtualized pathways across the TCP/IP network. Many softPBX systems, Asterisk included, call them channels . The word channel does mean different things to different vendors , though. Keep that in mind as you read VoIP documentation. It also means different things at different layers . RTP protocol has media channels , which are streams of sound or video data, while a path across the network for call signaling is also sometimes called a channel .
For this chapter, we'll give the word channel a wide definition: the complete virtualized transport that takes the mouth-to-ear analog signal and transports it over a great distance using networked software.
There are several steps in the process of transmitting voice sounds over a channel: sampling, digitizing, encoding, transport, decoding, and playback. Usually, each step occurs once per packet of voice data. Complex applications like conference call, surveillance, or overhead paging may handle these steps in a unique way, but for this chapter, we'll concentrate on how the process works for a standard, two-party, point-to-point phone call.
6.2.1. Sampling and Digitizing
Digital-to-analog conversion (DAC) and analog-to-digital conversion are the processes that convert sound from the format in which it is heardanalog sound wavesinto the format that VoIP uses to carry itdigital streamsand back again. These processes are necessary in order for inherently analog devicesnamely, human earsto use digital sound signals. In the world of traditional telephony, the process is fairly simple, because variations in DAC techniques are driven by requirements of different data links and devices and by regional standards variations.
The DAC processes employed in Voice over IP aren't tied to the data link, so they can vary greatly: different DAC, digitizing, and compression techniques are used in different circumstances. Sometimes the data link's properties, like bandwidth capacity and latency, are factors in the selection of these techniques, but not always. DAC is required in all telephony environments, even where VoIP isn't used, because just about every traditional telephony system employs digital carriers and analog sound reproduction devices like speakers and transducers .
DAC includes of quantizing or digital "sampling" of sounds, filtering for bandwidth preservation, and signal compression for bandwidth efficiency. Pulse code modulation (PCM) is the most common sampling technique used to turn audible sounds into digital signals. We'll deal with DAC subjects in greater detail later.
184.108.40.206 The 64 kbps channel
To connect a phone call, a traditional telephone, whether analog or digital, requires a loop with enough quality for 64 kilobits per second of digital throughput. In fact, 64 kbps is the fixed line speed of any POTS line. Analog and (most) digital telephone systems offer similar sound clarity because they operate at the same sampling frequency, 8,000 Hz. This frequency, when combined with a sampling resolution of 8 bits, requires 64 kbps of bandwidthhence the 64 kbps line speed.
It should suffice for now that each concurrent voice conversationanalog or digitalrequires a link capable of a speed or bandwidth [*] of 64 kbps. As you'll discover, the 64 kbps channel is a baseline unit for dealing with sizing issues on your VoIP network. It's important to look at bandwidth conservation methods in relationship to 64 kbpsthe "standard unit" of voice bandwidth.
Framing is the real-time process of dividing a stream of digital sound information into manageable, equal- sized hunks for transport over the network. Consider Figure 6-1, which shows a representation of 60 milliseconds of digitized sound. It's divided into three frames , each 20 milliseconds in duration. At this rate, it takes 50 frames to represent 1 second of digitized sound.
Figure 6-1. Framing is the process of dividing a digital stream into equal-sized hunks
220.127.116.11 Digital versus packet based
Unlike an analog phone line, the sound signals transmitted and received by an IP endpoint are digital. This makes them more akin to those carried over a traditional voice T1 or ISDN circuit. But unlike a digital phone company circuit, VoIP calls are also packet based. This means that the sound frames are carried across the network in units that are also used to carry other kinds of datain VoIP's case, UDP datagrams.
The PSTN offers a way of providing a far-higher call capacity than a single POTS line24 simultaneous calls using two pairs of wire rather than one pair per simultaneous call. This high-density technology is called T1. It's often used to provide links between PBX systems. For instance, one could use a T1 circuit to link PBXs in separate buildings at disparate locations. T1 provides a far more economical way of allowing many simultaneous calls between users at opposing locations than does POTS. The technique is called multiplexing . Even denser multiplexed voice circuits can carry more voice channels: DS3, which supports 672 individual channels, and OC (optical carrier) circuits are also used to multiplex and link between switches. These circuits tend to be quite expensive. In voice applications, DS3s and higher are most likely to show up in call center environments or as trunks between PSTN switches. In data applications, DS3s and OC circuits are often used by ISPs and application service providers that need very high-capacity Internet connectivity.
VoIP provides an even more economical way of linking those PBXs together. If 100 calls are to occur at the same time using a PBX, then roughly 6 mbps of composite bandwidth is required. This would require five T1s. But VoIP encoding techniques allow for significant compression of the sound sample used to represent the spoken voice in the network, so that far fewer physical links are required in this instance.
It's possible to reduce a 64 kbps voice call down to 44 kbps without a noticeable reduction in sound qualitya feat that, setting aside the concept of overhead (which we'll cover later), is quite common with VoIP compression methods. Now, that link between PBXs uses only 4 mbps, and needs only three T1 circuits instead of five, resulting in a much cheaper trunk.
The algorithms VoIP uses to encode sound data, and sometimes to decrease bandwidth requirements, are called codecs . In order to get three T1 circuits to do the work of five in a voice application, bandwidth-conserving codecs are used.
Codecs, short for coder /decoders, are algorithms for packaging multimedia data in order to stream it, or transport it in real time, over the network. There are dozens of codecs for audio and video. We'll be talking about audio codecs, since they are most common on VoIP networks.
Most of the codecs in use on VoIP networks were defined by ITU-T recommendations in of the G variety (transmission systems and media). A few are well-suited to a very high fidelity application like music streaming, but most are suitable only to spoken word. They're the ones we'll be concentrating on.
Telephony audio codecs break down into two groups: those that are based on pulse code modulation and those that restructure the digital representation of PCM into a more portable format. So the two groups of telephony codecs are PCM codecs, which are the basic 64 kbps codecs, and vocoders , which are the codecs that go a step beyond the essential PCM algorithm. Here are the codecs you'll see most often:
Each of the codecs has some pros and cons. G.711 is great on data links where there's plenty of capacity and very little latency, like Ethernet. It's also highly resilient to errors. But you wouldn't want to use it on a 56 k frame relay link because there would not be enough bandwidth. Conversely, the codecs that provide compression do so at a loss, or degradation in quality, to the sound. That's why some call them "lossy" codecs.
18.104.22.168 Codec packet rates
Besides the bits that represent data, all data packets carry bits used for routing and sometimes for error correction. These "overhead bits" have no direct benefit to voice applications, other than allowing the lower levels to functionthings like Ethernet headers, IP routing headers, other information necessary for transport of the packet. When longer durations of sound are carried by each packet, these overhead items don't have to be transmitted as often, because fewer packets are required to transport the same sound. The net result of decreasing overhead is that the application uses the network more efficiently . Reducing overhead is crucial, just as it is in a business plan, because overhead, while necessary, provides no direct benefit to the application (or to the business).
The knee-jerk way to lower overhead in a VoIP network is to reduce the number of packets per second used to transmit the sound. But this increases the impact of network errors on the voice call. So there needs to be some balance between what's acceptable overhead and what's acceptable resiliency to errors. This is where a diversity of available codecs can help. Different codecs have different packet rates and overhead ratioswhich gives VoIP system builders a way to fine-tune their network's voice bandwidth economy.
The packet rate is the number of packets required per second (pps) of sound transmitted. Again, different audio codecs use different rates. The gap between transmitted packets is called the packet interval , and it is expressed in converse proportion to the packet rate. The shorter the packet interval, the more packets are required per second. Some of the codecs, especially those that use very advanced CELP algorithms, can require a longer duration of audio at a time (say, 30 ms rather than 20 ms) in order to encode and decode. The packet interval has the most obvious effect on overhead. The shorter it is, the more overhead is required to transmit the sound. The longer it is, the less overhead is required.
But with longer packet intervals comes increased lag (see Figure 6-2). The longer the interval, the longer the lag will be between the time the sound is spoken and the time it is encoded, transported, decoded, and played back for the listener. An IP packet isn't transmitted until it is completely constructed , so a VoIP sound frame can't travel across the network until it's completely encoded. A 30 ms sound frame takes a third longer to encode than a 20 ms one, and inflicts 10 ms more lag, too. As with all networked apps, lag is bad. It's especially bad in VoIP.
Figure 6-2. Longer packet intervals cause lag, but decrease overhead
Long packet intervals have another drawback: the greater the duration of sound carried by each packet, the greater the chance that a listener will notice a negative effect on the sound if a packet is dropped due to congestion or a network error. Dropping a packet carrying 20 ms of sound is almost imperceptible with the G.711 codec, but dropping a 60 ms packet is quite obtrusive. Since VoIP sound frames are carried in "unreliable" UDP datagrams, dropped packets aren't retransmitted. Even if TCP packets were used instead of UDP, error awareness and retransmission would take so long that, by the time the retransmitted packet arrived at the receiving phone, it would be hopelessly out of sequence.
Consider that 8,000 samples per second are required for a basic voice signal at 8 bits per sample. Now, assuming a 20 ms packet interval (1/50 th of a second), you can see that it takes a minimum of 1,280 bits of G.711 data in each packet to adequately carry the sound:
Mathematically, increasing the sound data in each packet means reduction of packet overhead. Figure 6-3 illustrates a very simplified cross-section of a VoIP packet carrying 20 ms of G.711 data.
Following the previous example, increasing the packet interval to 30 ms (1/33rd of a second) results in a reduction in the number of packets required per second, raising
Figure 6-3. TCP/IP adds overhead to a VoIP channel; this IP packet carries 20 ms of sound
the bit count per packet and reducing the amount of overhead required to transmit the sound:
Generally , on Ethernet-to-Ethernet calls, the use of G.711 with a 20 ms packet interval is encouraged, because a 100 mbps data link can support hundreds of simultaneous 64 kbps calls without congestion, and a dropped packet at 20 ms interval is almost imperceptible.
On calls that cross low-bandwidth links, it's up to the administrator to balance between latency, possible reductions in sound quality incurred by using a compression codec, and network congestion.
Up until this point, we've been talking about the overhead of each packet merely as it relates to the amount of voice payload it carries, and with good reason: codec selection and framing are something over which you, as an administrator, have the most control.
But packet overhead is affected by the network and data link layers, too. Ethernet frames have a different size and different overhead than ATM cells or frame relay frames. Network overhead is addressed in the following section.
Different codecs have different bandwidth requirements. Table 6-1 shows the characteristics of the most popular VoIP codecs.
Table 6-1. VoIP codec characteristics
22.214.171.124 The T1 carrier versus VoIP
The T1's 24 DS0 channels are each never-ending streams of digitized voice information. In reality though, the T1 circuit itself is one big stream of binary digits that uses TDM to divide the T1 into those 24 DS0 channels. Each is assigned a time slice of the big stream, and each time slice is further divided into frames, as shown in Figure 6-1. All voice T1s use the same amount of bandwidth no matter how many calls are in progressroughly 1.54 mbps. Trunking with T1s is very stable and predictable as a result.
VoIP frees the system builder from requirements traditionally imposed on the lower OSI layers of the network. In a T1, the transport and data link layers are defined together as a bundled carrier, and you have to use the G.711 PCM codec on all the channels, yielding 24 simultaneous voice channels in the available bandwidth. VoIP lets you pick and choose the codec, packet interval, and transport technologies you want and thus gives you ultimate control. Using the G.729A codec and a T1, you could conceivably trunk hundreds of calls at once.
Unlike G.711 traffic on a T1, VoIP's "carrier" is TCP/IP. So VoIP can traverse Ethernet, T1s, DSL lines, cable internet lines, POTS lines, frame relay networks, virtual private networks (VPNs), microwave radio, satellite connections, ATM, and just about any other link. If IP can go there, VoIP can go therejust with varying levels of quality.
126.96.36.199 Voice packet structure
The layered appearance of a VoIP packet is similar to that of other types of networked applications that run within the TCP/IP protocol: the lower layers encapsulate the higher layers recursively.
The lowest layer, shown leftmost in Figure 6-3, is the Internet Protocol (IP) packet header. It contains routing information so that the packet can be handled correctly by the devices responsible for carrying it across the network. It also contains a flag that indicates which protocol of the TCP/IP suite this packet is carrying: TCP, UDP, or something else. Voice packets are almost always UDP. Among other things, the IP header may include a Type of Service flag that allows routers and switches to treat it with a certain priority based on its sensitivity to delay. At a minimum, the IP packet header is 160 bits in length.
The payload of the IP packet is the UDP packet, whose header is 64 bits long. Its first 32 bits contain the source and destination ports of the UDP traffic it carries in its payload, along with 8 bits for optional error checking and 8 bits for describing the length of its payload in multiples of 8 bits.
The payload of the UDP packet is the RTP packet, whose header is 96 bits long. It contains information about the sequence and timing of the packet within the greater data stream.
188.8.131.52 Real-Time Transport Protocol
The Real-Time Transport Protocol (RTP) defines a simple way of sending and receiving encoded media streams [*] in connectionless sessions. It provides headers that afford VoIP systems an easy way of discriminating between multiple sessions on the same host. Remember that the codec merely describes how the digitized sample is encoded, compressed, and decoded. RTP is responsible for transporting the encoded sound data within a UDP datagram. RTP was designed for use outside the realm of telephony, too: streaming audio and video for entertainment and education are common with RTP.
RTP supports mixing several streams into a single session in order to support applications like conference calling. It doesn't, however, provide adequate controls for defining multiplexed voice pathways that are normally associated with telephony, like trunks. This is the responsibility of the softPBX and its signaling protocols. Control of RTP's media sessions, and collection of data relevant to those sessions, is accomplished by RTP's sister, RTCP (Real-Time Transport Control Protocol). Together, RTP and RTCP provide:
For the VoIP administrator, RTP is largely invisible. Most VoIP frameworks and system-building tools, including Asterisk and Open H.323, implement RTP so seamlessly that the administrator rarely has to worry about its inner workings. If you are interested in RTP, check out Internet RFC (Request for Comments) 1889, published by the IETF's Network Working Group.
As shown in Figure 6-3, the only part of each VoIP packet not considered overhead is the payload of the RTP packet, which is encoded sound data.
Ethernet is a physical and logical data link specification that makes provisions for error-correction on locally connected network devices. Ethernet packets, called frames , are typically less than 1,500 bytes, or about 12,000 bits. VoIP packets are very rarely larger than 250 bytes, or 2,000 bits. Not accounting for Ethernet overhead, the packet in Figure 6-3 is 1,600 bits long, a rather small packet.
Like RTP, UDP, and IP, Ethernet adds some bulk to each packet. The overhead Ethernet imposes is 176 bits for its header and 128 for its CRC "footer"a bumper at the end of each Ethernet frame that provides an error-detection mechanism used by network interfaces on participating hosts . Figure 6-4 shows an Ethernet VoIP frame.
The total size of a G.711 Ethernet VoIP frame is 1,904 bits. At a standard packet interval of 20 ms and 50 pps, a voice call digitized using plain- vanilla PCM at a rate of 64 kbps consumes a healthy amount of overheadspecifically, 15.2 kbps of Ethernet overhead and 16 kbps of combined RTP, UDP, and IP overhead.
When you add in the payload to all that overhead, an Ethernet-transported voice channel using the G.711 codec requires 95.2 kbps of bandwidth.
While a G.729A voice channel requires only 8 kbps of bandwidth to frame the sound stream, the overhead of IP, UDP, RTP, and Ethernet adds 31.2 kbps, putting the total bandwidth consumption of a G.729A call at 39.2 kbps. Table 6-2 shows the total Ethernet bandwidth consumed by several of the most popular codecs.
Figure 6-4. An Ethernet-encapsulated 20 ms VoIP packet
Table 6-2. VoIP codec bandwidth consumption
Ethernet isn't the only data link suitable for carrying VoIP packetsATM, frame-relay, point-to-point circuits, and other technologies can be used, and each introduces its own overhead factors.
6.2.4. Decoding and Playback
When a VoIP packet is received, it is decoded according to the codec employed to encode it. It is then played back on the analog hardware of the receiving endpointa speakerwhile undergoing DAC, or digital-to-analog conversion. Decoding generally takes about as much processing power as encoding, depending on the codec employed.
Most IP phones and ATAs support several codecs, as shown in Table 6-3. All support G.711 using both the m law and Alaw scales, and a majority support G.729A, though with variance in quality and completeness of their implementation. It's fair to say that G.711 and G.729A are the two most popular VoIP codecs in use today.
Table 6-3. Codecs supported by some leading VoIP endpoint devices
184.108.40.206 Things that degrade playback quality
Several factors can degrade the quality of audio transmitted over the network:
In Project 3.1, a SIP endpoint was used to connect through the Asterisk server to a demonstration server on the Internet via the IAX signaling protocol. Though the voice channel from the SIP phone to the Asterisk server was encoded using G.711 law, the cross-Internet voice channel to the demonstration server was encoded using GSM, as shown in Figure 6-5.
Figure 6-5. A call path that uses two codecs
When a call path requires it to use more than one codecdifferent ones on the two endpoints of the call, as in the case of Figure 6-5the softPBX or another specialized server device called a gateway must transform in real time, or transcode each leg of the call. Certain connectivity mediums don't provide enough bandwidth to facilitate G.711 from end to end. A 64 kbps circuit, for example, can't carry a G.711 call, because it requires more than 64 kbps when IP packet overhead is accounted for. So bandwidth-conserving codecs have their uses, but not all endpoints support every codec. Transcoding is the solution.
Transcoding is a processing-intensive task, so it's a good idea to minimize the number of codecs that you support as standards on your network. A few conference calls with three or four codecs apiece could be a real handful. Cisco and other commercial vendors recommend G.711 for local calls over Ethernet, and G.729A over low-bandwidth WAN connections. The softPBX will insert itself into the path of the call in order to negotiate the appropriate codec on each leg and then perform transcoding.
220.127.116.11 Call paths
Even though the softPBX is the central call-management and -signaling element on the VoIP network, it doesn't always sit in the call path. One of the purposes of SIP, and other signaling protocols, is to allow endpoints to discover what codecs their peers support, so that, when beginning a call, both endpoints can be using the same one. Another purpose of SIP is to allow for multiple pathways through the voice network based on the capabilities of each endpoint and the preferences of the administrator. These pathways are known as call paths.
For example, an IP phone can place a call through the softPBX, and the softPBX can act as a proxy for the sound signals, receiving them from that caller and sending them milliseconds later to the receiver. In this case, the softPBX may or may not be transcoding, but it is a point in the call path. You could call this softPBX call path or a proxied call path.
But an IP phone placing a call mustn't always have its call path cross through the softPBX. In fact, in most commercial VoIP softPBX implementations, this isn't the preferred method. Indeed, with Cisco's CallManager, it isn't even possible out of the box, with exceptions for a few centralized applications like conferencing, music-on-hold , and bridging. In these setups, the softPBX sets up the call using a signaling protocol, and then the phones themselves communicate the sound data directly to each other in UDP bursts. The big advantage of an independent call path is that there's less processing load incurred on the softPBX. One disadvantage is that it's impossible to run centralized applications that deal with the sound stream in the call, like, say, a clandestine call-recording application.
When transcoding is employed, the call path always crosses the softPBX or another gateway device that speaks all the necessary codecs. Fortunately, transcoding tends only to be used when a medium other than Ethernet is being used for connectivity, and a codec besides G.711 is employed for that leg of the call path.
So, the call path determination is affected by several issues:
Commercial vendors support automatic selection of a call path to varying degrees. As indicated earlier, some don't support a softPBX call path at all, unless its purpose is to deliver conferencing applications and not necessarily for transcoding. Others support automated negotiation of the call path during call setup signaling.
Asterisk falls into the latter group. It allows either kind of path for SIP calls (except a conference call) according to the administrator's design. Project 6.1 describes how to enable an independent call path using a SIP feature called Reinvite.
18.104.22.168 Silence suppression and comfort noise generation
When nobody is speaking, there's a great opportunity to save bandwidth, because during periods of silence, no sound data needs to be transmitted over the network, right?
Several codecs have taken this idea to heart. GSM, G.723.1, and others support silence suppression , a technique that suspends the packet stream during periods of silence. In order to create a seamless experience for the person listening to that silence, silence suppression is usually accompanied by comfort noise generation, or a small amount of white noise. This white noise is created by the endpoint of the listener, rather than being transmitted to her over the network.