Chopping Your Voice into Byte-Size Pieces

Chopping Your Voice into "Byte"-Size Pieces

The job of converting analog voice into digital data begins with sampling. To better understand sampling, consider the movies you watch at your local theater. When you're watching your favorite actor or actress on the screen (Meg Ryan, in my case), you are not actually watching their continuous motion. Rather, you are watching still images of them played back very rapidly. Typically, movies show 24 frames every second, and when you see that many sequential frames that quickly, it appears to be smooth motion.

Digitizing voice uses a very similar concept. We take "snapshots" or "samples" of an analog voice wave very frequently. Those samples are then digitized (that is, represented as a series of 1s and 0s). Then, at the other end of the voice conversation, this digitized signal can be converted back into an analog wave, which the listener can understand.

One of the major issues with sampling is determining how often we should take those samples (that is, "snapshots") of the analog wave. We don't want to take too few samples per second because when the equipment at the other end of the phone call attempts to reassemble and make sense of those samples, a different sound (that is, a lower-frequency sound) signal might also match those samples, and the listener would hear an incorrect sound. This phenomenon is called aliasing, as show in Figure 2-2.

Figure 2-2. Aliasing

Now that you see the evils of undersampling, or aliasing, you might be tempted to say, "Let's take many more samples per second to avoid aliasing." Although that approach, sometimes called oversampling, does indeed eliminate the issue of aliasing, it also suffers from a major drawback. If we take far more samples per second than we actually need to accurately re-create the original signal, we are consuming more bandwidth than is absolutely necessary. Because bandwidth is a scarce commodity (especially on a WAN), we don't want to perform oversampling, as shown in Figure 2-3.

Figure 2-3. Oversampling

At this point, you have seen that you do not want to take too few samples per second, nor do you want to take too many samples per second. So, where is the "sweet spot?" What is the magic number of samples that allows equipment to accurately reproduce the original signal, without consuming more bandwidth than necessary? The answer was provided for us back in 1933 by Harry Nyquist. In fact, the Nyquist Theorem is very popular among telephony professionals. Mr. Nyquist said the sample rate needs to be at least twice as high as the highest frequency being sampled. For voice, in theory, the highest sampled frequency is 4 kHz (that is, 4000 cycles per second). Based on that information, the Nyquist Theorem tells us that we need to take 8000 samples per second, which means that we need to take a sample every 125 microseconds, as shown in Figure 2-4.

Figure 2-4. Sampling

You might be wondering why we said that the highest frequency sampled for voice traffic is 4 kHz (that is, 4000 Hz or 4000 cycles per second). After all, if we measured the frequency range of the spoken voice with sophisticated test equipment, we would find that frequencies contained in the human voice go well above 4 kHz. The next time you are in one of those huge electronics stores, check out the frequency range (sometimes called the frequency response) on stereo speakers. Those speakers can typically reproduce sounds in the frequency range from 20 Hz at the low end to 20,000 Hz at the high end. Some of the really expensive speakers have an even greater frequency range. However, because most humans cannot hear frequencies above 20,000 Hz, I am not quite sure why customers pay extra money to reproduce sounds that they cannot even hear. Maybe they pay the extra money for the listening enjoyment of their household dog.

However, the question for our purposes is, "Why don't we attempt to reproduce the higher frequency components of the human voice?" The answer is twofold. First, if we sampled more times per second in order to reproduce higher frequencies (that is, frequencies above 4 kHz), the additional required samples would consume more bandwidth. Second, because our goal is to reproduce clear and understandable voice and not to reproduce the fidelity experienced in a concert hall, we don't need to reproduce signals in excess of 4 kHz. In fact, over 90 percent of voice intelligence (that is, frequencies used by human speech) is contained in the 0 to 4000 Hz frequency range.

A common misconception about these voice samples is that when we take a sample, the sample is immediately in digital form. The initial process of sampling is called pulse amplitude modulation (PAM). Interestingly, after PAM is performed, the samples are still in an analog format. These samples, consisting of a single frequency, have amplitudes (that is, volumes) equaling the amplitudes of the sampled waveform at the instance of the sampling.

The next step in digitizing the voice waveforms is to take these PAM amplitudes and assign them a number, which can then be transmitted in binary form. The process of assigning a number to an amplitude is called quantization, as shown in Figure 2-5.

Figure 2-5. Linear Quantization

Once PAM samples have been taken, we need to quantize these samples (that is, assign numbers to represent their amplitudes). However, if we use a linear scale (as shown in Figure 2-5), the quantization error (as indicated by the deltas) causes distortion in the voice. This distortion is especially noticeable at lower volumes. Therefore, instead of a linear scale, we use a logarithmic scale, which has more measurement intervals at lower volumes.

In theory, these PAM samples might have an infinite number of amplitudes, and it would not be practical to try to assign a unique number to every sample. As a result, the quantization rounds off these amplitude values to the closest number on a scale, as shown in Figure 2-5 (represented by the deltas). The challenge with rounding off is that it causes quantization error, which sounds like a "hiss" on the line. The example in Figure 2-5 uses the linear scale on the left side to assign numbers to the various amplitudes.

Interestingly, quantization error is more noticeable at lower amplitudes (that is, lower volumes). This is because when the volume is louder, the volume of the speech tends to drown out the relatively quiet "hiss," and lower volumes occur more frequently than higher volumes. Based on these characteristics, taking more samples at lower volumes and fewer samples at higher volumes can help overcome the symptoms of quantization error, while still not using extra bandwidth. To accomplish this result, instead of a linear scale, a logarithmic scale is used, as shown in Figure 2-6.

Figure 2-6. Logarithmic Quantization

There are a couple of popular approaches to defining this logarithmic scale, called m-Law (pronounced "mu-Law" and sometimes written and pronounced as "u-Law") and a-Law (pronounced and sometimes written as "a-Law"). m-Law is the approach most commonly used in North America and Japan, while a-Law is more commonly used in other countries. Although both approaches do a great job of defining a logarithmic scale, m-Law has lower idle channel noise, while a-Law has a superior signal-to-noise (S/N) ratio for lower-volume samples. However, if VoIP equipment in a country using m-Law connects to VoIP equipment in a country using a-Law, the common practice is for both sets of equipment to use a-Law. Now that we can measure PAM amplitudes more effectively, let's assign a number to these samples.

An 8-bit (that is, 1-byte) value represents each sample. The first bit of the byte determines the polarity (that is, positive or negative) of the sample. The byte's next 3 bits identify the segment (that is, the major division on the logarithmic scale), while the final 4 bits of the byte specify the step (that is, the minor division on the logarithmic scale), as shown in Figure 2-7.

Figure 2-7. Anatomy of an 8-Bit Sample

At this point, the spoken voice has been converted into a series of 1 and 0s, but the question is, "How much bandwidth is being used on my network to send this voice conversation?" Let's do the math:

According to Mr. Nyquist, we need to take 8000 samples per second.
Each sample uses 8 bits.
8000 samples per second * 8 bits per sample = 64,000 bits per second (that is, 64 kbps).

These calculations show us that we can transmit digitized voice using 64 kbps of bandwidth. However, in addition to the actual voice, header information also needs to be transmitted. The 64 kbps of bandwidth only represents the voice traffic.