The effects of finite binary word lengths touch all aspects of digital signal processing. Using finite word lengths prevents us from representing values with infinite precision, increases the background noise in our spectral estimation techniques, creates nonideal digital filter responses, induces noise in analogtodigital (A/D) converter outputs, and can (if we're not careful) lead to wildly inaccurate arithmetic results. The smaller the word lengths, the greater these problems will be. Fortunately, these finite, wordlength effects are rather well understood. We can predict their consequences and take steps to minimize any unpleasant surprises. The first finite, wordlength effect we'll cover is the errors that occur during the A/D conversion process.
12.3.1 A/D Converter Quantization Errors
Practical A/D converters are constrained to have binary output words of finite length. Commercial A/D converters are categorized by their output word lengths, which are normally in the range from 8 to 16 bits. A typical A/D converter input analog voltage range is from –1 to +1 volt. If we used such an A/D converter having eightbit output words, the least significant bit would represent
What this means is that we can represent continuous (analog) voltages perfectly as long as they're integral multiples of 7.81 millivolts—any intermediate input voltage will cause the A/D converter to output a best estimate digital data value. The inaccuracies in this process are called quantization errors because an A/D output least significant bit is an indivisible quantity. We illustrate this situation in Figure 121(a), where the continuous waveform is being digitized by an eightbit A/D converter whose output is in the signmagnitude format. When we start sampling at time t = 0, the continuous waveform happens to have a value of 31.25 millivolts (mv), and our A/D output data word will be exactly correct for sample x(0). At time T when we get the second A/D output word for sample x(1), the continuous voltage is between 0 and –7.81 mv. In this case, the A/D converter outputs a sample value of 10000001 representing –7.81 mv, even though the continuous input was not quite as negative as –7.81mv. The 10000001 A/D output word contains some quantization error. Each successive sample contains quantization error because the A/D's digitized output values must lie on a horizontal dashed line in Figure 121(a). The difference between the actual continuous input voltage and the A/D converter's representation of the input is shown as the quantization error in Figure 121(b). For an ideal A/D converter, the quantization error, a kind of roundoff noise, can never be greater than ±1/2 an lsb, or ±3.905 mv.
Figure 121. Quantization errors: (a) digitized x(n) values of a continuous signal; (b) quantization error between the actual analog signal values and the digitized signal values.
While Figure 121(b) shows A/D quantization noise in the time domain, we can also illustrate this noise in the frequency domain. Figure 122(a) depicts a continuous sinewave of one cycle over the sample interval shown as the dotted line and a quantized version of the timedomain samples of that wave as the dots. Notice how the quantized version of the wave is constrained to have only integral values, giving it a stair step effect oscillating above and below the true unquantized sinewave. The quantization here is 4 bits, meaning that we have a sign bit and three bits to represent the magnitude of the wave. With three bits, the maximum peak values for the wave are ±7. Figure 122(b) shows the discrete Fourier transform (DFT) of a discrete version of the sinewave whose timedomain sample values are not forced to be integers, but have high precision. Notice in this case that the DFT has a nonzero value only at m = 1. On the other hand, Figure 122(c) shows the spectrum of the 4bit quantized samples in Figure 122(a), where quantization effects have induced noise components across the entire spectral band. If the quantization noise depictions in Figures 121(b) and 122(c) look random, that's because they are. As it turns out, even though A/D quantization noise is random, we can still quantify its effects in a useful way.
In the field of communications, people often use the notion of output signaltonoise ratio, or SNR = (signal power)/(noise power), to judge the usefulness of a process or device. We can do likewise and obtain an important expression for the output SNR of an ideal A/D converter, SNRA/D, accounting for finite wordlength quantization effects. Because quantization noise is random, we can't explicitly represent its power level, but we can use its statistical equivalent of variance to define SNRA/D measured in decibels as
Figure 122. Quantization noise effects: (a) input sinewave applied to a 64point DFT; (b) theoretical DFT magnitude of highprecision sinewave samples; (c) DFT magnitude of a sinewave quantized to 4 bits.
Equation 128
Next, we'll determine an A/D converter's quantization noise variance relative to the converter's maximum input peak voltage Vp. If the full scale (–Vp to +Vp volts) continuous input range of a bbit A/D converter is 2Vp, a single quantization level q is that voltage range divided by the number of possible A/D output binary values, or q = 2Vp/2b. (In Figure 121, for example, the quantization level q is the lsb value of 7.81 mv.) A depiction of the likelihood of encountering any given quantization error value, called the probability density function p(e) of the quantization error, is shown in Figure 123.
Figure 123. Probability density function of A/D conversion roundoff error (noise).
This simple rectangular function has much to tell us. It indicates that there's an equal chance that any error value between –q/2 and +q/2 can occur. By definition, because probability density functions have an area of unity (i.e., the probability is 100 percent that the error will be somewhere under the curve), the amplitude of the p(e) density function must be the area divided by the width, or p(e) = 1/q. From Figure D4 and Eq. (D12) in Appendix D, the variance of our uniform p(e) is
Equation 129
We can now express the A/D noise error variance in terms of A/D parameters by replacing q in Eq. (129) with q = 2Vp/2b to get
Equation 1210
OK, we're halfway to our goal—with Eq. (1210) giving us the denominator of Eq. (128), we need the numerator. To arrive at a general result, let's express the input signal in terms of its root mean square (rms), the A/D converter's peak voltage, and a loading factor LF defined as
Equation 1211
With the loading factor defined as the input rms voltage over the A/D converter's peak input voltage, we square and rearrange Eq. (1211) to show the signal variance as
Equation 1212
Substituting Eqs. (1210) and (1212) in Eq. (128),
Equation 1213
Eq. (1213) gives us the SNRA/D of an ideal bbit A/D converter in terms of the loading factor and the number of bits b. Figure 124 plots Eq. (1213) for various A/D word lengths as a function of the loading factor. Notice that the loading factor in Figure 124 is never greater than –3dB, because the maximum continuous A/D input peak value must not be greater than Vp volts. Thus, for a sinusoid input, its rms value must not be greater than volts (3 dB below Vp).
Figure 124. SNRA/D of ideal A/D converters as a function of loading factor in dB.
Recall from Appendix D, Section D.2 that, although the variance s2 is associated with the power of a signal, the standard deviation is associated with the rms value of a signal.
When the input sinewave's peak amplitude is equal to the A/D converter's fullscale voltage Vp, the fullscale LF is
Equation 1214
Under this condition, the maximum A/D output SNR from Eq. (1213) is
Equation 1215
This discussion of SNR relative to A/D converters means three important things to us:
[] By the way, some folks use the term crest factor to describe how hard an A/D converter's input is being driven. The crest factor is the reciprocal of the loading factor, or CF = Vp/(rms of the input signal).
A word of caution is appropriate here concerning our analysis of A/D converter quantization errors. The derivations of Eqs. (1213) and (1215) are based upon three assumptions:
To conclude our discussion of A/D converters, let's consider one last topic. In the literature the reader is likely to encounter the expression
Equation 1216
Equation (1216) is used by test equipment manufacturers to specify the sensitivity of test instruments using a beff parameter known as the number of effective bits, or effective number of bits (ENOB) [3–8]. Equation (1216) is merely Eq. (1215) solved for b. Test equipment manufacturers measure the actual SNR of their product indicating its ability to capture continuous input signals relative to the instrument's inherent noise characteristics. Given this true SNR, they use Eq. (1216) to determine the beff value for advertisement in their product literature. The larger beff, the greater the continuous voltage that can be accurately digitized relative to the equipment's intrinsic quantization noise.
12.3.2 Data Overflow
The next finite, wordlength effect we'll consider is called overflow. Overflow is what happens when the result of an arithmetic operation has too many bits, or digits, to be represented in the hardware registers designed to contain that result. We can demonstrate this situation to ourselves rather easily using a simple fourfunction, eightdigit pocket calculator. The sum of a decimal 9.9999999 plus 1.0 is 10.9999999, but on an eightdigit calculator the sum is 10.999999 as
The hardware registers, which contain the arithmetic result and drive the calculator's display, can hold only eight decimal digits; so the least significant digit is discarded (of course). Although the above error is less than one part in ten million, overflow effects can be striking when we work with large numbers. If we use our calculator to add 99,999,999 plus 1, instead of getting the correct result of 100 million, we'll get a result of 1. Now that's an authentic overflow error!
Let's illustrate overflow effects with examples more closely related to our discussion of binary number formats. First, adding two unsigned binary numbers is as straightforward as adding two decimal numbers. The sum of 42 plus 39 is 81, or
In this case, two 6bit binary numbers required 7 bits to represent the results. The general rule is the sum of m individual bbit binary numbers can require as many as [b + log2(m)] bits to represent the results. So, for example, a 24bit result register (accumulator) is needed to accumulate the sum of sixteen 20bit binary numbers, or 20 + log2(16) = 24. The sum of 256 eightbit words requires an accumulator whose word length is [8 + log2(256)], or 16 bits, to ensure that no overflow errors occur.
In the preceding example, if our accumulator word length was 6 bits, an overflow error occurs as
Here, the most significant bit of the result overflowed the 6bit accumulator, and an error occurred.
With regard to overflow errors, the two's complement binary format has two interesting characteristics. First, under certain conditions, overflow during the summation of two numbers causes no error. Second, with multiple summations, intermediate overflow errors cause no problems if the final magnitude of the sum of the bbit two's complement numbers is less than 2b–1. Let's illustrate these properties by considering the fourbit two's complement format in Figure 125, whose binary values are taken from Table 122.
Figure 125. Fourbit two's complement binary numbers.
The first property of two's complement overflow, which sometimes causes no errors, can be shown by the following examples:
Then again, the following examples show how two's complement overflow sometimes does cause errors:
The rule with two's complement addition is if the carry bit into the sign bit is the same as the overflow bit out of the sign bit, the overflow bit can be ignored, causing no errors; if the carry bit into the sign bit is different from the overflow bit out of the sign bit, the result is invalid. An even more interesting property of two's complement numbers is that a series of bbit word summations can be performed where intermediate sums are invalid, but the final sum will be correct if its magnitude is less than 2b–1. We show this by the following example. If we add a +6 to a +7, and then add a –7, we'll encounter an intermediate overflow error but our final sum will be correct as
The magnitude of the sum of the three fourbit numbers was less than 24–1 (<8), so our result was valid. If we add a +6 to a +7, and next add a –5, we'll encounter an intermediate overflow error, and our final sum will also be in error because its magnitude is not less than 8.
Another situation where overflow problems are conspicuous is during the calculation of the fast Fourier transform (FFT). It's difficult at first to imagine that multiplying complex numbers by sines and cosines can lead to excessive data word growth—particularly because sines and cosines are never greater than unity. Well, we can show how FFT data word growth occurs by considering a decimationintime FFT butterfly from Figure 414(c) repeated here as Figure 126, and grinding through a little algebra. The expression for the x' output of this FFT butterfly, from Eq. (426), is
Equation 1217
Figure 126. Single decimationintime FFT butterfly.
Breaking up the butterfly's x and y inputs into their real and imaginary parts and remembering that , we can express Eq. (1217) as
Equation 1218
If we let a be the twiddle factor angle of 2pk/N, and recall that e–ja = cos(a) – jsin(a), we can simplify Eq. (1218) as
Equation 1219
If we look, for example, at just the real part of the x' output, x'real, it comprises the three terms
Equation 1220
If xreal, yreal, and yimag are of unity value when they enter the butterfly and the twiddle factor angle a = 2pk/N happens to be p/4 = 45o, then, x'real can be greater than 2 as
Equation 1221
So we see that the real part of a complex number can more than double in magnitude in a single stage of an FFT. The imaginary part of a complex number is equally likely to more than double in magnitude in a single FFT stage. Without mitigating this word growth problem, overflow errors could render an FFT algorithm useless.
OK, overflow problems are handled in one of two ways—by truncation or rounding—each inducing its own individual kind of quantization errors, as we shall see.
12.3.3 Truncation
Truncation is the process where a data value is represented by the largest quantization level that is less than or equal to that data value. If we're quantizing to integral values, for example, the real value 1.2 would be quantized to 1. An example of truncation to integral values is shown in Figure 127(a), where all values of x in the range of 0 x < 1 are set equal to 0, values of x in the range of 1 x < 2 are set equal to 1, x values in the range of 2 x < 3 are set equal to 2, and so on.
Figure 127. Truncation: (a) quantization nonlinearities; (b) error probability density function.
As we did with A/D converter quantization errors, we can call upon the concept of probability density functions to characterize the errors induced by truncation. The probability density function of truncation errors, in terms of the quantization level, is shown in Figure 127(b). In Figure 127(a) the quantization level q is 1, so, in this case, we can have truncation errors as great as –1. Drawing upon our results from Eqs. (D11) and (D12) in Appendix D, the mean and variance of our uniform truncation probability density function are expressed as
Equation 1222
and
Equation 1223
In a sense, truncation error is the price we pay for the privilege of using integer binary arithmetic. One aspect of this is the error introduced when we use truncation to implement division by some integral power of 2. We often speak of a quick way of dividing a binary value by 2T is to shift that binary word T bits to the right; that is, we're truncating the data value (not the data word) by lopping off the rightmost T bits after the shift. For example, let's say we have the value 31 represented by the fivebit binary number 111112, and we want to divide it by 16 through shifting the bits T = 4 places to the right and ignoring (truncating) those shifted bits. After the right shift and truncation, we'd have a binary quotient of 31/16 = 000012. Well, we see the significance of the problem because our quick division gave us an answer of one instead of the correct 31/16 = 1.9375. Our divisionbytruncation error here is almost 50 percent of the correct answer. Had our original dividend been a 63 represented by the sixbit binary number 1111112, dividing it by 16 through a fourbit shift would give us an answer of binary 0000112, or decimal three. The correct answer, of course, is 63/16 = 3.9375. In this case the percentage error is 0.9375/3.9375, or about 23.8 percent. So, the larger the dividend, the lower the truncation error.
If we study these kinds of errors, we'll find that the resulting truncation error depends on three things: the number of value bits shifted and truncated, the values of the truncated bits (were those dropped bits ones or zeros), and the magnitude of the binary number left over after truncation. Although a complete analysis of these truncation errors is beyond the scope of this book, we can quantify the maximum error that can occur with our divisionbytruncation scheme when using binary integer arithmetic. The worst case scenario is when all of the T bits to be truncated are ones. For binary integral numbers the value of T bits, all ones, to the right of the binary point is 1 – 2–T. If the resulting quotient N is small, the significance of those truncated ones aggravates the error. We can normalize the maximum division error by comparing it to the correct quotient using a percentage. So the maximum percentage error of the binary quotient N after truncation for a Tbit right shift, when the truncated bits are all ones, is
Equation 1224
Let's plug a few numbers into Eq. (1224) to understand its significance. In the first example above, 31/16, where T = 4 and the resulting binary quotient was N = 1, the percentage error is
Plotting Eq. (1224) for three different shift values as functions of the quotient, N, after truncation results in Figure 128.
Figure 128. Maximum error when data shifting and truncation are used to implement division by 2T. T = 1 is division by 2; T = 2 is division by 4; and T = 8 is division by 256.
So, to minimize this type of division error, we need to keep the resulting binary quotient N, after the division, as large as possible. Remember, Eq. (1224) was a worst case condition where all of the truncated bits were ones. On the other hand, if all of the truncated bits happened to be zeros, the error will be zero. In practice, the errors will be somewhere between these two extremes. A practical example of how divisionbytruncation can cause serious numerical errors is given in Reference [9].
12.3.4 Data Rounding
Rounding is an alternate process of dealing with overflow errors where a data value is represented by, or rounded off to, its nearest quantization level. If we're quantizing to integral values, the number 1.2 would be quantized to 1, and the number 1.6 would be quantized to 2. This is shown in Figure 129(a), where all values of x in the range of –0.5 x < 0.5 are set equal to 0, values of x in the range of 0.5 x < 1.5 are set equal to 1, x values in the range of 1.5 x < 2.5 are set equal to 2, etc. The probability density function of the error induced by rounding, in terms of the quantization level, is shown in Figure 129(b). In Figure 129(a), the quantization level q is 1, so, in this case, we can have truncation error magnitudes no greater than q/2, or 1/2. Again, using our Eqs. (D11) and (D12) results from Appendix D, the mean and variance of our uniform roundoff probability density function are expressed as
Figure 129. Roundoff: (a) quantization nonlinearities; (b) error probability density function.
Equation 1225
and
Equation 1226
Because the mean (average) and maximum error possibly induced by roundoff is less than that of truncation, rounding is generally preferred, in practice, to minimize overflow errors.
In digital signal processing, statistical analysis of quantization error effects is exceedingly complicated. Analytical results depend on the types of quantization errors, the magnitude of the data being represented, the numerical format used, and which of the many FFT or digital filter structures we happen to use. Be that as it may, digital signal processing experts have developed simplified error models whose analysis has proved useful. Although discussion of these analysis techniques and their results is beyond the scope of this introductory text, many references are available for the energetic reader[10–18]. (Reference [11] has an extensive reference list of its own on the topic of quantization error analysis.)
Again, the overflow problems using fixedpoint binary formats—that we try to alleviate with truncation or roundoff—arise because so many digital signal processing algorithms comprise large numbers of additions or multiplications. This obstacle, particularly in hardware implementations of digital filters and the FFT, is avoided by hardware designers through the use of floatingpoint binary formats.
URL http://proquest.safaribooksonline.com/0131089897/ch12lev1sec3
Amazon  


