9.4 Audio Streaming

Above the transport layer is the media-coding layer. Sound is typically captured using an Analog to Digital Converted (ADC) device, which periodically samples and converts a continuous measure of pressure into a discrete digital value. The two parameters that control that process are sample rate and resolution. The sample rate is the frequency at which the signal (e.g., pressure) is sampled. The resolution is the number of bits used by the ADC to digitize the continuous pressure level. Usually the bit and sample rates are some multiple of 44,100 samples per second and 16 bit samples.

There are two major audio streaming formats: MPEG-2 dolby AC-3. DVB audio is MPEG-2 based and follows a ISO/IEC 13818-3, with the minimum requirements for the interoperability of baseline receivers as specified in TR 101 154. ATSC Audio, A52, is based on dolby AC-3. Both formats are derivatives of PCM audio.

9.4.1 Pulse Code Modulation (PCM) Audio

Once captured, the signal is encoded. Pulse Code Modulation (PCM) is the most simple and common method of (uncompressed) encoding of an analog sound signal into a digital bit-stream. It is not restricted to e.g., speech signals, but codes all types of sound signals. First, the amplitude of the signal is encoded using Pulse Amplitude Modulation (PAM). Next , the PAM sample is then coded (quantized) into a binary (digital) number, namely a sequence of 0's and 1's. The result comprises the raw sampled data, as is, without any compression.

PCM was introduced to improve on PAM and Pulse Duration Modulation (PDM) techniques (see Figure 9.6). The disadvantage of PAM is that any noise riding on the signal changes the pulse height, thereby introducing distortion. One solution to this problem is to use PDM, in which the PAM signal goes to a signal generator that produces new pulses of uniform height but varying length. The information is then carried by the pulse length, or duration, rather than by the pulse height as in PAM. PDM is somewhat less susceptible to noise than PAM. However, any distortion of the pulse shape may change the apparent pulse duration, thereby producing a distorted output signal.

Figure 9.6. Pulse Modulation techniques.

PCM offers a method of overcoming some of the disadvantages of other types of pulse modulation. In PCM, the instantaneous amplitude of the sampled signal is represented by bits depicted by a series of pulses and spaces having the same height (i.e., amplitude) but with varying durations (see Figure 9.6). To extract the signal, it is only necessary for the receiving equipment to detect the duration of each pulse or space. A distorted pulse does not degrade the signal as long as the relative durations of the pulses and spaces can still be recognized. As opposed to analog amplitudes, PCM digitizes durations, resulting in less distortion of time differences and reduced sensitivity to latency differences over time.

9.4.1.1 Encoding

The series of pulses representing a single sample from a single channel is called a word. One complete sampling cycle, including a word from each channel, is called a minor frame. Each pulse or space in the word is called a bit (derived from binary digit ). The total number of bits in each word determines the resolution which is a measure of the number of different discrete signal levels that can be identified and coded.

The actual encoding occurs in an ADC. One such ADC encoding process is known as successive approximation . It is done by comparing the amplitude of the analog sample to a series of precision voltages, one for each digit in the code. The first voltage is 50% of maximum amplitude, the second is 25%, the third is 12.5%, and so on. When a sample is compared to the first voltage, the first digit of the code is determined. If the sample is greater than 50% of maximum, logic circuits generate a binary 1 in the Most Significant Bit (MSB) position. If it is less than 50%, the first digit is a 0.

There are a number of modulation techniques to represent the logic levels ONE and ZERO, such as the presence or absence of a pulse with respect to time, a switch from voltage level to another, etc. Information on the se other modulation techniques can be found in the International Radio Instrumentation Group (IRIG) standards.

9.4.1.2 Adaptive Differential Pulse Code Modulation (ADPCM)

ADPCM is a speech coding method which uses fewer bits than the traditional PCM. It calculates the difference between two consecutive speech samples in standard PCM coded voice signals. This calculation is encoded using an adaptive filter and therefore, is transmitted at a lower rate than the standard 64 Kbps technique. Typically, ADPCM allows an analog voice conversation to be carried within a 32 Kbit digital channel; 3 or 4 bits are used to describe each sample, which represents the difference between two adjacent samples. Sampling is done 8000 times a second. ADPCM, which many voice processing makers use, allows encoding of voice signals in half the space PCM allows. In short, ADPCM is a reduced bitrate variant of PCM audio encoding.

9.4.1.3 A-Law

A-Law is the PCM coding and componding standard used in Europe and in areas outside of North American influence. A-law encoding is the method of encoding sampled audio waveforms used in the 2.048 Mbps, 30-channel PCM primary system known as E-carrier.

9.4.1.4 Mu-Law (u-Law)

u-Law is the PCM voice coding and componding standard used in Japan and North America. In this encoding method the analog voice signal is sampled 8000 times per second, with each sample being represented by an 8 bit value yielding a raw 64 Kbps transmission rate. A sample byte consists of a sign bit, a 3 bit segment specifying a logarithmic range, and a 4 bit step offset into the range. All bits of the sample are inverted before transmission.

9.4.2 Dolby AC-3

A Dolby AC-3 (or ATSC A52) audio stream is a sequence of self contained frames [DOLBY], each made up of the following:

A synchronization information header, which includes the following:
- A sync word, used for acquiring and maintaining synchronization
- An indication of the sampling rate, 48 kHz, 44.1 kHz or 32 kHz
- The size of the sync frame
A bit-stream information (BSI) header which includes the sync frames' time stamp,
6 audio blocks (AB0-AB5), each block represents 256 new audio samples,
An auxiliary data field (AUX),
An error check field Cyclic Redundancy Check (CRC).

Each sync frame is a complete independent data unit, it does not require any other data to be decoded. A complete sync frame is presented to the decoder for decompression . An incomplete sync frame does not pass the decoder's error detection test causing the decoder to mute. At 48 kHz this can cause a maximum of 64 ms of muted audio (if decoder is unable to synchronize with the immediate next sync word).

All sync frames within a sequence are the same size. Frame sizes range from 128 bytes to 3840 bytes. At 48 kHz each sync frame represents 32 ms of audio data (each audio block is 5.33 ms).

The AC-3 digital compression algorithm can encode from 1 to 5.1 channels of source audio from a PCM representation into a serial bit-stream at data rates ranging from 32 Kbps to 448 Kbps. The 0.1 channel refers to a fractional bandwidth channel intended to convey only low frequency ( subwoofer ) signals.

9.4.3 MP3 (ISO/MPEG-1 Audio Layer-3)

MPEG is the major audio format adopted by DVB. MPEG-1 audio standardizes three different coding schemes for digitized sound waves called Layers I, II, and III [MPEG]. None of these three standardize the encoder, but rather they standardizes the type of information that an encoder has to produce and write to a conformant bit-stream as well as the way in which the decoder has to parse, decompress, and resynthesize this information to regain the encoded sound.

Layer I has the lowest complexity and is specifically suitable for applications that require simple low (often negligible) cost encoders. Layer II requires a more complex encoder and a slightly more complex decoder, and is directed towards one-to-many applications, namely one encoder serves many decoders. Compared to Layer I, Layer II is able to remove more of the signal redundancy and to apply the psychoacoustic threshold more efficiently .

Layer III is the most complex of the three emphasizing support for lower bitrate applications due to improved compression through perceptual audio coding rather than lossless coding. In lossless coding, redundancy in the waveform is reduced to compress the sound signal, and the decoded sound wave does not differ from the original sound wave [MP3]. On the contrary, a perceptual audio codec does not attempt to retain the input signal exactly after encoding and decoding, rather its goal is to ensure that the output signal sounds the same to a human listener. After standardization of MPEG-2, sound files encoded with the MPEG-2 lower sampling rate extension of Layer III are also called MP3-files; often MP3 is wrongly called MPEG-3.

9.4.3.1 The Minimal Audition Threshold

A psychoacoustics model is used in the encoding stage and plays the major role in audio coding. This model achieves better coding efficiency by taking advantage of the human auditory system. The sensitivity of the human auditory systems for the audio signal varies in the frequency domain. The minimal audition threshold of the ear is not linear. The law of Fletcher and Munson [FRQSENS] indicates that humans are mostly sensitive to sounds between 2.5 and 5 KHz; beyond this range, human sound sensitivity drops sharply. The sensitivity, represented by the threshold in quiet , is increased by the masking effect of a tonal or noisy audio signal. Any tone below this threshold is not perceived. Improved compression is therefore achieved by encoding only audible sounds, namely sounds above the threshold in quiet.

9.4.3.2 The Masking Effect

Tone masking is a psychoacoustics effect that is used to further improve compression efficiency. When several tones lie below the masking threshold, only the louder tone is heard, namely audible. For every tone in the audio signal, a masking threshold can be calculated. The encoding bitrate allocation is done by assigning the bit resource to audible elements, namely those that are heard , and the inaudible elements are eliminated. The threshold in quiet is used as the minimum threshold of audibility.

9.4.3.3 The Reservoir of Bytes

Often, the audio track cannot be coded to a given bitrate without reducing quality. In such cases, MP3 then uses a short reservoir of bytes that acts as a buffer by using capacity from passages that can be coded to an inferior rate in the given flow.

9.4.3.4 The Joint Stereo

In the case of a stereophonic signal, the MP3 format can then use a few more tools, referred as Joint Stereo (JS) coding, to further shrink the compressed file size.

The first JS tool takes advantage of the observation that for very low and very high frequencies, the human ear is not able to locate the spacial origin of sounds with accuracy. Taking advantage of this effect is a technique called Intensity Stereo (IS), according to which some frequencies are recorded as a monophonic signal followed by additional information to restore a minimum of specialization.

The second JS is called Mid/Side (M/S) stereo. When the left and the right channels are sufficiently similar, then a middle (L+R) and a side (L “R) channels are encoded instead of left and right. The L “R channel is minimal, and therefore it compresses much better, reducing the final bitrate (or file size). During playback, the MP3 decoder automatically reconstructs the left and right channels.

9.4.3.5 Huffman Coding

MP3 also uses the classic technique of the Huffman algorithm. This coding creates variable length codes on a whole number of bits. Higher probability symbols have shorter codes. Huffman codes have the property to have a unique prefix, they can therefore be decoded correctly in spite of their variable length. The decoding step is very fast (via a correspondence table). This kind of coding saving on the average a bit less than 20% of space.

Huffman coding is an ideal complement to perceptual coding. The Huffman algorithm is very seldom efficient during big polyphonies, for which the perceptual coding is very efficient, because many sounds are masked or lessened, with little but identical or repetitive information. Huffman is then very efficient for pure sounds, for which perceptual coding is inefficient. This is because although there are few masking effects, the digitalized sound contains many repetitive bytes, that are replaced by shorter codes.

9.4.3.6 Architecture

The generic architecture for perceptual audio streaming is depicted in Figure 9.7. The input audio signal is converted into a sub-sampled spectral representation using various types of analysis filter banks. The signal's time and frequency dependent perceptibility (masking) threshold is estimated by the perceptual model; this calculates the maximum quantization error that can be introduced without degradation of quality. Subsequently, spectral values are quantized and coded using parameters computed by the perceptual model. Finally, the bit-stream is multiplexed and modulated to enable its transmission. The processing stages in the reverse direction are similar, with the exception that there is no need for a perceptual model for decoding the signal.

Figure 9.7. The generic streaming architecture of a perceptual audio T/F encoder/decoder.

9.4.4 MPEG-4 AAC

MPEG-4 Advanced Audio Coding (AAC), also known as MPEG-2 non- backward-compatible , represents the actual state of the art in natural audio coding. The AAC requires a complex toolbox to perform a wide range of operations from low-bitrate speech coding to high-quality audio coding and music synthesis [MPEG4]. These are essentially the same coding tools already present in MP3, with improved application methods . Bit-rates range from low bitrate speech coding (down to 2 Kbps) to high-quality audio coding (at 64 kbits/s per channel and higher).

Instead of MP3's hybrid filter bank, AAC uses the Modified Discrete Cosine Transform (MDCT) together with the increased window lengths of 2048 points. AAC can be switched dynamically between block lengths of 2048 points and 256 points. In other words, long windows are nearly twice as long as MP3 ones, providing better frequency resolution, and short windows are smaller than MP3 ones, providing better transients handling and less pre-echo. If a single change or transient occurs, the short window of 256 points is chosen for better resolution. Otherwise, the longer 2048 point window is used to improve the coding efficiency.

AAC provides the ability to toggle middle or side stereo on a sub-band basis instead of an entire frame basis. It also provides the ability to toggle intensity stereo on a sub- band basis instead of using it only for a contiguous group of sub-bands. AAC also introduces some new tools over previous coding schemes. Temporal Noise Shaping (TNS) is a tool designed to control the location, in time, of the quantization noise by transmission of filtering coefficients Prediction is a tool designed to enhance compressibility of stationary signals.

The most processing- intensive AAC decoding algorithms are prediction, intensity/coupling and the filter bank, which includes inverse MDCT. These three algorithms consume 70 percent of the processing power required to run the total AAC decoding algorithm.

9.4.5 Audio Transmission

To facilitate transmission, the compressed bit-stream is encapsulated in packets so as to enable switching, multiplexing, re-multiplexing and synchronized reconstruction (i.e., real-time guarantees ) and rendering by receivers. In that respect, both MPEG and AC-3 audio are similar (but not identical). An audio stream may be combined with another audio stream by interleaving packets from the two streams. Additional streams may be added to an existing stream by inserting additional packets representing the added streams between the packets of the existing streams. Removing packets can be achieved by replacing them with null packets. However, packets inserted need to be transmitted, and therefore when adding streams bandwidth issues need to be considered .

MPEG encapsulations , for example, utilize PID and association_tag s to enable switching, multiplexing and re-multiplexing, and utilizing DTS and PTS to enable specifying synchronization and real-time delivery and presentation requirements that receivers satisfy . An example scenario is depicted in Figure 9.8, where a 5.1 channel audio program is converted from a PCM representation requiring more than 5.184 Mbps into a 384 Kbps serial bit-stream by the AC-3 encoder. This means that the AC-3 digital compression reduces the amount of bandwidth and power required by the transmission by an order of magnitude (from 5.184 Mbps to 448 Kbps ”or more typically ”384 Kbps). Satellite transmission equipment converts this bit-stream to an RF transmission which is directed to a satellite transponder . The signal received from the satellite is demodulated back into the typical 384 Kbps serial bit-stream, and decoded by the AC-3 decoder. The result is the original 5.1 channel audio program.

Figure 9.8. An example AC-3 audio encoding, transmission, and decoding.

A comparison between AC-3 and MPEG-2 Audio compression is presented in Table 9.3 MPEG-1 supports a wide range of bit-rates from 32 kbit/s to 320 Kbps. The Low Sampling Frequency (LSF) extension of MPEG-2 extends this range down to 8 Kbps.

AAC accommodates many more channels than MP3: up to 48 full channels and 16 low frequency enhancements compared to 5 full channels and 1 low frequency enhancement for MP3. It can handle higher sampling frequencies than MP3, up to 96 KHz compared to 48 KHz. MPEG formal listening tests have demonstrate that for 2 channels it is able to provide slightly better audio quality at 96 Kbps than layer-3 at 128 Kbps or layer-2 at 192 Kbps.

Table 9.3. Audio Data Aggregate Specifications

Property	Linear PCM	Dolby AC-3	MPEG -2 Audio
Sampling Frequency	48 KHz or 96 KHz	48 KHz	48 KHz
Number of bits per sample	16/20/24	16 bits compressed	16 bits compressed
Max transfer rate	6.144 Mbps	448 Kbps	640 Kbps
Max number of channels	8	5.1	5.1 or 7.1

9.4.5.1 Encoding

The AC-3 encoder accepts PCM audio and produces an encoded bit-stream (see Figure 9.9). The AC-3 algorithm achieves high coding gain (the ratio of the input bitrate to the output bitrate) by coarsely quantizing a frequency domain representation of the audio signal. Initially, the PCM time samples are transformed into a sequence of blocks of frequency coefficients using an analysis filter bank. Overlapping blocks of 512 time samples are multiplied by a time window and transformed into the frequency domain. Due to the overlapping blocks, each PCM input sample is represented in two sequential transformed blocks. The frequency domain representation may then be decimated by a factor of two so that each block contains 256 frequency coefficients. The individual frequency coefficients are represented in binary exponential notation as a binary exponent and a mantissa. The set of exponents is encoded into a coarse representation of the signal spectrum that is referred to as the spectral envelope. This spectral envelope is used by the core bit allocation routine that determines how many bits to use to encode each individual mantissa. The spectral envelope and the coarsely quantized mantissas for 6 audio blocks (1536 audio samples per channel) are formatted into an AC-3 frame. The AC-3 bit-stream is a sequence of AC-3 frames.

Figure 9.9. A simplified AC-3 encoding architecture.

MPEG encoding processes the input signal through a time-to-frequency mapping, resulting in spectrum components for subsequent coding (see Figure 9.10). The encoder contains a psychoacoustic model leveraging psychoacoustic effect called auditory masking (described earlier). This model analyzes the input signals within consecutive time blocks and determines, for each block, the spectral components of the input audio signal by applying a frequency transform. It estimates the just noticeable noise level for each frequency band, namely the threshold of masking . In its quantization and coding stage, the encoder allocates the available number of data bits in a way that meets both the bitrate and masking requirements.

Figure 9.10. A simplified MPEG-1 Audio encoding architecture.

The MPEG perceptual model relies on two types of auditory masking: frequency and temporal masking (see Figure 9.11) [FRQSENS]. With frequency, the quiet threshold around a signal frequency is high, dropping rapidly when moving away from the signal's frequency. With regard to temporal masking, a strange effect is observed . In addition to not being able to hear signals immediately following a given signal (inability to hear very high frequencies), the arrival of the signal may alter or cancel the interpretation of a different signal sent a short period prior to the masking signal (similar reduced ability to hear high frequencies).

Figure 9.11. The hearing threshold behavior for frequency and temporal masking.

9.4.5.2 Decoding

The decoding process is the inverse of the encoding process (see Figure 9.12). The decoder synchronizes its clock to the encoded bit-stream, checks for errors, and deformats the various types of data such as the encoded spectral envelope and the quantized mantissas. The bit allocation routine is run and the results are used to unpack and dequantize the mantissas. The spectral envelope is decoded to produce the exponents. The exponents and mantissas are transformed back into the time domain to produce the decoded PCM time samples.

Figure 9.12. A simplified AC-3 decoding architecture.

The decoding of an MPEG-1 bit-stream is not exactly the inverse of its decoding process. The irrelevant (as opposed to redundant) information eliminated by the psychoacoustic model is not reconstructed (see Figure 9.13). The result is an approximate signal, which, while it sounds the same as the original, it is not identical to it.