7.13 Embedded Audio in SDI

7.13 Embedded Audio in SDI

In component SDI, there is provision for ancillary data packets to be sent during blanking 10 , 12 . The high clock rate of component means that there is capacity for up to 16 audio channels sent in four groups. Composite SDI has to convey the digitized analog sync edges and bursts and only sync tip is available for ancillary data. As a result of this and the lower clock rate, composite has much less capacity for ancillary data than component although it is still possible to transmit one audio data packet carrying four audio channels in one group . Figure 7.38(a) shows where the ancillary data may be located for PAL and (b) shows the locations for NTSC.

image from book
Figure 7.38a: Ancillary data locations for PAL.

As was shown in Chapter 4, the data content of the AES/EBU digital audio subframe consists of validity ( V ), user ( U ) and channel ( C ) status bits, a 20-bit sample and four auxiliary bits which optionally may be appended to the main sample to produce a 24-bit sample. The AES recommends sampling rates of 48, 44.1 and 32 kHz, but the interface permits variable sampling rates. SDI has various levels of support for the wide range of audio possibilities and these levels are defined in Figure 7.39. The default or minimum level is Level A which operates only with a video-synchronous 48 kHz sampling rate and transmits V, U, C and the main 20-bit sample only. As Level A is a default it need not be signalled to a receiver as the presence of IDs in the ancillary data is enough to ensure correct decoding. However, all other levels require an audio control packet to be transmitted to teach the receiver how to handle the embedded audio data. The audio control packet is transmitted once per field in the second horizontal ancillary space after the video switching point before any associated audio sample data. One audio control packet is required per group of audio channels.

image from book
Figure 7.38b: Ancillary data locations for NTSC.
image from book
Figure 7.39: The different levels of implementation of embedded audio. Level A is default.

If it is required to send 24-bit samples, the additional four bits of each sample are placed in extended data packets that must directly follow the associated group of audio samples in the same ancillary data space.

There are thus three kinds of packet used in embedded audio: the audio data packet which carries up to four channels of digital audio, the extended data packet and the audio control packet.

In component systems, ancillary data begins with a reversed TRS or sync pattern. Normal video receivers will not detect this pattern and so ancillary data cannot be mistaken for video samples. The ancillary data TRS consists of all zeros followed by all ones twice. There is no separate TRS for ancillary data in composite. Immediately following the usual TRS, there will be an ancillary data flag whose value must be 3 FC 16 . Following the ancillary TRS or data flag is a data ID word containing one of a number of standardized codes which tell the receiver how to interpret the ancillary packet. Figure 7.40 shows a list of ID codes for various types of packets. Next come the data block number and the data block count parameters. The data block number increments by 1 on each instance of a block with a given ID number. On reaching 255 it overflows and recommences counting. Next, a data count parameter specifies how many symbols of data are being sent in this block. Typical values for the data count are 36 10 for a small packet and 48 10 for a large packet. These parameters help an audio extractor to assemble contiguous data relating to a given set of audio channels.

 

Group 1

Group 2

Group 3

Group 4

Audio data

2FF

1FD

1FB

2F9

Audio CTL

1EF

2EE

2ED

1EC

Ext. data

1FE

2FC

2FA

1F8


Figure 7.40: The different packet types have different ID codes as shown here.

Figure 7.41 shows the structure of the audio data packing. In order to prevent accidental generation of reserved synchronizing patterns, bit 9 is the inverse of bit 8 so the effective system word length is nine bits. Three ninebit symbols are used to convey all of the AES/EBU subframe data except for the four auxiliary bits. Since four audio channels can be conveyed, there are two 'Ch' or channel number bits specifying the audio channel number to which the subframe belongs. A further bit, Z , specifies the beginning of the 192-sample channel status message. V, U and C have the same significance as in the normal AES/EBU standard, but the P bit reflects parity on the three nine-bit symbols rather than the AES/EBU definition. The three-word sets representing an audio sample will then be repeated for the remaining three channels in the packet but with different combinations of the Ch bits.

image from book
Figure 7.41: AES/EBU data for one audio sample is sent as three nine-bit symbols. A = audio sample. Bit Z = AES/EBU channel status block start bit.

One audio sample in each of the four channels of a group requires 12 video sample periods and so packets will contain multiples of 12 samples. At the end of each packet a checksum is calculated on the entire packet contents.

If 24-bit samples are required, extended data packets must be employed in which the additional four bits of each audio sample in an AES/EBU frame are assembled in pairs according to Figure 7.42. Thus for every 12 symbols conveying the four 20-bit audio samples of one group in an audio data packet two extra symbols will be required in an extended data packet.

image from book
Figure 7.42: The structure of an extended data packet.

The audio control packet structure is shown in Figure 7.43. Following the usual header are symbols representing the audio frame number, the sampling rate, the active channels, the processing delay and some reserved symbols. The sampling rate parameter allows the two AES/EBU channel pairs in a group to have different sampling rates if required. The active channel parameter simply describes which channels in a group carry meaningful audio data. The processing delay parameter denotes the delay the audio has experienced measured in audio sample periods. The parameter is a 26-bit two's complement number requiring three symbols for each channel. Since the four audio channels in a group are generally channel pairs, only two delay parameters are needed. However, if four independent channels are used, one parameter each will be required. The e bit denotes whether four individual channels or two pairs are being transmitted.

image from book
Figure 7.43: The structure of an audio control packet.

The frame number parameter comes about in 525 line systems because the frame rate is 29.97 Hz not 60 Hz. The resultant frame period does not contain a whole number of audio samples. An integer ratio is only obtained over the multiple frame sequence shown in Figure 7.44. The frame number conveys the position in the frame sequence. At 48 kHz odd frames hold 1602 samples and even frames hold 1601 samples in a five-frame sequence. At 44.1 and 32 kHz the relationship is not so simple and to obtain the correct number of samples in the sequence certain frames (exceptions) have the number of samples altered . At 44.1 kHz the frame sequence is 100 frames long whereas at 32 kHz it is 15 frames long. As the two channel pairs in a group can have different sampling rates, two frame parameters are required per group. In 50 Hz systems all three sampling rates allow an integer number of samples per frame and so the frame number is irrelevant.

image from book
Figure 7.44: The origin of the frame sequences in 525 line systems.

As the ancillary data transfer is in bursts, it is necessary to provide a little RAM buffering at both ends of the link to allow real-time audio samples to be time compressed up to the video bit rate at the input and expanded back again at the receiver. Figure 7.45 shows a typical audio insertion unit in which the FIFO buffers can be seen. In such a system all that matters is that the average audio data rate is correct. Instantaneously there can be timing errors within the range of the buffers. Audio data cannot be embedded at the video switch point or in the areas reserved for EDH packets, but provided that data are evenly spread throughout the frame 20-bit audio can be embedded and retrieved with about 48 audio samples of buffering. If the additional four bits per sample are sent this requirement rises to 64 audio samples. The buffering stages cause the audio to be delayed with respect to the video by a few milliseconds at each insertion. Whilst this is not serious, Level I allows a delay-tracking mode which allows the embedding logic to transmit the encoding delay so a subsequent receiver can compute the overall delay. If the range of the buffering is exceeded for any reason, such as a non-synchronous audio sampling rate fed to a Level A encoder, audio samples are periodically skipped or repeated in order to bring the delay under control.

image from book
Figure 7.45: A typical audio insertion unit. See text for details.

It is permitted for receivers that can only handle 20-bit audio to discard the four-bit sample extension data. However, the presence of the extension data requires more buffering in the receiver. A device having a buffer of only 48 samples for Level A working could experience an overflow due to the presence of the extension data.

In 48 kHz working, the average number of audio samples per channel is just over three per video line. In order to maintain the correct average audio sampling rate, the number of samples sent per line is variable and not specified in the standard. In practice a transmitter generally switches between packets containing three samples and packets containing four samples per channel per line as required to keep the buffers from overflowing. At lower sampling rates either smaller packets can be sent or packets can be omitted from certain lines.

As a result of the switching, ancillary data packets in component video occur mostly in two sizes. The larger packet is 55 words in length of which 48 words are data. The smaller packet contains 43 words of which 36 are data. There is space for two large packets or three small packets in the horizontal blanking between EAV and SAV.

A typical embedded audio extractor is shown in Figure 7.46. The extractor recognizes the ancillary data TRS or flag and then decodes the ID to determine the content of the packet. The group and channel addresses are then used to direct extracted symbols to the appropriate audio channel. A FIFO memory is used to timebase expand the symbols to the correct audio sampling rate.

image from book
Figure 7.46: A typical audio extractor. Note the FIFOs for timebase expansion of the audio samples.


Digital Interface Handbook
Digital Interface Handbook, Third Edition
ISBN: 0240519094
EAN: 2147483647
Year: 2004
Pages: 120

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net