7.8 Video buffer verifier

A coded bit stream contains different types of pictures, and each type ideally requires a different number of bits to encode. In addition, the video sequence may vary in complexity with time, and it may be desirable to devote more coding bits to one part of a sequence than to another. For constant bit rate coding, varying the number of bits allocated to each picture requires that the decoder has a buffer to store the bits not needed to decode the immediate picture. The extent to which an encoder can vary the number of bits allocated to each picture depends on the size of this buffer. If the buffer is large an encoder can use greater variations, increasing the picture quality, but at the cost of increasing the decoding delay. The delay is the time taken to fill the input buffer from empty to its current level. An encoder needs to know the size of the decoder's input buffer in order to determine to what extent it can vary the distribution of coding bits among the pictures in the sequence.

In constant bit rate applications (for example decoding a bit stream from a CD-ROM), problems of synchronisation may occur. In these applications, the encoder should generate a bit stream that is perfectly matched to the device. The decoder will display the decoded pictures at their specific rate. If the display clock is not locked to the channel data rate, and this is typically the case, then any mismatch between the encoder and channel clock, and the display clock will eventually cause a buffer overflow or underflow. For example, assume that the display clock runs one part per million too slow with respect to the channel clock. If the data rate is 1 Mbit/s, then the input buffer will fill at an average rate of one bit per second, eventually causing an overflow. If the decoder uses the entire buffer to allocate bits between pictures, the overflow could occur more quickly. For example, suppose the encoder fills the buffer completely except for one byte at the start of each picture. Then overflow will occur after only 8 s!

The model decoder is defined to resolve three problems: to constrain the variability in the number of bits that may be allocated to different pictures; it allows a decoder to initialise its buffer when the system is started; it allows the decoder to maintain synchronisation while the stream is played. At the beginning of this Chapter we mentioned multiplexing and synchronisation of audio and video streams. The tools defined in the international standard for the maintenance of synchronisation should be used by decoders when multiplexed streams are being played.

The definition of the parameterised model decoder is known as the video buffer verifier (VBV). The parameters used by a particular encoder are defined in the bit stream. This really defines a model decoder that is needed if encoders are to be assured that the coded bit stream they produce will be decodable. The model decoder looks like Figure 7.15.

click to expand
Figure 7.15: Model decoder

A fixed rate channel is assumed to put bits at a constant rate into the buffer, at regular intervals, set by the picture rate. The picture decoder instantaneously removes all the bits pertaining to the next picture from the input buffer. If there are too few bits in the input buffer, that is all the bits for the next picture have been received, then the input buffer underflows and there is an underflow error. If during the time between the picture starts the capacity of the input buffer is exceeded, then there is an overflow error.

Practical decoders may differ from this model in several important ways. They may not remove all the bits required to decode a picture from the input buffer instantaneously. They may not be able to control the start of decoding very precisely as required by the buffer fullness parameters in the picture header, and they take a finite time to decode. They may also be able to delay decoding for a short time to reduce the chance of underflow occurring. But these differences depend in degree and kind on the exact method of implementation. To satisfy the requirements of different implementations, the MPEG video committee chose a very simple model for the decoder. Practical implementations of decoders must ensure that they can decode the bit stream constrained in this model. In many cases this will be achieved by using an input buffer that is larger than the minimum required, and by using a decoding delay that is larger than the value derived from the buffer fullness parameter. The designer must compensate for any differences between the actual design and the model in order to guarantee that the decoder can handle any bit stream that satisfies the model.

Encoders monitor the status of the model to control the encoder so that overflow does not occur. The calculated buffer fullness is transmitted at the start of each picture so that the decoder can maintain synchronisation.

7.8.1 Buffer size and delay

For constant bit rate operation each picture header contains a variable delay parameter (vbv_delay) to enable decoders to synchronise their decoding correctly. This parameter defines the time needed to fill the input buffer of Figure 7.15 from an empty state to the current level immediately before the picture decoder removes all the bits from the picture. This time thus represents a delay and is measured in units of 1/90 000 s. This number was chosen because it is almost an exact factor of the picture duration in various original video formats: 1/24, 1/25, 1/29.97 and 1/30 s, and because it is comparable in duration to an audio sample. The delay is given by:

(7.4)

For example, if vbv_delay was 9000, then the delay would be 0.1 s. This means that at the start of a picture the input buffer of the model decoder should contain exactly 0.1s worth of data from the input bit stream.

The bit rate, R, is defined in the sequence header. The number of bits in the input buffer at the beginning of the picture is thus given by

(7.5)

For example, if vbv_delay and R were 9000 and 1.2 Mbit/s, respectively, then the number of bits in the input buffer would be 120 kbits. The constrained parameter bit stream requires that the input buffer have a capacity of 327 680 bits, and B should never exceed this value [3].

7.8.2 Rate control and adaptive quantisation

The encoder must make sure that the input buffer of the model decoder is neither overflowed nor underflowed by the bit stream. Since the model decoder removes all the bits associated with a picture from its input buffer instantaneously, it is necessary to control the total number of bits per picture. In H.261 we saw that the encoder could control the bit rate by simply checking its output buffer content. As the buffer fills up, so the quantiser step size is raised to reduce the generated bit rate, and vice versa. The situation in MPEG-1, due to the existence of three different picture types, where each generates a different bit rate, is slightly more complex. First, the encoder should allocate the total number of bits among the various types of picture within a GOP, so that the perceived image quality is suitably balanced. The distribution will vary with the scene content and the particular distribution of I, P and B-pictures within a GOP.

Investigations have shown that for most natural scenes, each P-picture might generate as many as 2–5 times the number of bits of a B-picture, and an I-picture three times those of the P-picture. If there is little motion and high texture, then a greater proportion of the bits should be assigned to I-pictures. Similarly, if there is strong motion, then a proportion of bits assigned to P-pictures should be increased. In both cases lower quality from the B-pictures is expected, to permit the anchor I and P-pictures to be coded at their best possible quality.

Our investigations with variable bit rate (VBR) video, where the quantiser step size is kept constant (no rate control), show that the ratios of generated bits are 6:3:2, for I, P and B-pictures, respectively [9]. Of course at these ratios, due to the fixed quantiser step size, the image quality is almost constant, not only for each picture (in fact slightly better for B-pictures, due to better motion compensation), but throughout the image. Again, if we lower the expected quality for B-pictures, we can change that ratio in favour of I and P-pictures.

Although these ratios appear to be very important for a suitable balance in picture quality, one should not worry very much about their exact values. The reason is that it is possible to make the encoder intelligent enough to learn the best ratio. For example, after coding each GOP, one can multiply the average value of the quantiser scale in each picture by the bit rate generated at that picture. Such a quantity can be used as the complexity index, since larger complexity indices should be due to both larger quantiser step sizes and larger bit rates. Therefore, based on the complexity index one can derive a new coding ratio, and the target bit rate for each picture in the next GOP is based on this new ratio.

As an example, let us assume that SIF-625 video is to be coded at 1.2 Mbit/s. Let us also assume that the GOP structure of N = 12 and M = 3 is used. Therefore there will be one I-picture, three P-pictures and eight B-pictures in each GOP. First of all, the target bit rate for each GOP is kbit/GOP. If we assume a coding ratio of 6:3:2, then the target bit rate for each of the I, P and B-pictures will be:

I-picture
P-picture
B-picture

Therefore, each picture is aiming for its own target bit rate. Similar to H.261, one can control the quantiser step size for that picture, such that the required bit rate is achieved. At the end of the GOP, the complexity index for each picture type is calculated. Note that for P and B-pictures, the complexity index is the average of three and eight complexity indices, respectively. These ratios are used to define new coding ratios between the picture type for coding of the next GOP. Also, bits generated in that GOP are added together and the extra bit rate, or the deficit, from the GOP target bit rate is transferred to the next GOP.

In practice, the target bit rates for B-pictures compared with other picture types are deliberately reduced by a factor of 1.4. This is done for two reasons. First, due to efficient bidirectional motion estimation in B-pictures, their transform coefficients are normally small. Increasing the quantiser step size hardly affects these naturally small value coefficients' distortions, but the overall bit rate can be reduced significantly. Second, since B-pictures are not used in the prediction loop of the encoder, even if they are coarsely coded, the encoding error is not transferred to the subsequent frames. This is not the case with the anchor I and P-pictures, since through the prediction loop, any saving in one frame due to coarser coding has to be paid back in the following frames.

Experimental results indicate that by reducing the target bit rates of the B-pictures by a factor of 1.4, the average quantiser step size for these pictures rises almost by the same factor, but its quality (PSNR) only slightly deteriorates, which makes it worth doing.

Note also that, although in the above example (which is typical) the bits per B-picture are fewer than those of I and P-pictures, nevertheless the eight B-pictures in a GOP generate almost 8 × 37 = 296 kbits, which is more than 50 per cent of the bits in a GOP. The first implication is that use of the factor 1.4 can have a significant reduction in the overall bit rate. The second implication of this is in the transmission of video over packet networks, where during periods of congestion, if only B-pictures are discarded, so reducing the network load by 50 per cent, congestion can be eased without significantly affecting the picture quality. Note that B-pictures are not used for predictions, so their loss will result in only a very brief (480 ms) reduction in quality.