Coding of Moving Pictures for Digital Storage Media (MPEG-1)

Overview

MPEG-1 is the first generation of video codecs proposed by the Motion Picture Experts Group as a standard to provide video coding for digital storage media (DSM), such as CD, DAT, Winchester discs and optical drives [1]. This development was in response to industry needs for an efficient way of storing visual information on storage media other than the conventional analogue video cassette recorders (VCR). At the time the CD-ROMs had the capability of 648 Mbytes, sufficient to accommodate movie programs at a rate of approximately 1.2 Mbit/s, and the MPEG standard aimed to conform roughly with this target. Although in most applications the MPEG-1 video bit rate is in the range of 1–1.5 Mbit/s, the international standard does not limit the bit rate, and higher bit rates might be used for other applications.

It was also envisaged that the stored data would be within both 625 and 525-line television systems and provide flexibility for use with workstations and personal computers. For this reason, the MPEG-1 standard is based on progressively scanned images and does not recognise interlacing. Interlaced sources have to be converted to a noninterlaced format before coding. After decoding, the decoded image may be converted back to provide an interlaced format for display.

Since coding for digital storage can be regarded as a competitor to VCRs, then MPEG-1 video quality at the rate of 1 to 1.5 Mbit/s is expected to be comparable to VCRs. Also, it should provide the viewing conditions associated with VCRs such as forward play, freeze picture, fast forward, fast reverse, slow forward and random access. The ability of the decoder to provide these modes depends to some extent on the nature of digital storage media. However, it should be borne in mind that efficient coding and flexibility in operation are not compatible. Provision of the added functionality of random access necessitates regular intraframe pictures in the coded sequence. Those frames that do not exploit the temporal redundancy in video have poor compression, and as a result the overall bit rate is increased.

Both H.261 [2] and MPEG-1 [1] are standards defined for relatively low bit rate coding of low spatial resolution pictures. Like H.261, MPEG-1 utilises DCT for lossy coding of its intraframe and interframe prediction errors. The MPEG-1 video coding algorithm is largely an extension of H.261, and many of the features are common. Their bit streams are, however, incompatible, although their encoding units are very similar.

The MPEG-1 standard, like H.261, does not specify the design of the decoder, and even less information is given about the encoder. What is expected from MPEG-1, like H.261, is to produce a bit stream which is decodable. Manufacturers are free to choose any algorithms they wish, and to optimise them for better efficiency and functionality. Therefore in this Chapter we again look at the fundamentals of MPEG-1 coding, rather than details of the implementation.

Systems coding outline

The MPEG-1 standard gives the syntax description of how audio, video and data are combined into a single data stream. This sequence is formally termed the ISO 11172 stream [3]. The structure of this ISO 11172 stream is illustrated in Figure 7.1. It consists of a compression layer and a systems layer. In this book we study only the video part of the compression layer, but the systems layer is important for the proper delivery of the coded bit stream to the video decoder, and hence we briefly describe it.

click to expand
Figure 7.1: Structure of an ISO 11172 stream

The MPEG-1 systems standard defines a packet structure for multiplexing coded audio and video into one stream and keeping it synchronised. The systems layer is organised into two sublayers known as the pack and packet layers. A pack consists of a pack header that gives the systems clock reference (SCR) and the bit rate of the multiplexed stream followed by one or more packets. Each packet has its own header that conveys essential information about the elementary data that it carries. The aim of the systems layer is to support the combination of video and audio elementary streams. The basic functions are as follows:

  • synchronised presentation of decoded streams
  • construction of the multiplexed stream
  • initialisation of buffering for playback start-up
  • continuous buffer management
  • time identification.

In the systems layer, elements of direct interest to the video encoding and decoding processes are mainly those of the stream-specific operations, namely multiplexing and synchronisation.

Multiplexing elementary streams

The multiplexing of elementary audio, video and data is performed at the packet level. Each packet thus contains only one elementary data type. The systems layer syntax allows up to 32 audio, 16 video and two data streams to be multiplexed together. If more than two data streams are needed, substreams may be defined.

Synchronisation

Multiple elementary streams are synchronised by means of presentation time stamps (PTS) in the ISO 11172 bit stream. End-to-end synchronisation is achieved when the encoders record time stamps during capture of raw data. The receivers will then make use of these PTS in each associated decoded stream to schedule their presentations. Playback synchronisation is pegged onto a master time base, which may be extracted from one of the elementary streams, the digital storage media (DSM), channel or some external source. This prototypical synchronisation arrangement is illustrated in Figure 7.2. The occurrences of PTS and other information such as the systems clock reference (SCR) and systems headers will also be essential for facilitating random access of the MPEG-1 bit stream. This set of access codes should therefore be located near to the part of the elementary stream where decoding can begin. In the case of video, this site will be near the head of an intraframe.

click to expand
Figure 7.2: MPEG-1's prototypical encoder and decoder illustrating end-to-end synchronisation (STC— systems time clock; SCR— systems clock reference; PTS— presentation time stamp; DSM— digital storage media)

To ensure guaranteed decoder buffer behaviour, the MPEG-1 systems layer employs a systems target decoder (STD) and decoding time stamp (DTS). The DTS differs from PTS only in the case of video pictures that require additional reordering delay during the decoding process.

Preprocessing

The source material for video coding may exist in a variety of forms such as computer files or live video in CCIR-601 format [4]. If CCIR-601 is the source, since MPEG-1 is for coding of video at VCR resolutions, then SIF format is normally used. These source pictures must be processed prior to coding. In Chapter 2 we explained how CCIR-601 video was converted to SIF format. If the source is film, we also discussed the conversion methodology in that Chapter. However, if computer source files do not have the SIF format, they have to be converted too. In MPEG-1, another preprocessing step is required to reorder the input pictures for coding. This is called picture reordering.

Picture reordering

Because of the conflicting requirements of random access and highly efficient coding, the MPEG suggested that not all pictures of a video sequence should be coded in the same way. They identified four types of picture in a video sequence. The first type is called I-pictures, which are coded without reference to the previous picture. They provide access points to the coded sequence for decoding. These pictures are intraframe coded as for JPEG, with a moderate compression. The second type is the P-pictures, which are predictively coded with reference to the previous I or P-coded pictures. They themselves are used as a reference (anchor) for coding of the future pictures. Coding of these pictures is very similar to H.261. The third type is B-pictures, or bidirectionally coded pictures, which may use past, future or combinations of both pictures in their predictions. This increases the motion compensation efficiency, since occluded parts of moving objects may be better compensated for from the future frame. B-pictures are never used for predictions. This part, which is unique to MPEG, has two important implications:

  1. If B-pictures are not used for predictions of future frames, then they can be coded with the highest possible compression without any side effects. This is because, if one picture is coarsely coded and is used as a prediction, the coding distortions are transferred to the next frame. This frame then needs more bits to clear the previous distortions, and the overall bit rate may increase rather than decrease.
  2. In applications such as transmission of video over packet networks, B-pictures may be discarded (e.g. due to buffer overflow) without affecting the next decoded pictures [5]. Note that if any part of the H.261 pictures, or I and P-pictures in MPEG, are corrupted during the transmission, the effect will propagate until they are refreshed [6].

Figure 7.3 illustrates the relationship between these three types of picture. Since B-pictures use I and P-pictures as predictions, they have to be coded later. This requires reordering the incoming picture order, which is carried out at the preprocessor.

click to expand
Figure 7.3: An example of MPEG-1 GOP

The fourth picture type is the D-pictures. These are intraframe coded, where only the DC coefficients are retained. Hence the picture quality is poor and normally used for applications like fast forward. D-pictures are not part of the GOP, hence they are not present in a sequence containing any other picture type.

Video structure

Group of pictures (GOP)

Since in the H.261 standard successive frames are similarly coded, a picture is the top level of the coding hierarchy. In MPEG-1 due to the existence of several picture types, a group of pictures, called GOP, is the highest level of the hierarchy. A GOP is a series of one or more pictures to assist random access into the picture sequence. The first coded picture in the group is an I-picture. It is followed by an arrangement for P and B-pictures, as shown in Figure 7.3.

The GOP length is normally defined as the distance between I-pictures, which is represented by parameter N in the standard codecs. The distance between the anchor I/P to P-pictures is represented by M. In the above Figure N = 12 and M = 3. The group of pictures may be of any length, but there should be at least one I-picture in each GOP. Applications requiring random access, fast forward play or fast and normal reverse play may use short GOPs. GOP may also start at scene cuts or other cases where motion compensation is not effective. The number of consecutive B-pictures is variable. Neither a P nor a B-picture needs to be present. For most applications, GOP in the SIF-625/50 format has N = 12 and M = 3. In SIF-525/60, the values are 15 and 3, respectively.

The encoding or transmission order of pictures differs from the display or incoming picture order. In the Figure B-pictures 1 and 2 are encoded after P-picture 0 and I-picture 3 are encoded. Also in this Figure B-pictures 13 and 14 are a part of the next GOP. Although their display order is 0,1,2,...,11, their encoding order is 3,1,2,6,4,5.. .. This reordering introduces delays amounting to several frames at the encoder (equal to the number of B-pictures between the anchor I and P-pictures). The same amount of delay is introduced at the decoder, in putting the transmission/decoding sequence back to its original. This format inevitably limits the application of MPEG-1 for telecommunications.

Picture

All the three main picture types, I, P and B, have the same SIF size with 4:2:0 format. In SIF-625 the luminance part of each picture has 360 pixels, 288 lines and 25 Hz, and those of each chrominance are 180 pixels, 144 lines and 25 Hz. In SIF-525, these values for luminance are 360 pixels, 240 lines and 30 Hz, and for the chrominance are 180, 120 and 30, respectively. For 4:2:0 format images, the luminance and chrominance samples are positioned as shown in Figure 2.3.

Slice

Each picture is divided into a group of macroblocks, called slices. In H.261 such a group was called GOB. The reason for defining a slice is the same as that for defining a GOB, namely resetting the variable length code to prevent channel error propagation into the picture. Slices can have different sizes within a picture, and the division in one picture need not be the same as the division in any other picture.

The slices can begin and end at any macroblock in a picture, but with some constraints. The first slice must begin at the top left of the picture (the first macroblock) and the end of the last slice must be the bottom right macroblock (the last macroblock) of the picture, as shown in Figure 7.4. Therefore the minimum number of slices per picture is one, and the maximum number is equal to the number of macroblocks (e.g. 396 in SIF-625).

1 begin

end 1

2 begin

end 2

3 begin

end 3

4 begin

end 4

5 begin

end 5

6 begin

end 6

7 be in

end 7

8 begin

end 8

9 begin

end 9

10 be in

end 10

11 be in

end 11

12 begin

end 12

13 begin

end 13

14 egin

end 14

1 egin

end 15

11 legin

end 16

17 begin

end 17

18 begin

end 18


Figure 7.4: An example of slice structure for SIF-625 pictures

Each slice starts with a slice start code, and is followed by a code that defines its position and a code that sets the quantisation step size. Note that in H.261 the quantisation step sizes were set at each GOB or row of GOBs, but in MPEG-1 they can be set at any macroblock (see below). Therefore, in MPEG-1 the main reason for defining slices is not to reset a new quantiser, but to prevent the effects of channel error propagation. If the coded data is corrupted, and the decoder detects it, then it can search for the new slice, and the decoding starts from that point. Part of the picture slice from the start of the error to the next slice can then be degraded. Therefore in a noisy environment it is desirable to have as many slices as possible. On the other hand each slice has a large overhead, called slice start code (minimum of 32 bits). This creates a large overhead in the total bit rate. For example, if we use the slice structure of Figure 7.4, where there is one slice for each row of MBs, then for SIF-625 video there are 18 slices per picture, and with 25 Hz video, the slice overhead can be 32 × 18 × 25 = 14 400 bit/s.

To optimise the slice structure, that is, to give a good immunity from channel errors and at the same time to minimise the slice overhead, one might use short slices for macroblocks with significant energy (such as intra-MB), and long slices for less significant ones (e.g. macroblocks in B-pictures). Figure 7.5 shows a slice structure where in some parts the slice length extends beyond several rows of macroblocks, and in some cases is less than one row.

click to expand
Figure 7.5: Possible arrangement of slices in SIF-625

Macroblock

Slices are divided into macroblocks of 16 × 16 pixels, similar to the division of GOB into macroblocks in H.261. Macroblocks in turn are divided into blocks, for coding. In Chapter 6, we gave a detailed description of how a macroblock was coded, starting from its type, mode of selection, blocks within the MB, their positional addresses and finally the block pattern. Since MPEG-1 is also a macroblock-based codec, most of these rules are used in MPEG-1. However, due to differences of slice versus GOB, picture type versus a single picture format in H.261, there are bound to be variations in the coding. We first give a general account of these differences then, in the following section, more details about the macroblocks in the various picture types.

The first difference is that since a slice has a raster scan structure, macroblocks are addressed in a raster scan order. The top left macroblock in a picture has address 0, the next one on the right has address 1 and so on. If there are M macroblocks in a picture (e.g. M = 396), then the bottom right macroblock has address M - 1. To reduce the address overhead, macroblocks are relatively addressed by transmitting the difference between the current macroblock and the previously coded macroblock. This difference is called the macroblock address increment. In I-pictures, since all the macroblocks are coded, the macroblock address increment is always 1. The exception is that, for the first coded macroblock at the beginning of each slice, the macroblock address is set to that of the right-hand macroblock of the previous row. This address at the beginning of each picture is set to -1. If a slice does not start at the left edge of the picture (see the slice structure of Figure 7.5), then the macroblock address increment for the first macroblock in the slice will be larger than one. For example, in the slice structure of Figures 7.4 and 7.5 there are 22 macroblocks per row. For Figure 7.4, at the start of slice two, the macroblock address is set to 21, which is the address of the macroblock at the right-hand edge of the top row of macroblocks. In Figure 7.5, if the first slice contains 30 macroblocks, eight of them would be in the second row, so the address of the first macroblock in the second slice would be 30 and the macroblock increment would be nine. For further reduction of address overhead, macroblock address increments are VLC coded.

There is no code to indicate a macroblock address increment of zero. This is why the macroblock address is set to -1 rather than zero at the top of the picture. The first macroblock will have an increment of one, making its address equal to zero.

Block

Finally, the smallest part of the picture structure is the block of 8 × 8 pixels, for both luminance and chrominance components. DCT coding is applied at this block level. Figure 7.6 illustrates the whole structure of partitioning a video sequence, from its GOP level at the top to the smallest unit of block at the bottom.

click to expand
Figure 7.6: MPEG-1 coded video structure

Encoder

As mentioned, the international standard does not specify the design of the video encoders and decoders. It only specifies the syntax and semantics of the bit stream and signal processing at the encoder/decoder interface. Therefore, options are left open to the video codec manufacturers to trade-off cost, speed, picture quality and coding efficiency. As a guideline, Figure 7.7 shows a block diagram of an MPEG-1 encoder. Again it is similar to the generic codec of Chapter 3 and the H.261 codec of Chapter 6. For simplicity the coding flags shown in the H.261 codec are omitted, although they also exist.

click to expand
Figure 7.7: A simplified MPEG-1 video encoder

The main differences between this encoder and that defined in H.261 are:

  • Frame reordering: at the input of the encoder coding of B-pictures is postponed to be carried out after coding the anchor I and P-pictures.
  • Quantisation: intraframe coded macroblocks are subjectively weighted to emulate perceived coding distortions.
  • Motion estimation: not only is the search range extended but the search precision is increased to half a pixel. B-pictures use bidirectional motion compensation.
  • No loop filter.
  • Frame store and predictors: to hold two anchor pictures for prediction of B-pictures.
  • Rate regulator: since here there is more than one type of picture, each generating different bit rates.

Before describing how each picture type is coded, and the main differences between this codec and H.261, we can describe the codec on a macroblock basis, as the basic unit of coding. Within each picture, macroblocks are coded in a sequence from left to right. Since 4:2:0 image format is used, then the six blocks of 8 × 8 pixels, four luminance and one of each chrominance component, are coded in turn. Note that the picture area covered by the four luminance blocks is the same as that covered by each of the chrominance blocks.

First, for a given macroblock, the coding mode is chosen. This depends on the picture type, the effectiveness of motion compensated prediction in that local region and the nature of the signal within the block. Secondly, depending on the coding mode, a motion compensated prediction of the contents of the block based on the past and/or future reference pictures is formed. This prediction is subtracted from the actual data in the current macroblock to form an error signal. Thirdly, this error signal is divided into 8×8 blocks and a DCT is performed on each block. The resulting two-dimensional 8×8 block of DCT coefficients is quantised and is scanned in zigzag order to convert into a one-dimensional string of quantised DCT coefficients. Fourthly, the side information for the macroblock, including the type, block pattern, motion vector and address alongside the DCT coefficients are coded. For maximum efficiency, all the data are variable length coded. The DCT coefficients are run length coded with the generation of events, as we discussed in H.261.

A consequence of using different picture types and variable length coding is that the overall bit rate is very variable. In applications that involve a fixed rate channel, a FIFO buffer is used to match the encoder output to the channel. The status of this buffer may be monitored to control the number of bits generated by the encoder. Controlling the quantiser parameter is the most direct way of controlling the bit rate. The international standard specifies an abstract model of the buffering system (the video buffering verifier) in order to limit the maximum variability in the number of bits that are used for a given picture. This ensures that a bit stream can be decoded with a buffer of known size (see section 7.8).

Quantisation weighting matrix

The insensitivity of the human visual system to high frequency distortions can be exploited for further bandwidth compression. In this case, the higher orders of DCT coefficients are quantised with coarser quantisation step sizes than the lower frequency ones. Experience has shown that for SIF pictures a suitable distortion weighting matrix for the intra DCT coefficients is the one shown in Figure 7.8. This intra matrix is used as the default quantisation matrix for intraframe coded macroblocks.

8

16

19

22

26

27

29

34

16

16

16

16

16

16

16

16

16

16

22

24

27

29

34

37

16

16

16

16

16

16

16

16

19

22

26

27

29

34

34

38

16

16

16

16

16

16

16

16

22

22

26

27

29

34

37

40

16

16

16

16

16

16

16

16

22

26

27

29

32

35

40

48

16

16

16

16

16

16

16

16

26

27

29

32

35

40

48

58

16

16

16

16

16

16

16

16

26

27

29

34

38

46

56

69

16

16

16

16

16

16

16

16

27

29

35

38

46

56

69

83

16

16

16

16

16

16

16

16

inter

inter


Figure 7.8: Default intra and inter quantisation weighting matrices

If the picture resolution departs significantly from the SIF size, then some other matrix may give perceptively better results. The reason is that this matrix is derived from the vision contrast sensitivity curve, for a nominal viewing distance (e.g. viewing distances of 4–6 times the picture height) [7]. For higher or lower picture resolutions, or changing the viewing distance, the spatial frequency will then change, and hence different weighting will be derived.

It should be noted that different weightings may not be used for interframe coded macroblocks. This is because high frequency interframe error does not necessarily mean high spatial frequency. It might be due to poor motion compensation or block boundary artefacts. Hence interframe coded macroblocks use a flat quantisation matrix. This matrix is called the inter or nonintra quantisation weighting matrix.

Note that, since in H.261 all the pictures are interframe coded and a very few macroblocks might be intra coded, then only the nonintra weighting matrix is defined. Little work has been performed to determine the optimum nonintra matrix for MPEG-1, but evidence suggests that the coding performance is more related to the motion and the texture of the scene than the nonintra quantisation matrix. If there is any optimum matrix, it should then be somewhere between the flat default inter matrix and the strongly frequency-dependent values of the default intra matrix.

The DCT coefficients, prior to quantisation, are divided by the weighting matrix. Note that the DCT coefficients prior to weighting have a dynamic range from -2047 to +2047. Weighted coefficients are then quantised by the quantisation step size and at the decoder, reconstructed quantised coefficients are multiplied to the weighting matrix to reconstruct the coefficients.

Motion estimation

In Chapter 3, block matching motion estimation/compensation and its application in standard codecs was discussed in great detail. We even introduced some fast search methods for estimation, which can be used in software-based codecs. As we saw, motion estimation in H.261 was optional. This was mainly due to the assumption that, since motion compensation can reduce correlation, then DCT coding may not be efficient. Investigations since the publication of H.261 have proved that this is not the case. What is expected from a DCT is to remove the spatial correlation within a small area of 8 x 8 pixels. Measurement of correlations between the adjacent error pixels have shown that there is still strong correlation between the error pixels, which does not impair the potential of DCT for spatial redundancy reduction. Hence motion estimation has become an important integral part of all the later video codecs, such as MPEG-1, MPEG-2, H.263, H.26L and MPEG-4. These will be explained in the relevant chapters.

Considering MPEG-1, the strategy for motion estimation in this codec is different from the H.261 in four main respects:

  • motion estimation is an integral part of the codec
  • motion search range is much larger
  • higher precision of motion compensation is used
  • B-pictures can benefit from bidirectional motion compensation.

These features are described in the following sections.

Larger search range

In H.261, if motion compensation is used, a search is carried out within every subsequent frame. Also, H.261 is normally used for head-and-shoulders pictures, where the motion speed tends to be very small. In contrast, MPEG-1 is used mainly for coding of films with much larger movements and activities. Moreover, in the search for motion in P-pictures, since they might be several frames apart, the search range becomes many times larger. For example, in a GOP structure with M = 3, where there are two B-pictures between the anchor pictures, the motion speed is three times greater than that for consecutive pictures. Thus in MPEG-1 we expect a much larger search range. Considering that in full search block matching the number of search positions for a motion speed of w is (2w + 1)2, then tripling the search range makes motion estimation prohibitively computationally expensive.

In Chapter 3 we introduced some fast search methods such as logarithmic step searches and hierarchical motion estimation. Although the hierarchical method can be used here, of course needing one or more levels of hierarchy, use of a logarithmic search may not be feasible. This is because these methods are very prone to large search ranges, and at these ranges the final minima can be very far away from the local minima, so causing the estimation to fail [8].

One way of alleviating this problem is to use a telescopic search method. This is unique to MPEG with B-pictures. In this method, rather than searching for the motion between the anchor pictures, the search is carried out on all the consecutive pictures, including B-pictures. The final search between the anchor pictures is then the sum of all the intermediate motion vectors, as shown in Figure 7.9. Note that since we are now searching for motion in successive pictures, the search range is smaller, and even fast search methods can be used.


Figure 7.9: Telescopic motion search

Motion estimation with half pixel precision

In the search process with a half pixel resolution, normal block matching with integer pixel positions is carried out first. Then eight new positions, with a distance of half a pixel around the final integer pixel, are tested. Figure 7.10 shows a part of the search area, where the coordinate marked A has been found as the best integer pixel position at the first stage.


Figure 7.10: Subpixel search positions, around pixel coordinate A

In testing the eight subpixel positions, pixels of the macroblock in the previous frame are interpolated, according to the positions to be searched. For subpixel positions, marked with h in the middle of the horizontal pixels, the interpolation is:

(7.1) 

where the division is truncated. For the subpixels in the vertical midpoints, the interpolated values for the pixels are:

(7.2) 

and for subpixels in the corner (centre of four pixels), the interpolation is:

(7.3) 

Note that in subpixel precision motion estimation, the range of the motion vectors' addresses is increased by 1 bit for each of the horizontal and vertical directions. Thus the motion vector overhead may be increased by two bits per vector (in practice due to variable length coding, this might be less than two bits). Despite this increase in motion vector overhead, the efficiency of motion compensation outweighs the extra bits, and the overall bit rate is reduced. Figure 7.11 shows the motion compensated error, with and without half pixel precision, for two consecutive frames of the Claire sequence. The motion compensated error has been magnified by a factor of four for better representation. It might be seen that half pixel precision has fewer blocking artefacts and, in general, motion compensated errors are smaller.

click to expand
Figure 7.11: Motion compensated prediction error (a) with half pixel precision (b) without half pixel precision

For further reduction on the motion vector overhead, differential coding is used. The prediction vector at the start of each slice and each intra coded macroblock is set to zero. Note that the predictively coded macroblocks with no motion vectors also set the prediction vector to zero. The motion vector prediction errors are then variable length coded.

Bidirectional motion estimation

B-pictures have access to both past and future anchor pictures. They can then use either past frame, called forward motion estimation, or the future frame for backward motion estimation, as shown in Figure 7.12.

click to expand
Figure 7.12: Motion estimation in B-pictures

Such an option increases the motion compensation efficiency, particularly when there are occluded objects in the scene. In fact, one of the reasons for the introduction of B-pictures was the fact that the forward motion estimation used in H.261 and P-pictures cannot compensate for the uncovered background of moving objects.

From the two forward and backward motion vectors, the coder has a choice of choosing any of the forward, backward or their combined motion compensated predictions. In the latter case, a weighted average of the forward and backward motion compensated pictures is calculated. The weight is inversely proportional to the distance of the B-picture with its anchor pictures. For example, in the GOB structure of I, B1, B2, P, the bidirectionally interpolated motion compensated picture for B1 would be two-thirds of the forward motion compensated pixels from the I-picture and one-third from backward motion compensated pixels of the P-picture. This ratio is reversed for B2. Note that B-pictures do not use motion compensation from each other, since they are not used as predictors. Also note that the motion vector overhead in B-pictures is much more than in P-pictures. The reason is that for B-pictures there are more macroblock types, which increase the macroblock type overhead, and for the bidirectionally motion compensated macroblocks two motion vectors have to be sent.

Motion range

When B-pictures are present, due to various distances between a picture and its anchor, it is expected that the search range for motion estimation will be different for different picture types. For example, with M = 3, P-pictures are three frames apart from their anchor pictures. B1-pictures are only one frame apart from their past frame and two frames from their future frames, and those of B2-pictures are in reverse order. Hence the motion range for P-pictures is larger than the backward motion range of B1-pictures, which is itself larger than the forward motion vector. For normal scenes, the maximum search range for P-pictures is usually taken as 11 pixels/3 frames, and the forward and backward motion range for B1-pictures are 3 pixels/frame and 7 pixels/2 frames, respectively. These values for B2-pictures become 7 and 3.

It should be noted that, although motion estimation for B-pictures, due to the calculation of forward and backward motion vectors, is more processing demanding than that of the P-pictures nevertheless, due to larger motion range for P-pictures, the latter can be more costly than the former. For example, if the full search method is used, the number of search operations for P-pictures will be (2 × 11 + 1)2 = 529. This value for the forward and backward motion vectors of B1-pictures will be (2 × 3 + 1)2 = 49 and (2 × 7 + 1)2 = 225, respectively. For B2-pictures, the forward and backward motion estimation cost becomes 225 and 49, respectively. Thus, although motion estimation cost for P-pictures in this example is 529, the cost for a B-picture is about 49 + 225 = 274, which is less. For motion estimation with half pixel accuracy, for P and B-pictures 8 and 16 more operations have to be added to these values, respectively. For more active pictures, where the search ranges for both P and B-pictures are larger, the gap on motion estimation cost becomes wider.

Coding of pictures

Since the encoder was described in terms of the basic unit of a macroblock, then the picture types may be defined in terms of their macroblock types. In the following each of these picture types are defined.

I pictures

In I-pictures all the macroblocks are intra coded. There are two intra macroblock types: one that uses the current quantiser scale, intra-d, and the other that defines a new value for the quantiser scale, intra-q. Intra-d is the default value when the quantiser scale is not changed. Although these two types can be identified with 0 and 1, and no variable length code is required, the standard has foreseen some possible extensions to the macroblock types in the future. For this reason, they are VLC coded and intra-d is assigned with 1, and intra-q with 01. Extensions to the VLC codes with a start code of 0 are then open. The policy of making the coding tables open in this way was adopted by the MPEG group video committee in developing the international standard. The advantage of future extensions was judged to be worth the slight coding inefficiency.

If the macroblock type is intra-q, then the macroblock overhead should contain an extra five bits, to define the new quantiser scale between 1 and 31. For intra-d macroblocks, no quantiser scale is transmitted and the decoder uses the previously set value. Therefore the encoder may prefer to use as many intra-d types as possible. However, when the encoding rate is to be adjusted, which normally causes a new quantiser to be defined, the type is changed to intra-q. Note that, since in H.261 the bit rate is controlled at either the start of GOBs or rows of a GOB, then, if there is any intra-q in a GOB, it must be the first MB in that GOB, or rows of the GOB. In I-pictures of MPEG-1, an intra-q can be any of the macroblocks.

Each block within the MB is DCT coded and the coefficients are divided by the quantiser step size, rounded to the nearest integer. The quantiser step size is derived from the multiplication of the quantisation weighting matrix and the quantiser parameter (1 to 31). Thus the quantiser step size is different for different coefficients and may change from MB to MB. The only exception is the DC coefficients, which are treated differently. This is because the eye is sensitive to large areas of luminance and chrominance errors; then the accuracy of each DC value should be high and fixed. The quantiser step size for the DC coefficient is fixed to eight. Since in the quantisation weighting matrix the DC weighting element is eight, the quantiser parameter for the DC coefficient is always 1, irrespective of the quantisation parameter used for the remaining AC coefficients.

Due to the strong correlation between the DC values of blocks within a picture, the DC indices are coded losslessly by DPCM. Such a correlation does not exist among the AC coefficients, and hence they are coded independently. The prediction for the DC coefficients of luminance blocks follows the coding order of blocks within a macroblock and the raster scan order. For example, in the macroblocks of 4:2:0 format pictures shown in Figure 7.13, the DC coefficient of block Y2 is used as a prediction for the DC coefficient of block Y3. The DC coefficient of block Y3 is a prediction for the DC coefficient of Y0 of the next macroblock. For the chrominance, we use the DC coefficients of the corresponding value of the block in the previous macroblock.


Figure 7.13: Positions of luminance and chrominance blocks within a macroblock in 4—2—0 format

The differentially coded DC coefficient and the remaining AC coefficients are zigzag scanned, in the same manner as was explained for H.261 coefficients of Chapter 6. A copy of the coded picture is stored in the frame store to be used for the prediction of the next P and the past or future B-pictures.

P pictures

As in I-pictures, each P-picture is divided into slices, which are in turn divided into macroblocks and then blocks for coding. Coding of P-pictures is more complex than of I-pictures, since motion compensated blocks may be constructed. For inter macroblocks, the difference between the motion compensated macroblock and the current macroblock is partitioned into blocks, and then DCT transformed and coded.

Decisions on the type of a macroblock, or whether motion compensation should be used or not, are similar to those for H.261 (see Chapter 6). Other H.261 coding tools, such as differential encoding of motion vectors, coded block pattern, zigzag scan, nature of variable length coding etc. are similar. In fact, coding of P-pictures is the same as coding each frame in H.261 with two major differences:

  1. Motion estimation has a half pixel precision and, due to larger distances between the P-frames, the motion estimation range is much larger.
  2. In MPEG-1 all intra-MB use the quantisation weighting matrix, whereas in H.261 all MB use a flat matrix. Also in MPEG-1 the intra-MB of P-pictures are predictively coded like those of I-pictures, with the exception that the prediction value is fixed at 128 × 8 if the previous macroblock is not intra coded.

Locally decoded P-pictures are stored in the frame store for further prediction. Note that, if B-pictures are used, two buffer stores are needed to store two prediction pictures.

B pictures

As in I and P-pictures, B-pictures are divided into slices, which in turn are divided into macroblocks for coding. Due to the possibility of bidirectional motion compensation, coding is more complex than for P-pictures. Thus the encoder has more decisions to make than in the case of P-pictures. These are: how to divide the picture into slices, determining the best motion vectors to use, deciding whether to use forward, backward or interpolated motion compensation or to code intra, and how to set the quantiser scale. These make processing of B-pictures computationally very intensive. Note that motion compensation is the most costly operation in the codecs, and for every macroblock both forward and backward motion compensations have to be performed.

The encoder does not need to store decoded B-pictures, since they are not used for prediction. Hence B-pictures can be coded with larger distortions. In this regard, to reduce the slice overhead, larger slices (fewer slices in the picture) may be chosen.

In P-pictures, as for H.261, there are eight different types of macroblock. In B-pictures, due to backward motion compensation and interpolation of forward and backward motion compensation, the number of macroblock types is about 14. Figure 7.14 shows the flow chart for macroblock type decisions in B-pictures.

click to expand
Figure 7.14: Selection of macroblock types in B-pictures

The decision on the macroblock type starts with the selection of a motion compensation mode based on the minimisation of a cost function. The cost function is the mean-squared/absolute error of the luminance difference between the motion compensated macroblock and the current macroblock. The encoder first calculates the best forward motion compensated macroblock from the previous anchor picture for forward motion compensation. It then calculates the best motion compensated macroblock from the future anchor picture, as the backward motion compensation. Finally, the average of the two motion compensated errors is calculated to produce the interpolated macroblock. It then selects the one that had the smallest error difference with the current macroblock. In the event of a tie, an interpolated mode is chosen.

Another difference between macroblock types in B and P-pictures is in the definition of noncoded and skipped macroblocks. In P-pictures, the skipped MB is the one in which none of its blocks has any significant DCT coefficient (cbp (coded block pattern) = 0), and the motion vector is also zero. The first and the last MB in a slice cannot be declared skipped. They are treated as noncoded.

A noncoded MB in P-pictures is the one in which none of its blocks has any significant DCT coefficient (cbp = 0), but the motion vector is nonzero. Thus the first and the last MB in a slice, which could be skipped, is noncoded with motion vector set to zero! In H.261 the noncoded MB was called motion vector only (MC).

In B-pictures, the skipped MB has again all zero DCT coefficients, but the motion vector and the type of prediction mode (forward, backward or interpolated) is exactly the same as that of its previous MB. Similar to P-pictures, the first and the last MB in a slice cannot be declared skipped, and is in fact called noncoded.

The noncoded MB in B-pictures has all of its DCT coefficients zero (cbp = 0), but either its motion vector or its prediction (or both) is different from its previous MB.

D pictures

D-pictures contain only low frequency information, and are coded as the DC coefficients of the blocks. They are intended to be used for fast visible search modes. A bit is transmitted for the macroblock type, although there is only one type. In addition there is a bit denoting the end of the macroblock. D-pictures are not part of the constrained bit stream.

Video buffer verifier

A coded bit stream contains different types of pictures, and each type ideally requires a different number of bits to encode. In addition, the video sequence may vary in complexity with time, and it may be desirable to devote more coding bits to one part of a sequence than to another. For constant bit rate coding, varying the number of bits allocated to each picture requires that the decoder has a buffer to store the bits not needed to decode the immediate picture. The extent to which an encoder can vary the number of bits allocated to each picture depends on the size of this buffer. If the buffer is large an encoder can use greater variations, increasing the picture quality, but at the cost of increasing the decoding delay. The delay is the time taken to fill the input buffer from empty to its current level. An encoder needs to know the size of the decoder's input buffer in order to determine to what extent it can vary the distribution of coding bits among the pictures in the sequence.

In constant bit rate applications (for example decoding a bit stream from a CD-ROM), problems of synchronisation may occur. In these applications, the encoder should generate a bit stream that is perfectly matched to the device. The decoder will display the decoded pictures at their specific rate. If the display clock is not locked to the channel data rate, and this is typically the case, then any mismatch between the encoder and channel clock, and the display clock will eventually cause a buffer overflow or underflow. For example, assume that the display clock runs one part per million too slow with respect to the channel clock. If the data rate is 1 Mbit/s, then the input buffer will fill at an average rate of one bit per second, eventually causing an overflow. If the decoder uses the entire buffer to allocate bits between pictures, the overflow could occur more quickly. For example, suppose the encoder fills the buffer completely except for one byte at the start of each picture. Then overflow will occur after only 8 s!

The model decoder is defined to resolve three problems: to constrain the variability in the number of bits that may be allocated to different pictures; it allows a decoder to initialise its buffer when the system is started; it allows the decoder to maintain synchronisation while the stream is played. At the beginning of this Chapter we mentioned multiplexing and synchronisation of audio and video streams. The tools defined in the international standard for the maintenance of synchronisation should be used by decoders when multiplexed streams are being played.

The definition of the parameterised model decoder is known as the video buffer verifier (VBV). The parameters used by a particular encoder are defined in the bit stream. This really defines a model decoder that is needed if encoders are to be assured that the coded bit stream they produce will be decodable. The model decoder looks like Figure 7.15.

click to expand
Figure 7.15: Model decoder

A fixed rate channel is assumed to put bits at a constant rate into the buffer, at regular intervals, set by the picture rate. The picture decoder instantaneously removes all the bits pertaining to the next picture from the input buffer. If there are too few bits in the input buffer, that is all the bits for the next picture have been received, then the input buffer underflows and there is an underflow error. If during the time between the picture starts the capacity of the input buffer is exceeded, then there is an overflow error.

Practical decoders may differ from this model in several important ways. They may not remove all the bits required to decode a picture from the input buffer instantaneously. They may not be able to control the start of decoding very precisely as required by the buffer fullness parameters in the picture header, and they take a finite time to decode. They may also be able to delay decoding for a short time to reduce the chance of underflow occurring. But these differences depend in degree and kind on the exact method of implementation. To satisfy the requirements of different implementations, the MPEG video committee chose a very simple model for the decoder. Practical implementations of decoders must ensure that they can decode the bit stream constrained in this model. In many cases this will be achieved by using an input buffer that is larger than the minimum required, and by using a decoding delay that is larger than the value derived from the buffer fullness parameter. The designer must compensate for any differences between the actual design and the model in order to guarantee that the decoder can handle any bit stream that satisfies the model.

Encoders monitor the status of the model to control the encoder so that overflow does not occur. The calculated buffer fullness is transmitted at the start of each picture so that the decoder can maintain synchronisation.

Buffer size and delay

For constant bit rate operation each picture header contains a variable delay parameter (vbv_delay) to enable decoders to synchronise their decoding correctly. This parameter defines the time needed to fill the input buffer of Figure 7.15 from an empty state to the current level immediately before the picture decoder removes all the bits from the picture. This time thus represents a delay and is measured in units of 1/90 000 s. This number was chosen because it is almost an exact factor of the picture duration in various original video formats: 1/24, 1/25, 1/29.97 and 1/30 s, and because it is comparable in duration to an audio sample. The delay is given by:

(7.4) 

For example, if vbv_delay was 9000, then the delay would be 0.1 s. This means that at the start of a picture the input buffer of the model decoder should contain exactly 0.1s worth of data from the input bit stream.

The bit rate, R, is defined in the sequence header. The number of bits in the input buffer at the beginning of the picture is thus given by

(7.5) 

For example, if vbv_delay and R were 9000 and 1.2 Mbit/s, respectively, then the number of bits in the input buffer would be 120 kbits. The constrained parameter bit stream requires that the input buffer have a capacity of 327 680 bits, and B should never exceed this value [3].

Rate control and adaptive quantisation

The encoder must make sure that the input buffer of the model decoder is neither overflowed nor underflowed by the bit stream. Since the model decoder removes all the bits associated with a picture from its input buffer instantaneously, it is necessary to control the total number of bits per picture. In H.261 we saw that the encoder could control the bit rate by simply checking its output buffer content. As the buffer fills up, so the quantiser step size is raised to reduce the generated bit rate, and vice versa. The situation in MPEG-1, due to the existence of three different picture types, where each generates a different bit rate, is slightly more complex. First, the encoder should allocate the total number of bits among the various types of picture within a GOP, so that the perceived image quality is suitably balanced. The distribution will vary with the scene content and the particular distribution of I, P and B-pictures within a GOP.

Investigations have shown that for most natural scenes, each P-picture might generate as many as 2–5 times the number of bits of a B-picture, and an I-picture three times those of the P-picture. If there is little motion and high texture, then a greater proportion of the bits should be assigned to I-pictures. Similarly, if there is strong motion, then a proportion of bits assigned to P-pictures should be increased. In both cases lower quality from the B-pictures is expected, to permit the anchor I and P-pictures to be coded at their best possible quality.

Our investigations with variable bit rate (VBR) video, where the quantiser step size is kept constant (no rate control), show that the ratios of generated bits are 6:3:2, for I, P and B-pictures, respectively [9]. Of course at these ratios, due to the fixed quantiser step size, the image quality is almost constant, not only for each picture (in fact slightly better for B-pictures, due to better motion compensation), but throughout the image. Again, if we lower the expected quality for B-pictures, we can change that ratio in favour of I and P-pictures.

Although these ratios appear to be very important for a suitable balance in picture quality, one should not worry very much about their exact values. The reason is that it is possible to make the encoder intelligent enough to learn the best ratio. For example, after coding each GOP, one can multiply the average value of the quantiser scale in each picture by the bit rate generated at that picture. Such a quantity can be used as the complexity index, since larger complexity indices should be due to both larger quantiser step sizes and larger bit rates. Therefore, based on the complexity index one can derive a new coding ratio, and the target bit rate for each picture in the next GOP is based on this new ratio.

As an example, let us assume that SIF-625 video is to be coded at 1.2 Mbit/s. Let us also assume that the GOP structure of N = 12 and M = 3 is used. Therefore there will be one I-picture, three P-pictures and eight B-pictures in each GOP. First of all, the target bit rate for each GOP is kbit/GOP. If we assume a coding ratio of 6:3:2, then the target bit rate for each of the I, P and B-pictures will be:

I-picture

P-picture

B-picture

Therefore, each picture is aiming for its own target bit rate. Similar to H.261, one can control the quantiser step size for that picture, such that the required bit rate is achieved. At the end of the GOP, the complexity index for each picture type is calculated. Note that for P and B-pictures, the complexity index is the average of three and eight complexity indices, respectively. These ratios are used to define new coding ratios between the picture type for coding of the next GOP. Also, bits generated in that GOP are added together and the extra bit rate, or the deficit, from the GOP target bit rate is transferred to the next GOP.

In practice, the target bit rates for B-pictures compared with other picture types are deliberately reduced by a factor of 1.4. This is done for two reasons. First, due to efficient bidirectional motion estimation in B-pictures, their transform coefficients are normally small. Increasing the quantiser step size hardly affects these naturally small value coefficients' distortions, but the overall bit rate can be reduced significantly. Second, since B-pictures are not used in the prediction loop of the encoder, even if they are coarsely coded, the encoding error is not transferred to the subsequent frames. This is not the case with the anchor I and P-pictures, since through the prediction loop, any saving in one frame due to coarser coding has to be paid back in the following frames.

Experimental results indicate that by reducing the target bit rates of the B-pictures by a factor of 1.4, the average quantiser step size for these pictures rises almost by the same factor, but its quality (PSNR) only slightly deteriorates, which makes it worth doing.

Note also that, although in the above example (which is typical) the bits per B-picture are fewer than those of I and P-pictures, nevertheless the eight B-pictures in a GOP generate almost 8 × 37 = 296 kbits, which is more than 50 per cent of the bits in a GOP. The first implication is that use of the factor 1.4 can have a significant reduction in the overall bit rate. The second implication of this is in the transmission of video over packet networks, where during periods of congestion, if only B-pictures are discarded, so reducing the network load by 50 per cent, congestion can be eased without significantly affecting the picture quality. Note that B-pictures are not used for predictions, so their loss will result in only a very brief (480 ms) reduction in quality.

Decoder

The decoder block diagram is based on the same principle as the local decoder associated with the encoder as shown in Figure 7.16.

click to expand
Figure 7.16: A block diagram of an MPEG-1 decoder

The incoming bit stream is stored in the buffer, and is demultiplexed into the coding parameters such as DCT coefficients, motion vectors, macroblock types, addresses etc. These are then variable length decoded using the locally provided tables. The DCT coefficients after inverse quantisation are inverse DCT transformed and added to the motion compensated prediction (as required) to reconstruct the pictures. The frame stores are updated by the decoded I and P-pictures. Finally, the decoded pictures are reordered to their original scanned form.

At the beginning of the sequence, the decoder will decode the sequence header, including the sequence parameters. If the bit stream is not constrained, and a parameter exceeds the capability of the decoder, then the decoder should be able to detect this. If the decoder determines that it can decode the bit stream, then it will set up its parameters to match those defined in the sequence header. This will include horizontal and vertical resolutions and aspect ratio, the bit rate and the quantisation weighting matrices.

Next, the decoder will decode the group of picture header field, to determine the GOP structure. It will then decode the first picture header in the group of pictures and, for constant bit rate operation, determine the buffer fullness. It will delay de-coding the rest of the sequence until the input buffer is filled to the correct level. By doing this, the decoder can be sure that no buffer overflow or underflow will occur during decoding. Normally, the input buffer size will be larger than the minimum required by the bit stream, giving a range of fullness at which the decoder may start to decode.

If it is required to play a recorded sequence from a random point in the bit stream, the decoder should discard all the bits until it finds a sequence start code, a group of pictures start code, or a picture start code which introduces an I-picture. The slices and macroblocks in the picture are decoded and written into a display buffer, and perhaps into another buffer. The decoded pictures may be postprocessed and displayed in the order defined by the temporal reference at the picture rate defined in the sequence header. Subsequent pictures are processed at the appropriate times to avoid buffer overflow and underflow.

Decoding for fast play

Fast forward can be supported by D-pictures. It can also be supported by an appropriate spacing of I-pictures in a sequence. For example, if I-pictures were spaced regularly every 12 pictures, then the decoder might be able to play the sequence at 12 times the normal speed by decoding and displaying only the I-pictures. Even this simple concept places a considerable burden on the storage media and the decoder. The media must be capable of speeding up and delivering 12 times the data rate. The decoder must be capable of accepting this higher data rate and decoding the I-pictures. Since I-pictures typically require significantly more bits to code than P and B-pictures, the decoder will have to decode significantly more than the of the data rate. In addition, it has to search for picture start codes and discard the data for P and B-pictures. For example, consider a sequence with N = 12 and M = 3, such as:

  • I B B P B B P B B P B B I B B

Assume that the average bit rate is C, each B-picture requires 0.6C, each P-picture requires 1.4C and the remaining 3C are assigned to the I-picture in the GOP. Then the I-pictures should code per cent of the total bit rate in just of the display time.

Another way to achieve fast forward in a constant bit rate application is for the medium itself to sort out the I-pictures and transmit them. This would allow the data rate to remain constant. Since this selection process can be made to produce a valid MPEG-1 video bit stream, the decoder should be able to decode it. If every I-picture of the preceding example were selected, then one I-picture would be transmitted every three picture periods, and the speed up rate would be times.

If alternate I-pictures of the preceding example were selected, then one I-picture would again be transmitted every three picture periods, but the speed up rate would be times. If one in N I-pictures of the preceding example were selected, then the speed up rate would be 4N.

Decoding for pause and step mode

Decoding for pause requires the decoder to be able to control the incoming bit stream, and display a decoded picture without decoding any additional pictures. If the decoder has full control over the bit stream, then it can be stopped for pause and resumed when play begins. If the decoder has less control, as in the case of a CD-ROM, there may be a delay before play can be resumed.

Decoding for reverse play

To decode a bit stream and play in reverse, the decoder must decode each group of pictures in the forward direction, store the entire decoded pictures, then display them in reverse order. This imposes severe storage requirements on the decoder in addition to any problems in gaining access to the decoded bit stream in the correct order.

To reduce decoder memory requirements, groups of pictures should be small. Unfortunately, there is no mechanism in the syntax for the encoder to state what the decoder requirements are in order to play in reverse. The amount of display buffer storage may be reduced by reordering the pictures, either by having the storage unit read and transmit them in another order, or by reordering the coded pictures in a decoder buffer. To illustrate this, consider the typical group of pictures shown in Figure 7.17.

B

B

I

B

B

P

B

B

P

B

B

P

pictures in display order

0

1

2

3

4

5

6

7

8

9

10

11

temporal reference

I

B

B

P

B

B

P

B

B

P

B

B

pictures in decoding order

2

0

1

5

3

4

8

6

7

11

9

10

temporal reference

I

P

P

P

B

B

B

B

B

B

B

B

pictures in new order

2

5

8

11

10

9

7

6

4

3

1

0

temporal reference


Figure 7.17: Example of group of pictures, in the display, decoding and new orders

The decoder would decode pictures in the new order, and display them in the reverse of the normal display. Since the B-pictures are not decoded until they are ready to be displayed, the display buffer storage is minimised. The first two B-pictures, 0 and 1, would remain stored in the input buffer until the last P-picture in the previous group of pictures was decoded.

Postprocessing

Editing

Editing of a video sequence is best performed before compression, but situations may arise where only the coded bit stream is available. One possible method would be to decode the bit stream, perform the required editing on the pixels and recode the bit stream. This usually leads to a loss in video quality, and it is better, if possible, to edit the coded bit stream itself.

Although editing may take several forms, the following discussion pertains only to editing at the picture level, that is deletion of the coded video material from a bit stream and insertion of coded video material into a bit stream, or rearrangement of coded video material within a bit stream.

If a requirement for editing is expected (e.g. clip video is provided analogous to clip art for still pictures), then the video can be encoded with well defined cutting points. These cutting points are places at which the bit stream may be broken apart or joined. Each cutting point should be followed by a closed group of pictures (e.g. a GOP that starts with an I-picture). This allows smooth play after editing.

To allow the decoder to play the edited video without having to adopt any unusual strategy to avoid overflow and underflow, the encoder should make the buffer fullness take the same value at the first I-picture following every cutting point. This value should be the same as that of the first picture in the sequence. If this suggestion is not followed, then the editor may make an adjustment either by padding (stuffing bits or macroblocks) or by recording a few images to make them smaller.

If the buffer fullness is mismatched and the editor makes no correction, then the decoder will have to make some adjustment when playing over an edited cut. For example, consider a coded sequence consisting of three clips, A, B and C, in order. Assume that clip B is completely removed by editing, so that the edited sequence consists only of clip A followed immediately by clip C, as illustrated in Figure 7.18.

click to expand
Figure 7.18: Edited sequences

Assume that in the original sequence the buffer is three-quarters full at the beginning of clip B, and one-quarter full at the beginning of clip C. A decoder playing the edited sequence will encounter the beginning of clip C with its buffer three-quarters full, but the first picture in clip C will contain a buffer fullness value corresponding to a quarter full buffer. In order to avoid buffer overflow, the decoder may try to pause the input bit stream, or discard pictures without displaying them (preferably B-pictures), or change the decoder timing.

For another example, assume that in the original sequence the buffer is one-quarter full at the beginning of clip B, and three-quarters full at the beginning of clip C. A decoder playing the edited sequence will encounter the beginning of clip C with its buffer one-quarter full, but the first picture in clip C will contain a buffer fullness value corresponding to a three-quarters full buffer. In order to avoid buffer underflow, the decoder may display one or more pictures for longer than the normal time.

If provision for editing was not specifically provided in the coded bit stream, or if it must be available at any picture, then the editing task is more complex, and places a greater burden on the decoder to manage buffer overflow and underflow problems. The easiest task is to cut at the beginning of a group of pictures. If the group of pictures following the cut is open (e.g. GOP starts with two B-pictures), which can be detected by examining the closed GOP flag in the group of pictures header, then editing must set the broken link bit to 1 to indicate to the decoder that the previous group of pictures cannot be used for decoding any B-pictures.

Resampling and up conversion

The decoded sequence may not match the picture rate or the spatial resolution of the display device. In such situations (which occur frequently), the decoded video must be resampled or scaled. In Chapter 2 we saw that CCIR-601 video was subsampled into SIF format for coding, hence for display it is appropriate to upsample it back into its original format. Similarly, it has to be temporally converted for proper display, as was discussed in Chapter 2. This is particularly important for cases where video was converted from film.

Problems

1. 

In MPEG-1, with the group of picture structure of N = 12 and M = 3, the maximum motion speed is assumed to be 15.5 pixels/frame (half pixel precision), calculate the number of search operations required to estimate the motion in P-pictures with:

  1. telescopic method
  2. direct on the P-pictures

 a. each operation = (2 l5 + 1) 2 + 8 = 969, total operations = 3 969 = 2907 b. ω = 3 15 = 45, total no of operations = (2 x 45 + 1) 2 + 8 = 8289

2. 

In an MPEG-1 encoder, for head-and-shoulders-type pictures, the maximum motion speed for P-pictures is set to 13 pixels and those of the forward and backward for the first B-picture in the subgroup are set to five pixels and nine pixels, respectively.

  1. explain why the search range for the B-picture is smaller than that of the P-picture
  2. what would be the forward and backward search ranges for the second B-picture in the subgroup?
  3. calculate the number of search operations with half pixel precision for the P and B-pictures

 a. the first b-picture is closer to its forward prediction than its backward prediction picture. b. for the second b-picture, fwd = 9 and bwd = 5 c. for p-picture (2 13 + 1) 2 + 8 = 737, and for each b-picture (2 5 + 1) 2 + 8 + (2 x 9 + 1) 2 + 8 = 498

3. 

An I-picture is coded at 50 kbits. If the quantiser step size is linearly distributed between 10 and 16, find the complexity index for this picture.

average and the complexity index is 50 1000 13 = 65 10 4 .

4. 

In coding of SIF-625 video at 1.2 Mbit/s, with a group of pictures (GOP) structure N = 12, M = 3, the ratios of complexity indices of I, P and B are 20: 10: 7, respectively. Calculate the target bit rate for coding of each frame in the next GOP.

(8 7) + (3 10) + 20 = 106 for i-pictures: for p = 54.3 kbits and for b = 38 kbits.

5. 

If in problem 4 the allocated bits to B-pictures were reduced by a factor of 1.4, find the new complexity indices and the target bits to each picture type.

the new index ratio for b becomes 7/1.4 = 5 (8 5) + (3 10) + 20 = 90, and bits for i = 128 kbits, for p = 64 and for b = 32 kbits

6. 

In problem 4 if due to scene change the average quantiser step size in the last P-picture of the GOP was doubled, but those of other pictures did not change significantly:

  1. how do the complexity index ratios change?
  2. what is the new target bit rate for each picture type?

20 + 10 + 10 + 20 + 8 7=116 for p, the average index of (10 + 10 + 20)/3 = 13.3 should be used, hence the target bit rates for i = 99.3 kbits, for p = 66.2 kbits and for b = 34.75 kbits.

7. 

If in problem 4 the complexity indices ratios were wrongly set to 1:1:1, but after coding the average quantiser step sizes for I, P and B were 60, 20 and 15 respectively, find:

  1. the target bit rate for each picture type before coding
  2. the target bit rate for each picture type of the next GOP.

 a. all equal to 48 kbit/s b. 60 + 3 20+8 15 = 240 for for p = 48 kbits and for b = 36 kbits

Answers

1. 

  1. each operation = (2 × l5 + 1)2 + 8 = 969, total operations = 3 × 969 = 2907
  2. ω = 3 × 15 = 45, total no of operations = (2 x 45 + 1)2 + 8 = 8289

2. 

  1. the first B-picture is closer to its forward prediction than its backward prediction picture.
  2. for the second B-picture, FWD = 9 and BWD = 5
  3. for P-picture (2 × 13 + 1)2 + 8 = 737, and for each B-picture (2 × 5 + 1)2 + 8 + (2 x 9 + 1)2 + 8 = 498

3. 

Average and the complexity index is 50 × 1000 × 13 = 65 × 104.

4. 

(8 × 7) + (3 × 10) + 20 = 106

for I-pictures:

for P = 54.3 kbits and for B = 38 kbits.

5. 

The new index ratio for B becomes 7/1.4 = 5

(8 × 5) + (3 × 10) + 20 = 90, and bits for I = 128 kbits, for P = 64 and for B = 32 kbits

6. 

20 + 10 + 10 + 20 + 8 × 7=116

for P, the average index of (10 + 10 + 20)/3 = 13.3 should be used, hence the target bit rates for I = 99.3 kbits, for P = 66.2 kbits and for B = 34.75 kbits.

7. 

  1. all equal to 48 kbit/s
  2. 60 + 3 × 20+8 × 15 = 240

    For for P = 48 kbits and for B = 36 kbits

References

1 MPEG-1: 'Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s'. ISO/IEC 1117-2: video, November 1991

2 H.261: 'ITU-T Recommendation H.261, video codec for audiovisual services at p × 64 kbit/s'. Geneva, 1990

3 'Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s'. ISO/IEC 1117-2: Systems, November 1991

4 CCIR Recommendation 601: 'Digital methods of transmitting television information'. Recommendation 601, encoding parameters of digital television for studios

5 WILSON, D., and GHANBARI, M.: 'Frame sequence partitioning of video for efficient multiplexing', Electron. Lett., 1998, 34:15, pp. 1480–1481

6 GHANBARI, M.: 'An adapted H.261 two-layer video codec for ATM networks', IEEE Trans. Commun., 1992, 40:9, pp. 1481–1490

7 PEARSON, D.E.: 'Transmission and display of pictorial information' (Pentech Press, 1975)

8 SEFERIDIS, V., and GHANBARI, M.: 'Adaptive motion estimation based on texture analysis', IEEE Trans. Commun., 1994, 42:2/3/4, pp. 1277–1287

9 ALDRIDGE, R.P.,GHANBARI, M., and PEARSON, D.E.: 'Exploiting the structure of MPEG-2 for statistically multiplexing video'. Proceedings of 1996 international Picture coding symposium, PCS '96, Melbourne, Australia, 13–15 March 1996, pp.111–113





Standard Codecs(c) Image Compression to Advanced Video Coding
Standard Codecs: Image Compression to Advanced Video Coding (IET Telecommunications Series)
ISBN: 0852967101
EAN: 2147483647
Year: 2005
Pages: 148
Authors: M. Ghanbari
Simiral book on Amazon

Flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net