3. MPEG Coding and Compressed-Domain Processing

3.1 MPEG Frame Coding

Efficient CDP algorithms are designed to exploit various features of the MPEG video compression standards. Detailed descriptions of the MPEG video compression standards can be found in [1][2]. This section briefly reviews some aspects of the MPEG standards that are relevant to CDP.

MPEG codes video in a hierarchy of units called sequences, groups of pictures (GOPs), pictures, slices, macroblocks, and blocks. 16x16 blocks of pixels in the original video frames are coded as a macroblock, which consists of four 8x8 blocks. The macroblocks are scanned in a left-to-right, top-to-bottom fashion, and series of these macroblocks form a slice. All the slices in a frame comprise a picture, contiguous pictures form a GOP, and all the GOPs form the entire sequence. The MPEG syntax allows a GOP to contain any number of frames, but typical sizes range from 9 to 15 frames. Each GOP refreshes the temporal prediction by coding the first frame in intraframe mode, i.e., without prediction. The remaining frames in the GOP can be coded with intraframe or interframe (predictive) coding techniques.

The MPEG algorithm allows each frame to be coded in one of three modes: intraframe (I), forward prediction (P), and bidirectional prediction (B). A typical IPB pattern in display order is:

B₇ B₈ P₉ B₁₀ B₁₁ I₀ B₁ B₂ P₃ B₄ B₅ P₆ B₇ B₈ P₉ B₁₀ B₁₁ I₀ B₁ B₂ P₃

The subscripts represent the index of the frame within a GOP. I frames are coded independently of other frames. P frames depend on a prediction based on the preceding I or P frame. B frames depend on a prediction based on the preceding and following I or P frames. Notice that each B frame depends on data from a future frame, i.e., future frame must be (de)coded before a current B frame can be (de)coded. For this reason, the coding order is distinguished from the display order. The coding order for the sequence shown above is:

P₉ B₇ B₈ G I₀ B₁₀ B₁₁ P₃ B₁ B₂ P₆ B₄ B₅ P₉ B₇ B₈ G I₀ B₁₀ B₁₁ P₃ B₁ B₂

MPEG requires the coded video data to be placed in the data stream in coding order. G represents a GOP header that is placed in the compressed bitstream.

A GOP always begins with an I frame. Typically, it includes the following (display order) P and B frames that occur before the next I frame, although the syntax also allows a GOP to contain multiple I frames. The GOP header does not specify the number of I, P, or B frames in the GOP, nor does it specify the structure of the GOP -- these are completely determined by the order of the data in the stream. Thus, there are no rules that restrict the size and structure of the GOP, although care should be taken to ensure that the buffer constraints are satisfied.

MPEG uses block motion-compensated prediction to reduce the temporal redundancies inherent to video. In block motion-compensated prediction, the current frame is divided into 16x16 pixel units called macroblocks. Each macroblock is compared to a number of 16x16 blocks in a previously coded frame. A single motion vector (MV) is used to represent this block with the best match. This block is used as a prediction of the current block, and only the error in the prediction, called the residual, is coded into the data stream.

The frames of a video sequence can be coded as an I, P, or B frame. In I frames, every macroblock must be coded in intraframe mode, i.e., without prediction. In P frames, each macroblock can be coded with forward prediction or in intraframe mode. In B frames, each macroblock can be coded with forward, backward, or bidirectional prediction or in intraframe mode. One MV is specified for each forward- and backward-predicted macroblock while two MVs are specified for each bidirectionally predicted macroblock. Thus, each P frame has a forward motion vector field and one anchor frame, while each B frame has a forward and backward motion vector field and two anchor frames. In some of the following sections, we define B_for and B_back frames as B frames that use only forward or only backwards prediction. Specifically, B_for frames can only have intra- and forward-predicted macroblocks while B_back frames can only have intra- and backward-predicted macroblocks.

MPEG uses discrete cosine transform (DCT) coding to code the intraframe and residual macroblocks. Specifically, four 8x8 block DCTs are used to encode each macroblock and the resulting DCT coefficients are quantized. Quantization usually results in a sparse representation of the data, i.e., one in which most of the amplitudes of the quantized DCT coefficients are equal to zero. Then, only the amplitudes and locations of the nonzero coefficients are coded into the compressed data stream.

3.2 MPEG Field Coding

While many video compression algorithms, including MPEG-1, H.261, and H.263, are designed for progressive video sequences, MPEG-2 was designed to support both progressive and interlaced video sequences, where two fields, containing the even and odd scanlines, are contained in each frame. MPEG-2 provides a number of coding options to support interlaced video. First, each interlaced video frame can be coded as a frame picture in which the two fields are coded as a single unit or as a field picture in which the fields are coded sequentially. Next, MPEG-2 allows macroblocks to be coded in one of five motion compensation modes: frame prediction for frame pictures, field prediction for frame pictures, field prediction for field pictures, 16x8 prediction for field pictures, and dual prime motion compensation. The frame picture and field picture prediction dependencies are as follows. For frame pictures, the top and bottom reference fields are the top and bottom fields of the previous I or P frame. For field pictures, the top and bottom reference fields are the most recent top and bottom fields. For example, if the top field is specified to be first, then MVs from the top field can point to the top or bottom fields in the previous frame, while MVs from the bottom field can point to the top field of the current frame or the bottom field of the previous frame. Our discussion focuses on P-frame prediction because the transcoder described in Subsection 5.1.5 only processes the MPEG I and P frames. We also focus on field picture coding of interlaced video, and do not discuss dual prime motion compensation.

In MPEG field picture coding, each field is divided into 16x16 macroblocks, each of which can be coded with field prediction or 16x8 motion compensation. In field prediction, the 16x16 field macroblock will contain a field selection bit which indicates whether the prediction is based on the top or bottom reference field and a motion vector which points to the 16x16 region in the appropriate field. In 16x8 prediction, the 16x16 field macroblock is divided into its upper and lower halves, each of which contains 16x8 pixels. Each half has a field selection bit which specifies whether the top or bottom reference field is used and a motion vector which points to the 16x8 pixel region in the appropriate field.

3.3 MPEG Bitstream Syntax

The syntax of the MPEG-1 data stream has the following structure: A Sequence header consists of a sequence start code followed by sequence parameters. Sequences contain a number of GOPs. Each GOP header consists of a GOP start code followed by GOP parameters. GOPs contain a number of pictures. Each picture header consists of a picture start code followed by picture parameters. Pictures contain a number of slices. Each slice header consists of a slice start code followed by slice parameters. The slice header is followed by slice data, which contains the coded macroblocks.

The sequence header specifies the picture height, picture width, and sample aspect ratio. In addition, it sets the frame rate, bitrate, and buffer size for the sequence. If the default quantizers are not used, then the quantizer matrices are also included in the sequence header. The GOP header specifies the time code and indicates whether the GOP is open or closed. A GOP is open or closed depending on whether or not the temporal prediction of its frames require data from other GOPs. The picture header specifies the temporal reference parameter, the picture type (I, P, or B), and the buffer fullness (via the vbv_delay parameter). If temporal prediction is used, it also describes the motion vector precision (full or half pixel) and the motion vector range. The slice header specifies the macroblock row in which slice starts and the initial quantizer scale factor for the DCT coefficients. The macroblock header specifies the relative position of the macroblock in relation to the previously coded macroblock. It contains a flag to indicate whether intra- or interframe coding is used. If interframe coding is used, it contains the coded motion vectors, which may be differentially coded with respect to previous motion vectors. The quantizer scale factor may be adjusted at the macroblock level. One bit is used to specify whether the factor is adjusted. If it is, the new scale factor is specified. The macroblock header also specifies a coded block pattern for the macroblock. This describes which of the luminance and chrominance DCT blocks are coded. Finally, the DCT coefficients of the coded blocks are coded into the bitstream. The DC coefficient is coded first, followed by the runlengths and amplitudes of the remaining nonzero coefficients. If it is an intra-macroblock, then the DC coefficient is coded differentially.

The sequence, GOP, picture, and slice headers begin with start codes, which are four-byte identifiers that begin with 23 zeros and a one followed by a one byte unique identifier. Start codes are useful because they can be found by examining the bitstream; this facilitates efficient random access into the compressed bitstream. For example, one could find the coded data that corresponds to the 2^nd slice of the 2^nd picture of the 22^nd GOP by simply examining the coded data stream, without parsing and decoding the data. Of course, reconstructing the actual pixels of that slice may require parsing and decoding additional portions of the data stream because of the prediction used in conventional video coding algorithms. However, computational benefits could still be achieved by locating the beginning of the 22^nd GOP and parsing and decoding the data from that point on, thus exploiting the temporal refresh property inherent to GOPs.

3.4 Compressed-Domain Processing for MPEG

The CDP problem statement was described in Section 2. In essence, the goal of CDP is to develop efficient algorithms for performing processing operations on compressed bitstreams. While the conventional approach requires decompressing the bitstream, processing the decoded frames, and re-encoding the result, improved efficiency, with respect to compute and memory requirements, can be achieved by exploiting structures used in the compression algorithms and using this knowledge to avoid the complete decode and re-encode cycle. In the context of MPEG transcoding, improved efficiency can be achieved by exploiting the structures used in MPEG coding. Furthermore, a decode/process/re-encode cycle can lead to significant loss of quality (even if no processing is performed besides the decode and re-encode) -- carefully designed CDP algorithms can greatly reduce and in some cases prevent this loss in quality.

MPEG coding uses a number of structures, and different compressed-domain processing operations require processing at different levels of depth. From highest to lowest level, these levels include:

Sequence-level processing
GOP-level processing
Frame-level processing
Slice-level processing
Macroblock-level processing
Block-level processing

Generally speaking, deeper-level operations require more computations. For example, some processing operations in the time domain require less computation if no information below the frame level needs to be adjusted. Operations of this kind include fast forward recoding and cut-and-paste or splicing operations restricted to cut points at GOP boundaries. However, if frame-accurate splicing [3] is required, frame and macroblock level information may need to be adjusted for frames around the splice point, as described in Section 5. In addition, in frame rate reduction transcoding, if the transcoder chooses to only drop non-reference frames such as B frames, a frame-level parsing operation could suffice.

On the other hand, operations related to the modification of content within video frames have to be performed below the frame level. Operations of this kind include spatial resolution reduction transcoding [4], frame-by-frame video reverse play [5], and many video-editing operations such as fading, logo insertion, and video/object overlaying [6,7]. As expected, these operations require significantly more computations; so for these operations efficient compressed-domain methods can lead to significant improvements.