This section shows how the compressed domain processing methods described in Section 4 can be applied to video transcoding and video processing/editing applications. Algorithms and architectures are described for a number of CDP operations.
With the introduction of the next generation wireless networks, mobile devices will access an increasing amount of media-rich content. However, a mobile device may not have enough display space to render content that was originally created for desktop clients. Moreover, wireless networks typically support lower bandwidths than wired networks, and may not be able to carry media content made for higher-bandwidth wired networks. In these cases, transcoders can be used to transform multimedia content to an appropriate video format and bandwidth for wireless mobile streaming media systems.
A conceptually simple and straightforward method to perform this transcoding is to decode the original video stream, downsample the decoded frames to a smaller size, and re-encode the downsampled frames at a lower bitrate. However, a typical CCIR601 MPEG-2 video requires almost all the cycles of a 300Mhz CPU to perform real-time decoding. Encoding is significantly more complex and usually cannot be accomplished in real time without the help of dedicated hardware or a high-end PC. These factors render the conceptually simple and straightforward transcoding method impractical. Furthermore, this simple approach can lead to significant loss in video quality. In addition, if transcoding is provided as a network service in the path between the content provider and content consumer, it is highly desirable for the transcoding unit to handle as many concurrent sessions as possible. This scalability is critical to enable wireless networks to handle user requests that may be very intense at high load times. Therefore, it is very important to develop fast algorithms to reduce the compute and memory loads for transcoding sessions.
Video processing applications often involve a combination of spatial and temporal processing. For example, one may wish to downscale the spatial resolution and lower the frame rate of a video sequence. When these video processing applications are performed on compressed video streams, a number of additional requirements may arise. For example, in addition to performing the specified video processing task, the output compressed video stream may need to satisfy additional requirements such as maximum bitrate, buffer size, or particular compression format (e.g. MPEG-4 or H.263). While conventional approaches to applying traditional video processing operations on compressed video streams generally have high compute and memory requirements, the algorithmic optimizations described in Section 4 can be used to design efficient compressed-domain transcoding algorithms with significantly reduced compute and memory requirements. A number of transcoding architectures were discussed in [23][24][25][26].
Figure 40.4 shows a progression of architectures that reduce the compute and memory requirements of such applications. These architectures are discussed in the context of lowering the spatial and temporal resolution of the video from S0, T0 to S1, T1 and lowering the bitrate of the bitstream from R0 to R1. The top diagram shows the conventional approach to processing the compressed video stream. First the input compressed bitstream with bitrate R0 is decoded into its decompressed video frames, which have a spatial resolution and temporal frame rate of S0 and T0. These frames are then processed temporally to a lower frame rate T1 < T0 by dropping appropriate frames. The spatial resolution is then reduced to S1 < S0 by spatially downsampling the remaining frames. The resulting frames with resolution S1, T1 are then re-encoded into a compressed bitstream with a final bitrate of R1<R0. The memory requirements of this approach are high because of the frame stores required to store the decompressed video frames at resolution S0, T0. The computational requirements are high because of the operations needed to decode, process, and re-encode the frames; in particular, motion estimation performed during re-encoding can be quite compute intensive.
Figure 40.4: Architectural development of CDP algorithms.
The bottom diagram shows an improved approach for this transcoding operation. Once again, the temporal frame rate is reduced at the bitstream layer by exploiting the picture start codes and picture headers. Furthermore, deriving the output coding parameters from those given in the input bitstream can significantly reduce the compute requirements of the final encode operation. This is advantageous because some of the computations that need to be performed in the encoder, such as motion estimation, may have already been performed by the original encoder and may be represented by coding parameters, such as motion vectors, given in the input bitstream. Rather than blindly recomputing this information from the decoded, downsampled video frames, the encoder can exploit the information contained in the input bitstream. In other words, much of the information that is derived in the original encoder can be reused in the transcoder. Specifically, the motion vectors, quantization parameters, and prediction modes contained in the input compressed bitstream can be used to calculate the motion vectors, quantization parameters, and prediction modes used in the encoder, thus largely bypassing the expensive operations performed in the conventional encoder.
Also, when transcoding to reduce the spatial resolution, the number of macroblocks in the input and output frames can differ; the bottom architecture can be further improved to consider this difference and achieve a better tradeoff in complexity and quality [23]. Note that the DCT-domain methods discussed in Section 4 can be used for further improvements.
Images and video frames coded with intraframe methods are represented by sets of block DCT coefficients. When using intraframe DCT coding, the original video frame is divided into 8x8 blocks, each of which is independently transformed with an 8x8 DCT. This imposes an artificial block structure that complicates a number of spatial processing operations, such as translation, downscaling, and filtering, that were considered straightforward in the pixel domain.
For spatial downsampling or resolution reduction on an intra-coded frame, one 8x8 DCT block of the downscaled image is determined from multiple 8x8 DCT blocks of the original image. Efficient downsampling algorithms can be derived in the DCT domain. Based on the distributed property of the DCT discussed in Subsection 4.1, DCT-domain downsampling can be achieved by matrix multiplication. Merhav and Bhaskaran [28] have developed an efficient matrix multiplication for downscale of DCT blocks. Natarajan and Bhaskaran [18] also used approximated DCT matrices to achieve the same goal. The approximated DCT matrices contain only elements of value 0, 1, or a power of 1/2. Effectively, the matrix multiplication can be achieved by integer shifts and additions, leading to a multiplication free implementation.
Efficient algorithms have also been developed for filtering images in the DCT domain. For example, [29] proposes a method to apply two-dimensional symmetric, separable filters to DCT-coded images.
Video frames coded with interframe coding techniques are represented with motion vectors and residual DCT coefficients. These frames are coded based on a prediction from one or more previously coded frames; thus, properly decoding one frame requires first decoding one or more other frames. This temporal dependence among frames severely complicates a number of spatial and temporal processing techniques such as translation, downscaling, and splicing.
To facilitate efficient transcoding in the compressed domain, one wants to reuse as much information as possible in the origin video bitstream. The motion vector information of the transcoded video can be derived using the motion vector processing method introduced in Subsection 4.2. The computing of the residual DCT data can follow the guidelines provided in Subsection 4.3. Specifically, an interframe representation can be transcoded to an intraframe representation in the DCT domain. Subsequently, the DCT domain residual data can be obtained based on the derived motion vector information.
Downscaling, or reducing the spatial resolution, of compressed video streams is an operation that benefits from the compressed-domain methods described in Section 4 and the compressed-domain transcoding architectures presented in Subsection 5.1.1. A block diagram of the compressed-domain downscaling algorithm is shown in Figure 40.5. The input bitstream is partially decoded into its motion vector and DCT domain representation. The motion vectors are resampled with the MV resampling methods described in Subsection 4.2. The DCT coefficients are processed with the DCT-domain processing techniques described in Subsections 4.1 and 4.3. A number of coding parameters from the input bitstream are extracted and used in the MV resampling and partial encoding steps of the transcoder. Rate control techniques, like those described in Section 4.4, are used to adapt the bitrate of the output stream. This is discussed in more detail below.
Figure 40.5: Compressed-domain downscaling algorithm.
The compressed-domain downscaling operation is complicated by the prediction dependencies used between frames during compression. Specifically, there are two tracks of dependencies in such a transcoding session. The first dependency is among frames in the original input video stream, while the second is among frames in the output downsampled video stream. The motion vectors for the downsampled version can be estimated based on the motion vectors in the original video. However, even when the motion information in the original video is reused, it is necessary to reconstruct the reference frames to avoid drift error due to imperfect motion vector estimation. As described in Subsection 4.3, the reconstruction may be performed using a DCT domain motion compensation method.
The selection of coding type for macroblock in the interframes is also an important issue. In the downsampling-by-two case, there may be four macroblocks each with a different coding type involved in the creation of each output macroblock; the transcoder may choose the dominant coding type as the coding type for the output macroblock. In addition, rate control must be used to control the bitrate of the transcoding result.
This section focuses on the problem of transcoding a field-coded compressed bitstream to a lower-rate, lower-resolution frame-coded compressed bitstream [26]. For example, conversions between interlaced MPEG-2 sequences to progressive MPEG-1, H.261, H.263, or MPEG-4 simple profile streams lie within this space. To simplify discussion, this section focuses on transcoding a given MPEG-2 bitstream to a lower-rate H.263 or MPEG-4 simple profile bitstream [26][30][31]. This is a practically important transcoding problem for converting MPEG-2 coded DVD and Digital TV video, which is often interlaced, to H.263 or MPEG-4 video for streaming over the Internet or over wireless links (e.g. 3G cellular) to PCs, PDAs, and cell phones that usually have progressive displays. For brevity, we refer to the output format as H.263; however it can be H.261, H.263, MPEG-1, or MPEG-4.
The conventional approach to the problem is as follows. An MPEG bitstream is first decoded into its decompressed interlaced video frames. These high-resolution interlaced video frames are then downsampled to form a progressive video sequence with a lower spatial resolution and frame rate. This sequence is then re-encoded into a lower-rate H.263 bitstream. This conventional approach to transcoding is inefficient in its use of computational and memory resources. It is desirable to have computation- and memory-efficient algorithms that achieve MPEG-2 to H.263 transcoding with minimal loss in picture quality.
A number of issues arise when designing MPEG-2 to H.263 transcoding algorithms. While both standards are based on block motion compensation and the block DCT, there are many differences that must be addressed. A few of these differences are listed below:
Interlaced vs. progressive video format: MPEG-2 allows interlaced video formats for applications including digital television and DVD. H.263 only supports progressive formats.
Number of I frames: MPEG uses more frequent I frames to enable random access into compressed bitstreams. H.263 uses fewer I frames to achieve better compression.
Frame coding types: MPEG allows pictures to be coded as I, P, or B frames. H.263 has some modes that allow pictures to be coded as I, P, or B frames; but has other modes that only allow pictures to be coded as I, P, or optionally PB frames. Traditional I, P, B frame coding allows any number of B frames to be included between a pair of I or P frames, while H.263 I, P, PB frame coding allows at most one.
Prediction modes: In support of interlaced video formats, MPEG-2 allows field-based prediction, frame-based prediction, and 16x8 field-based prediction. H.263 only supports frame-based prediction but optionally allows an advanced prediction mode in which four motion vectors are allowed per macroblock.
Motion vector restrictions: MPEG motion vectors must point inside the picture, while H.263 has an unrestricted motion vector mode that allows motion vectors to point outside the picture. The benefits of this mode can be significant, especially for lower-resolution sequences where the boundary macroblocks account for a larger percentage of the video.
A block diagram of the MPEG-2 to H.263 transcoder [26][30] is shown in Figure 40.6. The transcoder accepts an MPEG IPB bitstream as input. The bitstream is scanned for picture start codes and the picture headers are examined to determine the frame type. The bits corresponding to B frames are discarded, while the remaining bits are passed on to the MPEG IP decoder. The decoded frames are downsampled to the appropriate spatial resolution and then passed to the modified H.263 IP encoder.
Figure 40.6: MPEG-2 to H.263 transcoder block diagram.
This encoder differs from a conventional H.263 encoder in that it does not perform conventional motion estimation; rather, it uses motion vectors and coding modes computed from the MPEG motion vectors and coding modes and the decoded, downsampled frames. There are a number of ways that this motion vector resampling can be done [4][5]. The class IV partial search method described in Subsection 4.2 was chosen. Specifically, the MPEG motion vectors and coding modes are used to form one or more initial estimates for each H.263 motion vector. A set of candidate motion vectors is generated; this set may include each initial estimate and its neighbouring vectors, where the size of the neighbourhood can vary depending on the available computational resources. The set of candidate motion vectors is tested on the decoded, downsampled frames and the best vector is chosen based on a criterion such as residual energy. A half-pixel refinement may be performed and the final mode decision (inter or intra) is then made.
Many degrees of freedom exist when designing an MPEG-2 to H.263 transcoder. For instance, a designer can make different choices in the mapping of input and output frame types; and the designer can choose how to vary the temporal frame rate and spatial resolution. Each of these decisions has a different impact on the computational and memory requirements and performance of the final algorithm. This section presents a very simple algorithm that makes design choices that naturally match the characteristics of the input and output bitstreams.
The target format of the transcoder can be chosen based on the format of the input source bitstream. A careful choice of source and target formats can greatly reduce the computational and memory requirements of the transcoding operation.
Spatial and temporal resolutions: The chosen correspondence between the input and output coded video frames is shown in Figure 40.7. The horizontal and vertical spatial resolutions are reduced by factors of two because the MPEG-2 interlaced field format provides a natural factor of two reduction in the vertical spatial resolution. Thus, the spatial downsampling is performed by simply extracting the top field of the MPEG-2 interlaced video frame and horizontally downsampling it by a factor of two. This simple spatial downsampling method allows the algorithm to avoid the difficulties associated with interlaced to progressive conversions. The temporal resolution is reduced by a factor of three, because MPEG-2 picture start codes, picture headers, and prediction rules make it possible to efficiently discard B-frame data from the bitstream without impacting the remaining I and P frames. Note that even though only the top fields of the MPEG I and P frames are used in the H.263 encoder, both the top and bottom fields must be decoded because of the prediction dependencies that result from the MPEG-2 interlaced field coding modes.
Figure 40.7: Video formats for MPEG-2 to H.263 transcoding.
Frame coding types: MPEG-2 allows I, P, and B frames while H.263 allows I and P frames and optionally PB frames. With sufficient memory and computational capabilities, an algorithm can be designed to transcode from any input MPEG coding pattern to any output H.263 coding pattern as in [31]. Alternatively, one may take the simpler approach of determining the coding pattern of the target H.263 bitstream based on the coding pattern of the source MPEG-2 bitstream. By aligning the coding patterns of the input and output bitstreams and allowing temporal downsampling, a significant improvement in computational efficiency can be achieved.
Specifically, a natural alignment between the two standards can be obtained by dropping the MPEG B frames and converting the remaining MPEG I and P frames to H.263 I and P frames, thus exploiting the similar roles of P frames in the two standards and exploiting the ease in which B frame data can be discarded from an MPEG-2 bitstream without affecting the remaining I and P frames. Since MPEG-2 sequences typically use an IBBPBBPBB structure, dropping the B frames results in a factor of three reduction in frame rate. While H.263 allows an advanced coding mode of PB pictures, it is not used in this algorithm because it does not align well with MPEG's IBBPBBPBB structure.
The problem that remains is to convert the MPEG-coded interlaced I and P frames to the spatially downsampled H.263-coded progressive I and P frames. The problem of frame conversions can be thought of as manipulating prediction dependencies in the compressed data; this topic was addressed in [22] and in Subsection 4.5 for MPEG progressive frame conversions. This MPEG-2 to H.263 transcoding algorithm requires three types of frame conversions: (1) MPEG I field to H.263 I frame, (2) MPEG I field to H.263 P frame, and (3) MPEG P field to H.263 P frame. The first is straightforward. The latter two require the transcoder to efficiently calculate the H.263 motion vectors and coding modes from those given in the MPEG-2 bitstream. When using the partial search method described in Subsection 4.3, the first step is to create one or more initial estimates of each H.263 motion vector from the MPEG-2 motion vectors. In the following two sections, we discuss the methods used to accomplish this for MPEG I field to H.263 P frame conversions and for MPEG P field to H.263 P frame conversions. Further details of the MPEG-2 to H.263 transcoder, including the progressive to interlace frame conversions, are given in [26][30]. These conversions address the differences between the MPEG-2 and H.263 standards described at the beginning of the section, and exploit the information in the input video stream to greatly reduce the computational and memory requirements of the transcoder with little loss in video quality.
This section describes a series of compressed-domain editing applications. It begins with temporal mode conversion, which can be used to transcode an MPEG sequence into a format that facilitates video editing operations. It then describes two frame-level processing operations, frame-accurate splicing and frame-by-frame reverse play. All these operations use the frame conversion methods described in Subsection 4.5 to manipulate the prediction dependencies of compressed frames [22].
The ability to transcode between arbitrary temporal modes adds a great deal of flexibility and power to compressed-domain video processing. In addition, it provides a method of trading off parameters to achieve various rate/robustness profiles. For example, an MPEG sequence consisting of all I frames, while least efficient from a compression viewpoint, is most robust to channel impairments in a video communication system. In addition, the all I-frame MPEG video stream best facilitates many video-editing operations such as splicing, downscaling, and reverse play. Finally, once an I-frame representation is available, the intraframe transcoding algorithms described in Subsection 5.1 can be applied to each frame of the sequence to achieve the same effect on the entire sequence.
Figure 40.8: Splicing operation.
In general, temporal mode conversions can be performed with the frame conversion method described in Subsection 4.5. For frames that need to be converted to different prediction modes, macroblock and block level processing can be used to convert the appropriate macroblocks between different types.
The following steps describe a DCT-domain approach to transcoding an MPEG video stream containing I, P, and B frames into an MPEG video stream containing only I frames. This processing must be performed for the appropriate macroblocks of the converted frames.
Calculate the DCT coefficients of the motion-compensated prediction. This can be calculated from the intraframe coefficients of the previously coded frames by using the compressed-domain inverse motion compensation routine described in Subsection 4.3.
Form the intraframe DCT representation of each frame. This step simply involves adding the predicted DCT coefficients to the residual DCT coefficients.
Requantize the intraframe DCT coefficients. This step must be performed to ensure that the buffer constraints of the new stream are satisfied. Requantization may be used to control the rate of the new stream.
Reorder the coded data and update the relevant header information. If B-frames are used, the coding order of the IPB MPEG stream will differ from the coding order of the I-only MPEG stream. Thus, the coded data for each frame must be shuffled appropriately. In addition, the appropriate parameters of the header data must be updated.
The goal of the splicing operation is to form a video data stream that contains the first Nhead frames of one video sequence and the last Ntail frames of another video sequence. For uncoded video, the solution is obvious: simply discard the unused frames and concatenate the remaining data. Two properties make this solution obvious: (1) the data needed to represent each frame is self-contained, i.e. it is independent of the data from other frames; and (2) the uncoded video data has the desirable property of original ordering, i.e. the order of the video data corresponds to the display order of the video frames. MPEG-coded video data does not necessarily retain these properties of temporal independence or original ordering (although it can be forced to do so at the expense of compression efficiency). This complicates the task of splicing two MPEG-coded data streams.
This section describes a flexible algorithm that splices two streams directly in the compressed domain [3]. The algorithm allows a natural tradeoff between computational complexity and compression efficiency, thus it can be tailored to the requirements of a particular system. This algorithm possesses a number of attributes. A minimal number of frames are decoded and processed, thus leading to low computational requirements while preserving compression efficiency. In addition, the head and tail data streams can be processed separately. Finally, if desired, the processing can be performed so that the final spliced data stream is a simple concatenation of the two streams and so that the order of the coded video data remains intact.
The conventional splicing solution is to completely decompress the video, splice the decoded video frames, and recompress the result. With this method, every frame in the spliced video sequence must be recompressed. This method has a number of disadvantages, including high computational requirements, high memory requirements, and low performance, since each recoding cycle can deteriorate the video data.
An improved compressed-domain splicing algorithm is shown in Figure 40.9. The computational requirements are reduced by only processing the frames affected by the splice, and by only decoding the frames needed for that processing. This is also shown in Figure 40.9. Specifically, the only frames that need to be recoded are within the GOPs affected by the head and tail cut points; at most, there will be one such GOP in the head data stream and one in the tail data stream. Furthermore, the only additional frames that need to be decoded are the I and P frames in the two GOPs affected by the splice.
Figure 40.9: Compressed-domain splicing and processed bitstreams.
The algorithm results in an MPEG-compliant data stream with variable-sized GOPs. This exploits the fact that the GOP header does not specify the number of frames in the GOP or its structure; rather these are fully specified by the order of the data in the coded data stream.
Each step of the splicing operation is described below. Further discussion is included in [3].
Process the head data stream. This step involves removing any backward prediction dependencies on frames not included in the splice. The simplest case occurs when the cut for the head data occurs immediately after an I or P frame. When this occurs, there are no prediction dependencies on cut frames and all the relevant video data is contained in one contiguous portion of the data stream. The irrelevant portion of the data stream can simply be discarded, and the remaining relevant portion does not need to be processed. When the cut occurs immediately after a B frame, some extra processing is required because one or more B-frame predictions will be based on an anchor frame that is not included in the final spliced video sequence. In this case, the leading portion of the data stream is extracted up to the last I or P frame included in the splice, then the remaining B frames should be converted to Bfor frames or P frames.
Form the tail data stream. This step involves removing any forward prediction dependencies on frames not included in the splice. The simplest case occurs when the cut occurs immediately before an I frame. When this occurs, the video data preceding this frame may be discarded and the remaining portion does not need to be processed. When the cut occurs before a P frame, the P frame must be converted to an I frame and the remaining data remains intact. When the cut occurs before a B frame, extra processing is required because one of the anchor frames is not included in the spliced sequence. In this case, if the first non-B frame is a P frame, it must be converted to an I frame. Then, each of the first consecutive B frames must be converted to Bback frames.
Match and merge the head and tail data streams. The IPB structure and the buffer parameters of the head and tail data streams determine the complexity of the matching operation. This step requires concatenating the two streams and then processing the frames near the splice point to ensure that the buffer constraints are satisfied. This requires matching the buffer parameters of the pictures surrounding the splice point. In the simplest case, a simple requantization will suffice. However, in more difficult cases, a frame conversion will also be required to prevent decoder buffer underflow. Furthermore, since prediction dependencies are inferred from the coding order of the compressed stream, when the merging step is performed the coded frames must be interleaved appropriately. The correct ordering will depend on the particular frame conversions used to remove the dependencies on cut frames.
The first two steps may require converting frames between the I, P, and B prediction modes. Converting P or B frames to I frames is quite straightforward as is B-to-Bfor conversion and B-to-Bback conversion; however, conversion between any other set of prediction modes can require more computations to compute new motion vectors. Exact algorithms involve performing motion estimation on the decoded video -- this process can dominate the computational requirements of the algorithm. Approximate algorithms such as motion vector resampling can significantly reduce the computations required for these conversions.
Results of a spliced video sequence are shown in Figure 40.10. The right side of the figure plots the frame quality (in peak signal-to-noise ratio) for original compressed football and cheerleader sequences, and the spliced result when splicing between the two sequences every twenty frames. In the spliced result, the solid line contains the original quality values from the corresponding frames in the original coded football and cheerleader sequences, while the dotted line represents the quality of the sequence resulting from the compressed-domain splicing operation. Note that the spliced sequence has a slight degradation in quality at the splice points. This slight loss in quality is due to the removal of prediction dependencies in the compressed video in conjunction with the rate matching needed to satisfy buffer requirements. However, note that it returns to full quality a few frames after the splice point (within one GOP). The plots on the left show the buffer occupancy for the original input sequences and the output spliced sequence. In the bottom plot, the bottom line shows the buffer usage if the rate matching operation is not performed; this results in an eventual decoder buffer underflow. The top line shows the result of the compressed-domain splicing algorithm with appropriate rate matching. In this case, the buffer occupancy levels stay consistent with the original streams except in small areas surrounding the splice points. However, as we saw in the quality plots, the quality and buffer occupancy levels match those of the input sequences within a few frames.
Figure 40.10: Performance of compressed-domain splicing algorithm.
The goal of the compressed-domain reverse-play operation is to create a new MPEG data stream that, when decoded, displays the video frames in the reverse order from the original MPEG data stream. For uncoded video the solution is simple: reorder the video frame data in reverse order. The simplicity of this solution relies on two properties: the data for each video frame is self-contained and it is independent of its placement in the data stream. These properties typically do not hold true for MPEG-coded video data.
Compressed-domain reverse-play is difficult because MPEG compression is not invariant to changes in frame order, e.g. reversing the order of the input frames will not simply reverse the order of the output MPEG stream. Furthermore, reversing the order of the input video frames does not result in a "reversed" motion vector field. However, if the processing is performed carefully, much of the motion vector information contained in the original MPEG video stream can be reused to save a significant amount of computations.
This section describes a reverse-play transcoding algorithm that operates directly on the compressed-domain data [32][33]. This algorithm is simple and achieves high performance with low computational and memory requirements. This algorithm only decodes the following data from the original MPEG data stream: I frames must be partially decompressed into their DCT representation and P frames must be partially decompressed to their MV/DCT representation, while for B frames only the forward and backward motion vector fields need to be decoded, i.e. only bitstream processing is needed.
Figure 40.11: Reverse play operation.
The development of the compressed-domain reverse play algorithm is shown in Figure 40.12. In the conventional approach shown in the top of the figure, each GOP in the MPEG stream, starting from the end of the sequence, is completely decoded into uncompressed frames and stored in a frame buffer. The uncompressed frames are reordered, and the resulting frames are re-encoded into an output MPEG stream that contains the original frames in reverse order.
Figure 40.12: Architectures for compressed-domain reverse play.
The middle figure shows an improved approach to the algorithm. This improvement results from exploiting the symmetry of B frames. Specifically, it uses the fact that the coding of the reverse-ordered sequence can be performed so that the same frames are coded as B frames and thus will have the same surrounding anchor frames. The one difference will be that the forward and backward anchors will be reversed. In this case, major computational savings can be achieved by performing simplified processing on the B frames. Specifically, for B frames only a bitstream-level decoding is used to efficiently decode the motion vectors and coding modes, swap them between forward and backward modes, and repackage the results. This greatly reduces the computational requirements because 2/3 of the frames are B frames and because typically the processing required for B frames is greater than that required for P frames, which in turn is much greater than that required for I frames. Also, note that the frame buffer requirements are reduced by a factor of three because the B frames are not decoded.
The bottom figure shows a further improvement that can be had by using motion vector resampling, as described in Subsection 4.2, on the I and P frames. In this architecture, the motion vectors given in the input bitstream are used to compute the motion vectors for the output bitstream, thereby avoiding the computationally expensive motion estimation process in the re-encoding process. The computational and performance tradeoffs of these architectures are discussed in detail in [5].
The resulting compressed-domain reverse-play algorithm shown in Figure 40.13 has the following steps:
Convert the IP frames to reverse IP frames. While the input motion vectors were originally computed for forward prediction between the I and P frames, the reverse IP frames require output motion vectors to be converted in the reverse order. Motion vector resampling methods described in Subsection 4.2 and in [5] can be used to calculate the new reversed motion vectors. Once the motion vectors are computed, the new output DCT coefficients can be computed directly in the DCT-domain by using the compressed-domain inverse motion compensation algorithm described in Subsection 4.3.
Exchange the forward and backward motion vector fields used in each B frame. This step exploits the symmetry of the B frame prediction process. In the reversed stream, the B frames will have the same two anchor frames, but in the reverse order. Thus, the forward prediction field can simply be exchanged with the backward prediction field, resulting in significant computational savings. Notice that only the motion vector fields need to be decoded for the B frames.
Requantize the DCT coefficients. This step must be performed to ensure that the buffer constraints of the new stream are satisfied. Requantization may be used to control the rate of the new stream.
Properly reorder the frame data and update the relevant header information. If no B frames are used, then the reordering process is quite straightforward. However, when B frames are used, care must be taken to properly reorder the data from the original coding order to the appropriate reverse coding order. In addition, the parameters in the header data must be updated appropriately.
Figure 40.13: MPEG compressed-domain reverse play algorithm.