3. Content Adaptation

Content adaptation can generally be defined as the conversion of a signal from one format to another, where the input and output signals are typically compressed bitstreams. This section focuses mainly on the adaptation, or transcoding, of video signals. In the earliest work, this conversion corresponded to a reduction in bitrate to meet available channel capacity. Additionally, researchers have investigated conversions between constant bitrate (CBR) streams and variable bit-rate (VBR) streams to facilitate more efficient transport of video. To improve the performance, techniques to refine the motion vectors have been proposed.

In this section, we first provide an overview of the techniques used for bitrate reduction and the corresponding architectures that have been proposed. Then, we describe recent advances with regard to spatial and temporal resolution reduction techniques and architectures. This section ends with a summary of the work covered and provides reference to additional adaptation techniques.

3.1 Bitrate Reduction

The objective in bitrate reduction is to reduce the bitrate accurately, while maintaining low complexity and achieving the highest quality possible. Ideally, the quality of the reduced rate bitstream should have the quality of a bitstream directly generated with the reduced rate. The most straightforward means to achieve this is to decode the video bitstream and fully re-encode the reconstructed signal at the new rate. This is illustrated in Figure 36.6. The best performance can be achieved by calculating new motion vectors and mode decisions for every macroblock at the new rate. However, significant complexity saving can be achieved, while still maintaining acceptable quality, by reusing information contained in the original incoming bitstreams and also considering simplified architectures [22,23,24,25,26].

click to expand
Figure 36.6: Reference transcoding architecture for bit rate reduction.

In the following, we review progress made over the past few years on bit-rate reduction architectures and techniques, where the focus has been centered on two specific aspects, complexity and drift reduction. Drift can be explained as the blurring or smoothing of successively predicted frames. It is caused by the loss of high frequency data, which creates a mismatch between the actual reference frame used for prediction in the encoder, and the degraded reference frame used for prediction in the transcoder and decoder. We will consider two types of systems, a closed-loop and an open-loop system. Each demonstrates the trade-off between complexity and quality.

3.1.1 Transcoding Architectures

In Figure 36.7, the open-loop and closed-loop systems are shown. The architecture in Figure 36.7(a) is open-loop systems, while the architecture in Figure 36.7(b) is a closed-loop system. In the open-loop system, the bitstream is variable-length decoded (VLD) to extract the variable-length codewords corresponding to the quantized DCT coefficients, as well as macroblock data corresponding to the motion vectors and other macroblock-type information. In this scheme, the quantized coefficients are inverse quantized, then simply requantized to satisfy the new output bitrate. Finally, the re-quantized coefficients and stored macroblock information are variable length coded (VLC). An alternative open-loop scheme, which is not illustrated here, but is even less complex than the one shown in Figure 36.7(a), is to directly cut high frequency data from each macroblock [29]. To cut the high frequency data without actually doing the VLD, a bit-profile for the AC coefficients is maintained. As macroblocks are processed, codewords corresponding to high-frequency coefficients are eliminated as needed so that the target bitrate is met. In terms of complexity, both open-loop schemes are relatively simple since a frame memory is not required and there is no need for an IDCT. In terms of quality, better coding efficiency can be obtained by the re-quantization approach since the variable-length codes that are used for the re-quantized data will be more efficient. However, both architectures are subject to drift.

click to expand
Figure 36.7: Simplified transcoding architectures for bitrate reduction. (a) open-loop, partial decoding to DCT coefficients, then re-quantize, (b) closed-loop, drift compensation for re-quantized data.

In general, the reason for drift is mainly due to the loss of high-frequency information. Beginning with the I-frame, which is a reference for the next P-frame, high-frequency information is discarded by the transcoder to meet the new target bit-rate. Incoming residual blocks are also subject to this loss. When a decoder receives this transcoded bitstream, it will decode the I-frame with reduced quality and store it in memory. When it is time to decode the next P-frame, the degraded I-frame is used as a predictive component and added to a degraded residual component. Since both components are different than what was originally derived by the encoder, a mismatch between the predictive and residual components is created. As time goes on, this mismatch progressively increases.

The architecture shown in Figure 36.7(b) is a closed-loop system and aims to eliminate the mismatch between predictive and residual components by approximating the cascaded decoder-encoder approach [27]. The main difference in structure between the reference architecture and this simplified scheme is that reconstruction in the reference is performed in the spatial domain, thereby requiring two reconstruction loops with one DCT and two IDCT's. On the other hand, in the simplified structure that is shown, only one reconstruction loop is required with one DCT and one IDCT. In this structure, some arithmetic inaccuracy is introduced due to the non-linear nature in which the reconstruction loops are combined. However, it has been found the approximation has little effect on the quality [27]. With the exception of this slight inaccuracy, this third architecture is mathematically equivalent to a cascaded decoder-encoder approach. In [32], additional causes of drift, e.g., due to floating-point inaccuracies, have been further studied. Overall, though, in comparison to the open-loop architectures discussed earlier, drift is eliminated since the mismatch between predictive and residual components is compensated for.

3.1.2 Simulation Results

In Figure 36.8, a frame-based comparison of the quality between the reference, open-loop and closed-loop architectures is shown. The input to the transcoders is the Foreman sequence at CIF resolution coded at 2Mbps with GOP structure N=30 and M=3. The transcoded output is re-encoded with a fixed quantization parameter of 15. To illustrate the effect of drift in this plot, the decoded quality of only the I- and P-frames is shown. It is evident that the open-loop architecture suffers from severe drift, and the quality of the simplified closed-loop architecture is very close to that of the reference architecture.

click to expand
Figure 36.8: Frame-based comparison of PSNR quality for reference, open-loop and closed-loop architectures for bitrate reduction.

3.1.3 Motion-Compensation in the DCT Domain

The closed-loop architecture described above provides an effective transcoding structure in which the macroblock reconstruction is performed in the DCT domain. However, since the memory stores spatial domain pixels, the additional DCT/IDCT is still needed. In an attempt to further simplify the reconstruction process, structures that reconstruct references frames completely in the compressed-domain have been proposed [29,30,27]. All of these architectures are based on the compressed-domain methods for motion compensation proposed by Chang and Messerschmidt [28]. It was found that decoding completely in the compressed-domain could yield equivalent quality to spatial-domain decoding [29]. However, this was achieved with floating-point matrix multiplication and proved to be quite costly. In [27], simplification of this computation was achieved by approximating the floating-point elements by power-of-two fractions, so that shift operations could be used, and in [31], simplifications have been achieved through matrix decomposition techniques.

Regardless of which simplification is applied, once the reconstruction has been accomplished in the compressed domain, one can easily re-quantize the drift-free blocks and VLC the quantized data to yield the desired bitstream. In [27], the bit re-allocation has been accomplished using the Lagrangian multiplier method. In this formulation, sets of quantizer steps are found for a group of macroblocks so that the average distortion caused by transcoding error is minimized.

3.1.4 Motion Vector Refinement

In all of the above-mentioned methods, significant complexity is reduced by assuming that the motion vectors computed at the original bit-rate are simply reused in the reduced rate bitstream. It has been shown that re-using the motion vectors in this way leads to non-optimal transcoding results due to the mismatch between prediction and residual components [25,32]. To overcome this loss of quality without performing a full motion re-estimation, the motion vector refinement schemes have been proposed. Such schemes can be used with any of the above bit-rate reduction architectures above for improved quality, as well as the spatial and temporal resolution reduction architectures described below.

3.1.5 CBR-to-VBR Conversion

While the above architectures have mainly focused on general bit-rate reduction techniques for the purpose of transmitting video over band-limited channels, there has also been study on the conversion between constant bitrate (CBR) streams and variable bitrate (VBR) streams to facilitate more efficient transport of video [33]. In this work, the authors exploit the available channel bandwidth of an ATM network and adapt the CBR encoded source accordingly. This is accomplished by first reducing the bitstream to a VBR stream with a reduced average rate, then segmenting the VBR stream into cells and controlling the cell generation rate by a traffic shaping algorithm.

3.2 Spatial Resolution Reduction

Similar to bitrate reduction, the Reference architecture for reduced spatial resolution transcoding refers to the cascaded decoding, spatial-domain down-sampling, followed by a full-re-encoding. Several papers have addressed this problem of resolution conversion, e.g., [36,25,37,34,38]. The primary focus of the work in [36] was on motion vector scaling techniques. In [25], the problems associated with mapping motion vector and macroblock type data were addressed. The performance of motion vector refinement techniques in the context of resolution conversion was also studied in this work. In [37], the authors propose to use DCT-domain down-scaling and motion compensation for transcoding, while also considering motion vector scaling and coding mode decisions. With the proposed two-loop architecture, computational savings of 40% have been reported with a minimal loss in quality. In [34], a comprehensive study on the transcoding to lower spatio-temporal resolutions and to different encoding formats has been provided based on the reuse of motion parameters. In this work, a full decoding and encoding loop were employed, but with the reuse of information, processing time was improved by a factor of 3. While significant savings can be achieved with the reuse of macroblock data, further simplifications of the Reference transcoding architecture were investigated in [35] based on an analysis of drift errors. Several new architectures, including an intra refresh architecture, were proposed.

In the following, the key points from the above works are reviewed, including the performance of various motion vector scaling algorithms, DCT-domain down-conversion and the mapping of macroblock-type information to the lower resolution. Also, the concepts of the intra-refresh architecture will be discussed.

3.2.1 Motion Vector Mapping

When down-sampling four macroblocks to one macroblock, the associated motion vectors have to be mapped. Several methods suitable for frame-based motion vector mapping have been described in past works [36,34,38]. To map from four frame-based motion vectors, i.e., one for each macroblock in a group, to one motion vector for the newly formed macroblock, a weighted average or median filters can be applied. This is referred to as a 4:1 mapping. However, with certain compression standards, such as MPEG-4 and H.263, there is support in the syntax for advanced prediction modes that allow one motion vector per 8x8 block. In this case, each motion vector is mapped from a 16x16 macroblock in the original resolution to an 8x8 block in the reduced resolution macroblock with appropriate scaling by 2. This is referred as a 1:1 mapping. While 1:1 mapping provides a more accurate representation of the motion, it is sometimes inefficient to use since more bits must be used to code four motion vectors. An optimal mapping would adaptively select the best mapping based on an R-D criterion. A good evaluation of the quality that can be achieved using the different motion vector mapping algorithms can be found in [34].

Because MPEG-2 supports interlaced video, we also need to consider field-based MV mapping. In [39], the top-field motion vector was simply used. An alternative scheme that averages the top and bottom field motion vectors under certain conditions was proposed in [35]. However, it should be noted that the appropriate motion vector mapping technique is dependent on the down-conversion scheme used. This is particularly important for interlaced data, where the target output may be a progressive frame. Further study on the relation between motion vector mapping and the texture down-conversion is needed.

3.2.2 DCT-Domain Down-Conversion

The most straightforward way to perform down-conversion in the DCT-domain is to cut the low-frequency coefficients of each block and recompose the new macroblock using the composting techniques proposed [28]. A set of DCT-domain filters can be derived by cascading these two operations. More sophisticated filters that attempt to retain more of the high frequency information, such as the frequency synthesis filters used in [40] and derived in the references therein, may also be considered. The filters used in this work perform the down-conversion operations on the rows and columns of the macroblock using separable 1D filters. These down-conversion filters can be applied in both the horizontal and vertical directions, and to both frame-DCT and field-DCT blocks. Variations of this filtering approach to convert field-DCT blocks to frame-DCT blocks, and vice versa, have also been derived [29].

3.2.3 Conversion of Macroblock Type

In transcoding video bitstreams to a lower spatial resolution, a group of four macroblocks in the original video corresponds to one macroblock in the transcoded video. To ensure that the down-sampling process will not generate an output macroblock in which its sub-blocks have different coding modes, e.g., both inter- and intra-sub-blocks within a single macroblock, the mapping of MB modes to the lower resolution must be considered. Three possible methods to overcome this problem when a so-called mixed-block is encountered are outlined below.

In the first method, ZeroOut, the MB modes of the mixed macroblocks are all modified to inter-mode. The MV's for the intra-macroblocks are reset to zero and so are corresponding DCT coefficients. In this way, the input macroblocks that have been converted are replicated with data from corresponding blocks in the reference frame. The second method, IntraInter, maps all MB's to inter-mode, but the motion vectors for the intra-macroblocks are predicted. The prediction can be based on the data in neighboring blocks, which can include both texture and motion data. As an alternative, we can simply set the motion vector to be zero, depending on which produces less residual. In an encoder, the mean absolute difference of the residual blocks is typically used for mode decision. The same principles can be applied here. Based on the predicted motion vector, a new residual for the modified macroblock must be calculated. In the third method, InterIntra, the MB modes are all modified to intra-mode. In this case, there is no motion information associated with the reduced-resolution macroblock; therefore all associated motion vector data are reset to zero and the intra-DCT coefficients are generated to replace the inter-DCT coefficients.

It should be noted that to implement the IntraInter and InterIntra methods, we need a decoding loop to reconstruct full-resolution picture. The reconstructed data is used as a reference to convert the DCT coefficients from intra-to-inter, or inter-to-intra. For a sequence of frames with a small amount of motion and a low-level of detail, the low complexity strategy of ZeroOut can be used. Otherwise, either IntraInter or InterIntra should be used. The performance of InterIntra is a little better than IntraInter, because InterIntra can stop drift propagation by transforming inter-blocks to intra-blocks.

3.2.4 Intra-Refresh Architecture

In reduced resolution transcoding, drift error is caused by many factors, such as requantization, motion vector truncation and down-sampling. Such errors can only propagate through inter-coded blocks. By converting some percentage of inter-coded blocks to intra-coded blocks, drift propagation can be controlled. In the past, the concept of intra-refresh has successfully been applied to error-resilience coding schemes [41], and it has been found that the same principle is also very useful for reducing the drift in a transcoder [35].

The intra-refresh architecture for spatial resolution reduction is illustrated in Figure 36.9. In this scheme, output macroblocks are subject to a DCT-domain down-conversion, requantization and variable-length coding. Output macro-blocks are either derived directly from the input bitstream, i.e., after variable-length decoding and inverse quantization, or retrieved from the frame store and subject to a DCT operation. Output blocks that originate from the frame store are independent of other data, hence coded as intra-blocks; there is no picture drift associated with these blocks.

click to expand
Figure 36.9: Illustration of intra-refresh architecture for reduced spatial resolution transcoding.

The decision to code an intra-block from the frame store depends on the macroblock coding modes and picture statistics. In a first case based on the coding mode, an output macroblock is converted if the possibility of a mixed-block is detected. In a second case based on picture statistics, the motion vector and residual data are used to detect blocks that are likely to contribute to larger drift error. For this case, picture quality can be maintained by employing an intra-coded block in its place. Of course, the increase in the number of intra-blocks must be compensated for by the rate control. Further details on the rate control can be found in [35], and a complexity-quality evaluation of this architecture compared to the reference method can be found in [43].

3.3 Temporal Resolution Reduction

Reducing the temporal resolution of a video bitstream is a technique that may be used to reduce the bitrate requirements imposed by a channel. However, this technique may also be used to satisfy processing limitations imposed by a terminal. As discussed earlier, motion vectors from the original bitstream are typically reused in bitrate reduction and spatial resolution reduction transcoders to speed up the re-encoding process. In the case of spatial resolution reduction, the input motion vectors are mapped to the lower spatial resolution. For temporal resolution reduction, we are faced with a similar problem in which it is necessary to estimate the motion vectors from the current frame to the previous non-skipped frame that will serve as a reference frame in the receiver. Solutions to this problem have been proposed in [32,44,45]. Assuming a pixel-domain transcoding architecture, this re-estimation of motion vectors is all that needs to be done since new residuals corresponding to the re-estimated motion vectors will be calculated. However, if a DCT-domain transcoding architecture is used, a method of re-estimating the residuals in the DCT-domain is needed. A solution to this problem has been described in [46]. In [47], the issue of motion vector and residual mapping has been addressed in the context of a combined spatio-temporal reduction in the DCT-domain based on the intra-refresh architecture described earlier. The key points of these techniques will be discussed in the following.

3.3.1 Motion Vector Re-Estimation

As described in [32,44,45], the problem of re-estimating a new motion vector from the current frame to a previous non-skipped frame can be solved by tracing the motion vectors back to the desired reference frame. Since the predicted blocks in the current frame are generally overlapping with multiple blocks, bilinear interpolation of the motion vectors in the previous skipped frame has been proposed, where the weighting of each input motion vector is proportional to the amount of overlap with the predicted block. In the place of this bilinear interpolation, a majority-voting or dominant vector selection scheme as proposed in [32,47] may also be used, where the motion vector associated with the largest overlapping region is chosen.

In order to trace back to the desired reference frame in the case of skipping multiple frames, the above process can be repeated. It is suggested, however, that some refinement of the resulting motion vector be performed for better coding efficiency. In [45], an algorithm to determine an appropriate search range based on the motion vector magnitudes and the number of frames skipped has been proposed. To dynamically determine the length of skipped frames and maintain smooth playback, frame rate control based on characteristics of the video content has also been proposed [45,46].

3.3.2 Residual Re-Estimation

The problem of estimating a new residual for temporal resolution reduction is more challenging for DCT-domain transcoding architectures than for pixel-domain transcoding architectures. With pixel-domain architectures, the residual between the current frame and the new reference frame can be easily computed given the new motion vector estimates. For DCT-domain transcoding architectures, this calculation should be done directly using DCT-domain motion compensation techniques [28]. Alternative methods to compute this new residual have been presented in [46,47].

3.4 Scalable Coding

An important area in the context of content adaptation is scalable video coding. Various forms of scalable video coding have been explored over the past decade, i.e., SNR scalability, frequency scalability, spatial scalability and temporal scalability. The advantage of scalable coding schemes in general is that multiple qualities and/or resolution can be encoded simultaneously. The ability to reduce bitrate or spatio-temporal resolutions is built into the coding scheme itself. Typically, the signal is encoded into a base layer and enhancement layers, where the enhancement layers add spatial, temporal and/or SNR quality to the reconstructed base layer.

Recently, a new form of scalability, known as Fine Granular Scalability (FGS), has been developed and adopted by the MPEG-4 standard. In contrast to conventional scalable coding schemes, FGS allows for a much finer scaling of bits in the enhancement layer [48]. In contrast to traditional scalable coding techniques, FGS provides an enhancement layer that is continually scalable. This is accomplished through a bit-plane coding method of DCT coefficients in the enhancement layer, which allows the enhancement layer bitstream to be truncated at any point. In this way, the quality of the reconstructed frames is proportional to the number of enhancement bits received.

The standard itself does not specify how the rate allocation, or equivalently, the truncation of bits on a per frame basis is done. The standard only specifies how a truncated bitstream is decoded. In [50,49], optimal rate allocation strategies that essentially truncate the FGS enhancement layer have been proposed. The truncation is performed in order to achieve constant quality over time under a dynamic rate budget constraint.

Embedded image coders, such as the JPEG-2000 coder [51], also employ bit-plane coding techniques to achieve scalability. Both SNR and spatial scalability are supported. In contrast to most scalable video coding schemes that are based on a Discrete Cosine Transform (DCT), most embedded image coders are based on a Discrete Wavelet Transform (DWT). The potential to use the wavelet transform for scalable video coding is also being explored [52].

3.5 Summary

This section has reviewed work related to the transcoding of video bitstreams, including bitrate, spatial resolution and temporal resolution reduction. Also, some transcoding considerations with regard to scalable coding have been discussed.

It should be noted that there are additional content adaptation schemes that have been developed and proposed, but have not been covered here. Included in these additional adaptation schemes are object-based transcoding [53], transcoding between FGS and single-layer [54], and syntax conversions [55,56]. Key frame extraction and video summarization is also considered a form of content adaptation in the sense that the original video has been abstracted or reduced in time [57,58,59]. Joint transcoding of multiple streams has been presented in [60] and system layer transcoding has been described in [61].