4. Compressed-Domain Processing Methods

In this section, we examine the basic techniques of compressed-domain processing methods. Since the main techniques used in video compression include spatial to frequency transformation, particularly DCT, and motion-compensated prediction, we focus the investigation on compressed domain methods in these two domains, namely, in the DCT domain and the motion domain.

4.1 DCT-Domain Processing

As described in Section 3, the DCT is the transformation used most often in image and video compression standards. It is therefore important to understand some basic operations that can be performed directly in the DCT domain, i.e., without an inverse DCT/forward DCT cycle.

The earliest work on direct manipulation of compressed image and video data expectedly dealt with point processing, which consists of operations such as contrast manipulation and image subtraction where a pixel value in the output image at position p depends solely on the pixel value at the same position p in the input image. Examples of such work can be found in Chang and Messerschmitt [8], who developed some special functions for video compositing, and in Smith and Rowe [9], who developed a set of algorithms for basic point operations. When viewing compressed domain manipulation as a matrix operation, point processing operations on compressed images and video can be characterized as inner-block algebra (IBA) operations since the information in the output block, i.e. the manipulated block, comes solely from information in the corresponding input block. These operations are listed in Table 40.1.

Table 40.1: Mathematical expression of spatial vs. DCT domain algebraic operations
	Spatial domain signal - x	Transform domain signal - X
Scalar addition	[f] + α
Scalar Multiplication	α[f]	α[F]
Pixel Addition	[f]+[g]	[F]+[G]
Pixel Multiplication	[f]•[g]	[f]⊗[g]

In this table, lower case f and g are used to represent spatial domain signals, while upper case F and G represent their corresponding DCT domain signals. Since compression standards typically use block-based schemes, each block can be treated as a matrix. Therefore, the operations can be expressed in forms of matrix operations. In general, the relationship holds as:

DCT(x) = X,

where DCT() represents the DCT function.

Because of the large number of zeros in the block in the DCT domain, the data manipulation rate is heavily reduced. The speedup of the first three operations in Table 40.1 is quite obvious given that the number of non-zero coefficients in F and G is quite small. As an example of these IBA operations, consider the compositing operation where foreground f is combined with background b with a factor of α to generate an output R in DCT representation. In spatial domain, this operation can be expressed as: R = DCT(α[f]+(1-α)[b]). Given the DCT representation of f and b in the compressed domain, F and B, the operation can be conveniently performed as: R = α[F]+(1-α)[B]. The operation is based on the linearity of the DCT and corresponds to a combination of some of the above-defined image algebra operations; it can be done in DCT domain efficiently with significant speedup. Similar compressed domain algorithms for subtitling and dissolving applications can also be developed based on the above IBA operations with computational speedups of 50 or more over the corresponding processing of the uncompressed data [9].

These methods can also be used for color transformation in the compressed domain. As long as the transformation is linear, it can be derived in the compressed domain using a combination of these IBA operations.

Pixel multiplication can be achieved by a convolution in the DCT domain. Compressed-domain convolution has been derived in [9] by mathematically combining the decompression, manipulation, and re-compression processes to obtain a single equivalent local linear operation where one can easily take advantage of the energy compaction property in quantized DCT blocks. A similar approach was taken by Smith [10] to extend point processing to global processing of operations where the value of a pixel in the output image is an arbitrary linear combination of pixels in the input image. Shen et al. [11] have studied the theory behind DCT domain convolution based on the orthogonal property of DCT. As a result, an optimized DCT domain convolution algorithm is proposed and applied to the application of DCT domain alpha blending. Specifically, given foreground f to be blended with the background b with an alpha channel a to indicate the transparency of each pixel in f, the operation can be expressed as: R = DCT([a]•[f]+(1-[a])•[b]). The DCT domain operation is performed as: R = [A]⊗[F]+(1-[A])⊗[B], where A is the DCT representation of a. A masking operation can also be performed in the same fashion with A representing the mask in the DCT domain. This operation enables the overlay of an object in the DCT domain with arbitrary shape. An important application for this is logo-insertion. Another example where processing of arbitrarily shaped objects arise is discussed in Section 6.1.

Many image manipulation operations are local or neighborhood operations where the pixel value at position p in the output image depends on neighboring pixels of p in the input image. We characterize methods to perform such operations in the compressed domain as inner-block rearrangement or resampling (IBR) methods. These methods are based on the fact that DCT is a unitary orthogonal transform and is distributive to matrix multiplication. It is also distributive to matrix addition, which is actually the case of pixel addition in Table 40.1. We group these two distributive properties of DCT in Table 40.2.

Table 40.2: Mathematical expression of distributiveness of DCT
	Spatial domain signal - x	Transform domain signal - X
Matrix Addition	[f]+[g]	[F]+[G]
Matrix Multiplication	[f][g]	[F][G]

Based on above, Chang and Messerschmitt [8] developed a set of algorithms to manipulate images directly in the compressed domain. Some of the interesting algorithms they developed include the translation of images by arbitrary amounts, linear filtering, and scaling. In general, a manipulation requiring uniform and integer scaling, i.e. certain forms of filtering, is easy to implement in the DCT domain using the resampling matrix. Since each block can use the same resampling matrix in space invariant filtering, these kinds of manipulations require little overhead in the DCT domain. In addition, translation of images by arbitrary amounts represents a shifting operation that is often used in video coding. We defer a detailed discussion of this particular method to Section 4.3.

Another set of algorithm has also been introduced to manipulate the orientation of DCT blocks [12]. These methods can be employed to flip-flop a DCT frame as well as rotate a DCT frame at multiples of 90 degrees, simply by switching the location and/or signs of certain DCT coefficients in the DCT blocks. For example, the DCT transform result of a transposed pixel block f is equivalent of the transpose of the corresponding DCT block. This operation is expressed mathematically as:

DCT([f]^t) = [F]^t.

A horizontal flip of a pixel block ([f]^h) can be achieved in the DCT domain by performing an element-by-element multiplication with a matrix composed of only two values: 1 or -1. The operation is therefore just sign reversal on some nonzero coefficients. Mathematically, this operation is expressed as:

DCT([f]^h) = [F]•[H],

where H is defined as follows assuming an 8x8 block operation,

For the full set of operations of this type, please refer to [12]. Note that for all the cases, the DC coefficient remains unchanged because of the fact that each pixel maintains its gray level while its location within the block is changed. These flip-flop and special angle rotation methods are very useful in applications such as image orientation manipulation that is used often in copy machines, printers, and scanners.

4.2 Motion Vector Processing (MV Resampling)

From a video coding perspective, motion vectors are estimated through block matching in a reference frame. This process is often compute intensive. The key of compressed-domain manipulation of motion vectors is to derive new motion vectors out of existing motion vector information contained in the input compressed bitstream.

Consider a motion vector processing scenario that arises in a spatial resolution reduction transcoder. Given the motion vectors for a group of four 16x16 macroblocks of the original video (NxM), how does one estimate the motion vectors for the 16x16 macroblocks in the downscaled video (e.g., N/2xM/2)? Consider forward-predicted macroblocks in a forward-predicted (P) frame, wherein each macroblock is associated with a motion vector and four 8x8 DCT blocks that represent the motion-compensated prediction residual information. The downscale-by-two operation requires four input macroblocks to form a single new output macroblock. In this case, it is necessary to estimate a single motion vector for the new macroblock from the motion vectors associated with the four input macroblocks.

The question asked above can be viewed as a motion vector resampling problem. Specifically, given a set of motion vectors MV in the input compressed bitstream, how does one compute the motion vectors MV* of the output compressed bitstream? Motion vector resampling algorithms can be classified into 5 classes as shown in Figure 40.2 [5]. The most accurate, but least efficient algorithm is Class V, in which one decompresses the original frames into their full pixel representation; and then one performs full search motion estimation on the decompressed frames. Since motion estimation is by far the most compute-intensive part of the transcoding operation, this is a very expensive solution. Simpler motion vector resampling algorithms are given in classes I through IV in order of increasing computational complexity, where increased complexity typically results in more accurate motion vectors. Class I MV resampling algorithms calculate each output motion vector based on its corresponding input motion vector. Class II algorithms calculate each output motion vector based on a neighbourhood of input motion vectors. Class III algorithms also use a neighbourhood of input motion vectors, but also consider other parameters from the input bitstream such as quantization parameters and coding modes when processing them. Class IV algorithms use a neighbourhood of motion vectors and other input bitstream parameters, but also use the decompressed frames. For example, the input motion vectors may be used to narrow the search range used when estimating the output motion vectors. Finally, Class V corresponds to full search motion estimation on the decompressed frames.

click to expand
Figure 40.2: Classes of motion vector resampling methods.

The conventional spatial-domain approach of estimating the motion vectors for the downscaled video is to first decompress the video, downscale the video in the spatial domain then use one of the several widely known spatial-domain motion-estimation techniques (e.g., [13]) to recompute the motion vectors. This is computationally intensive. A class II approach might be to simply take the average of the four motion vectors associated with the four macroblocks and divide it by two so that the resulting motion vector can be associated with the 16x16 macroblock of the downscaled-by-two video. While this operation requires little processing, the motion vectors obtained in this manner are not optimal in most cases.

Adaptive motion vector resampling (AMVR) is a class III approach proposed in [4] to estimate the output motion vectors using the original motion information from the MPEG or H.26x bitstream of the original NxN video sequence. This method uses the DCT blocks to derive the block-activity information for the motion-vector estimation. When comparing the compressed-domain AMVR method to the conventional spatial-domain method, the results suggest that AMVR generates, with significantly less computation, motion vectors for the N/2xM/2 downscaled video that are very close to the optimal motion vector field that would be derived from an N/2xM/2 version of the original video sequence.

This weighted average motion vector scheme can also be extended to motion vector downsampling by arbitrary factors. In this operation, the number of participating macroblocks is not an integer. Therefore, the portion of the area of the participating macroblock is used to weight the contributions of the existing motion vectors.

A class IV method for performing motion vector estimation out of existing motion vectors can be found in [14]. In frame rate reduction transcoding, if a P-picture is to be dropped, the motion vectors of macroblocks on the next P-picture should be adjusted since the reference frame is now different. Youn et al. [14] proposed a motion vector composition method to compute a motion vector from the incoming motion vectors. In this method, the derived motion vector can be refined by performing partial search motion estimation within a narrow search range.

The MV resampling problem for the compressed-domain reverse play operation was examined in [5]. In this application, the goal was to compute the "backward" motion vectors between two frames of a sequence when given the "forward" motion vectors between two frames. Perhaps contrary to intuition, the resulting forward and backwards motion vector fields are not simply inverted versions of one another because of the block-based motion-compensated processing used in typical compression algorithms. A variety of MV resampling algorithms are presented in [5], and experimental results are given that illustrate the tradeoffs in complexity and performance.

4.3 MC+DCT Processing

The previous section introduced methodologies for deriving output motion vectors from existing input motion vectors in the compressed domain. However, if the newly derived motion vectors are used in conjunction with the original residual data, the result is imperfect and will result in a drift error. To avoid drift error, it is important to reconstruct the original reference frame and re-compute the residual data. This renders the IDCT process as the next computation bottleneck since the residual data is in the DCT domain. Alternatively, DCT domain motion compensation methods, such as the one introduced in [8], can be employed where the reference frames are converted to a DCT representation so that no IDCT is needed.

Inverse motion compensation is the process of extracting a 16x16 block given a motion vector in the reference frame. It can be characterised by group matrix multiplications. Due to the distributive property of the DCT, this operation can be achieved by matrix multiplications of DCT blocks. Mathematically, consider a block g of size 8x8 in a reference frame pointed by a motion vector (x,y). Block g may lie in an area covered by a 2x2 array of blocks (f₁, f₂, f₃, f₄) in the reference frame. g can then be calculated as:

The shifting matrices m_xi and m_yi are defined as:

click to expand

where I_z are identity matrices of size 8x8. In the DCT domain, this operation can be expressed as:

(40.1)

where M_xi and M_yi are the DCT representations of m_xi and m_yi respectively. Since these shifting matrices are constant, they can be pre-computed and stored in the memory. However, the computing of Eq (40.1) may still be CPU-intensive since the shifting matrices may not be sparse enough. To this end, various authors have proposed different methods to combat this problem.

Merhav and Bhaskaran [15] proposed to decompose the DCT domain shifting matrices. Matrix decomposition methods are based on the sparseness of the factorized DCT transform matrices. Factorization of DCT transform matrix is introduced in a fast DCT algorithm [16]. The goal of the decomposition is to replace the matrix multiplication with a product of diagonal matrices, simple permutation matrices and more sparse matrices. The multiplication with a diagonal matrix can be absorbed in the quantization process. The multiplication with a permutation matrix can be performed by coefficient permutation. And finally, the multiplication with a sparse matrix requires fewer multiplications. Effectively, the matrix multiplication is achieved with less computation.

In an alternative approach, the coefficients in the shifting matrix can be approximated so that floating point multiplication can be replaced by integer shift and add operation. Work of this kind is introduced in [17]. Effectively, fewer basic CPU operations are needed since multiplication operations are avoided. A similar method is also used in [18] for DCT domain downsampling of images by employing the approximated downsampling matrices.

To further reduce the computation complexity of the DCT domain motion compensation process, a look-up-table (LUT) based method [19] is proposed by modelling the statistical distribution of DCT coefficients in compressed images and video sequences and precomputing all possible combinations of M _xi F_i M _yi as in Eq (40.1). As a result, the matrix multiplications are reduced to simple table look-ups. Using around 800KB of memory, the LUT-based method can save more than 50% of computing time.

4.4 Rate Control/Buffer Requirements

Rate control is another important issue in video coding. For compressed domain processing, the output of the process should also be confined to a certain bitrate so that it can be delivered in a constant transmission rate. Eleftheriadis and Anastassiou [20] have considered rate reduction by an optimal truncation or selection of DCT coefficients. Since fewer coefficients are coded, a lower number of bits are spent in coding them. Nakajima et al. [21] achieve the similar rate reduction by re-quantization using a larger quantization step size.

For compressed domain processing, it is important for the rate control module to use compressed domain information existing in the original stream. This is a challenging problem, since the compressed bitstream lacks information that was available to the original encoder. To illustrate this problem, consider TM5 rate control, which is used in many video coding standards. This rate controller begins by estimating the number of bits available to code the picture, and computes a reference value of the quantization parameter based on the buffer fullness and target bitrate. It then adaptly raises or lowers the quantization parameter for each macroblock based on the spatial activity of that macroblock. The spatial activity measure as defined in TM5 as the variance of each block:

click to expand

However, the pixel domain information Pi may not be available in the compressed domain processing. In this case, the activity measure has to be derived from the DCT coefficients instead of the pixel domain frames. For example, the energy of quantized AC coefficients in the DCT block can be used as a measure of the variance. It has been shown in [4] that this approach achieves satisfactory rate control. In addition, the target bit budget for a particular frame can be derived from the bitrate reduction factor and the number of bits spent for the corresponding original frame, which is directly available from the original video stream.

4.5 Frame Conversions

Frame conversions are another basic tool that can be used in compressed-domain video processing operations [22]. They are especially useful in frame-level processing applications such as splicing and reverse play. Frame conversions are used to convert coded frames from one prediction mode to another to change the prediction dependencies in coded video streams. For example, an original video stream coded with I, P, and B frames may be temporarily converted to a stream coded with all I frames, i.e. a stream without temporal prediction, to facilitate pixel-level editing operations. Also, an IPB sequence may be converted to an IB sequence in which P frames are converted to I frames to facilitate random access into the stream. Also, when splicing two video sequences together, frame conversions can be used to remove prediction dependencies from video frames that are not included in the final spliced sequence. Furthermore, one may wish to use frame conversions to add prediction dependencies to a stream, for example to convert from an all I-frame compressed video stream to an I and P frame compressed stream to achieve a higher compression rate.

A number of frame conversion examples are shown in Figure 40.3. The original IPB sequence is shown in the top. Examples of frame conversions that remove temporal dependencies between frames are given: specifically P-to-I frame, B-to-B_for conversion, and B-to-B_back conversion. These operations are useful for editing operations such as splicing. Finally, an example of I-to-P conversion is shown in which prediction dependencies are added between frames. This is useful in applications that require further compression of a pre-compressed video stream.

click to expand
Figure 40.3: Frame conversions and required macroblock conversions.

Frame conversions require macroblock-level and block-level processing because they modify the motion vector and DCT coefficients of the compressed stream. Specifically, frame conversions require examining each macroblock of the compressed frame, and when necessary changing its coding mode to an appropriate dependency. Depending on the conversion, some, but not all, macroblocks may need to be processed. An example in which a macroblock may not need to be processed is in a P-to-I frame conversion. Since P frames contain i- and p- type macroblocks and I frames contain only i-type macroblocks, a P-to-I conversion requires converting all p-type input macroblocks to i-type output macroblocks; however, note that i-type input macroblocks do not need to be converted. The list of frame types and allowed macroblock coding modes are shown in the upper right table in Figure 40.3. The lower right table shows macroblock conversions needed for some frame conversion operations. These conversions will be used in the splicing and reverse play applications described in Section 5. The conversion of a macroblock from p-type to i-type can be performed with the inverse motion compensation process introduced in Subsection 4.3.

One should note that in standards like MPEG, frame conversions performed on one frame may affect the prediction used in other frames because of the prediction rules specified by I, P, and B frames. Specifically, I-to-P and P-to-I frame conversions do not affect other coded frames. However, I-to-B, B-to-I, P-to-B, and B-to-P frame conversions do affect other coded frames. This can be understood by considering the prediction dependency rules of MPEG. Specifically, since P frames are specified to depend on the nearest preceding I or P frame and B frames are specified to depend on the nearest surrounding I or P frames, it is understandable that frame conversions of certain types will affect the prediction dependency tree inferred from the frame coding types.