6. Advanced Topics

6.1 Object-Based to Block-Based Transcoding

This chapter focused on compressed-domain processing and transcoding algorithms for block-based compression schemes such as MPEG-1, MPEG-2, MPEG-4 simple profile, H.261, H.263, and H.264/MPEG-4 AVC. These compression standards represent each video frame as a rectangular array of pixels, and perform compression based on block-based processing, e.g. the block DCT and block-based motion estimation and motion compensated prediction. These compression algorithms are referred to as block- or frame-based schemes. Recently, object-based representations and compression algorithms have been developed -- the object-based coding part of MPEG-4 is the most well known example. These object-based representations decompose the image or video into arbitrarily shaped (non-rectangular) objects, unlike the block-based representations discussed above.

Object-based representations provide a more natural representation than square blocks, and can facilitate a number of new functionalities such as interactivity with objects in the video and greater content-creation flexibility. The object-based profiles of MPEG-4 are especially appealing for content creation and editing. For example, it may be useful to separately represent and encode different objects, such as different people or foreground or background objects, in a video scene in order to simplify manipulation of the scene. Therefore, object-based coding, such as MPEG-4, may become a natural approach to create, manipulate, and distribute new content. On the other hand, most clients may have block-based decoders, especially thin clients such as PDAs or cell phones. Therefore, it may become important to be able to efficiently transcode from object-based coding to block-based coding, e.g. from object-based MPEG-4 to block-based MPEG-2 or MPEG-4 simple profile. Efficient object-based to block-based transcoding algorithms were developed for intraframe (image) and interframe (video) compression in [7]. These efficient transcoding algorithms use many of the compressed-domain methods described in Section 4.

At each time instance (or frame), a video object has a shape, an amplitude (texture) within the shape, and a motion from frame to frame. In object-based coding, the shape (or support region) of the arbitrarily shaped object is often represented by a binary mask, and the texture of the object is represented by DCT transform coefficients. The object-based coding tools are often designed based on block-based coding tools. Typically in object-based image coding, such as in MPEG-4, a bounding box is placed around the object and the box is divided into blocks. The resulting blocks are classified as interior, boundary, or exterior blocks based on whether the block is completely within, partially within, or completely outside the object's support. For intraframe coding, a conventional block-DCT is applied to interior blocks and a modified block transform is applied to boundary blocks. For interframe coding, a macroblock and transform block structure similar to block-based video coding is used, where motion vectors are computed for macroblocks and conventional or modified block transforms are applied to interior and boundary blocks.

Many of the issues that arise in intraframe object-based to block-based transcoding algorithms can be understood by considering the simplified problem of overlaying an arbitrarily shaped object onto a fixed rectangular image, and producing the output compressed image that contains the rectangular image with the arbitrarily shaped overlay.

The simplest case occurs when the block boundaries of the fixed rectangular image and of the overlaid object are aligned. In this case, the output blocks can be computed in one of three cases. First, output image blocks that do not contain any portion of the overlay object may be simply copied from the corresponding block in the fixed rectangular image. Second, output image blocks that are completely covered by the overlaid object are replaced with the object's corresponding interior block. Finally, output image blocks that partially contain pixels from the rectangular image and the overlaid object are computed from the corresponding block from the fixed rectangular image and the corresponding boundary block from the overlaid object. Specifically, the new output coded block can be computed by properly masking the two blocks according to the object's segmentation mask. This can be computed in the spatial domain by inverse transforming the corresponding blocks in the background image and object, appropriately combining the two blocks with a spatial-domain masking operation, and transforming the result. Alternatively, it can be computed with compressed-domain masking operations, as described in Subsections 4.1, to reduce the computational requirements of the operation.

If the block boundaries of the object are not aligned with the block boundaries of the fixed rectangular image, then the affected blocks need additional processing. In this scenario, a shifting operation and a combined shifting/masking operation are needed for the unaligned block boundaries. Once again, output blocks that do not contain any portion of the overlaid object are copied from the corresponding input block in the rectangular image. Each remaining output block in the original image will overlap with 2 to 4 of the overlaid object's coded blocks (depending on whether one or both of the horizontal and vertical axes are misaligned). For image blocks with full coverage of the object and for which all the overlapping object's blocks are interior blocks, a shifting operation can be used to compute the new output "shifted" block. For the remaining blocks, a combined shifting/masking operation can be used to compute the new output block. As in the previous example, these computations can be performed in the spatial domain, or possibly more efficiently in the transform domain using the operations described in Subsections 4.1 and 4.3.

The object-to-block based interframe (video) transcoding algorithm shares the issues that arise in the intraframe (image) transcoding algorithm with regard to the alignment of macroblock boundaries between the rectangular video and overlaid video object, or between multiple arbitrarily shaped video objects. Furthermore, a number of important problems arise because of the different prediction dependencies that exist for the multiple objects in the object-coded video and the desired single dependency tree for the block-based coded video. This requires significant manipulation of the temporal dependencies in the coded video. Briefly speaking, given multiple arbitrarily shaped objects described by shape parameters and motion and DCT coefficients, the transcoding algorithm requires the computation of output block-based motion vectors and DCT coefficients. The solution presented computes output motion vectors with motion vector resampling techniques and computes output DCT coefficients with efficient transform-domain processing algorithms for combinations of the shifting, masking, and inverse motion compensation operations. Furthermore, the algorithm uses macroblock mode conversions, similar to those described in Subsection 4.5 and [22], to appropriately compensate for prediction dependencies that originally may have relied upon areas now covered by the overlaid object. The reader is referred to [7] for a detailed description of the transcoding algorithm.

6.2 Secure Scalable Streaming

It should now be obvious that transcoding is a useful capability in streaming media and media communication applications, because it allows intermediate network nodes to adapt compressed media streams for downstream client capabilities and time-varying network conditions. An additional issue that arises in some streaming media and media communication applications is security, in that an application may require the transported media stream to remain encrypted at all times. In applications where this type of security is required, the transcoding algorithms described earlier in this chapter can only be applied by decrypting the stream, transcoding the decrypted stream, and encrypting the result. By requiring decryption at transcoding nodes, this solution breaks the end-to-end security of the system.

Secure Scalable Streaming (SSS) is a solution that achieves the challenge of simultaneously enabling security and transcoding; specifically it enables transcoding without decryption [34][35]. SSS uses jointly designed scalable coding and progressive encryption techniques to encode and encrypt video into secure scalable packets that are transmitted across the network. The joint encoding and encryption is performed such that these resulting secure scalable packets can be transcoded at intermediate, possibly untrusted, network nodes by simply truncating or discarding packets and without compromising the end-to-end security of the system. The secure scalable packets may have unencrypted headers that provide hints, such as optimal truncation points, which the downstream transcoders use to achieve rate-distortion (R-D) optimal fine-grain transcoding across the encrypted packets.

The transcoding methods presented in this chapter are very powerful in that they can operate on most standard-compliant streams. However, in applications that require end-to-end security (where the transcoder is not allowed to see the bits), SSS can be used with certain types of scalable image and video compression algorithms to simultaneously provide security and scalability by enabling transcoding without decryption.

6.3 Applications to Mobile Streaming Media Systems

The increased bandwidth of next-generation wireless systems will make streaming media a critical component of future wireless services. The network infrastructure will need to be able to handle the demands of mobility and streaming media, in a manner that scales to large numbers of users. Mobile streaming media (MSM) systems can be used to enable media delivery over next-generation mobile networks. For example, a mobile streaming media content delivery network (MSM-CDN) can be used to efficiently distribute and deliver media content to large numbers of mobile users [36]. These MSM systems need to handle large numbers of compressed media streams; the CDP methods presented in this chapter can be used to do so in an efficient and scalable manner. For example, compressed-domain transcoding can be used to adapt media streams originally made for high-resolution display devices such as DVDs into media streams made for lower-resolution portable devices [26], and to adapt streams for different types of portable devices. Furthermore, transcoding can be used to adaptively stream content over error-prone, time-varying wireless links by adapting the error-resilience based on channel conditions [37]. When using transcoding sessions in mobile environments, a number of system-level technical challenges arise. For example, user mobility may cause a server handoff in an MSM-CDN, which in turn may require the midstream handoff of a transcoding session [38]. CDP is likely to play a critical and enabling role in next-generation MSM systems that require scalability and performance, and in many cases CDP will enable next-generation wireless, media-rich services.