Content-Based Video Coding (MPEG-4)


MPEG-4 is another ISO/IEC standard, developed by MPEG (Moving Picture Experts Group), the committee that also developed the Emmy Award winning standards of MPEG-1 and MPEG-2. While MPEG-1 and MPEG-2 video aimed at devising coding tools for CD-ROM and digital television respectively, MPEG-4 video aims at providing tools and algorithms for efficient storage, transmission and manipulation of video data in multimedia environments [1,2]. The main motivations behind such a task are the proven success of digital video in three fields of digital television, interactive graphics applications (synthetic image content) and the interactive multimedia (worldwide web, distribution and access to image content). The MPEG-4 group believe these can be achieved by emphasising the functionalities of the proposed codec, which include efficient compression, object scalability, spatial and temporal scalability, error resilience etc.

The approach taken by the experts group in coding of video for multimedia applications relies on a content-based visual data representation of scenes. In content-based coding, in contrast to conventional video coding techniques, a scene is viewed as a composition of video objects (VO) with intrinsic properties such as shape, motion and texture. It is believed that such a content-based representation is a key to facilitating interactivity with objects for a variety of multimedia applications. In such applications, the user can access arbitrarily shaped objects in the scene and manipulate these objects.

The MPEG-4 group has defined the specifications of their intended video codec in the form of verification models (VM) [3]. The verification model in MPEG-4 has the same role as the reference and test models defined for H.261 and MPEG-2, respectively. The verification model has evolved over time by means of core experiments in various laboratories round the world. It is regarded as a common platform with a precise definition of the encoding and decoding algorithms that can be represented as tools addressing specific functionalities of MPEG-4. New algorithms/tools are added to the VM and old algorithms/tools are replaced in the VM by successful core experiments.

So far the verification model has been gradually evolved from version 1.0 to version 11.0, and during each evolution new functionalities have been added. In this Chapter we do not intend to review all of them, but instead to address those functionalities that have made MPEG-4 video coding radically different from its predecessors. Hence, it is intended to look at the fundamentals of new coding algorithms that have been introduced in MPEG-4. Before going into details of this coding technique, let us examine the functionalities of this codec.

Levels and profiles

MPEG-4, as with MPEG-2, has so many functionalities that users may only be interested in a subset of them. These are defined as profiles. For each profile, a number of resolution states such as bit rate and frame rate and pixel resolutions can be defined as levels. Since MPEG-4 is a very versatile video coding tool it has several levels and profiles, and every now and then many more are added to them. Profile in MPEG-4 is also synonymous with the support of a set of annexes in H.263. Some well known profiles with the associated commonly used levels are as follows:

  • Simple profile: this profile provides the simplest tool for low cost applications, such as video over mobile and Internet. It supports up to four rectangular objects in a scene within QCIF pictures. There are three levels in this profile to define bit rates from 64-384 kbit/s (64, 128 and 384 kbit/s for level-1, level-2 and level-3, respectively). The simple profile also supports most of the optionalities (annexes in H.263) that are mainly useful for error resilience transmission. In addition to I and P-VOPs (video object planes to be defined shortly) they include: AC/DC prediction, four motion vectors, unrestricted motion vectors, quarter-pixel spatial accuracy, slice synchronisation, data partitioning and reversible VLC. This profile can decode a bit stream generated by the core H.263.
  • Simple scalable profile: this adds the support for B-VOPs and temporal and spatial scalability to the simple profile. It provides services to the receivers requiring more than one level of quality of service, such as video over Internet.
  • Advanced real time simple: this adds error protection to the simple profile, through the introduction of the back channel. In response to a negative acknowledgement from the decoder, the encoder encodes the affected parts of the picture in intra mode. This profile improves the robustness of real time visual services over error prone channels such as videophone.
  • Advanced simple profile: this improves the compression efficiency of the simple profile, by supporting quarter-pixel resolution and global motion estimation in addition to B-VOPs.
  • Fine granular scalability profile: this is similar to SNR scalability, but the enhancement layers are represented in bit planes, to offer up to eight scalable layers. It is mainly used with the simple or advanced simple profile as the base layer.
  • Core profile: this adds scalability to still textures, B-VOPs, binary shape coding and temporal scalability of rectangular as well as binary shape objects to the simple profile. Its maximum bit rate for level-1 is 384 kbit/s and for level-2 is 1 Mbit/s. This profile is useful for high quality interactive services, as well as mobile broadcast services.
  • Core scalable visual profile: this adds object-based SNR, spatial and temporal scalability to the core profile.
  • Main profile: this supports for interlaced video, greyscale alpha maps and sprites. This profile is intended for broadcast services that can handle both progressive and interlaced video. It can handle up to 32 objects with a maximum bit rate of 38 Mbit/s.
  • Advanced coding efficiency: this profile is an extension of the main profile, but for bit rates of less than 1 Mbit/s. It adds quarter pixel and global motion estimation to the main profile to improve the encoding efficiency. However, it does not support sprites.
  • Simple studio profile: this only supports I-VOP pictures coded at very high quality up to 1200 Mbit/s. As the name implies it is designed for studio applications, and can support high resolution video for HDTV and digital cinema. Each I-VOP may have an arbitrary shape and have several alpha planes.
  • Core studio profile: this adds P-VOPs to the simple studio profile, to reduce the bit rate for very high quality video.

Video object plane (VOP)

In object-based coding the video frames are defined in terms of layers of video object planes (VOP). Each video object plane is then a video frame of a specific object of interest to be coded, or to be interacted with. Figure 10.1a shows a video frame that is made of three VOPs. In this Figure, the two objects of interest are the balloon and the aeroplane. They are represented by their video object planes of VOP1 and VOP2. The remaining part of the video frame is regarded as a background, represented by VOP0. For coding applications, the background is coded only once, and the other object planes are encoded through time. At the receiver the reconstructed background is repeatedly added to the other decoded object planes. Since in each frame the encoder only codes the objects of interest (e.g. VOP1 and/or VOP2), and usually these objects represent a small portion of the video frame, then the bit rate of the encoded video stream can be extremely low. Note that, had the video frame of Figure 10.1a been coded with a conventional codec such as H.263, since clouds in the background move, the H.263 encoder would have inevitably encoded most parts of the picture with a much higher bit rate than that generated from the two objects.

click to expand
Figure 10.1: (a A video frame composed of) (b balloon VOP1), (c aeroplane VOP2 and) (d the background VOP0)

The VOP can be a semantic object in the scene, such as the balloon and aeroplane in Figure 10.1. It is made of Y, U and V components plus their shapes. The shapes are used to mask the background, and help to identify object boarders.

In MPEG-4 video, the VOPs are either known by construction of the video sequence (hybrid sequence based on blue screen composition or synthetic sequences) or are defined by semiautomatic segmentation. In the former, the shape information is represented by eight bits, known as the greyscale alpha plane. This plane is used to blend several video object planes to form the video frame of interest. Thus with eight bits, up to 256 objects can be identified within a video frame. In the second case, the shape is a binary mask, to identify individual object borders and their positions in the video frames.

Figure 10.2 shows the binary shapes of the balloon and aeroplane in the above example. Both cases are currently considered in the encoding process. The VOP can have an arbitrary shape. When the sequence has only one rectangular VOP of fixed size displayed at a fixed interval, it corresponds to frame-based coding. Frame-based coding is similar to H.263.

click to expand
Figure 10.2: Shape of objects (a balloon) (b aeroplane)

Coding of objects

Each video object plane corresponds to an entity that after being coded is added to the bit stream. The encoder sends, together with the VOP, composition information to indicate where and when each VOP is to be displayed. Users are allowed to trace objects of interest from the bit stream. They are also allowed to change the composition of the entire scene displayed by interacting with the composition information.

Figure 10.3 illustrates a block diagram of an object-based coding verification model (VM). After defining the video object planes, each VOP is encoded and the encoded bit streams are multiplexed to a single bit stream. At the decoder the chosen object planes are extracted from the bit stream and then are composed into an output video to be displayed.

click to expand
Figure 10.3: An object-based video encoder/decoder

Encoding of VOPs

Figure 10.4 shows a general overview of the encoder structure for each of the video object planes (VOPs). The encoder is mainly composed of two parts: the shape encoder and the traditional motion and texture encoder (e.g. H.263) applied to the same VOP.

click to expand
Figure 10.4: VOP encoder structure

Before explaining how the shape and the texture of the objects are coded, in the following we first explain how a VOP should be represented for efficient coding.

Formation of VOP

The shape information is used to form a VOP. For maximum coding efficiency, the arbitrary shape VOP is encapsulated in a bounding rectangle such that the object contains the minimum number of macroblocks. To generate the bounding rectangle, the following steps are followed:

  1. Generate the tightest rectangle around the object, as shown in Figure 10.5. Since the dimensions of the chrominance VOP are half of the luminance VOP (4:2:0), then the top left position of the rectangle should be an even numbered pixel.

    click to expand
    Figure 10.5: Intelligent VOP formation

  2. If the top left position of this rectangle is the origin of the frame, skip the formation procedure.
  3. Form a control macroblock at the top left corner of the tightest rectangle, as shown in Figure 10.5.
  4. Count the number of macroblocks that completely contain the object, starting at each even numbered point of the control macroblock. Details are as follows:

    1. Generate a bounding rectangle from the control point to the right bottom side of the object that consists of multiples of 16 × 16 pixel macroblocks.
    2. Count the number of macroblocks in this rectangle that contain at least one object pixel.
  5. Select that control point which results in the smallest number of macroblocks for the given object.
  6. Extend the top left coordinate of the tightest rectangle to the selected control coordinate.

This will create a rectangle that completely contains the object but with the minimum number of macroblocks in it. The VOP horizontal and vertical spatial references are taken directly from the modified top left coordinate.

Image segmentation

If VOPs are not available, then video frames need to be segmented into objects and a VOP derived for each one. In general, segmentation consists of extracting image regions of similar properties such as brightness, colour or texture. These regions are then used as masks to extract the objects of interest from the image [4]. However, video segmentation is by no means a trivial task and requires the use of a combination of techniques that were developed for image processing, machine vision and video coding. Segmentation can be performed either automatically or semiautomatically.

Semiautomatic segmentation

In semiautomatic video segmentation, the first frame is segmented by manually tracing a contour around the object of interest. This contour is then tracked through the video sequence by dynamically adapting it to the object's movements. The use of this deformable contour for extracting regions of interest was first introduced by Kass et al. [5], and it is commonly known as active contour or active snakes.

In the active snakes method, the contour defined in the first frame is modelled as a polynomial of an arbitrary degree, the most common being a cubic polynomial. The coefficients of the polynomial are adapted to fit the object boundary by minimising a set of constraints. This process is often referred to as the minimisation of the contour energy. The contour stretches or shrinks dynamically while trying to seek a minimal energy state that best fits the boundary of the object. The total energy of a snake is defined as the weighted sum of its internal and external energy given by:


In this equation, the constants a and β give the relative weightings of the internal and the external energy. The internal energy (Einternal) of the snake constrains the shape of the contour and the amount it is allowed to grow or shrink. External energy (Eexternal) can be defined from the image property, such as the local image gradient. By suitable use of the gradient-based external constraint, the snake can be made to wrap itself around the object boundary. Once the boundary of the object is located in the first frame, the shape and positions of the contours can be adapted automatically in the subsequent frames by repeated applications of the energy minimisation process.

Automatic segmentation

Automatic video segmentation aims to minimise user intervention during the segmentation process and is a significantly harder problem than semiautomatic segmentation. Foreground objects or regions of interest in the sequence have to be determined automatically with minimal input from the user. To achieve this, spatial segmentation is first performed on the first frame of the sequence by partitioning it into regions of homogeneous colour or grey level. Following this, motion estimation is used to identify moving regions in the image and to classify them as foreground objects.

Motion information is estimated for each of the homogenous regions using a future reference frame. Regions that are nonstationary are classified as belonging to the foreground object and are merged to form a single object mask. This works well in simple sequences that contain a single object with coherent motion. For more complex scenes with multiple objects of motion disparity, a number of object masks have to be generated. Once the contours of the foreground objects have been identified, they can be tracked through the video sequence using techniques such as active snakes. In the following, each element of video segmentation is explained.

Image gradient

The first step of the segmentation process is the partitioning of the first video frame into regions of homogenous colour or grey level. A technique that is commonly used for this purpose is the watershed transform algorithm [6]. This algorithm partitions the image into regions which correspond closely to the actual object boundaries, but direct application of the algorithm on the image often leads to over segmentation. This is because, in addition to the object borders in the image, texture and noise may also create artificial contours. To alleviate this problem, the image gradient is used as input to the watershed transform instead of the actual image itself. Nonlinear diffusion

To eliminate the false contours the image has to be smoothed, in such a way that edges are not affected. This can be done by diffusing the image iteratively without allowing the diffusion process to cross the object borders. One of the best known filters for this purpose is nonlinear diffusion, where the diffusion is prohibited across edges but unconstrained in regions of almost uniform texture.

Nonlinear diffusion can be realised by space variant filtering, which adapts the Gaussian kernel dynamically to the local gradient estimate [7]. The Gaussian kernel is centred at each pixel and its coefficients are weighted by a function of the gradient magnitude between the centre pixel and its neighbours. The weighted coefficients aij are given by:


where cij is the initial filter coefficient, wij is the weight assigned to each coefficient based on the local gradient magnitude ∇Iij and g is the diffusion function [8]. If wij = 1, then the diffusion is linear. The local gradient is computed for all pixels within the kernel by taking the difference between the centre and the corresponding pixels. More precisely, ∇Iij = Iij - Ic, where Ic is the value of the centre pixel and Iij are the pixels within the filter kernel.

Figure 10.6 shows the diffusion function and the corresponding adapted Gaussian kernel to inhibit diffusion across a step edge. The Figure shows that for large gradient magnitudes, the filter coefficient is reduced in order not to diffuse across edges. Similarly, at the lower gradient magnitudes (well away from the edges) within the area of almost uniform texture, the diffusion is unconstrained, so as to have maximum smoothness. The gradient image generated in this form is suitable to be segmented at its object borders by the watershed transform.

click to expand
Figure 10.6: Weighting function (left), adapted Gaussian kernel (right) Colour edge detection

Colour often gives more information about the object boundaries than the luminance. However, although gradient detection in greyscale images is to some extent easy, deriving the gradient of colour images is less straightforward due to separation of the image into several colour components. Each pixel in the RGB colour space can be represented by a position vector in the three-dimensional RGB colour cube:


Likewise, pixels in other colour spaces can be represented in a similar manner. Hence, the problem of detecting edges in a colour image can be redefined as the problem of computing the gradient between vectors in the three-dimensional colour space. Various methods have been proposed for solving the problem of colour image gradient detection and they include component-wise gradient detection and vector gradient.

Component-wise gradient

In this approach, gradient detection is performed separately for each colour channel, followed by the merging of the results. The simplest way is to apply the Prewitt or the Sobel operator separately on each channel and then combine the gradient magnitude using some linear or nonlinear function [10]. For example:

(10.4)  click to expand

where ∇IR, ∇IG, ∇IB represent the gradient magnitude computed for each of the colour channels.

Vector gradient

This approach requires the definition of a suitable colour difference metric in the three-dimensional colour space. This metric should be based on human perception of colour difference, such that a large value indicates significant perceived difference, and a small value indicates colour similarity. A metric that is commonly used is the Euclidean distance between two colour vectors. For the n-dimensional colour space, the Euclidean distance between two colour pixels, A and B, is defined as follows:


The gradient of the image is then derived, based on the sum of the colour differences between each pixel and its neighbours. This approximates the Prewitt kernel and is given by the following equations.


The gradient magnitude is then defined as:

The use of Euclidean distance in the RGB colour space does not give a good indication of the perceived colour differences. Hence, in [9], it was proposed that the vector gradient should be computed in the CIELUV colour space. CIELUV is an approximation of a perceptually uniform colour space, which is designed so that the Euclidean distance could be used to quantify colour similarity or difference [10].

Watershed transform

The watershed transform algorithm can be visualised by adopting a topological view of the image, where high intensity values represent peaks and low intensity values represent valleys. Water is flooded into the valleys and the lines where the waters from different valleys meet are called the watershed lines, as shown in Figure 10.7. The watershed lines separate one region (catchment basin) from another. In actual implementation, all pixels in an image are initially classified as unlabelled and they are examined starting from the lowest intensity value. If the pixel that is currently being examined has no labelled neighbours, it is classified as a new region or a new catchment basin. On the other hand, if the pixel was a neighbour to exactly one labelled region, it would be classified as belonging to that region. However, if the pixel separates two or more regions, it would be labelled as part of the watershed line.

click to expand
Figure 10.7: Immersion-based watershed flooding

The algorithm can be seeded with a series of markers (seed regions), which represent the lowest point in the image from where water is allowed to start flooding. In other words, these markers form the basic catchment basins for the flooding. In some implementations, new basins are not allowed to form and, hence, the number of regions in the final segmented image will be equal to the number of markers. In other implementations, new regions are allowed to be formed if they are at a higher (intensity) level as compared with the markers. In the absence of markers, the lowest point (lowest intensity value) in the valley is usually chosen as the default marker.

There are two basic approaches to the implementation of the watershed transform and they are the immersion method [11] and the topological distance method [5]. Results obtained using different implementations are usually not identical. Immersion watershed flooding

Efficient implementations of immersion-based watershed flooding are largely based on Vincent and Soille's algorithm [11]. Immersion flooding can be visualised by imagining that holes are pierced at the bottom of the lowest point in each valley. The image is then immersed in water, which flows into the valley from the holes. Obviously, the lowest intensity valley will get flooded first before valleys that have a higher minimum point, as shown in Figure 10.7.

Vincent and Soille's algorithm uses a sorted queue in which pixels in the image are sorted based on their intensity level. In this way, pixels of low intensity (lowest points in valleys) could be examined first before the higher intensity pixels. This algorithm achieves fast performance since the image is scanned in its entirety only once during the queue setup phase. Topological distance watershed

In the topological distance approach, a drop of water that falls on the topology will flow to the lowest point via the slope of the steepest descent. The slope of the steepest descent is loosely defined as the shortest distance from the current position to the local minima. If a certain point in the image has more than one possible steepest descent, that point would be a potential watershed line that separates two or more regions. A more rigorous mathematical definition of this approach can be found in [5].

Colour similarity merging

Despite the use of nonlinear diffusion and the colour gradient, the watershed transformation algorithm still produces an excessive number of regions. For this reason, the regions must be merged based on colour similarity in a suitable colour space. The CIELUV uniform colour space is a plausible candidate due to the simplicity of quantifying colour similarity based on Euclidean distances. The mean luminance L and colour differences u* and v* values are computed for each region of the watershed transformed image. The regions are examined in the increasing order of size and merged with their neighbours if the colour difference between them is less than a threshold (Th). More precisely, the regions A and B are merged if:


where AL,u*,v* and BL,u*,v* are the mean CIELUV values of the respective regions. Since it is likely that each region would have multiple neighbours, the neighbour that minimises the Euclidean distance is chosen as the target for merging. After each merging, the mean L, u* and v* values of the merged region are updated accordingly.

Region motion estimation

Spatial segmentation alone cannot discriminate the foreground objects in a video sequence. Motion information between the current frame and a future reference frame must be utilised to identify the moving regions of the image. These moving regions can then be classified as foreground objects or as regions of interest.

Motion estimation is performed for each of the regions after the merging process. Since regions can have any arbitrary shapes, the boundary of each region is extended to its tightest fitting rectangle. Using the conventional block matching motion estimation algorithm (BMA), the motion of the fitted rectangular is estimated.

Figure 10.8 gives a pictorial summary of the whole video segmentation process. The first frame of the video after filtering and gradient operator generates the gradient of the diffused image. This image is then watershed transformed into various regions, some of which might be similar in colour and intensity. In the above example, the watershed transformed image has about 2660 regions, but many regions are identical in colour. Through the colour similarity process, identical colour regions are merged into about 60 regions (in this picture). Finally, homogeneous neighbouring regions are grouped together, and are separated from the static background, resulting in two objects: the ball and the hand and the table tennis racket.

click to expand
Figure 10.8: A pictorial representation of video segmentation (a original) (b gradient) (c watershed transformed) (d colour merged) (e segmented image)

Object mask creation

The final step of the segmentation process is the creation of the object mask. The object mask is synonymous with object segmentation, since it is used to extract the object from the image. The mask can be created from the union of all regions that have nonzero motion vectors. This works well for sequences that have a stationary background with a single moving object. For more complex scenes with nonstationary background and multiple moving objects, nine object masks are created for the image. Each mask is the union of all regions with coherent motion. Mask 0 corresponds to regions of zero motion, and the other eight masks are regions with motion in the N, NE, E, SE, S, SW, W and NW directions, respectively. These object masks are then used to extract the foreground objects from the video frame.

Figure 10.9 shows two examples of creating the object mask and extracting the object, for two video sequences. The mother and daughter sequence has a fairly stationary background and hence the motion information, in particular motion over several frames, can easily separate foreground from the background.

click to expand
Figure 10.9: (a gradient image) (b object mask) (c segmented object)

For the BBC car, due to the camera following the car, the background is not static. Here before estimating the motion of the car, the global motion of the background needs to be compensated. In section 10.7 we will discuss global motion estimation.

The contour or shape information for each object can also be derived from the object masks and used for tracking the object through the video sequence. Once the object is extracted and its shape is defined, the shape information has to be coded and sent along with the other visual information to the decoder. In the following section several methods for coding of shapes are presented.

Shape coding

The binary and greyscale shapes are normally referred to as binary and greyscale alpha planes. Binary alpha planes are encoded with one of the binary shape coding methods (to be explained later), and the greyscale alpha planes are encoded by motion compensated DCT similar to texture coding (e.g. H.263). An alpha plane is bounded by a rectangle that includes the shape of the VOP, as described in the formation of VOP, in section 10.2.3. The bounding rectangle of the VOP is then extended on the right bottom side to multiples of 16 x 16 pixel macroblocks. The extended alpha samples are set to zero. Now the extended alpha plane can be partitioned into exact multiples of 16 x 16 pixel macroblocks. Hereafter, these macroblocks are referred to as alpha blocks, and the encoding and decoding process for block-based shape coding is carried out per alpha block.

If the pixels in an alpha block are all transparent (all zero), the block is skipped before motion and/or texture coding. No overhead is required to indicate this mode since this transparency information can be obtained from shape coding. This skipping applies to all I, P and B-VOPs. Since shape coding is unique to MPEG-4 (no other standard codecs use it), then in the following sections we pay special attention to various shape coding methods.

Coding of binary alpha planes

A binary alpha plane is encoded in the intra mode for I-VOPs and the inter mode for P-VOPs and B-VOPs. During the development of MPEG-4 several methods for coding of the binary alpha planes have been considered. These include chain coding of the object contours, quad tree coding, modified modified Reed (MMR) and context-based arithmetic encoding (CAE) [3,12]. It appears that CAE, recommended in the latest version of the verification model (VM-11) [3], is the best. Hence in the introduction of these methods, more details are given on the CAE.

Chain code

In the chain code method, the object boundaries are represented by a closed contour, as shown in Figure 10.10, and the chain codes are then applied to the contour. Derivation of the contour from the binary alpha plane is similar to detection of the edge, as discussed in section 10.3. For coding, it involves moving on the contour in one of eight directions, as shown in the Figure, and coding the direction. The chain code terminates when the starting point is revisited.

click to expand
Figure 10.10: Object boundaries for chain coding and the eight directions of the chain around the object

Each chain code contains a start point data followed by the first chain code, and the subsequent differential chain codes. If VOP contains several closed contours, then plural chain codes are coded following the data for the number of regions.

Since a chain code has a cyclic property, a differential chain code in eight directions can be expressed in the range from -3 to 4 by the following definition:


where d is the differential chain code, cn is the current chain code and cn-1 is the previous chain code. Huffman code is used to encode the differential chain code d. The Huffman table is shown in Table 10.1.

Table 10.1: Huffman table for the differential chain code



















At the receiver, after the variable length decoding of d, the current chain code, cn, is then reconstructed as follows:


Quad tree coding

Each binary alpha block (BAB) of 16 × 16 pixels, represented by binary data (white 255 and black 0), is first quad tree segmented. The indices of the segments, according to the rules to be explained, are calculated and then Huffman coded. Figure 10.11 shows the quad tree structure employed for coding of a binary alpha block.

click to expand
Figure 10.11: Quad tree representation of a shape block

At the bottom level (level-3) of the quad tree, a 16 × 16 alpha block is partitioned into 64 subblocks of 2 × 2 samples. Each higher level as shown also contains 16 × 16 pixels, but in groups of 4 × 4, 8 × 8 and 16 × 16 subblocks.

The calculation of the indices is as follows:

  • The indexing of subblocks starts at level-3, where an index is assigned to each 2 × 2 subblock pixel.
  • For the four pixels of the subblock b[0] to b[3] of Figure 10.12, the index is calculated as:


    Figure 10.12: A subblock of 2 × 2 pixels

    where b[i] = 2 if the sample value is 255 (white) and b[i] = 0 if it is black. Hence there are 16 different index values with a minimum of 0 and a maximum of 80.

Step 1: indexing of subblocks at level 3

These indices then become inputs to level-2 for the generation of a new set of indices at this level. However, to increase inter block correlation, the subblocks are swapped in decreasing order of indices. The swapping also causes the distribution of index values to be predominantly low and hence this nonuniform distribution is more efficiently variable length coded. Arrangement for the swap of the four subblocks is carried out according to the relationship of the neighbouring indices in the following order. The upper and left neighbouring indices are shown in Figure 10.13.

Figure 10.13: Upper and left level indices of a subblock

  1. If upper_index[0] is less than upper_index[l], then swap index[0] with index[1] and index[2] with index[3], except for subblocks numbered 0,1,4,5,16,17,20 and 21.
  2. If left_index[0] is less than left_index[l], then swap index[0] with index[2] and index[1] with index[3] except for subblocks numbered 0,2,8,10,32,34,40 and 42.
  3. If upper_index[0] +upper_index[1] is less than left_index[0] +left_index[l], then swap index[1] with index[2] except for subblocks numbered 0,1,2,4,5,8,10,16,17, 20,21,32,34,40 and 42,
  4. The index of level-2 is computed from index[0], index[1], index[2] and index[3] after swapping according to:

    • index_level_2 = 27×f(index[0]) + 9×f(index[l]) + 3×f(index[2]) + f(index[3])



The current subblock is then reconstructed and used as a reference when processing subsequent subblocks.

Step 2: grouping process for higher levels

The grouping process of blocks at the higher level first starts at level-2 where four subblocks from level-3 are grouped to form a new subblock. The grouping process involves swapping and indexing similarly to that discussed for level-3, except that in this level a 4 x 4 pixel block is represented by a 2 × 2 subblock whose elements are indices rather than pixels. The current subblock is then reconstructed and used as a reference when processing subsequent subblocks. At the decoder swapping is done following a reverse sequence of steps as at the encoder.

The grouping process is also performed similarly for level-1 where four subblocks from level-2 are grouped to form a new subblock. The swapping, indexing and reconstruction of a subblock follows grouping, the same as that for other levels.

Now arrangement of subblocks in decreasing order of their indices at level-2, to utilise interblock correlation, is done as follows:

  1. If f(upper_index[0]) is less than f(upper_index[l]), then swap index[0] with index[1] and index[2] with index[3], except for subblocks numbered 0,1,4 and 5.
  2. If f (left_index[0]) is less than f(left_index[1] ), then swap index[0] with index[2] and index[1] with index[3] except for subblocks numbered 0,2,8 and 10.
  3. If f(upper_index[0]) + f(upper_index[l]) is less than f(left_index[0]) +
  4. (left_index[1] ), then swap index[1] with index[2] except for subblocks numbered 0,1,2,4,5,8 and 10.
  5. The index of level-1 is computed from index[0], index[1], index[2] and index[3] after swapping, according to eqn. 10.11.

At level-1 no swapping is required.

Step 3: encoding process

The encoding process involves use of results from the grouping process which produces a total of 85 (=1 + 4 + 16 + 64) indices for a 16 × 16 alpha block. Each index is encoded from the topmost level (level 0). At each level, the order for encoding and transmission of indices is shown by numbers in Figure 10.11. Indices are Huffman coded. Note that indices at level-3 can take only 16 different values, but at the other levels they take 80 different values. Hence for efficient variable length coding, two different Huffman tables are used, one with 16 symbols at level-3 and the other with 80 symbols at levels 0 to 2. These tables are shown in Appendix C.

Modified modified Reed (MMR)

During the course of MPEG-4 development, another shape coding method named modified modified Reed (MMR) was also investigated [12]. The basic idea in this method is to detect the pixel intensity changes (from opaque to transparent and vice versa) within a block, and to code the positions of the changing pixels. The changing pixels are defined by pixels whose colour changes while scanning an alpha block in raster scan order. Figure 10.14 illustrates the changing pixels in an intra and motion compensated inter alpha block. Also in the Figure are the top and left reference row and column of pixels, respectively.

click to expand
Figure 10.14: Changing pixels (a intra alpha block) (b inter alpha block)

To code the position of the changing pixels, the following parameters, as shown in Figure 10.14, are defined.

For each length, the starting pixel is denoted by a0. Initially a0 is positioned at the right end in the top reference and addressed as abs_a0 = -1. The first changing pixel appearing after a0 is called a1. For intra blocks, the pixel located above a0 is called b0. The area from the next pixel to bo down to a0 is called the reference area in intraMMR. In this case the reference changing pixel is called b1. This pixel is found in reference area r1 or r2, shown in Figure 10.15a. Searching in rl, the first changing pixel whose colour is the opposite of a0 is named b1; if not found, the first changing pixel in r2 is named b1. Thus b1 might become identical to a0. Exceptionally b1 may not be found when abs_a0 = -1.

click to expand
Figure 10.15: Reference area for detecting reference changing pixel a (b1) (b c1)

For interMMR, the pixel at the corresponding position to a0 is called c0. The area from the next pixel to c0 down to the end of alpha block is called the reference area in interMMR. The reference changing pixel in this mode is called c1. This pixel is found in the reference area r1 or r2, shown in Figure 10.15b. Searching in r1, the first changing pixel whose colour is the opposite of a0 is named c1; if not found, the first changing pixel in r2 is named c1.

Given a0, the eventual task is to identify a1 by referring to either b1 or c1 depending on the intraMMR or interMMR mode, respectively. This is achieved in three different modes, the vertical, horizontal and vertical pass modes. The vertical mode is the first option that is considered and if it is decided not to be applicable, the horizontal mode or the vertical pass mode is employed in turn.

In the vertical mode the position of a1 is identified relative to that of b1. It is invoked only when the absolute value of the relative distance defined by the relative address DIST = r_a1 - r_b1, is equal to or less than a predefined threshold [12]. The DIST is then variable length coded using the appropriate intraMMR and interMMR VLC tables.

If the vertical mode is not used (DIST > threshold), then the horizontal mode is considered for coding. In this mode the position of a1 is identified on the basis of the absolute distance from a0. If this distance is less than the width of the alpha block, then it is used; otherwise the vertical pass mode is used, which implies that one row of the pixels in the alpha block is passed (not coded).

Finally, the decision to use intraMMR or interMMR is to first scan the alpha block in the horizontal and vertical scanning directions. The one requiring the least number of bits is chosen. In the case of a tie, horizontal scanning is chosen. For the final decision between intraMMR and interMMR, again the one is selected that gives the least coding bits.

Context based arithmetic coding

Each intracoded binary alpha block (BAB), and the intercoded one after being motion compensated by block-based motion compensation, is context-based arithmetic encoded (CAE). In general each binary alpha block is coded according to one of the following seven modes (in C terminology):

  1. MVDs == 0 && No_Update
  2. MVDs! = 0 && No_Update
  3. All_0
  4. All_255
  5. IntraCAE
  6. MVDs == 0 && InterCAE
  7. MVDs! = 0 && InterCAE

The first and second modes indicate that the shape will not be updated, and All_0 and All_255 indicate that the BAB contains only black and white pixels, respectively. None of these modes is required to be arithmetic coded. Also, in the quad tree and MMR methods, All_0 and All_255 are not coded further.

IntraCAE is the mode for context-based arithmetic coding of BABs that contain a mixture of black and white pixels. In modes 6 and 7, the interframe BABs (mixed black and white pixels) with and without motion compensation, respectively, are arithmetic coded.

The motion vector data of shape, represented by MVDS, is the difference between the shape motion vector and its predictor, MVP. The prediction procedure is similar to that of the motion vector data for texture, described in section 9.1.2. However, there are differences, such as:

  • the prediction motion vectors can be derived from the candidate motion vectors of shape MVS1, MVS2, MVS3 or the candidate motion vectors of the texture, MV1, MV2 and MV3, similar to those of H.263 illustrated in Figure 9.2; the prediction motion vector is determined by taking the first encountered motion vector that is valid; if no candidate is valid, the prediction is set to zero
  • overlapped, half pixel precision and 8×8 motion compensation is not carried out
  • in the case that the region outside the VOP is referred to, the value for that is set to zero
  • for B-VOPs, only forward motion compensation is used and neither backward nor interpolated motion compensation is allowed.

It should be noted that when the shape prediction motion vector MVPS is determined, the difference between the motion compensated BAB indicated with MVPS and the current BAB is calculated. If the motion compensated error is less than a certain threshold (AlphaTH) for any 4 × 4 subblock of the BAB, the MVPS is directly employed as the best prediction. If this condition is not met, MV is searched around the prediction vector MVPS, by comparing the BAB indicated by the MV and the current BAB. The MV that minimises the error is taken as the best motion vector for shape, MVS, and the motion vector data for shape (MVDS) is given by MVDS = MVS - MVPS. Size conversion

Rate control and rate reduction in MPEG-4 is realised through size conversion of the binary alpha information. This method is also applicable to quad tree and MRR. It is implemented in two successive steps.

In the first step, if required, the size of the VOP can be reduced by half in each of the horizontal and vertical directions. This is indicated in the VOP header, as the video object plane conversion ratio, VOP_CR, which takes a value of either 1 or . When VOP_CR is , size conversion is carried out on the original bounding box of Figure 10.5.

In the case that the value of VOP_CR is , the locally decoded shape, which is size converted at the VOP level, is stored in the frame memory of the shape frame. For shape motion estimation and compensation, if VOP_CR of the reference shape VOP is not equal to that of the current shape VOP, the reference shape frame (not VOP) is size converted corresponding to the current shape VOP.

For P-VOPs, if the VOP_CR is , the components of the shape motion information vector are measured on the downsampled shape frame. The predicted motion vector for the shape, MVPS, is calculated only using the shape motion vectors MVS1, MVS2 and MVS3.

In the second step, when required, the size conversion is carried out for every binary alpha block, BAB, except for All_0, All_255 and No_Update. At the block level the conversion ratio (CR) can be one of , and 1 (the original size).

For CR = , if the average of pixel values in a 2 × 2 pixel block is equal to or larger than 128, the pixel value of the downsampled block is set to 255, otherwise it is set to zero. For CR = , if the average of pixels in a 4 × 4 pixel block is equal to or larger than 128, the pixel value of the downsampled block is set to 255, otherwise it is set to zero. In either of these cases, upsampling is carried out for the BAB. The values of the interpolated pixels are calculated from their neighbouring pixels, according to their Euclidean distances.

Selection of a suitable value for the conversion ratio (CR) is done based on the conversion error between the original BAB and the BAB which is once downsampled and then reconstructed by upsampling. The conversion error is computed for each 4×4 subblock, by taking the absolute difference of pixels in the corresponding subblocks of the original and reconstructed BABs. If this difference is greater than a certain threshold, this subblock is called an error pixel block (Error_PB). Size conversion at a certain conversion ratio is accepted if there is no Error_PB at that ratio. Figure 10.16 summarises determination of the conversion ratio, CR.

click to expand
Figure 10.16: CR determination algorithm

If a downsampled BAB turns out to be all transparent or all opaque and the conversion error in any 4×4 subblocks in the BAB is equal to or lower than the threshold, the shape information is coded as shape_mode = All_0 or All_255. Unless this is the case, the BAB is coded with a context-based arithmetic coding at the size determined by the algorithm for the rate control. Generation of context index

The pixels in the binary alpha blocks (BABs) are context-based arithmetic coded for both intra and inter modes. The number of pixels in the BAB is determined by the conversion ratio (CR), which is either 16 × 16, 8 × 8 or 4 × 4 pixels for CR values of 1, and , respectively.

The context-based arithmetic encoding (CAE) is a binary arithmetic coding, where the symbol probability is determined from the context of the neighbouring pixels. Such a coding is applied to each pixel of the BAB in the following manner. First, prior to encoding of each BAB, the arithmetic encoder is initialised. Each binary pixel is then encoded in the raster scan order. The process for coding a given pixel is carried out using the following steps:

  1. compute the context number
  2. index a probability table using the context number
  3. use the indexed probability to derive an arithmetic encoder.

When the final pixel has been encoded the arithmetic code is terminated.

Figure 10.17 shows the template of the neighbouring pixels that contribute to the creation of the context number for intra and inter shape pixels.

click to expand
Figure 10.17: Template for the construction of the pixels of (a the intra and) (b inter BABs. The pixel to be coded is marked with '?')

For intra coded BABs, a ten-bit context is calculated for each pixel, as shown in Figure 10.17a. In this Figure the pixel to be encoded is represented by ? and the ten neighbouring pixels are ordered as shown. For inter coded BABs, in addition to spatial redundancy, temporal redundancy is exploited by using pixels from the bordered motion compensated BAB, to make up part of the context, as shown in Figure 10.17b. In this mode, only nine bits are required to calculate the context number, e.g. .

In both modes there are some special cases to note:

  • In building contexts, any pixel outside the bounding box of the current VOP to the left and above are assumed to be zero.
  • The template may cover pixels from BABs which are not known at the decoding time. The values of these unknown pixels are estimated by template padding in the following manner:

    1. When constructing the intra context, the following steps are taken in sequence:

      1. if (C7 is unknown) C7 = C8
      2. if (C3 is unknown) C3 = C4
      3. if (C2 is unknown) C2 = C3.
    2. When constructing the inter context, the following conditional assignment is performed:

      1. if (C1 is unknown) C1 = C2.

Once the context number is calculated, it is used to derive a probability table for binary arithmetic coding. Two probability tables, a 10-bit for intra and a 9-bit for inter BABs, are given in Appendix D. These tables contain the probabilities for a binary alpha pixel being equal to 0 for intra and inter shape coding using the context-based arithmetic coding. All probabilities are normalised to the range of [1, 65535].

As an example let us assume that the neighbouring pixels for an intra BAB template have a black and white pattern as shown in Figure 10.18.

Figure 10.18: An example of an intra BAB template

In this Figure, C0=C1=C2=C3=C4=C7=1, and C5=C6 = C8 = C9 = 0. Hence the context number for coding of pixel ? is C = 20 + 21 + 22 + 23 + 24 + 27 = 159.

If pixel ? was a black pixel it would have been coded with an Intra_prob [159]. This value according to Appendix D is 74 out of 65535. If it was a white pixel, its probability would have been 65535 - 74 = 65461 out of 65535. Such a high probability for a white pixel in this example is expected, since this pixel is surrounded by many white pixels. Note also, although the given probability table is fixed, as the pattern of neighbouring pixels changes, the calculated context number changes such that the assigned probability to the pixel is better suited for that pattern. This is a form of adaptive arithmetic coding which only looks at the limited number of past coded symbols. It has been shown that adaptive arithmetic coding with limited past history has more efficient compression over fixed rate arithmetic coding [13].

Greyscale shape coding

The grey level alpha plane is encoded as its support function and the alpha values on the support. The support is obtained by thresholding the grey level alpha plane by 0, and the remaining parts constitute the alpha values, as shown in Figure 10.19.

click to expand
Figure 10.19: Greyscale shape coding

The support function is encoded by binary shape coding, as described in section 10.4.5. The alpha values are partitioned into 16 × 16 blocks and encoded the same way as the luminance of texture is coded.

Motion estimation and compensation

The texture of each VOP is motion compensated prior to coding. The motion estimation and compensation is similar to that of H.263 with the exception that the blocks on the VOP borders have to be modified to cater for the arbitrary shapes of the VOPs. These modified macroblocks are referred to as polygons, and the motion estimation is called polygon-based matching. Furthermore, since shapes change from time to time, some conversion is necessary to ensure the consistency of the motion compensation.

A macroblock that lies on the VOP boundary, called a boundary macroblock, is padded by replicating the boundary samples of the VOP towards the exterior. This process is carried out by repetitive padding in the horizontal and vertical directions. In case there are macroblocks completely outside the VOP, they are padded by extended padding.

In horizontal padding, each sample at the boundary of a VOP is replicated horizontally in the left or right direction in order to fill the transparent region outside the VOP of a boundary macroblock. If there are two boundary sample values for filling a sample outside a VOP, the two boundary samples are averaged. A similar method is used for vertical padding of the boundary macroblocks in the vertical direction.

Exterior macroblocks immediately next to boundary macroblocks are filled by replicating the samples at the border of the boundary macroblocks. The boundary macroblocks are numbered in a prioritised order according to Figure 10.20.

Figure 10.20: Priority of boundary macroblocks surrounding an exterior macroblock

The exterior macroblock is then padded by replicating upwards, downwards, lefwards or rightwards the rows of sampling from the horizontal, vertial border of the boundary macroblock having the largest priority number. Note that the boundary macroblocks have already been padded by horizontal and vertical repetitive padding. The remaining macroblocks (not located next to any boundary macroblock) are filled with 128. The original alpha plane for the VOP is used to exclude the pixels of the macroblocks that are outside the VOP.

The reference VOP is padded based on its own shape information. For example, when the reference VOP is smaller than the current VOP, the reference is not padded up to the size of the current VOP.

The motion estimation and compensation with the padded VOPs can be carried out in several different forms, such as integer pixel motion estimation, half and quarter sample search, unrestricted motion estimation/compensation, overlapped motion compensation and advanced mode prediction. Motion vectors are then differentially encoded, similar to H.263.

Texture coding

The intra VOPs and motion compensated inter VOPs are coded with 8 × 8 block DCT. The DCT is performed separately for each of the luminance and chrominance planes.

For an arbitrarily shaped VOP, the macroblocks that completely reside inside the VOP shape are coded with a technique identical to H.263. Boundary macroblocks that are of intra type are padded with horizontal and vertical repetition. For inter macroblocks, not only is the macroblock repeatedly padded, but also the region outside the VOP within the block is padded with zeros. Transparent blocks are skipped and therefore are not coded. These blocks are then coded in a manner identical to the interior blocks. Blocks that lie outside the original shape are padded with 128, 128 and 128 for the luminance and the two chrominances in the case of intra and 0, 128 and 128 for inter macroblocks. Blocks that belong neither to the original nor the coded arbitrary shape but to the inside of the bounding box of the VOP are not coded at all.

Shape adaptive DCT

At the boundary macroblocks, the horizontally/vertically padded blocks can be coded with a standard 8×8 block DCT. This padding removes any abrupt transitions within a block, and hence reduces the number of significant DCT coefficients. At the decoder the added pixels are removed by the help of shape parameters from the decoded BABs.

Since the number of opaque pixels in the 8 x 8 blocks of some of the boundary macroblocks is usually less than 64 pixels, it would have been more efficient if these opaque pixels could have been DCT coded without padding. This method of DCT coding is called shape adaptive DCT (SA-DCT). The internal processing of SA-DCT is controlled by the shape information that has to be derived from the decoded BAB. Hence only opaque pixels within the boundary blocks are actually coded. As a consequence, in contrast to standard DCT, the number of DCT coefficients in an SA-DCT is equal to the number of opaque pixels.

There are two types of shape adaptive DCT, one used for inter blocks, known as SA-DCT and the other for intra blocks, known as ∆ SA-DCT, which is an extension of SA-DCT. For both cases, a two-dimensional separable DCT with a varying length of basis vector is used.

The basic concept of SA-DCT is shown in Figure 10.21. Segments of the opaque pixels are encapsulated in the picture grids of 8 x 8 pixel blocks as shown in Figure 10.21a. Each row of the pixels is shifted and aligned to the left, as shown in Figure 10.21b. The aligned pixels are then one-dimensionally DCT coded in the horizontal direction with variable basis functions, where the lengths of the basis functions are determined by the number of pixels in each line. For example, the first pixel of the segment is represented by itself, as a DC coefficient.

click to expand
Figure 10.21: An example of SA-DCT (a original segment) (b ordering of pixels and horizontal SA-DCT) (c location 1D coefficients) (d location of samples prior to vertical SA-DCT) (e ordering of 1D samples and vertical SA-DCT) (f location of 2D SA-DCT coefficients)

The second line of the pixels is DCT coded with a three-point transform, and the third line with a five-point transform, and so on. The coefficients of the N-point DCT, cj, are defined by:



Figure 10.21c illustrates the horizontal DCT coefficients, where the DC values are represented with a dot. These coefficients now become the input to the second stage one-dimensional DCT in the vertical direction. Again they are shifted upwards and aligned to the upper border, and the N-point DCT is applied to each vertical column, as shown in Figure 10.21e. The final two-dimensional DCT coefficients are shown in Figure 10.21f. Note that, since the shape information from the decoded BABs is known, these processes of shifting and alignments in the horizontal and vertical directions are reversible.

The ∆SA-DCT algorithm that is used for intra coded macroblocks is similar to SA-DCT but with extra processing. This additional processing is simply calculating the mean of the opaque pixels in the block, and subtracting the mean from each individual pixel. The resultant zero mean opaque pixels are then SA-DCT coded. The mean value is separately transmitted as the DC coefficient.

Zigzag scanning, quantisation and the variable length coding of the SA-DCT coefficients operations are similar to those of standard DCT used in H.263.

Coding of the background

An important advantage of the content-based coding approach in MPEG-4 is that the compression efficiency can be significantly improved by not coding the video background. If the background is completely static, all the other noncontent-based codecs would not code the background either. However, due to noise, camera shaking, or deliberate camera movement, such as panning, zooming, tilt etc., the background cannot be entirely static. One of the attractive features of content-based or object-based coding is that the background movement does not need to be represented in its exact form. Viewers normally do not pay attention to the accuracy of the background video, unless it is badly coded, such that the coding distortion distracts the viewer from the main scene.

An efficient way of coding the background is to represent the background movement with a global motion model. In section 9.4.5 we saw that motion compensation with the spatial transform was capable of dealing with complex motions such as translation, sheering, zooming etc. The same concept can be used to compensate for the global motion of the background. Here the whole background VOP is transformed to match against the VOP of the previous frame. The transformation parameters then represent the amount of information required to code the background VOP, which corresponds to an extremely low bit rate.

In MPEG-4, the global motion is either represented by six motion parameters (three motion vectors) through the affine transform, defined as:


or, with eight parameters (four motion vectors), using the perspective transform, defined by:

(10.14)  click to expand

In both cases, similar to the bilinear transform of section 9.4.5, the coordinates (u, v) in the current frame are matched against the coordinates (x, y) in the previous frame. The best matching parameters then define the motion of the object (in this case, the background VOP).

In MPEG-4, global motion compensation is based on the transmission of a static sprite. A static sprite is a (possibly large) still image, describing panoramic background. For each consecutive image in a sequence, only eight global motion parameters describing camera motion are coded to reconstruct the object. These parameters represent the appropriate perspective transform of the sprite transmitted in the first frame.

Figure 10.22 depicts the basic concept for coding an MPEG-4 video sequence using a sprite panorama image. It is assumed that the foreground object (tennis player, top right image) can be segmented from the background and that the sprite panorama image can be extracted from the sequence prior to coding. (A sprite panorama is a still image that describes the content of the background over all frames in the sequence.) The large panorama sprite image is transmitted to the receiver only once as the first frame of the sequence to describe the background. The sprite is stored in a sprite buffer. In each consecutive frame only the camera parameters relevant for the background are transmitted to the receiver. This allows the receiver to reconstruct the background image for each frame in the sequence based on the sprite. The moving foreground object is transmitted separately as an arbitrary-shape video object. The receiver composes both the foreground and background images to reconstruct each frame (bottom picture in Figure 10.22). For low delay applications it is possible to transmit the sprite in multiple smaller pieces over consecutive frames or to build up progressively the sprite at the decoder.

click to expand
Figure 10.22: Static sprite of Stefan (courtesy of MPEG-4)

Global motion compensation can also be applied to the foreground objects. Here, for each VOP, a spatial transform, like the perspective transform, is used to estimate the transformation parameters. These are regarded as the global motion parameters that are used to compensate for the global motion of the foreground object. Globally motion compensated VOP is then coded by motion compensated texture coding, where it is once more motion compensated, but this time with a conventional block matching technique.

Coding of synthetic objects

Synthetic images form a subset of computer graphics, that can be supported by MPEG-4. Of particular interest in synthetic images is the animation of head-and-shoulders or cartoon-like images. Animation of synthetic faces was studied in the first phase of MPEG-4, and the three-dimensional body animation was addressed in the second phase of MPEG-4 development [2].

The animation parameters are derived from a two-dimensional mesh, which is a tessellation of a two-dimensional region into polygonal patches. The vertices of the polygonal patches are referred to as the node points or vertices of the mesh. In coding of the objects, these points are moved according to the movement of the body, head, eyes, lips and changes in the facial expressions. A two-dimensional mesh matched to the Claire image is shown in Figure 10.23a. Since the number of nodes representing the movement can be very small, this method of coding, known as model-based coding, requires a very low bit rate, possibly in the range of 10–100 bit/s [14].

click to expand
Figure 10.23: (a a two-dimensional mesh) (b the mapped texture)

In order to make synthetic images look more natural, the texture of the objects is mapped into the two-dimensional mesh, as shown in Figure 10.23b. For coding of the animated images, triangular patches in the current frame are deformed by the movement of the node points to be matched into the triangular patches or facets in the reference frame. The texture inside each patch in the reference frame is thus warped onto the current frame, using a parametric mapping, defined as a function of the node point motion vectors.

For triangular meshes, affine mapping with six parameters (three node points or vertices) is a common choice [15]. Its linear form implies that texture mapping can be accomplished with low computational complexity. This mapping can model a general form of motion including translation, rotation, scaling, reflection and shear, and preserves straight lines. This implies that the original two-dimensional motion field can be compactly represented by the motion of the node points, from which a continuous, piecewise affine motion field can be reconstructed. At the same time, the mesh structure constrains movements of adjacent image patches. Therefore, meshes are well suited to representing mildly deformable but spatially continuous motion fields.

However, if the movement is more complex, like the motion of lips, then affine modelling may fail. For example, Figure 10.24 shows the reconstructed picture of Claire, after nine frames of affine modelling. The accumulated error due to model failure around the lips is very evident.

Figure 10.24: Reconstructed model-based image with the affine transform

For a larger complex motion, requiring a more severe patch deformation, one can use quadrilateral mappings with eight degrees of freedom. Bilinear and perspective mappings are these kinds of mapping, which have a better deformation capability over affine mapping [16, 17].

Coding of still images

MPEG-4 also supports coding of still images with a high coding efficiency as well as spatial and SNR scalability. The coding principle is based on the discrete wavelet transform, which was described in some length in Chapter 4. The lowest subband after quantisation is coded with a differential pulse code modulation (DPCM) and the higher bands with a variant of embedded zero tree wavelet (EZW) [18]. The quantised DPCM and zero tree data are then entropy coded with an arithmetic encoder. Figure 10.25 shows a block diagram of the still image encoder.

click to expand
Figure 10.25: Block diagram of a wavelet-based still image encoder

In the following sections each part of the encoder is described.

Coding of the lowest band

The wavelet coefficients of the lowest band are coded independently from the other bands. These coefficients are DPCM coded with a uniform quantiser. The prediction for coding a wavelet coefficient wx is taken from its neighbouring coefficients wA or wC, according to:


The difference between the actual wavelet coefficient wx and its predicted value wprd is coded. The positions of the neighbouring pixels are shown in Figure 10.26.

Figure 10.26: Prediction for coding the lowest band coefficients

The coefficients after DPCM coding are encoded with an adaptive arithmetic coder. First the minimum value of the coefficient in the band is found. This value,

known as band_offset, is subtracted from all the coefficients to limit their lower bound to zero. The maximum value of the coefficients as band_max_value is also calculated. These two values are included in the bit stream.

For adaptive arithmetic coding [19], the arithmetic coder model is initialised at the start of coding with a uniform distribution in the range of 0 to band_max_value. Each quantised and DPCM coded coefficient after arithmetic coding is added to the distribution. Hence, as the encoding progresses, the distribution of the model adapts itself to the distribution of the coded coefficients (adaptive arithmetic coding).

Coding of higher bands

For efficient compression of higher bands as well as for a wide range of scalability, the higher order wavelet coefficients are coded with the embedded zero tree wavelet (EZW) algorithm first introduced by Shapiro [18]. Details of this coding technique were given in Chapter 4. Here we show how it is used within the MPEG-4 still image coding algorithm.

Figure 10.27 shows a multiscale zero tree coding algorithm, based on EZW, for coding of higher bands. The wavelet coefficients are first quantised with a quantiser Q0. The quantised coefficients are scanned with the zero tree concept (exploiting similarities among the bands of the same orientation), and then entropy coded with an arithmetic coder (AC). The generated bits comprise the first portion of the bit stream, as the base layer data, BS0. The quantised coefficients of the base layer after inverse quantisation are subtracted from the input wavelet coefficients, and the residual quantisation distortions are requantised by another quantiser, Q1. These are then zero tree scanned (ZTS), entropy coded to represent the second portion of the bit stream, BS1. The procedure is repeated for all the quantisers, Q0 to QN, to generate N + 1 layers of the bit stream.

click to expand
Figure 10.27: A multiscale encoder of higher bands

The quantisers used in each layer are uniform with a dead band zone of twice the quantiser step size of that layer. The quantiser step size in each layer is specified by the encoder in the bit stream. As we are already aware each quantiser is a multilayer and the quantiser step size of a lower layer is several times that of its immediate upper layer. This is because, for a linear quantiser Qi, with a quantiser step size of qi, the maximum residual quantisation distortion is qi (for those which fall in the dead zone of Qi) and qi/2 (for those which are quantised). Hence for a higher layer quantiser Qi+1, with a quantiser step size of qi+1 to be efficient, then qi+1

should be several times smaller than qi. If , then it becomes a bilevel quantiser.

The number of quantisers indicates the number of SNR-scalable layers, and the quantiser step size in each layer determines the granularity of SNR scalability at that layer. For finest granularity of SNR scalability, all the layers can use a bilevel (one bit) quantiser. In this case, for optimum encoding efficiency, the quantiser step size of each layer is exactly twice that of its immediate upper layer. Multistage quantisation in this mode now becomes quantisation by successive approximation or bit plane encoding, described in Chapter 4. Here, the number of quantisers is equal to the number of bit planes, required to represent the wavelet transform coefficients. In this bilevel quantisation, instead of quantiser step sizes, the maximum number of bit planes is specified in the bit stream.

As the Figure shows, the quantised coefficients are zero tree scanned (ZTS) to exploit similarities among the bands of the same orientation. The zero tree takes advantage of the principle that if a wavelet coefficient at a lower frequency band is insignificant, then all the wavelet coefficients of the same orientation at the same spatial location are also likely to be insignificant. A zero tree exists at any node, when a coefficient is zero and all the node's children are zero trees. The wavelet trees are efficiently represented and coded by scanning each tree from the root at the lowest band through the children and assigning symbols to each state of the tree.

If a multilevel quantiser is used, then each node encounters three symbols: zero tree root (ZT), value zero tree root (VZ) and value (V). A zero tree root symbol denotes a coefficient that is the root of a zero tree. When such a symbol is coded, the zero tree does not need to be scanned further, because it is known that all the coefficients in such a tree have zero values. A value zero tree root symbol is a node where the coefficient has a nonzero value, and all its four children are zero tree roots. The scan of this tree can stop at this symbol. A value symbol identifies a coefficient with value either zero or nonzero, but some of the descendents are nonzero. The symbols and the quantised coefficients are then entropy coded with an adaptive arithmetic coder.

When a bilevel quantiser is used, the value of each coefficient is either 0 or 1. Hence, depending on the implementation procedure, different types of symbol can be defined. Since multilayer bilevel quantisation is in fact quantisation by successive approximation, then this mode is exactly the same as coding the symbols at EZW. There we defined four symbols of +, -, ZT and Z, where ZT is a zero tree root symbol, and Z is an isolated zero within a tree and + and - are the values for refinement (see section 4.5 for details).

In order to achieve both spatial and SNR scalability, two different scanning methods are employed in this scheme. For spatial scalability, the wavelet coefficients are scanned from subband to subband, starting from the lowest frequency band to the highest frequency band. For SNR scalability, the wavelet coefficients are scanned quantiser to quantiser. The scanning method is defined in the bit stream.

Shape adaptive wavelet transform

Shape adaptive wavelet (SA-wavelet) coding is used for compression of arbitrary shaped textures. SA-wavelet coding is different from the regular wavelet coding mainly in its treatment of the boundaries of arbitrary shaped texture. The coding ensures that the number of wavelet coefficients to be coded is exactly the same as the number of pixels in the arbitrary shaped region, and coding efficiency at the object boundaries is the same as for the middle of the region. When the object boundary is rectangular, SA-wavelet coding becomes the same as regular wavelet coding.

The shape information of an arbitrary shaped region is used in performing the SA-wavelet transform in the following manner. Within each region, the first row of pixels belonging to that region and the first segment of the consecutive pixels in the row are identified. Depending on whether the starting point in the region has odd or even coordinates, and the number of pixels in the row of the segment is odd or even, the proper arrangements for 2:1 downsampling and use of symmetric extensions are made [3].

Coding of the SA-wavelet coefficients is the same as coding of regular wavelet coefficients, except that a modification is needed to handle partial wavelet trees that have wavelet coefficients corresponding to pixels outside the shape boundary. Such wavelet coefficients are called out nodes of the wavelet trees. Coding of the lowest band is the same as that of the regular wavelet, but the out nodes are not coded. For the higher bands, for any wavelet trees without out nodes, the regular zero tree is applied. For a partial tree, a minor modification to the regular zero tree coding is needed to deal with the out nodes. That is, if the entire branch of a partial tree has out nodes only, no coding is needed for this branch, because the shape information is available to the decoder to indicate this case. If a parent node is not an out node, all the children out nodes are set to zero, so that the out nodes do not affect the status of the parent node as the zero tree root or isolated zero. At the decoder, shape information is used to identify such zero values as out nodes. If the parent node is an out node and not all of its children are out nodes, there are two possible cases. The first case is that some of its children are not out nodes, but they are all zeros. This case is treated as a zero tree root and there is no need to go down the tree. The shape information indicates which children are zeros and which are out nodes. The second case is that some of its children are not out nodes and at least one of such nodes is nonzero. In this case, the out node parent is set to zero and the shape information helps the decoder to know that this is an out node, and coding continues further down the tree. There is no need to use a separate symbol for any out nodes.

Video coding with the wavelet transform

The success of the zero tree in efficient coding of wavelet transform coefficients has encouraged researchers to use it for video coding. Although wavelet-based video coding is not part of the standard, there is no reason why wavelet-based video coding cannot be used in the future. This of course depends on the encoding efficiency of wavelet-based coding and its functionalities. For this reason, in this section we look at some video coding scenarios and examine the encoding efficiency.

One way of wavelet-based video coding is to use the generic video encoder of Figure 3.18, but replacing the DCT with the DWT, as shown in Figure 10.28. Here, variable length coding of the quantised coefficients is replaced by the zero tree coding, of either EZW or SPIHT [18,21]. Also, overlapped motion compensation has been found to be very effective with the wavelet transform, as it prevents the motion compensation blocking artefacts from creating spurious vertical and horizontal edges.

click to expand
Figure 10.28: Block diagram of a wavelet video codec

Another method that might be used with the wavelet transform is shown in Figure 10.29.

click to expand
Figure 10.29: A hybrid H.263/wavelet video coding scheme

Each frame of the input video is transformed into n-band wavelet subbands. The lowest LL band is fed into a DCT-based video encoder, such as MPEG-1 or H.263. The other bands undergo a hierarchical motion compensation. First, the three high frequency bands of the last stage are motion compensated using the motion vectors from MPEG-1/H.263. The reconstructed picture from these four bands (LL, LH, HL and HH), which is the next level LL band, only requires a ±1 pixel refinement [20]. The other three bands at this stage are also motion compensated by the same amount. This process is continued for all the bands. Hence at the end all the bands are motion compensated. Now, these motion compensated bands are coded with a zero tree-based coding method (EZW or SPIHT).

Virtual zero tree (VZT) algorithm

The lowest band of the wavelet, LL, is a reduced size replica of the original video, and hence can be encoded with an efficient encoder, such as H.263. When zero tree-based coding such as EZW/SPIHT is used along with the standard codecs (e.g. Figure 10.29), it meets some problems. First, the subband decomposition stops when the top level LL band reaches a size of SIF/QSIF or subQSIF. At these levels there will be too many clustered zero tree roots. This is very common for either static parts of the pictures or when motion compensation is very efficient. Even for still images or I-pictures, a large part of the picture may contain only low spatial frequency information. As a result, at the early stages of the quantisation by successive approximation, where the yardstick is large, a vast majority of the wavelet coefficients fall below the yardstick. Secondly, even if the subband decomposition is taken to more stages, such that the top stage LL is a small picture of 16 x 16 pixels (e.g. Figure 10.28), it is unlikely that many zero trees can be generated. In other words, with a higher level of wavelet decomposition, the tree structure is bound to break and hence the efficiency of EZW/SPIHT is greatly reduced.

To improve the efficiency of zero tree-based coding, we have devised a version of it called virtual zero tree (VZT) [22]. The idea is to build trees outside the image boundary, hence the word virtual, as an extension to the existing trees that have roots in the top stage, so that the significant map can be represented in a more efficient way. It can be imagined as replacing the top level LL band with zero value coefficients. These coefficients represent the roots of wavelet trees of several virtual subimages in normal EZW/SPIHT coding, although no decomposition and decimation actually takes place, as demonstrated in Figure 10.30.

click to expand
Figure 10.30: A virtual zero tree

In this Figure, virtual trees, or a virtual map, are built in the virtual subbands on the high frequency bands of the highest stage. Several wavelet coefficients of the highest stage form a virtual node at the bottom level of the virtual map. Then in the virtual map, four nodes of a lower level are represented by one node of a higher level in the same way as a zero tree is formed in EZW/SPIHT coding. The virtual map has only two symbols: VZT root or nonVZT root. If four nodes of a 2 x 2 block on any level of a virtual tree are all VZT roots, the corresponding node on the higher level will also be a VZT root. Otherwise this one node of the higher level will be a nonVZT node. This effectively constructs a long rooted tree of clustered real zero trees. One node on the bottom level of the virtual map is a VZT root only when the four luminance coefficients of a 2 x 2 block and their two corresponding chrominance coefficients on the top stage wavelet band are all zero tree roots. Chrominance pictures are also wavelet decomposed and, for a 4:2:0 image format, four zero tree roots of the luminance and one from each chrominance can be made a composite zero tree root [22].

Coding of high resolution video

A wavelet transform can be an ideal tool for coding of high resolution pictures or video. This is because the high correlation among the dense pixels of high resolution pictures is better exploited if a larger area for decorrelation of pixels is used. Unfortunately, in the standard video codecs, the DCT block sizes are fixed at 8 × 8, and hence pixel correlation beyond eight-pixel distances cannot be exploited. On the other hand, in the wavelet transform, increasing the number of decomposition levels means decorrelating the pixels at larger distances if so desired.

We have tested the encoding efficiency of the wavelet transform (for Figure 10.29) for high definition video (HDTV) as well as the super HDTV (SHD). For HDTV, we have used the test sequence Gaynor (courtesy of BBC). For this image sequence, a two-stage (seven-band) wavelet transform is used and the LL band of the SIF size was MPEG-1 coded. Motion compensated higher bands were coded with VZT and EZW. For the SHD video, a park sequence with 2048 x 2048 pixels at 60 Hz (courtesy of NHK Japan [21]) was used. The SHD images after three-stage subband decomposition result in ten bands. The LL band, with picture resolutions of 256 x 256 pixels, is MPEG-1 coded; the remaining nine bands are VZT and EZW coded in two separate experiments.

It appears at first that, by creating virtual nodes, we have increased the number of symbols to be coded, and hence the bit rate tends to increase rather than to decrease. However, these virtual roots will cluster the zero tree roots into a bigger zero tree root, such that instead of coding these roots one by one, at the expense of a large overhead by a simple EZW, we can code the whole cluster by a single VZT with only a few bits. VZT is more powerful at the early stages of encoding, where the vast majority of top stage coefficients are zero tree roots. This can be seen from Table 10.2, where a complete breakdown of the total bit rate required to code a P-picture of the park sequence by both methods is given.

Table 10.2: Breakdown bit rate in coding of a P-picture with VZT and EZW

VZT (kbits)

EZW (kbits)


virtual pass

real pass

dominant pass

subordinate pass


dominant pass

subordinate pass

































































Grand total





The first row of the Table shows that 171 kbits are used to code the LL band by MPEG-1. The second row shows that 15 kbits is used for the additional ±1 pixel refinement in all bands. For the higher bands the image is scanned in five passes, where the bits used in the dominant and subordinate passes of VZT and EZW are shown. In VZT the dominant pass is made up of two parts: one used in coding of the virtual nodes and the other parts for real data in the actual nine bands. Note that although some bits are used to code the virtual nodes (are not used in EZW), the total bits of the dominant pass in VZT is much less than for EZW. The number of bits in the subordinate passes, which codes the real subordinate data, is the same for both methods. In the Table the grand total is the total number of bits used to code the P-frame under the two coding schemes. It can be seen that VZT requires two-thirds of the bit rate required by EZW.

For HDTV, our results show that, although a good quality video at 18 Mbit/s can be achieved under EZW, VZT only needs 11 Mbit/s [22].

Coding of low resolution video

Although coding of low spatial resolution (e.g. QCIF, subQCIF) video may not benefit from wavelet decomposition to the extent that higher resolution video does, nevertheless zero tree coding is efficient enough to produce good results. In fact, the multiresolution property of wavelet-based coding is a bonus that the DCT-based coding most suffers from. In section 8.5.7, we saw that two-layer spatial and SNR scalability of the standard codecs (MPEG, H.263) reduces the compression efficiency by as much as 30 per cent. That is, spatial/SNR coders are required to generate about 30 per cent more bits to be able to produce the same video quality as a single layer coder does. Undoubtedly, increasing the number of layers, or combining them in a hybrid form, will widen this deficiency gap.

On the other hand, with the wavelet transform, as the number of decomposition levels increases, the encoding efficiency increases too. Moreover, with the concept of virtual zero tree a large area of static parts of the picture can be grouped together for efficient coding, as we have seen from the results of higher resolution pictures of the previous section.

To compare the quality of wavelet-coded video (purely wavelet-based video codec of Figure 10.28) against the best of the standard codec, we have coded 10 Hz QCIF size of Akio and Carphone standard video test sequences at various bit rates, as shown in Figure 10.31, for Akio and Figure 10.32 for Carphone.

click to expand
Figure 10.31: Quality of QCIF size Akio sequence coded at various bit rates

click to expand
Figure 10.32: Quality of QCIF size Carphone sequence coded at various bit rates

For the wavelet coder, we have used three levels of wavelet decomposition (ten bands) and two levels of virtual nodes, with an SPIHT-type zero tree coding, called virtual SPHIT [23]. Unlike Figure 10.29, where the lowest subband was coded by an standard DCT codec, here we have coded all the bands, including the LL band with the virtual SPIHT (Figure 10.28). This is because, after three-level wavelet decomposition, the lowest LL band has a dimension of 22 × 18 pixels which is not viable, nor practical to be coded with codecs using 16 × 16 pixel macroblocks. On the motion compensation, the whole QCIF image was motion compensated with an overlapped block matching algorithm, with half a pixel precision. Motion compensated wavelet coefficients after the zero tree scan were arithmetic coded.

For comparison we have also coded these sequences with single and two-layer SNR scalable H.263 codecs, at various bit rates from 20 kbit/s up to 60 kbit/s. For the two-layer H.263, 50 per cent of the total bits were assigned to the base layer. As the PSNR Figures 10.31 and 10.32 show, single-layer H.263 is consistently better than the wavelet which itself is better than the two-layer H.263. However, subjectively wavelet-coded video appears better than single layer H.263, despite being 1–2 dB poorer on the PSNR.

Figures 10.33 and 10.34 show snap shots of Akio and Carphone, coded at 20 kbit/s and 40 kbit/s, respectively. Although these small pictures may not show the subjective quality differences between the two codecs, the fact is that the H.263 picture is blocky but that of the wavelet is not. In the Carphone picture, the wavelet-coded picture is smeared but it is still much sharper than the two-layer H.263. Some nonexperts prefer the smeared picture to the blocky one, but expert viewers have a different opinion. Comparing the wavelet with the two-layer H.263, both subjective and objective results are in favour of the wavelet.

click to expand
Figure 10.33: A snap shot of Akio coded at 20 kbit/s, 10 Hz (a H.263) (b SNR-scalable H.263) (c wavelet SPIHT)

click to expand
Figure 10.34: A snap shot of Carphone coded at 40 kbit/s, 10 Hz (a H.263) (b SNR-scalable H.263) (c wavelet SPIHT)

Considering that the above three-level decomposition wavelet transform can be regarded as an N-layer video without impairing its compression efficiency, it implies that the wavelet has a high potential for multilayer coding of video. Although this video codec is not a part of any standard, there is no doubt that its high potential for multilayer coding makes it very attractive.


We have covered scalability in the standard codecs several times, but due to the nature of content-based coding, scalability in MPEG-4 can be different from the other standard codecs. However, since MPEG-4 is also used as a frame-based video codec, the scalability methods we have discussed so far can also be used in this codec. Thus we introduce two new methods of scalability that are only defined for MPEG-4.

Fine granularity scalability

The scalability methods we have seen so far, SNR, spatial and temporal, are normally carried out in two or a number of layers. These create coarse levels of representing various quality, spatial and temporal resolutions. A few spatial and temporal resolutions are quite acceptable for natural scenes, as it is of no practical use to have more resolution levels of these kinds. Hence, for the frame-based video, MPEG-4 also recommends the spatial and temporal scalabilities as we have discussed so far.

On the other hand, the existing SNR scalability has the potential to be represented in more quality levels. The increment in quality is both practical and appreciated by the human observer. Thus in MPEG-4, instead of SNR scalability, a synonymous method named fine granularity scalability (FGS) is recommended. In this method the base layer is coded similar to the base layer of a SNR scalable coder, namely coding of video at a given frame rate and a relatively large quantiser step size. Then the difference between the original DCT coefficients and the quantised coefficients in the base layer (base layer quantisation distortion) rather than being quantised with a finer quantiser step size, as is done in the SNR scalable coder, is represented in bit planes. Starting from the highest bit plane that contains nonzero bits, each bit plane is successively coded using run length coding, on a block by block basis. The codewords for the run lengths can be derived either from Huffman or arithmetic coding. Typically different codebooks are used for different bit planes, because the run length distributions across the bit planes are different.

Object based scalability

In the various scalability methods described so far, including FGS, the scalability operation is applied to the entire frame. In object-based scalability, the scalability operations are applied to the individual objects. Of particular interest to MPEG-4 is object-based temporal scalability (OTS), where the frame rate of a selected object is enhanced, such that it has a smoother motion than the remaining area. That is, the temporal scalability is applied to some objects to increase their frame rates against the other objects in the frames.

MPEG-4 defines two types of OTS. In type-1 the video object layer 0 (VOL0) comprises the background and the object of interest. Higher frame rates for the object of interest are coded at VOL1, as shown in Figure 10.35, where it is predictively coded, or Figure 10.36, if the enhancement layer is bidirectionally coded.

click to expand
Figure 10.35: OTS enhancement structure of type-1, with predictive coding of VOL

click to expand
Figure 10.36: OTS enhancement structure of type-1, with bidirectional coding of VOL

On type-2 OTS, the background is separated from the object of interest, as shown in Figure 10.37. The background, VO0, is sent at a low frame rate without scalability. The object of interest is coded at two frame rates of base and enhancement layers of VOL0 and VOL1, respectively, as shown in the Figure.

click to expand
Figure 10.37: OTS enhancement structure of type-2

MPEG 4 versus H 263

There is a strong similarity between the simple profile MPEG-4 and H.263, such that a simple profile MPEG-4 can decode a bit stream generated by the core H.263. In fact most of the annexes introduced into H.263 after the year 2000 were parts of the profiles designed for MPEG-4 in the mid 1990s. However, there is no absolute reverse compatibility between the simplest profile of MPEG-4 and the H.263.

The main source of incompatibility is concerned with the systems function of MPEG-4. Like its predecessor, MPEG-4 uses a packet structure form of transporting compressed audio-visual data. In MPEG-2 we saw that packetised elementary streams (PES) of audio, video and data are multiplexed into a transport packet stream. In MPEG-4, due to the existence of many objects (up to 256 visual objects plus audio and data), the transport stream becomes significantly important. Even if one object is coded (frame-based mode), the same transpost stream should be used. In particular, for transmission of MPEG-4 video over mobile networks, the packetised system has to add extra error resilience to the MPEG-4 bit stream. It is this packetised format of MPEG-4 that makes it nondecodable by an H.263 codec.

In MPEG-4 the elementary objects of the audio-visual media are fed to the synchronisation layer (SL), to generate SL-packets. An SL-packet has a resynchronisation marker, similar to the GOB and slice resynchronisation markers in H.263. However, in MPEG-4 the resynchronisation marker is inserted after a certain number of bits are coded, but in H.263 they are added after several macroblocks. Since the number of bits generated per macroblock is variable, then while distances between the resynchronisation markers in H.263 are variable, those of MPEG-4 are fixed. This helps to reduce the effect of channel errors in MPEG-4, hence making it more error resilient.

A copy of the picture header is repeated in every SL-packet, to make them independent of each other. Each medium may generate more than one SL-packet. For example, if scalability is used, several interrelated packets are generated. Packets are then multiplexed into the output stream. This layer is unaware of the transport or delivery layer. The sync layer is interfaced to the delivery layer through the delivery multimedia integration framework (DMIF) application interface (DAI). The DAI is network independent but demands session setup and stream control functions. It also enables setting up quality of service for each stream.

The transport or delivery layer is delivery aware but unaware of media. MPEG-4 does not define any specific delivery layer. It relies mainly on the existing transport layers, such as RTP for Internet, MPEG-2 transport stream for wired and wireless or ATM for B-ISDN networks.

There are also some small differences on the employment of the coding tools in the two codecs. For example, reversible VLC used in the simple profile of MPEG-4 is not exactly the same as the RVLC used in data partitioning of Annex V of H.263. In the former, RVLC is also used for DCT coefficients, and in the latter it is used for the nonDCT coefficients, e.g. motion vectors, macroblock addresses etc. Although in Chapter 9 we showed that RVLC due to its higher overhead over the conventional VLC is not viable for the DCT-coefficients, nevertheless in the experiments of Chapter 9 macroblocks of the P-pictures were all interframe coded. Had any macroblock been intraframe coded, its retrieval through RVLC would have improved the picture quality.

These differences are significant enough to ask which one of H.263 and MPEG-4 might be suitable for video over mobile networks. In the late 1990s, an international collaboration under the project 3gpp (3rd generation partnership project) set up an extensive investigation to compare the performance of these two codecs [24]. The outcome of one of the experiments is shown in Figure 10.38.

click to expand
Figure 10.38: Comparison between MPEG-4 and H.263

In this experiment wireless channels of 64 kbit/s were used, of which 7.6 kbit/s were assigned to the speech signal. The remaining bits were the target video bit rates for coding of the overtime video test sequence for the two codecs, including the overheads. The channel errors were set to 10-6, 2 × 10-4 and 10-3, for both fixed and mobile sets, identified by F and M in the Figure, respectively. The H.263 codec is equipped with annexes D, F, I, J and N (see list of H.263 annexes in Chapter 9 [25]) and that of MPEG-4 was the simple profile (see section 10.1). The performance of the reconstructed pictures after error concealment was subjectively evaluated. The average of the viewers' scores (see section 2.4) as the mean opinion score (MOS) is plotted in Figure 10.38 against the tested error rates.

At the low error rate of 10-6, H.263 outperforms MPEG-4 for both mobile and fixed environments. Thus it can be concluded that H.263 is a more compression efficient encoder than MPEG-4. Much of this is due to the use of RVLC in the MPEG-4, as we have seen in Table 9.3, and RVLC is less compression efficient than MPEG-4. Other overheads, such as packet header and more frequent resynchronisation markers, add to this overhead. On the other hand, at high error rates, the superior error resilience of MPEG-4 over H.263 compensates for compression inefficiency and in some cases the decoded video under MPEG-4 is perceived to be better than H.263.

Considering the above analysis, one cannot for sure say whether H.263 is better or worse than MPEG-4. In fact what makes a bigger impact on the experimental results is not the basic H.263 or MPEG-4 definitions but the annexes or optionalities of these two codecs. For example, had data partitioning (Annex V) been used in H.263, it would have performed as well as MPEG-4 at high error rates. Unfortunately, data partitioning in H.263 was introduced in 2001, and this study was carried out in 1997. Hence the 3ggp project fails to show the true difference between the H.263 and MPEG-4.



Figure 10.39 shows a part of a binary alpha plane, where shaded pixels are represented by 1 and blank pixels by 0:

  1. calculate the context number for pixels A, B and C
  2. use the intra table of Appendix D to calculate the probability for coding of alpha pixels A, B and C.

Figure 10.39

for a: c0 = c2 = c3 = 1, index = 1 + 4 + 8 = 13, the given frequency table in appendix d is for prob(0), hence prob(0) = 29789, but since a is a 1 pixel, then its probability prob(1) = 65535 - 29789 = 35746 out of 65535. for b: c0 = c1 = c2 = c3 = c4 = 1, and index = 1 + 2 + 4 + 8 + 16 = 31, prob(0) = 6554. as we see this odd pixel of 0 among the 1s has a lower probability. for c: c1 = c2 = c3 = c4 = c5 = c7 = 1, and the index becomes 190. like pixel a the prob(0) = 91, but its prob(1) = 65535 - 91 = 65444 out of 65535, which is as expected.


Draw a part of the contour of the shape of Figure 10.39 starting from point X, and ending at point Y. Use the Huffman Table 10.1 to calculate the number of bits required to differentially chain code the part of the contour from X to Y.

the chain code is 0, 1, 0, 1, 7, 7. the differential chain code will be: 0, 1, -1, 1, -2, 0 with bits 1+2 + 3 + 2 + 5 + 1 = 14 bits


Figure 10.40 shows a part of a quad tree binary alpha block (BAB). Assume shaded pixels at level three are binary 1 and blank ones binary 0:

  1. calculate the index of each pixel at level 2
  2. calculate the index of the pixel at level 1.

Figure 10.40

at level 2 the indices (without swapping) will be 0 for the two blank blocks and index = 27 2 + 9 2 + 0 + 2 = 74 and index =0 + 0 + 3x2 + 2 = 8 for the two blocks. at the upper level the index is: index = 0 + 0 + 3 x 1 + 1 =4


A part of a foreground texture within an 8 × 8 pixel block is shown in Figure 10.41. The background pixels are shown blank. The block is DCT coded and the quantised coeffiicients with a step size of th = q = 8 are zigzag scanned. Compare the efficeincy of the following two coding methods:

  1. use a shape adaptive DCT to code the block; assume the shape of this object is available to the decoder
  2. use normal DCT, but assume the background pixels are zero.

Figure 10.41

the coefficients of the shape adaptive dct will be 127 -40 1 10 -15 23 -7 -6 confined to the top left corner, while for those of the normal dct with padded zeros, the significant coefficients are scattered all over the 8 8 area.



For A: c0 = c2 = c3 = 1, index = 1 + 4 + 8 = 13, the given frequency table in Appendix D is for prob(0), hence prob(0) = 29789, but since A is a 1 pixel, then its probability prob(1) = 65535 - 29789 = 35746 out of 65535.

For B: c0 = c1 = c2 = c3 = c4 = 1, and index = 1 + 2 + 4 + 8 + 16 = 31, prob(0) = 6554. As we see this odd pixel of 0 among the 1s has a lower probability.

For C: c1 = c2 = c3 = c4 = c5 = c7 = 1, and the index becomes 190. Like pixel A the prob(0) = 91, but its prob(1) = 65535 - 91 = 65444 out of 65535, which is as expected.


The chain code is 0, 1, 0, 1, 7, 7. The differential chain code will be: 0, 1, -1, 1, -2, 0 with bits 1+2 + 3 + 2 + 5 + 1 = 14 bits


At level 2 the indices (without swapping) will be 0 for the two blank blocks and index = 27 × 2 + 9 × 2 + 0 + 2 = 74 and index =0 + 0 + 3x2 + 2 = 8 for the two blocks. At the upper level the index is: index = 0 + 0 + 3 x 1 + 1 =4


The coefficients of the shape adaptive DCT will be












confined to the top left corner, while for those of the normal DCT with padded zeros, the significant coefficients are scattered all over the 8 × 8 area.


1 KOENEN, R.,PEREIRA, F., and CHIARIGLIONE, L.: 'MPEG-4: Context and objectives', Image Commun. J., 1997, 9:4

2 MPEG-4: 'Testing and evaluation procedures document'. ISO/IEC JTC1/SC29/WG11, N999, July 1995

3 MPEG-4 video verification model version-11, ISO/IEC JTC1/SC29/WG11, N2171, Tokyo, March 1998

4 Special issue on segmentation, IEEE Trans. Circuits Syst. Video Technol., 1998, 8:5

5 KASS, M.,WITKIN, A., and TERZOPOULOS, D.: 'Snakes: active contour models', Inter. J. Comput. Vis., 1987, 1:321–331

6 ROERDINK, J.B.T.M., and MEIJSTER, A.: 'The watershed transform: definitions, algorithms, and parallellization strategies', Fundamental Informatics, 2000, 41, pp.187–228

7 IZQUIERDO, E., and GHANBARI, M.: 'Key components for an advanced segmentation systems', IEEE Trans. Multimedia, 2002, 4:1, pp.97–113

8 PERONA, P., and MALIK, J.: 'Scale-space and edge detection using anisotropic diffusion', IEEE Trans. Pattern Anal. Mach. Intell., 1990, 12:7, pp.629–639

9 SHAFARENKO, L.,PETROU, M., and KITTLER, J.: 'Automatic watershed segmentation of randomly textured color images', IEEE Trans. Image Process., 1997, 6:11, pp.1530–1544

10 WYSZECKI, G., and STILES, W.S.: 'Color science: concepts and methods, quantitative data and formulae' (Wiley, 1982, 2nd edn.)

11 VINCENT, L., and SOILLE, P.: 'Watersheds in digital spaces: an efficient algorithm based on immersion simulations', IEEE Trans. Pattern Anal. Mach. Intell., 1991, 13:6, pp.583–589

12 MPEG-4: 'Video shape coding'. ISO/IEC JTC1/SC29/WG11, N1584, March 1997

13 GHANBARI, M.: 'Arithmetic coding with limited past history', Electron. Lett., 1991, 27:13, pp.1157–1159

14 PEARSON, D.E.: 'Developments in model-based video coding', Proc. IEEE, 1995, 83:6, pp.892–906

15 WOLBERG, G.: 'Digital image warping' (IEEE Computer Society Press, Los Alamitos, California, 1990)

16 SEFERIDIS, V., and GHANBARI, M.: 'General approach to block matching motion estimation', J. Opt. Engineering, 1993, 37:7, pp.1464–1474

17 GHANBARI, M.,DE FARIA, S.,GOH, I.J., and TAN, K.T.: 'Motion compensation for very low bit-rate video', Signal Process. Image Commun., special issue on very low bit rate video, 1995, 7:(4–6), pp.567–580

18 SHAPIRO, J.M.: 'Embedded image coding using zero-trees of wavelet coefficients', IEEE Trans. Signal Process., 1993, 4:12, pp.3445–3462

19 WITTEN, I.H.,NEAL, R.M., and CLEARY, J.G.: 'Arithmetic coding for data compression', Commun. ACM, 1987, 30:6, pp.520–540

20 WANG, Q., and GHANBARI, M.: 'Motion-compensation for super high definition video'. Proceedings of IS&T/SPIE symposium, on Electronic imaging, science and technology, very high resolution and quality imaging, San Jose, CA, Jan. 27–Feb. 2, 1996

21 SAID, A., and PEARLMAN, W.A.: 'A new, fast and efficient image codec based on set partitioning in hierarchical trees', IEEE Trans. Circuits Syst. Video Technol., 1996, 6:3, pp.243–250

22 WANG, Q., and GHANBARI, M.: 'Scalable coding of very high resolution video using the virtual zero-tree', IEEE Trans. Circuits Syst. Video Technol., special issue on multimedia, 1997, 7:5, pp.719–729

23 KHAN, E., and GHANBARI, M.: 'Very low bit rate video coding using virtual SPIHT', Electron. Lett., 2001, 37:1, pp.40–42

24 3rd Generation Partnership Project (3gpp): TSG-SA codec working group, release 1999, technical report TR 26.912 V 3.0.0

Standard Codecs(c) Image Compression to Advanced Video Coding
Standard Codecs: Image Compression to Advanced Video Coding (IET Telecommunications Series)
ISBN: 0852967101
EAN: 2147483647
Year: 2005
Pages: 148
Authors: M. Ghanbari
Simiral book on Amazon © 2008-2017.
If you may any questions please contact us: