10.4 Shape coding

The binary and greyscale shapes are normally referred to as binary and greyscale alpha planes. Binary alpha planes are encoded with one of the binary shape coding methods (to be explained later), and the greyscale alpha planes are encoded by motion compensated DCT similar to texture coding (e.g. H.263). An alpha plane is bounded by a rectangle that includes the shape of the VOP, as described in the formation of VOP, in section 10.2.3. The bounding rectangle of the VOP is then extended on the right bottom side to multiples of 16 x 16 pixel macroblocks. The extended alpha samples are set to zero. Now the extended alpha plane can be partitioned into exact multiples of 16 x 16 pixel macroblocks. Hereafter, these macroblocks are referred to as alpha blocks, and the encoding and decoding process for block-based shape coding is carried out per alpha block.

If the pixels in an alpha block are all transparent (all zero), the block is skipped before motion and/or texture coding. No overhead is required to indicate this mode since this transparency information can be obtained from shape coding. This skipping applies to all I, P and B-VOPs. Since shape coding is unique to MPEG-4 (no other standard codecs use it), then in the following sections we pay special attention to various shape coding methods.

10.4.1 Coding of binary alpha planes

A binary alpha plane is encoded in the intra mode for I-VOPs and the inter mode for P-VOPs and B-VOPs. During the development of MPEG-4 several methods for coding of the binary alpha planes have been considered. These include chain coding of the object contours, quad tree coding, modified modified Reed (MMR) and context-based arithmetic encoding (CAE) [3,12]. It appears that CAE, recommended in the latest version of the verification model (VM-11) [3], is the best. Hence in the introduction of these methods, more details are given on the CAE.

10.4.2 Chain code

In the chain code method, the object boundaries are represented by a closed contour, as shown in Figure 10.10, and the chain codes are then applied to the contour. Derivation of the contour from the binary alpha plane is similar to detection of the edge, as discussed in section 10.3. For coding, it involves moving on the contour in one of eight directions, as shown in the Figure, and coding the direction. The chain code terminates when the starting point is revisited.

click to expand
Figure 10.10: Object boundaries for chain coding and the eight directions of the chain around the object

Each chain code contains a start point data followed by the first chain code, and the subsequent differential chain codes. If VOP contains several closed contours, then plural chain codes are coded following the data for the number of regions.

Since a chain code has a cyclic property, a differential chain code in eight directions can be expressed in the range from -3 to 4 by the following definition:

(10.8)

where d is the differential chain code, c_n is the current chain code and c_n-1 is the previous chain code. Huffman code is used to encode the differential chain code d. The Huffman table is shown in Table 10.1.

Table 10.1: Huffman table for the differential chain code
d	code
0	1
1	00
-1	011
2	0100
-2	01011
3	010100
-3	0101011
4	0101010

At the receiver, after the variable length decoding of d, the current chain code, c_n, is then reconstructed as follows:

(10.9)

10.4.3 Quad tree coding

Each binary alpha block (BAB) of 16 × 16 pixels, represented by binary data (white 255 and black 0), is first quad tree segmented. The indices of the segments, according to the rules to be explained, are calculated and then Huffman coded. Figure 10.11 shows the quad tree structure employed for coding of a binary alpha block.

click to expand
Figure 10.11: Quad tree representation of a shape block

At the bottom level (level-3) of the quad tree, a 16 × 16 alpha block is partitioned into 64 subblocks of 2 × 2 samples. Each higher level as shown also contains 16 × 16 pixels, but in groups of 4 × 4, 8 × 8 and 16 × 16 subblocks.

The calculation of the indices is as follows:

The indexing of subblocks starts at level-3, where an index is assigned to each 2 × 2 subblock pixel.
For the four pixels of the subblock b[0] to b[3] of Figure 10.12, the index is calculated as:

(10.10)

Figure 10.12: A subblock of 2 × 2 pixels

where b[i] = 2 if the sample value is 255 (white) and b[i] = 0 if it is black. Hence there are 16 different index values with a minimum of 0 and a maximum of 80.

Step 1: indexing of subblocks at level 3

These indices then become inputs to level-2 for the generation of a new set of indices at this level. However, to increase inter block correlation, the subblocks are swapped in decreasing order of indices. The swapping also causes the distribution of index values to be predominantly low and hence this nonuniform distribution is more efficiently variable length coded. Arrangement for the swap of the four subblocks is carried out according to the relationship of the neighbouring indices in the following order. The upper and left neighbouring indices are shown in Figure 10.13.

Figure 10.13: Upper and left level indices of a subblock

If upper_index[0] is less than upper_index[l], then swap index[0] with index[1] and index[2] with index[3], except for subblocks numbered 0,1,4,5,16,17,20 and 21.
If left_index[0] is less than left_index[l], then swap index[0] with index[2] and index[1] with index[3] except for subblocks numbered 0,2,8,10,32,34,40 and 42.
If upper_index[0] +upper_index[1] is less than left_index[0] +left_index[l], then swap index[1] with index[2] except for subblocks numbered 0,1,2,4,5,8,10,16,17, 20,21,32,34,40 and 42,
The index of level-2 is computed from index[0], index[1], index[2] and index[3] after swapping according to:
- index_level_2 = 27×f(index[0]) + 9×f(index[l]) + 3×f(index[2]) + f(index[3])
where

(10.11)

The current subblock is then reconstructed and used as a reference when processing subsequent subblocks.

Step 2: grouping process for higher levels

The grouping process of blocks at the higher level first starts at level-2 where four subblocks from level-3 are grouped to form a new subblock. The grouping process involves swapping and indexing similarly to that discussed for level-3, except that in this level a 4 x 4 pixel block is represented by a 2 × 2 subblock whose elements are indices rather than pixels. The current subblock is then reconstructed and used as a reference when processing subsequent subblocks. At the decoder swapping is done following a reverse sequence of steps as at the encoder.

The grouping process is also performed similarly for level-1 where four subblocks from level-2 are grouped to form a new subblock. The swapping, indexing and reconstruction of a subblock follows grouping, the same as that for other levels.

Now arrangement of subblocks in decreasing order of their indices at level-2, to utilise interblock correlation, is done as follows:

If f(upper_index[0]) is less than f(upper_index[l]), then swap index[0] with index[1] and index[2] with index[3], except for subblocks numbered 0,1,4 and 5.
If f (left_index[0]) is less than f(left_index[1] ), then swap index[0] with index[2] and index[1] with index[3] except for subblocks numbered 0,2,8 and 10.
If f(upper_index[0]) + f(upper_index[l]) is less than f(left_index[0]) +
(left_index[1] ), then swap index[1] with index[2] except for subblocks numbered 0,1,2,4,5,8 and 10.
The index of level-1 is computed from index[0], index[1], index[2] and index[3] after swapping, according to eqn. 10.11.

At level-1 no swapping is required.

Step 3: encoding process

The encoding process involves use of results from the grouping process which produces a total of 85 (=1 + 4 + 16 + 64) indices for a 16 × 16 alpha block. Each index is encoded from the topmost level (level 0). At each level, the order for encoding and transmission of indices is shown by numbers in Figure 10.11. Indices are Huffman coded. Note that indices at level-3 can take only 16 different values, but at the other levels they take 80 different values. Hence for efficient variable length coding, two different Huffman tables are used, one with 16 symbols at level-3 and the other with 80 symbols at levels 0 to 2. These tables are shown in Appendix C.

10.4.4 Modified modified Reed (MMR)

During the course of MPEG-4 development, another shape coding method named modified modified Reed (MMR) was also investigated [12]. The basic idea in this method is to detect the pixel intensity changes (from opaque to transparent and vice versa) within a block, and to code the positions of the changing pixels. The changing pixels are defined by pixels whose colour changes while scanning an alpha block in raster scan order. Figure 10.14 illustrates the changing pixels in an intra and motion compensated inter alpha block. Also in the Figure are the top and left reference row and column of pixels, respectively.

click to expand
Figure 10.14: Changing pixels (a intra alpha block) (b inter alpha block)

To code the position of the changing pixels, the following parameters, as shown in Figure 10.14, are defined.

For each length, the starting pixel is denoted by a₀. Initially a₀ is positioned at the right end in the top reference and addressed as abs_a₀ = -1. The first changing pixel appearing after a₀ is called a₁. For intra blocks, the pixel located above a₀ is called b₀. The area from the next pixel to bo down to a₀ is called the reference area in intraMMR. In this case the reference changing pixel is called b₁. This pixel is found in reference area r₁ or r₂, shown in Figure 10.15a. Searching in r_l, the first changing pixel whose colour is the opposite of a₀ is named b₁; if not found, the first changing pixel in r₂ is named b₁. Thus b₁ might become identical to a₀. Exceptionally b₁ may not be found when abs_a₀ = -1.

click to expand
Figure 10.15: Reference area for detecting reference changing pixel a (b₁) (b c₁)

For interMMR, the pixel at the corresponding position to a₀ is called c₀. The area from the next pixel to c₀ down to the end of alpha block is called the reference area in interMMR. The reference changing pixel in this mode is called c₁. This pixel is found in the reference area r₁ or r₂, shown in Figure 10.15b. Searching in r₁, the first changing pixel whose colour is the opposite of a₀ is named c₁; if not found, the first changing pixel in r₂ is named c₁.

Given a₀, the eventual task is to identify a₁ by referring to either b₁ or c₁ depending on the intraMMR or interMMR mode, respectively. This is achieved in three different modes, the vertical, horizontal and vertical pass modes. The vertical mode is the first option that is considered and if it is decided not to be applicable, the horizontal mode or the vertical pass mode is employed in turn.

In the vertical mode the position of a₁ is identified relative to that of b₁. It is invoked only when the absolute value of the relative distance defined by the relative address DIST = r_a₁ - r_b₁, is equal to or less than a predefined threshold [12]. The DIST is then variable length coded using the appropriate intraMMR and interMMR VLC tables.

If the vertical mode is not used (DIST > threshold), then the horizontal mode is considered for coding. In this mode the position of a₁ is identified on the basis of the absolute distance from a₀. If this distance is less than the width of the alpha block, then it is used; otherwise the vertical pass mode is used, which implies that one row of the pixels in the alpha block is passed (not coded).

Finally, the decision to use intraMMR or interMMR is to first scan the alpha block in the horizontal and vertical scanning directions. The one requiring the least number of bits is chosen. In the case of a tie, horizontal scanning is chosen. For the final decision between intraMMR and interMMR, again the one is selected that gives the least coding bits.

10.4.5 Context-based arithmetic coding

Each intracoded binary alpha block (BAB), and the intercoded one after being motion compensated by block-based motion compensation, is context-based arithmetic encoded (CAE). In general each binary alpha block is coded according to one of the following seven modes (in C terminology):

MVDs == 0 && No_Update
MVDs! = 0 && No_Update
All_0
All_255
IntraCAE
MVDs == 0 && InterCAE
MVDs! = 0 && InterCAE

The first and second modes indicate that the shape will not be updated, and All_0 and All_255 indicate that the BAB contains only black and white pixels, respectively. None of these modes is required to be arithmetic coded. Also, in the quad tree and MMR methods, All_0 and All_255 are not coded further.

IntraCAE is the mode for context-based arithmetic coding of BABs that contain a mixture of black and white pixels. In modes 6 and 7, the interframe BABs (mixed black and white pixels) with and without motion compensation, respectively, are arithmetic coded.

The motion vector data of shape, represented by MVDS, is the difference between the shape motion vector and its predictor, MVP. The prediction procedure is similar to that of the motion vector data for texture, described in section 9.1.2. However, there are differences, such as:

the prediction motion vectors can be derived from the candidate motion vectors of shape MVS1, MVS2, MVS3 or the candidate motion vectors of the texture, MV1, MV2 and MV3, similar to those of H.263 illustrated in Figure 9.2; the prediction motion vector is determined by taking the first encountered motion vector that is valid; if no candidate is valid, the prediction is set to zero
overlapped, half pixel precision and 8×8 motion compensation is not carried out
in the case that the region outside the VOP is referred to, the value for that is set to zero
for B-VOPs, only forward motion compensation is used and neither backward nor interpolated motion compensation is allowed.

It should be noted that when the shape prediction motion vector MVPS is determined, the difference between the motion compensated BAB indicated with MVPS and the current BAB is calculated. If the motion compensated error is less than a certain threshold (AlphaTH) for any 4 × 4 subblock of the BAB, the MVPS is directly employed as the best prediction. If this condition is not met, MV is searched around the prediction vector MVPS, by comparing the BAB indicated by the MV and the current BAB. The MV that minimises the error is taken as the best motion vector for shape, MVS, and the motion vector data for shape (MVDS) is given by MVDS = MVS - MVPS.

10.4.5.1 Size conversion

Rate control and rate reduction in MPEG-4 is realised through size conversion of the binary alpha information. This method is also applicable to quad tree and MRR. It is implemented in two successive steps.

In the first step, if required, the size of the VOP can be reduced by half in each of the horizontal and vertical directions. This is indicated in the VOP header, as the video object plane conversion ratio, VOP_CR, which takes a value of either 1 or . When VOP_CR is , size conversion is carried out on the original bounding box of Figure 10.5.

In the case that the value of VOP_CR is , the locally decoded shape, which is size converted at the VOP level, is stored in the frame memory of the shape frame. For shape motion estimation and compensation, if VOP_CR of the reference shape VOP is not equal to that of the current shape VOP, the reference shape frame (not VOP) is size converted corresponding to the current shape VOP.

For P-VOPs, if the VOP_CR is , the components of the shape motion information vector are measured on the downsampled shape frame. The predicted motion vector for the shape, MVPS, is calculated only using the shape motion vectors MVS1, MVS2 and MVS3.

In the second step, when required, the size conversion is carried out for every binary alpha block, BAB, except for All_0, All_255 and No_Update. At the block level the conversion ratio (CR) can be one of , and 1 (the original size).

For CR = , if the average of pixel values in a 2 × 2 pixel block is equal to or larger than 128, the pixel value of the downsampled block is set to 255, otherwise it is set to zero. For CR = , if the average of pixels in a 4 × 4 pixel block is equal to or larger than 128, the pixel value of the downsampled block is set to 255, otherwise it is set to zero. In either of these cases, upsampling is carried out for the BAB. The values of the interpolated pixels are calculated from their neighbouring pixels, according to their Euclidean distances.

Selection of a suitable value for the conversion ratio (CR) is done based on the conversion error between the original BAB and the BAB which is once downsampled and then reconstructed by upsampling. The conversion error is computed for each 4×4 subblock, by taking the absolute difference of pixels in the corresponding subblocks of the original and reconstructed BABs. If this difference is greater than a certain threshold, this subblock is called an error pixel block (Error_PB). Size conversion at a certain conversion ratio is accepted if there is no Error_PB at that ratio. Figure 10.16 summarises determination of the conversion ratio, CR.

click to expand
Figure 10.16: CR determination algorithm

If a downsampled BAB turns out to be all transparent or all opaque and the conversion error in any 4×4 subblocks in the BAB is equal to or lower than the threshold, the shape information is coded as shape_mode = All_0 or All_255. Unless this is the case, the BAB is coded with a context-based arithmetic coding at the size determined by the algorithm for the rate control.

10.4.5.2 Generation of context index

The pixels in the binary alpha blocks (BABs) are context-based arithmetic coded for both intra and inter modes. The number of pixels in the BAB is determined by the conversion ratio (CR), which is either 16 × 16, 8 × 8 or 4 × 4 pixels for CR values of 1, and , respectively.

The context-based arithmetic encoding (CAE) is a binary arithmetic coding, where the symbol probability is determined from the context of the neighbouring pixels. Such a coding is applied to each pixel of the BAB in the following manner. First, prior to encoding of each BAB, the arithmetic encoder is initialised. Each binary pixel is then encoded in the raster scan order. The process for coding a given pixel is carried out using the following steps:

compute the context number
index a probability table using the context number
use the indexed probability to derive an arithmetic encoder.

When the final pixel has been encoded the arithmetic code is terminated.

Figure 10.17 shows the template of the neighbouring pixels that contribute to the creation of the context number for intra and inter shape pixels.

click to expand
Figure 10.17: Template for the construction of the pixels of (a the intra and) (b inter BABs. The pixel to be coded is marked with '?')

For intra coded BABs, a ten-bit context is calculated for each pixel, as shown in Figure 10.17a. In this Figure the pixel to be encoded is represented by ? and the ten neighbouring pixels are ordered as shown. For inter coded BABs, in addition to spatial redundancy, temporal redundancy is exploited by using pixels from the bordered motion compensated BAB, to make up part of the context, as shown in Figure 10.17b. In this mode, only nine bits are required to calculate the context number, e.g. .

In both modes there are some special cases to note:

In building contexts, any pixel outside the bounding box of the current VOP to the left and above are assumed to be zero.
The template may cover pixels from BABs which are not known at the decoding time. The values of these unknown pixels are estimated by template padding in the following manner:
1. When constructing the intra context, the following steps are taken in sequence:
  1. if (C7 is unknown) C7 = C8
  2. if (C3 is unknown) C3 = C4
  3. if (C2 is unknown) C2 = C3.
2. When constructing the inter context, the following conditional assignment is performed:
  1. if (C1 is unknown) C1 = C2.

Once the context number is calculated, it is used to derive a probability table for binary arithmetic coding. Two probability tables, a 10-bit for intra and a 9-bit for inter BABs, are given in Appendix D. These tables contain the probabilities for a binary alpha pixel being equal to 0 for intra and inter shape coding using the context-based arithmetic coding. All probabilities are normalised to the range of [1, 65535].

As an example let us assume that the neighbouring pixels for an intra BAB template have a black and white pattern as shown in Figure 10.18.

Figure 10.18: An example of an intra BAB template

In this Figure, C0=C1=C2=C3=C4=C7=1, and C5=C6 = C8 = C9 = 0. Hence the context number for coding of pixel ? is C = 2⁰ + 2¹ + 2² + 2³ + 2⁴ + 2⁷ = 159.

If pixel ? was a black pixel it would have been coded with an Intra_prob [159]. This value according to Appendix D is 74 out of 65535. If it was a white pixel, its probability would have been 65535 - 74 = 65461 out of 65535. Such a high probability for a white pixel in this example is expected, since this pixel is surrounded by many white pixels. Note also, although the given probability table is fixed, as the pattern of neighbouring pixels changes, the calculated context number changes such that the assigned probability to the pixel is better suited for that pattern. This is a form of adaptive arithmetic coding which only looks at the limited number of past coded symbols. It has been shown that adaptive arithmetic coding with limited past history has more efficient compression over fixed rate arithmetic coding [13].

10.4.6 Greyscale shape coding

The grey level alpha plane is encoded as its support function and the alpha values on the support. The support is obtained by thresholding the grey level alpha plane by 0, and the remaining parts constitute the alpha values, as shown in Figure 10.19.

click to expand
Figure 10.19: Greyscale shape coding

The support function is encoded by binary shape coding, as described in section 10.4.5. The alpha values are partitioned into 16 × 16 blocks and encoded the same way as the luminance of texture is coded.