Video Coding for Low Bit Rate Communications (H.263)


The H.263 Recommendation specifies a coded representation that can be used for compressing the moving picture components of audio-visual services at low bit rates. Detailed specifications of the first generation of this codec under the test model (TM) to verify the performance and compliance of this codec were finalised in 1995 [1]. The basic configuration of the video source algorithm in this codec is based on ITU-T Recommendation H.261, which is a hybrid of interpicture prediction to utilise temporal redundancy and transform coding of the residual signal to reduce spatial redundancy. However, during the course of the development of H.261 and the subsequent advances on video coding in MPEG-1 and MPEG-2 video codecs, substantial experience was gained, which has been exploited to make H.263 an efficient encoder [2–4]. In this Chapter those parts of the H.263 standard that make this codec more efficient than its predecessors will be explained.

It should be noted that the primary goal in the H.263 standard codec was coding of video at low or very low bit rates for applications such as mobile networks, the public switched telephone network (PSTN) and narrowband ISDN. This goal could only be achieved with small image sizes such as subQCIF and QCIF, at low frame rates. Today, this codec has been found so attractive that higher resolution pictures can also be coded at relatively low bit rates. The current standard recommends operation on five standard pictures of the CIF family, known as subQCIF, QCIF, CIF, 4CIF and 16CIF.

Soon after the finalisation of the H.263 in 1995, work began to improve the coding performance of this codec further. The H.263+ was the first set of extensions to this family, which was intended for near-term standardisation of enhancements of H.263 video coding algorithms for real-time telecommunications [5]. Work on improving the encoding performance is still an ongoing process under H.263++ and every now and then a new extension called an annex is added to the family [6]. The codec for long-term standardisation is called H.26L [7]. The H.26L project has the mandate from ITU-T to develop a very low bit rate (less than 64 kbit/s with emphasis on less than 24 kbit/s) video coding recommendations achieving better video quality, lower delay, lower complexity and better error resilience than are currently available. The project also has an objective to work closely with the MPEG-4 committee in investigating new video coding techniques and technologies as candidates for recommendation [8].

How does H 263 differ from H 261 and MPEG 1?

The source encoder of H.263 follows the general structure of the generic DCT-based interframe coding technique used in the H.261 and MPEG-1 codecs (see Figure 3.18). The core H.263 employs a hybrid interpicture prediction to utilise temporal redundancy and transform coding of the residual signal to reduce spatial redundancy. The decoder has motion compensation capability, allowing optional incorporation of this technique at the encoder. Half pixel precision is used for the motion compensation, as opposed to the optional full pixel precision and loop filter used in Recommendation H.261. In the new versions of H.263, use of quarter and even one eighth of pixel precision are recommended [6].

Perhaps the most significant differences between the core H.263 and H.261/MPEG-1 are in the coding of the transform coefficients and motion vectors. In the following sections these and some other notable differences such as the additional optional modes are explained.

Coding of H 263 coefficients

In H.261 and MPEG-1, we saw that the transform coefficients are converted via a zigzag scanning process into two-dimensional, run and index events (see section 6.4). In H.263 these coefficients are represented as a three-dimensional event of (last, run, level). Similar to the two-dimensional event, the run indicates the number of zero-valued coefficients preceding a nonzero coefficient in the zigzag scan, and level is the normalised magnitude of the nonzero coefficient which is sometimes called index. last is a new variable to replace the end of block (EOB) code of H.261 and MPEG-1. last takes only two values, 0 and 1. last 0 means that there are more nonzero coefficients in the block, and 1 means that this is the last nonzero coefficient in the block.

The most likely events of (last, run, level) are then variable length coded. The remaining combinations of (last, run, level) are coded with a fixed 22-bit word consisting of seven bits escape, one bit last, six bits run and eight bits level.

Coding of motion vectors

The motion compensation in the core H.263 is based on one motion vector per macroblock of 16 × 16 pixels, with half pixel precision. The macroblock motion vector is then differentially coded with predictions taken from three surrounding macroblocks, as indicated in Figure 9.1. The predictors are calculated separately for the horizontal and vertical components of the motion vectors MV1, MV2 and MV3. For each component, the predictor is the median [1] value of the three candidate predictors for this component:

click to expand
Figure 9.1: Motion vector prediction


The difference between the components of the current motion vector and their predictions are variable length coded. The vector differences are defined by:


In the special cases at the borders of the current group of blocks (GOB) or picture, the following decision rules are applied in order as follows:

  1. The candidate predictor MV1 is set to zero if the corresponding macroblock is outside the picture at the left side (Figure 9.2a).

    click to expand
    Figure 9.2: Motion vector prediction for the border macroblocks

  2. The candidate predictors MV2 and MV3 are set to MV1 if the corresponding macroblocks are outside the picture at the top, or if the GOB header of the current GOB is nonempty (Figure 9.2b).
  3. The candidate predictor MV3 is set to zero if the corresponding macroblock is outside the picture at the right side (Figure 9.2c).
  4. When the corresponding macroblock is coded intra or is not coded, the candidate predictor is set to zero.

The values of the difference components are limited to the range [-16 to 15.5]. Since, in H.263, the source images are of the CIF family with the 4:2:0 format, each macroblock comprises four luminance and two chrominance components, Cb and Cr. Hence the motion vector of the macroblock is used for all four luminance blocks in the macroblock. Motion vectors for both chrominance blocks are derived by dividing the component values of the macroblock vector by two, due to the lower chrominance resolution. The resulting values of the quarter pixel resolution vectors are modified towards the nearest half pixel position (note: the macroblock motion vector has half pixel resolution).

Source pictures

The source encoder operates on noninterlaced pictures at approximately 29.97 frames per second. These pictures can be one of the five standard picture formats of the CIF family: subQCIF, QCIF, CIF, 4CIF and 16CIF. Since in CIF the luminance and chrominance sampling format is 4:2:0, then for either of these pictures, the horizontal and vertical resolutions of the chrominance components are half the luminance. Table 9.1 summarises pixel resolutions of the CIF family used in H.263.

Table 9.1: Number of pixels per line and number of lines per picture for each of the H.263 picture formats

Picture format

Number of pixels for luminance per line

Number of lines for luminance per picture

Number of pixels for chrominance per line

Number of lines for chrominance per picture


























Each picture is divided into a group of blocks (GOBs). A GOB comprises k × 16 lines, depending on the picture format (k = 1 for subQCIF, QCIF and CIF; k = 2 for 4CIF; k = 4 for 16CIF). The number of GOBs per picture is six for subQCIF, nine for QCIF, and 18 for CIF, 4CIF and 16CIF. Each GOB is divided into 16 × 16 pixel macroblocks, of which there are four luminance and one each of chrominance blocks of 8 x 8 pixels.

Picture layer

The picture layer contains the picture header, the GOB header together with various coding decisions on macroblocks in a GOB and finally the coded transform coefficients, which are also used in H.261 and MPEG-1 and 2. The most notable difference in the header information for H.263 is in the type information, called PTYPE. For the first generation of H.263, this is a 13-bit code that gives information about the complete picture, in the form of [1]:

bit 1

always 1, in order to avoid start code emulation

bit 2

always 0, for distinction with H.261

bit 3

split screen indicator, 0 off, 1 on

bit 4

document camera indicator, 0 off, 1 on

bit 5

freeze picture release, 0 off, 1 on

bit 6–8

source format, 000 forbidden, 001 subQCIF, 010 QCIF, 011 CIF, 100 4CIF, 101 16CIF, 110 reserved, 111 extended PTYPE

bit 9

picture coding type, 0 intra, 1 inter

bit 10

optional unrestricted motion vector mode, 0 off, 1 on

bit 11

optional syntax-based arithmetic coding mode, 0 off, 1 on

bit 12

optional advanced prediction mode, 0 off, 1 on

bit 13

optional PB frame mode, 0 normal picture, 1 PB frame

The split screen indicator is a signal that indicates that the upper and lower half of the decoded picture could be displayed side by side. This has no direct effect on the encoding and decoding of the picture.

The freeze picture release is a signal from an encoder which responds to a request for packet retransmission (if not acknowledged) or fast update request, and allows a decoder to exit from its freeze picture mode and display decoded picture in the normal manner.

Bits 10–13 refer to the early four optional modes of H.263. Since 1995 more options as annexes have then been added to the extensions of this codec. These optional modes are activated when bits 6–8 of the PTYPE header are in the extended mode of 111, and necessarily some additional bits define the new options. Hence extensions of H.263 have a longer PTYPE header and also a different picture layer than the above 13 bits. All these optional modes are only used after negotiation between the encoder and the decoder via the control protocol Recommendation H.245 [9].

Also for further reduction in the overhead, the code for macroblock type and coded block pattern are combined. For example, the combined code of macroblock type and coded block pattern is called MCBPC. MCBPC is always present for each macroblock, irrespective of its type and the options used. Note that in H.261, the MPEG-1 and 2 code block pattern is defined separately from the macroblock type.

[1]To find the median value, the components are rank ordered and the middle value is chosen.

Switched multipoint

One of the initial aims in the design of a new low bit rate video codec was to replace H.261 with a more efficient one, that took the name of H.263. Hence the functionalities of H.261, but with some improvements, need to be included in this codec; they appear in Annex C of the H.263 specification that can be activated or disabled as desired [25-C]. Since this annex was introduced prior to all the other annexes, which later on were called optionalities, we introduce this annex prior to all the other options.

In H.263 the decoder can be instructed to alter its normal decoding mode and provide some extra display functions. Instructions for the alterations may be issued by an external device such as Recommendation H.245, which is a control protocol for multimedia communications [9]. Some of the commands and the actions are as follows.

Freeze picture request

This signal causes the decoder to freeze its displayed picture until a freeze release signal is received or a time-out period of at least six seconds has expired. A frozen picture is much better perceived by the viewer than, say, a broken picture due to channel errors, or if the encoder cannot deliver the compressed bit stream on time for a continuous display.

Fast update request

This command causes the encoder to encode its next picture in intra mode, with coding parameters to avoid buffer overflow. This mode in conjunction with the back channel reduces the probability of error propagation into the subsequent pictures. This mode improves the resilience of the codec to channel errors.

Freeze picture release

Freeze picture release is a signal from the encoder, which has responded to a fast update request, and allows a decoder to exit from its freeze mode and display decoded pictures in the normal manner. This signal is transmitted by the PTYPE in the picture header of the first picture coded in response to the fast update request.

Continuous presence multipoint

In a multipoint connection, a multipoint control unit (MCU) can assemble two to four video bit streams into one video bit stream, so that at the receiver up to four different video signals can be displayed simultaneously. In H.261, this can be done on a quad screen by only editing the GOB header, but in H.263 it is more complex due to a different GOB structure, overlap motion estimation, multiple motion vectors etc. Therefore, in H.263 a special continuous presence multipoint mode is provided in which four independent video bit streams are transmitted in the four logical channels of a single H.263 video bit stream.

Extensions of H 263

In late 1990s the Video Coding Experts Group (VCEG) of the ITU-Telecommunications standardisation sector set up two activities. The aim was to develop very low bit rate video coding at bit rates less than 64 kbit/s and more specifically at less than 24 kbit/s. One activity is looking at the video coding for very low bit rates, under the name of H.263+ [5]. Recently work of this activity has continued under the name of H.263++, indicating further improvements on H.263+ [6]. The other activity, which has more in common with MPEG-4, is work on advanced low bit rate video coding, under the name of H.26L [7].

The H.263+/H.263++ development effort is intended for short term standardisation of enhancements of the H.263 video coding algorithm for real-time telecommunication and related nonconversational services. The H.26L development effort is aimed at identifying new video coding technology beyond the capabilities of enhancements to H.263 by the H.263+/H.263++ coding algorithms.

These two subgroups also have a close cooperation in the development of their codecs, since the core codec is still H.263. They also work closely with the other bodies of ITU. For example, the collaboration between H.263+ and the mobile group has led to consideration for greater video error resilience capability. The back channel error resilience in H.263+ is especially designed to address the needs of mobile video and other such unreliable bit stream transport environments. The H.26L group work very closely with the MPEG-4 group, as this group has the mandate of developing advanced video coding for storage and broadcasting applications [7].

One of the key features of the H.263+, H.263++ and H.26L is the real-time audio-visual conversational services. In a real-time application, information is simultaneously acquired, processed and transmitted and is usually used immediately at the receiver. This feature implies critical delay and complexity constraints on the codec algorithm.

An important component in any application is the transmission media over which it will need to operate. The transmission media for H.263+/H.26L applications include PSTN, ISDN (1B), dial-up switched-56/64 kbit/s service, LANs, mobile networks (including GSM, DECT, UMTS, FLMPTS, NADC, PCS etc.), microwave and satellite networks, digital storage media (i.e. for immediate recording) and concatenation of the above media. Due to the large number of likely transmission media and the wide variations in the media error and channel characteristics, error resiliency and recovery are critical requirements for this application class.

Scope and goals of H 263+

The expected enhancements of H.263+ over H.263 fall into two basic categories:

  • enhancing quality within existing applications
  • broadening the current range of applications.

A few examples of the enhancements are:

  • improving perceptual compression efficiency
  • reducing video coding delay
  • providing greater resilience to bit errors and data losses.

Note that H.263+ has all the features of H.263, and further tools are added to this codec to increase its coding efficiency and its robustness to errors. This is an ongoing process and more tools are added every year. Recently this codec has been designated H.263++, to emphasise the ongoing improvement in the coding efficiency [6].

Scopes and goals of H 26L

The long-term objective of the ITU-U video experts group, under the Advanced Video Coding project, is to provide a video coding recommendation which at very low bit rates can perform substantially better than that achievable with the existing standards (e.g. H.263+). The adopted technology should provide for:

  • enhanced visual quality at very low bit rates and particularly at PSTN rates (e.g. at rates below 24 kbit/s)
  • enhanced error robustness in order to accommodate the higher error rates experienced when operating for example over mobile links
  • low complexity appropriate for small, relatively inexpensive, audio-visual terminals
  • low end-to-end delay as required in bidirectional personal communications.

In addition, the group is closely working with the MPEG-4 experts group, to include new coding methods and promote interoperability. Advances in this direction will be discussed in detail at the end of this Chapter.

Optional modes of H 263

In the course of development of H.263 numerous optional modes have been added as annexes to the main specifications to improve the visual communication efficiency of this codec. Some of the annexes were introduced along with the introduction of the core H.263, and many more were added gradually under H.263+ and H.263++. It is not the intention to introduce here all these annexes, nor to specify which annex belongs to which generation of the codec. Instead, we try to classify them into groups, without specifying in which generation they were introduced, rather to give a better appreciation of these optional modes in improving coding efficiency. As we will see, since the H.263 video codec is primarily aimed for mobile video communications, and UHF channels are particularly prone to channel errors, the majority of the annexes deal with the protection of visual data against channel errors.

Advanced motion estimation compensation

Motion estimation/compensation is probably the most evolutionary coding tool in the history of video coding. For every previous generation of video codecs, motion estimation has been considered as a means of improving coding efficiency. In the first video codec (H.120) under COST211, which was a DPCM-based codec, working on pixel-by-pixel, motion estimation for each pixel would have been costly and hence it was never used. Motion estimation was made optional for the H.261 block-based codec, on the grounds that the DCT of this codec is to decorrelate interframe pixels, and since motion compensation reduces this correlation, then nothing is left for DCT! Motion compensation in MPEG-1 was considered seriously, since for B-pictures, which refer to both past and future, the motion of objects even those hidden in the background can be compensated. It was so efficient that it was also recommended for P-pictures, and even with a half pixel precision. In MPEG-2, due to interlacing, a larger variety of motion estimation/compensation between fields and frames, or their combinations, was introduced. The improvement in coding efficiency was at the cost of additional overhead for delivering the motion vectors to the receiver. However, for MPEG-1 and 2, coding at a rate of 1–5 Mbit/s, this overhead is negligible, but the question is how this overhead can be justified for H.263 at a rate of 24 kbit/s or less? In fact, as we will see below, some extensions of H.263 recommend using smaller block sizes, which imply more motion vectors per picture and they even suggest motion estimation precision should be at a quarter or one eighth of a pixel. All these increase the motion vector overhead, which is very significant at a very low bit rate of 24 kbit/s.

The fact is that, if motion compensation is efficient, the motion compensated pictures may not need to be coded by the DCT. That is, these blocks are only represented by their motion vectors. In H.261 and MPEG1, we have seen a form of macroblock that was coded only by the motion vector, without coding the motion compensated error. However, if the expected video quality is low and the motion estimation is efficient, then we can see more of these macroblocks in a codec. This is in fact what is happening with motion compensation in H.263. In the following sections some improvements to motion estimation/compensation, in addition to those introduced for H.261, MPEG-1 and 2, will be discussed. At the end of this section a form of motion estimation that warps the picture for better compensation of complex motion will be introduced. Although this method is not a part of any form of H.263 nor is recommended for other video coding standards, there is no reason why we cannot have this form of motion estimation and compensation in the future video codecs.

Unrestricted motion vector

In the default prediction mode of H.263, motion vectors are restricted so that all pixels referenced by them are within the coded picture area. In the optional unrestricted motion vector mode this restriction is removed and therefore motion vectors are allowed to point outside the picture [25-D]. When a pixel referenced by a motion vector is outside of the coded picture area, an edge pixel is used instead. This edge pixel is found by limiting the motion vector to the last full pixel position inside the coded picture area. Limitation of the motion vector is performed on a pixel-by-pixel basis and separately for each component of the motion vector.

Advanced prediction

The optional advanced prediction mode of H.263 employs overlapped block matching motion compensation and may have four motion vectors per macroblock [25-F]. The use of this mode is indicated in the macroblock-type header. This mode is only used in combination with the unrestricted motion vector mode [25-D], described above. Four motion vectors per macroblock

In H.263, one motion vector per macroblock is used except in the advanced prediction mode, where either one or four motion vectors per macroblock are employed. In this mode, the motion vectors are defined for each 8 × 8 pixel block. If only one motion vector for a certain macroblock is transmitted, this is represented as four vectors with the same value. When there are four motion vectors, the information for the first motion vector is transmitted as the codeword MVD (motion vector data), and the information for the three additional vectors in the macroblock is transmitted as the codeword MVD2-4.

The vectors are obtained by adding predictors to the vector differences indicated by MVD and MVD2-4, as was the case when only one motion vector per macroblock was present (see section 9.1.2). Again, the predictors are calculated separately for the horizontal and vertical components. However, the candidate predictors MV1, MV2 and MV3 are redefined as indicated in Figure 9.3.

click to expand
Figure 9.3: Redefinition of the candidate predictors MV1, MV2 and MV3 for each luminance block in a macroblock

As Figure 9.3 shows, the neighbouring 8 × 8 blocks that form the candidates for the prediction of the motion vector MV take different forms, depending on the position of the block in the macroblock. Note, if only one motion vector in the neighbouring macroblocks is used, then MV1, MV2 and MV3 are defined as 8 x 8 block motion vectors, which possess the same motion vector as the macroblock. Overlapped motion compensation

Overlapped motion compensation is only used for 8 × 8 luminance blocks. Each pixel in an 8 × 8 luminance prediction block is the weighted sum of three prediction values, divided by eight (with rounding). To obtain the prediction values, three motion vectors are used. They are the motion vector of the current luminance block and two out of four remote vectors, as follows:

  • the motion vector of the block at the left or right side of the current luminance block
  • the motion vector of the block above or below the current luminance block.

The remote motion vectors from other groups of blocks (GOBs) are treated the same way as the remote motion vectors inside the GOB.

For each pixel, the remote motion vectors of the block at the two nearest block borders are used. This means that for the upper half of the block the motion vector corresponding to the block above the current block is used, and for the lower half of the block the motion vector corresponding to the block below the current block is used, as shown in Figure 9.4. In this Figure, the neighbouring pixels closer to the pixels in the current block take greater weights.

click to expand
Figure 9.4: Weighting values for prediction with motion vectors of the luminance blocks on top or bottom of the current luminance block, H1 (i, j)

Similarly, for the left half of the block, the motion vector corresponding to the block at the left side of the current block is used, and for the right half of the block the motion vector corresponding to the block at the right side of the current block is used, as shown in Figure 9.5.

click to expand
Figure 9.5: Weighting values for prediction with motion vectors of luminance blocks to the left or right of current luminance block, H2(i, j)

The creation of each interpolated (overlapped) pixel, p(i, j), in an 8 × 8 reference luminance block is governed by:


where q(i, j), r(i, j) and s(i, j) are the motion compensated pixels from the reference picture with the three motion vectors defined by:

where denotes the motion vector for the current block, denotes the motion vector of the block either above or below and denotes the motion vector of the block either to the left or right of the current block. The matrices H0(i, j), H1(i, j) and H2(i, j) are the current, top-bottom and left-right weighting matrices, respectively. Weighting matrices of H1(i, j) and H2(i, j) are shown in Figures 9.4 and 9.5, respectively, and the weighting matrix for prediction with the motion vector of the current block, H0(i, j), is shown in Figure 9.6.

Figure 9.6: Weighting values for prediction with motion vector of current block, H0(i, j)

If one of the surrounding blocks is not coded or is in intra mode, the corresponding remote motion vector is set to zero. However, in PB frames mode (see section 9.5), a candidate motion vector predictor is not set to zero if the corresponding macro block is intra mode.

If the current block is at the border of the picture and therefore a surrounding block is not present, the corresponding remote motion vector is replaced by the current motion vector. In addition, if the current block is at the bottom of the macroblock, the remote motion vector corresponding with an 8 × 8 luminance block in the macroblock below the current macroblock is replaced by the motion vector for the current block.

Importance of motion estimation

In order to demonstrate the importance of motion compensation and to some extent the compression superiority of H.263 over H.261 and MPEG-1, in an experiment the CIF test image sequence Claire was coded at 256 kbit/s (30 frames/s) with the following encoders:

  • H.261
  • MPEG-1, with a GOP length of 12 frames and two B-frames between the anchor pictures, i.e. N = 12 and M = 3 (MPEG-GOP)
  • MPEG-1, with only P-pictures, i.e. N = ∞ and M = 1 (MPEG-IPPPP...)
  • H.263 with advanced mode (H.263-ADV).

Figure 9.7 illustrates the peak-to-peak signal-to-noise ratio (PSNR) of the coded sequence. At this bit rate, the worst performance is that of MPEG-1, with a GOP structure of 12 frames per GOB, and two B-frames between the anchor pictures, (IBBPBBPBBPBBIBB...). The main reason for the poor performance of this codec at this bit rate is that I-pictures consume most of the bits and, compared with the other coding modes, relatively lower bits are assigned to the P and B-pictures.

click to expand
Figure 9.7: PSNR of Claire sequence coded at 256 kbit/s, with MPEG-1, H.261 and H.263

The second poorest is the H.261, where all the consecutive pictures are interframe coded with an integer pixel precision motion compensation. The second best performance is the MPEG-1 with only P-pictures. It is interesting to note that this mode is similar to H.261 (every frame is predictively coded), except that motion compensation is carried out with half pixel precision. Hence this mode shows the advantage of using half pixel precision motion estimation. The amount of improvement for the used sequence at 256 kbit/s is almost 2 dB.

Finally, the best performance comes from the advanced mode of H.263, which results in an almost 4 dB improvement over the best of MPEG-1 and 6 dB over H.261. The following are some of the factors that may have contributed to such a good performance:

  • motion compensation on smaller block sizes of 8 x 8 pixels results in smaller error signals than for the macroblock compensation used in the other codecs
  • overlapped motion compensation; by removing the blocking artefacts on the block boundaries, the prediction picture has a better quality, so reducing the error signal, and hence the number of significant DCT coefficients
  • efficient coding of DCT coefficients through three-dimensional (last, run, level)
  • efficient representation of the combined macroblock type and block patterns.

Note that, in this experiment, other options such as PB frames mode, arithmetic coding etc. were not used. Had the arithmetic coding been used, it is expected that the picture quality would be further improved by 1–2 dB. Experimental results have confirmed that arithmetic coding has approximately 5–10 per cent better compression efficiency over the Huffman [10].

Deblocking filter

At very low bit rates, the blocks of pixels are mainly made of low frequency DCT coefficients. In these areas, when there is a significant difference between the DC levels of the adjacent blocks, they appear as block borders. At the extreme case pictures break into blocks, and the blocking artefacts can be very annoying.

The overlapped block matching motion compensation to some extent reduces these blocking artefacts. For further reduction in the blockiness, the H.263 specification recommends deblocking of the picture through the block edge filter [25-J]. The filtering is performed on 8 × 8 block edges and assumes that 8 × 8 DCT is used and the motion vectors may have either 8 × 8 or 16 × 16 resolution. Filtering is equally applied to both luminance and chrominance data and no filtering is permitted on the frame and slice edges.

Consider four pixels A, B, C and D on a line (horizontal or vertical) of the reconstructed picture, where A and B belong to block 1 and C and D belong to a neighbouring, block 2, which is either to the right of or below block 1, as shown in Figure 9.8.

click to expand
Figure 9.8: Filtering of pixels at the block boundaries

In order to turn the filter on for a particular edge, either block 1 or block 2 should be an intra or a coded macroblock with the code COD = 0. In this case B1 and C1 replace values of the boundary pixels B and C, respectively, where:


The amount of alteration of pixels, ±d1, is related to a function of pixel differences across the block boundary, d, and the quantiser parameter QP, as shown in eqn. 9.4. The sign of d1 is the same as the sign of d.

Figure 9.9 shows how the value of d1 changes with d and the quantiser parameter QP, to make sure that only block edges which may suffer from blocking artefacts are filtered, and not the natural edges. As a result of this modification, only those pixels on the edge are filtered so that their luminance changes are less than the quantisation parameter, QP.

Figure 9.9: d1 as a function of d

Motion estimation compensation with spatial transforms

The motion estimation we have seen so far is based on matching a block of pixels in the current frame against a similar size block of pixels in the previous frame, the so-called block matching algorithm (BMA). It relies on the assumptions that the motion of objects is purely translational and the illumination is uniform, which of course are not realistic. In practice, motion has a complex nature that can be decomposed into translation, rotation, shear, expansion and other deformation components and the illumination changes are nonuniform. To compensate for these nonuniform changes between the frames a block of pixels in the current frame can be matched against a deformed block in the previous frame. The deformation should be such that all the components of the complex motion and illumination changes are included.

A practical method for deformation is to transform a square block of N × N pixels into a quadrilateral of irregular shape, as shown in Figure 9.10.

click to expand
Figure 9.10: Mapping of a block to a quadrilateral

One of the methods for this purpose is the bilinear transform, defined as [11]:


where a pixel at a spatial coordinate (u, v) is mapped onto a pixel at coordinate (x, y). To determine the eight unknown mapping parameters α0-α7, eight simultaneous equations relating the coordinates of vertices A, B, C and D into E, F, G and H of Figure 9.10 must be solved. To ease computational load, all the coordinates are offset to the position of coordinates at u = 0 and v = 0, as shown in the Figure. Referring to the Figure, the eight mapping parameters are derived as:

(9.6)  click to expand

where k = N - 1 and N is the block size, e.g. for N = 16, k = 15.

To use this kind of spatial transformation as a motion estimator, the four corners E, F, G and H in the previous frame are chosen among the pixels within a vertex search window. Using the offset coordinates of these pixels in eqn. 9.6, the motion parameters that transform all the pixels in the current square macroblock of ABCD into a quadrilateral are derived. The positions of E, F, G and H that result in the lowest difference between the pixels in the quadrilateral and the pixels in the transformed block of ABCD are regarded as the best match. Then α0-α7 of the best match is taken as the parameters of the transform that define the motion estimation by spatial transformation.

It is obvious that in general the number of pixels in the quadrilateral is not equal to the N2 pixels in the square block. To match these two unequal size blocks, the corresponding pixel locations of the square block in the quadrilateral must be determined. Since these in general do not coincide with the pixel grid, their values should be interpolated from the intensity of their four surrounding neighbours, I0, I1, I2 and I3, as shown in Figure 9.11.

Figure 9.11: Intensity interpolation of a nongrid pixel

The interpolated intensity of the mapped pixels, I, from its four immediate neighbours, is inversely proportional to their distances and is given by:


which is simplified to

I = (I1 - I0)a + (I2 - I0)b + (I0 + I3 - I1 - I2)ab + I0

where a and b are the horizontal and vertical distances of the mapped pixel from the pixel with intensity I0.

Note that this type of motion estimation/compensation is much more complex than the simple block matching. First, in block matching for a maximum motion speed of ω pixel/frame, there are (2ω + 1)2 matching operations, but in the spatial transforms, since each vertex is free to move in any direction, the number of matching operations becomes (2ω + 1)8. Still each operation is more complex than BMA, since for each match the transformation parameters α07 have to be calculated, and all the mapped pixels should be interpolated.

There are numerous methods of simplifying the operations [11]. For example, a fast matching algorithm, like the orthogonal search algorithm (OSA) introduced in section 3.3, can be used. Since for a motion speed of eight pixels/frame OSA needs only 13 operations, the total number of search operations becomes 134 = 28 561, which is practical and is much less than using the brute force method of (2 × 8+1)8 = 7 × 109, which is not practical! Also, the use of simplified interpolation reduces the interpolation complexity to some extent [12].

In order to appreciate the motion compensation capability of this method, a head-and-shoulders video sequence was recorded, at a speed of almost 12 frames per second, where the head moves from one side to another in three seconds. Assuming that the first frame is available at the decoder, the remaining 35 frames were reconstructed by the motion vectors only, and no motion compensated error was coded. Figure 9.12 shows frames 5, 15, 25 and 35 of the reconstructed pictures by the bilinear transform, called here block matching with spatial transform (BMST) and the conventional block matching algorithm (BMA). At frame 5, where the eye should be closed, the BMA cannot track it as this method only translates the initial eye position of frame one, where it was open, but BMST tracks it well. Also throughout the sequence, the BMST tracks all the eye's and head's movements (plus opening and closing the mouth) and produces almost good quality picture, but BMA produces a noisy image.

click to expand
Figure 9.12: Reconstructed pictures with the BMST and BMA motion vectors operating individually

To explain why the BMST can produce such a remarkable performance, consider Figure 9.13, where the reconstructed pictures around frame 30, i.e. frames, 28–32, are shown. Looking at the back of the ear, we see that from frame to frame the hair grows, such that at frame 32 it looks quite natural. This is because if, for example, the quadrilateral is only made up of a single black dot, and it is then interpolated over the 16 × 16 pixels, to be matched to the current block of the same size in hair, then the current block can be made from a single pixel.

click to expand
Figure 9.13: Frame by frame reconstruction of the pictures by BMST

Note that in BMST, each block requires eight transformation parameters equal to four motion displacements at the four vertices of the quadrilateral. Either the eight parameters α0-α7 or four displacement vectors at the four vertices of the quadrilateral as four motion vectors should be sent. The second method is preferred, since α0-α7 are in general noninteger values and need more bits than the four motion vectors. Hence, the motion vector overhead of this method is four times that of BMA for the same block size. However, if BMA of 8 × 8 pixels is used as in the advanced prediction mode [25-F], then we have the same overhead. Again this is irrespective of the block size, since BMA compensates for translational motion only, and cannot produce any better results than those above.

One way of reducing the motion vector overhead is to force the vertices of the four adjacent quadrilaterals to a common vertex. This generates a net-like structure or mesh, as shown in Figure 9.14a. As can be seen, the motion compensated picture (Figure 9.14b) is smooth and free from blocking artefacts.

click to expand
Figure 9.14: Mesh-based motion compensation (a mesh) (b motion compensated picture)

To generate such a mesh, the three vertices of a quadrilateral are fixed to their immediate neighbours and only one (bottom right vertex) is free to move. This constrains the efficiency of the motion estimation and for better performance motion estimation for the whole frame has to be iterated several times. Thus it is not expected to perform as well as the unconstrained movement of the vertices applied to Figure 9.12. Despite this, since mesh-based motion estimation creates a smooth boundary between the quadrilaterals, the motion compensated picture will be free from blockiness. Also, it needs only one motion vector per quadrilateral, similar to BMA. Thus this mesh-based motion estimation is expected to be better than the BMA with the same motion vector overhead, but of course with increased computational complexity. Figure 9.15 compares the motion compensation efficiency of the full search block matching algorithm (BM-FSA) with quadrilateral matching, using three-step search (QM-TSS) and the mesh-based iterative algorithm (MB-ITR). The motion compensation is applied between the incoming pictures, to eliminate accumulation of errors. Also, since QM requires four motion vectors per block, in order to reduce motion vector overhead, each macroblock is first tested with BMA, and if the MBA motion compensated error is larger than a given threshold, QM is used, otherwise BMA is used. Hence the MQ overhead is less than four times that of BMA. Our investigations show that for head-and-shoulders type pictures, about 20–30 per cent of the macroblocks need QM and the rest can be faithfully compensated by the BMA method. In this Figure, QM also used overlap motion compensation [13]. However, the mesh-based (MB) method, requires the same overhead as for BMA (slightly less, no need at the picture borders).

click to expand
Figure 9.15: Performance of spatial transform motion compensation

As the Figure shows, mesh-based motion compensation is superior to the conventional block matching technique, with the same motion vector overhead. Considering the smooth motion compensated picture of the mesh-based method (Figure 9.14) and its superiority over block matching, it is a good candidate to be used in standard codecs.

More information on motion estimation with spatial transforms is given in [11,14,15]. In these papers some other spatial transforms such as Affine and Perspective are also tested. Methods for their use in a video codec to generate equal overhead to those used in H.263 are also explained.

Treatment of B pictures

B-pictures play an important role in low bit rate applications. If they are coded at lower quality, the quantisation distortion is not accumulated (since they are not used for prediction, see section 7.6). This is not the case for P-pictures, where any gain in reducing the bits in one frame may have to be returned at a higher cost later, when the distortion accumulates in a noise-like signal, which is difficult to code. For very low bit rate video, such as video for mobile networks, normally the frame rate is low (e.g. 5–10 frames/s), and hence the number of B-pictures between the anchor P and I-pictures cannot be large. Apparently only one B-picture is an ideal choice. Also, in these applications I-pictures are hardly used, or if they are used, the GOP length is normally very large. Hence it is plausible to assume, if there is any B-picture in a video, that it is accompanied by a neighbouring P-picture. Thus one can nearly always code B-pictures in relation to the P-picture counterpart, and interrelate their addressing. Two of these are used as annexes in the H.263 family, and will be discussed in the following.

PB frames mode

A PB frame consists of two P and B-pictures coded as one unit [25-G]. The P-picture is predicted from the last decoded P-picture and the B-picture is predicted both from the last decoded P-picture and the P-picture currently being decoded. The prediction process is illustrated in Figure 9.16.

click to expand
Figure 9.16: Prediction in PB frames mode Macroblock type

Since in the PB frames mode a unit of coding is a combined macroblock from P and B-pictures, the composite macroblock comprises 12 blocks. First, the data for the six P-blocks is transmitted as the default H.263 mode then the data for the six B-blocks. The composite macroblock may have various combinations of coding status for the P-and B-blocks, which are dictated by the combined macroblock block pattern MCBPC. One of the modes of the MCBPC is the intra macroblock type that has the following meaning:

  • the p-blocks are intra coded
  • the B-blocks are inter coded with prediction as for an inter block.

The motion vector data (MVD) is also included for intra blocks in pictures for which the type information PTYPE indicates inter. In this case the vector is used for the B-block only. The codewords MVD2-4 are never used for intra. The candidate motion vector predictor is not set to zero if the corresponding macroblock was coded in intra mode. Motion vectors for B-pictures in PB frames

In the PB frames mode, the motion vectors for the B-pictures are calculated as follows. Assume we have a motion vector component MV in half pixel units to be used in the P-pictures. This MV represents a vector component for an 8 × 8 luminance block. If only one motion vector per macroblock is transmitted, then MV has the same value for each of the 8 × 8 luminance blocks.

For prediction of the B-picture we need both forward and backward vector components MVF and MVB. Assume also that MVD is the delta vector component given by the motion vector data of a B-picture (MVDB) and corresponds to the vector component MV. Now MVF and MVB are given in half pixel units by the following formulae:


Here TRD is the increment of temporal reference TR from the last picture header. In the optional PB frames mode, TR only addresses P-pictures. TRB is the temporal reference for the B-pictures, which indicates the number of nontransmitted pictures since the last P or I-picture and before the B-picture.

Division is done by truncation and it is assumed that the scaling reflects the actual position in time of P and B-pictures. Care is also taken that the range of MVF should be constrained. Each variable length code for MVDB represents a pair of difference values. Only one of the pairs will yield a value for MVF falling within the permitted range of -16 to +15.5. The above relations between MVF, MVB and MV are also used in the case of intra blocks, where the vector is used for predicting B-blocks.

For chrominance blocks, the forward and backward motion vectors, MVF and MVB, are derived by calculating the sum of the four corresponding luminance vectors and dividing this sum by 8. The resulting one-sixteenth pixel resolution vectors are modified towards the nearest half pixel position. Prediction for a B-block in PB frames

In PB frames mode, predictions for the 8 × 8 pixel B-blocks are related to the blocks in the corresponding P macroblock. First, it is assumed that the forward and backward motion vectors MVF and MVB are calculated. Secondly, it is assumed that the luminance and chrominance blocks of the corresponding P-macroblock are decoded and reconstructed. This macroblock is called PREC. Based on PREC and its prediction, the prediction for the B-block is calculated.

The prediction of the B-block has two modes that are used for different parts of the block:

  1. For pixels where the backward motion vector MVB points to inside PREC, use bidirectional prediction. This is obtained as the average of the forward prediction using MVF relative to the previously decoded P-picture, and the backward prediction using MVB relative to PREC. The average is calculated by dividing the sum of the two predictions by two with truncation.
  2. For all other pixels, forward prediction using MVF relative to the previously decoded P-picture is used.

Figure 9.17 shows forward and bidirectionally predicted B-blocks. Part of the block that is predicted bidirectionally is shaded and the part that uses forward prediction only is shown unshaded.

click to expand
Figure 9.17: Forward and bidirectional prediction for a B-block

Improved PB frames

This mode is an improved version of the optional PB frames mode of H.263 [25-M]. Most parts of this mode are similar to the PB frames mode, the main difference being that in the improved PB frames mode, the B part of the composite PB-macroblock, known as the BPB-macroblock, may have a separate motion vector for forward and backward prediction. This is in addition to the bidirectional prediction mode that is also used in the normal PB frames mode.

Hence there are three different ways of coding a BPB-macroblock and the coding type is signalled by the MVDB parameter. The BPB-macroblock coding modes are:

  1. Bidirectional prediction: in the bidirectional prediction mode, prediction uses the reference pictures before and after the BPB-picture. These references are the P-picture part of the temporally previous improved PB frames and the P-picture part of the current improved PB frame. This prediction is equivalent to the prediction in normal PB frames mode when MVD = 0. Note that in this mode the motion vector data (MVD) of the PB macroblock must be included if the P macroblock is intra coded.
  2. Forward prediction: in the forward prediction mode the vector data contained in MVDB are used for forward prediction from the previous reference picture (an intra or inter picture, or the P-picture part of PB or improved PB frames). This means that there is always only one 16 × 16 vector for the BPB-macroblock in this prediction mode. A simple prediction is used for coding of the forward motion vector. The rule for this predictor is that if the current macroblock is not at the far left edge of the current picture or slice and the macroblock to the left has a forward motion vector, then the predictor of the forward motion vector for the current macroblock is set to the value of the forward motion vector of the block to the left; otherwise the predictor is set to zero. The difference between the predictor and the desired motion vector is then VLC coded in the same way as vector data to be used for the P-picture (MVD).
  3. Backward prediction: in the backward prediction mode the prediction of the BPB-macroblock is identical to BREC of normal PB frames mode. No motion vector data is used for the backward prediction.

Quantisation of B pictures

In normal mode the quantisation parameter quant is used for each macroblock of P and B-pictures. In PB frames mode, quant is used for P-blocks only, while for the B-blocks a different quantisation parameter bquant is used. In the header information a relative quantisation parameter known as dbquant is sent which indicates the relation between quant and bquant, as defined in Table 9.2.

Table 9.2: Dbquant codes and relation between quant and bquant




(5 × quant)/4


(6 × quant)/4


(7 × quant)/4


(8 × quant)/4

Division is done by truncation, and bquant ranges from 1 to 31. If the range exceeds these values they are clipped to their limits. Note that since dbquant is a two-bit codeword whereas quantisation information, such as quant, is a five-bit word (indicating quantisation parameters in the range of 1 to 31), such a strategy significantly reduces the overhead information.

Advanced variable length coding

H.263 pays special attention to variable length coding (VLC) for two different reasons. First, since H.263 is a low bit rate codec, it uses any means as well as arithmetic coding as an efficient VLC to enhance the compression efficiency. On the other hand, since H.263 is intended for mobile applications, where the channel error can be very severe and VLC coded data is very prone to the effects of errors, it uses a less compression efficient VLC to localise the side effect of channel errors. These two contradictory requirements are of course for two different applications and both are optional. In the normal mode, H.263 like the other standard video codecs uses the conventional VLC.

Syntax based arithmetic coding

In the optional arithmetic coding mode of H.263 [25-E], all the corresponding variable length coding/decoding operations may be replaced by binary arithmetic coding/decoding. This mode is used to improve the compression efficiency. It is shown that use of arithmetic coding will improve the compression efficiency over the conventional Huffman coded VLC by approximately 5–10 per cent, depending on the type of data to be coded [10]. However, arithmetic coded data is as prone to channel errors as normal VLC data. Hence, care should be taken to protect data against channel errors.

The type of arithmetic coding is called syntax-based, since a symbol is VLC encoded using a specific table based on the syntax of the coder. This table typically stores lengths and values of the VLC codewords. The symbol is mapped to an entry of the table in a table look-up operation, and then the binary codeword specified by the entry is sent out normally to a buffer for transmitting to the receiver.

In variable length decoding (VLD), the received bit stream is matched entry by entry in a specific table based on the syntax of the coder. This table must be the same as the one used at the encoder for encoding the current symbol. The matched entry in the table is then mapped back to the corresponding symbol that is the end result of the VLD decoder and is then used for recovering the video pictures.

The use of this mode is indicated by the type information, ptype. Details of syntax-based binary arithmetic coding (SAC) were given in Chapter 3. The model for the arithmetic coding of each symbol is specified by the cumulative frequency of that symbol.

Reversible variable length coding

To decode VLC coded data, decoders need to find the beginning of the codeword that starts after the resynchronisation marker. The marker has a unique pattern which is known to the decoder. VLC coded data is decoded one bit at a time, and each time the found bits are compared against a set of codewords in a lookup table. If a valid codeword is found, the symbol is decoded, otherwise another bit from the bit stream is appended to the code and tested again. In the event of any error in the bit stream, either a wrong symbol is decoded or the result declared invalid. In the former, it is more likely that the decoded symbols that follow will all be wrong, and/or eventually an invalid codeword is detected. In the case of an invalid codeword, decoding is halted, and the decoder waits for the next resynchronisation marker to start decoding. Thus a single bit error may cause a large part of the picture, from the occurrence of the error to the next resynchronisation marker, to be corrupted.

One way of reducing the damaged area is to be able to decode the bit stream backward as well as forward. This is called reversible variable length coding (RVLC). The decoder normally decodes in the forward mode, but when an invalid codeword is detected it stops decoding and stores the remaining data, up to the next resynchronisation marker. It then decodes backward from the marker, to find an invalid codeword, as shown in Figure 9.18. The area between the forward and reverse nondecodable part becomes the erroneous part.

click to expand
Figure 9.18: A reversible VLC

A variable length code able to work in both ways (RVLC) is required to be symmetric and its success very much depends on finding an invalid codeword arising from the error. If this is not found, then there is no way of identifying the erroneous area unless by postprocessing, which will be discussed in section 9.7.4. Fortunately, due to symmetry of RVLC codes, any error will most likely destroy the symmetry and will cause a nonvalid codeword.

Resynchronisation markers

The resynchronisation markers play an important role on the performance of H.263 video codec. If used too often, they limit the damaged area more tightly, so improving the error resilience of the codec. On the other hand, they incur some overheads, which can be costly for low bit rate video. In secure communication environments, they might be used only along with the picture header, where a single bit error can damage the whole picture. In this case, the overhead is minimised and the error-free picture quality is at its best. In normal transmission media, the resynchronisation markers are preferred to be used at each group of blocks (GOB), to give a balance between the compression efficiency (nonerroneous picture quality) and resilience to errors.

For an optimum balance between the resilience and the coding efficiency, the resynchronisation markers may be inserted where they are needed most. For example, if a GOB does not produce enough bits, or if it belongs to a B-picture, then markers may be inserted between several GOBs. Similarly, if a part of a picture is very active, such as the intraframe coded macroblocks, then within a GOB several markers can be inserted.

This optional mode of H.263 is defined under Annex K and is called the slice structure mode [25-K]. In this mode the slice header information within the bit stream acts as the resynchronisation marker. A slice header has more information than a GOB header (e.g. repeating picture header), such that out of order decoding of slices within a picture is possible. This is particularly useful for packetised transmission of H.263 coded data, where out of sequence decoding of packets reduces the decoding delay. Note that there is no complete independence between the slices, since some processing tools like deblocking filter mode interrelate adjacent slices [25-J].

In order to ensure that slice boundary locations can act as resynchronisation points, and ensure that slices can be sent out of order without causing additional decoding delays, the following rules are adopted in the slice structure mode:

  1. The prediction of motion values is the same as if a GOB header was present (see section 9.1.2), preventing the use of motion vectors of blocks outside the current slice for the prediction of the values of motion vectors within the slice.
  2. The advanced intra coding mode [25-I] treats the slice boundary as if it was a picture boundary with respect to the prediction of intra block DCT coefficient values.
  3. The assignment of remote motion vectors for use in overlapped block motion compensation within the advanced prediction mode [25-F] also prevents the use of motion vectors of blocks outside of the current slice for use as remote motion vectors.

For complete independence between slices, the recommendation describes the optional independent segment decoding mode [25-R]. When this mode is used, the slice boundaries are treated like the picture boundaries, including the treatment of motion vectors which cross these boundaries. If need be, the boundary pixels are extrapolated to be able to use other optional modes such as: unrestricted motion vector, advanced prediction mode, deblocking filter and scalability [25-R].

Advanced intra inter VLC

For further improvement to compression efficiency, H.263 specifies some optional modes that can use the normal Huffman designed VLC codes differently from the other standard video codecs. The following two optional modes describe situations where proper use of VLC improves the encoding efficiency. Advanced intra coding

In this optional mode [25-I], intra blocks are predictively coded using nearby blocks in the image to predict values in each intra block. A separate VLC is used for the intra VLC coefficients, and also the quantisation of the DC coefficient for intra is different. This is all done to improve the coding efficiency of the intra macroblocks.

The prediction may be made from the block above or the block to the left of the current block being decoded. An exception occurs in the special case of an isolated intra coded macroblock in an inter coded frame with neither the macroblock above nor to the left being intra coded. In this case, no prediction is made. In prediction, DC coefficients are always predicted in some manner, although either the first row or column of AC coefficients may or may not be predicted as signalled on a macroblock-by-macroblock basis. Inverse quantisation of the intra DC coefficient is identical to the inverse quantisation of AC coefficients for predicted blocks, unlike the core H.263 or other standards that use a fixed quantiser of eight bits for intra DC coefficients.

Also, in addition to zigzag scanning, two more scans are employed, alternate horizontal and alternate vertical scans, as shown in Figure 9.19. Alternate vertical is similar to the alternate scan mode of MPEG-2. For intra predicted blocks, if the prediction mode is set to zero, a zigzag scan is selected for all blocks in a macroblock, otherwise the prediction direction is used to select a scan on a block basis. For instance, if the prediction refers to the horizontally adjacent block, an alternate vertical scan is selected for the current block, otherwise (for DC prediction referring to the vertically adjacent block), alternate horizontal scan is used for the current block.

click to expand
Figure 9.19: Alternate scans (a horizontal) (b vertical)

For nonintra blocks, the 8 x 8 blocks of transform coefficients are always scanned with zigzag scanning, similar to all the other standard codecs. A separate VLC table is used for all intra DC and AC coefficients.

Depending on the value of intra_mode, either one or eight coefficients are the prediction residuals that must be added to a predictor. Figure 9.20 shows three 8 × 8 blocks of quantised DC levels and prediction residuals labelled A(u, v), B(u, v) and E(u, v), where u and v are row and column indices, respectively.

click to expand
Figure 9.20: Three neighbouring blocks in the DCT domain

E(u, v) denotes the current block that is being decoded. A(u, v) denotes the block immediately above E(u, v) and B(u, v) denotes the block immediately to the left of E(u, v). Define C(u, v) to be the actual quantised DCT coefficient. The quantised level C(u, v) is recovered by adding E(u, v) to the appropriate prediction as signalled in the intra_mode field.

The reconstruction for each coding mode is given by:

Mode 0: DC prediction only


Mode 1: DC and AC prediction from the block above


Mode 2: DC and AC prediction from the block to the left


where Q PA, Q PB and Q Pc denote the quantisation parameter (taking values between 1 and 31) used for A(u, v), B(u, v) and C(u, v), respectively. Advanced inter coding with switching between two VLC tables

At low frame rates (very common for low bit rate applications), the DCT coefficients of interframe coded macroblocks are normally large. Also, in general, the VLC tables designed for intraframe coded macroblocks suits larger value coefficients better. Hence to improve the compression efficiency of the H.263 codec, the inter coded macroblocks are allowed to use the VLC tables that primarily designed for intra macroblocks, but with a different interpretation of level and run. This is made optional and is called alternative inter VLC mode [25-S]. It is activated when significant changes are evident in the picture.

The intra VLC is constructed so that codewords have the same value for last (0 or 1) in both the inter and intra tables. The intra table is therefore produced by reshuffling the meaning of the codewords with the same value of last. Furthermore, for events with large level, the intra table uses a codeword which in the inter table has large run.

Encoder action

The encoder uses the intra VLC table for coding an inter block if the following two criteria are satisfied:

  • the intra VLC results in fewer bits than Inter VLC
  • if the coefficients are coded with the intra VLC table, but the decoder assumes that the inter VLC is used, coefficients outside the 64 coefficients of a 8 × 8 block are addressed.

With many large coefficients, this will easily happen due to the way the intra VLC is used.

Decoder action

At the decoder the following actions are taken:

  • the decoder first receives all coefficient codes of a block
  • the codewords are then interpreted assuming that inter VLC is used; if the addressing of coefficients stays inside the 64 coefficients of a block, the decoding is ended
  • if coefficients outside the block are addressed, the codewords are interpreted according to the intra VLC.

Protection against error

H.263 provides error protection, robustness and resilience to allow accessing of video information over a wide range of transmission media. In particular, due to the rapid growth of mobile communications, it is extremely important that access is available to video information via wireless networks. This implies a need for useful operation of video compression algorithms in a very error prone environment at low bit rates (i.e. less than 64 kbit/s).

In the previous sections we studied the two important coding tools of VLC and resynchronisation markers in the H.263 codec. The former spreads the errors, and the latter tries to confine them into a small area. In this section, we introduce some more useful tools that can enhance the video quality beyond what we have seen so far. Some of these are recommended as options (or annexes) and some as postprocessing tools that can be implemented at the decoder without the help of the encoder. They can be used either together or individually, to improve video quality.

Forward error correction

Forward error correction is the simplest and most effective means of improving video quality in the event of channel errors. It is based on adding some redundancy bits, known as parity bits, to a group of data bits, according to some rules. At the receiver, the decoder, invoking the same rule, can detect if any error has occurred, and in certain cases even correct it. However, for video data, error correction is not as important as is error detection.

The forward error correction for H.263 is the same as for H.261, and is optional [25-H]. However, since the main usage of H.263 will be in a mobile environment with poor error characteristics, forward error correction is particularly important. In most cases (e.g. the GSM system), the error correction will be an integral part of the transmission channel. If it is not, or if additional protection is required, then it should be built into the H.263 system.

To allow the video data and error correction parity information to be identified by the decoder, an error correction framing pattern is included. This pattern consists of multiframes of eight frames, each frame comprising 1 bit framing, 1 bit fill indicator (FI), 492 bits of coded data and 18 bits parity. One bit from each one of the eight frames provide the frame alignment pattern of (S1S2S3S4S5S6S7S8) = (00011011) that will help the decoder to resynchronise itself after the occurrence of errors.

The error detection/correction code is a BCH (511, 493) [16]. The parity is calculated against a code of 493 bits, comprising a bit fill indicator (FI) and 492 bits of coded video data. The generator polynomial is given by:


The parity bits are calculated by dividing the 493 bits (left shifted by 18 bits) of the video data (including the fill bit) to this generating function. Since the generating function is a 19-bit polynomial, the remainder will be an 18-bit binary number (that is why the data bits had to be shifted by 18 bits to the left), to be used as the parity bits. For example, for the input data of 01111...11 (493 bits), the resulting correction parity bits are 011011010100011011 (18 bits). The encoder appends these 18 bits to the 493 data bits, and the whole 511 bits are sent to the receiver as a block of data. Now this 511-bit data is exactly divisible by the generating function, and the remainder will be zero. Thus, the receiver can perform a similar division, and if there is any remainder, it is an indication of channel error. This is a very robust form of error detection, since burst of errors can also be detected.

Back channel

The impact of error on interframe coded pictures becomes objectionable when error propagates through the picture sequence. Errors affecting only one video frame are easily tolerated by the viewers, especially at high frame rates. To improve quality of video services, propagation of errors through the picture frames must be prevented. A simple method for this task is: when the decoder detects errors in the bit stream (e.g. section 9.7.1), it may ask the encoder to code that part of the picture in the next frame in intra mode. This is called forced updating, and of course requires a back channel from the decoder to the encoder.

Since intraframe coded macroblocks (MB) generate more bits than interframe coded ones, forced updating may not be too impressive. In particular, in normal interframe coding, only a small number of MBs in a GOB are coded. Forced updating will encode all the MBs in the GOB (including the noncoded MBs) in intramode, which increases the bit rate significantly. This can have a side effect of impairing video quality in the subsequent frames. Moreover, if errors occur in more than one GOB, the situation becomes much worse, since the encoder can exceed its bit rate budget, dropping some picture frames. This results in picture jerkiness, which is equally annoying.

A better way of preventing propagation of errors is to ask the encoder to change its prediction to an error free picture. For example, if error occurs in frame N, then in coding of the next frame (frame N + 1) the encoder uses frame N - 1, which is free of error at the decoder. This of course requires some additional picture buffers at both the encoder and the decoder.

The optional reference picture selection mode of H.263 uses additional picture memory at the encoder to perform such a task [25-N]. The amount of additional picture memory accommodated in the decoder may be signalled by external means to help memory management at the encoder. The source encoder for this mode is similar to the generic interframe coder, but several picture memories are provided in order that the encoder may keep a copy of several past pictures, as shown in Figure 9.21.

click to expand
Figure 9.21: An encoder with multiple reference pictures

The source encoder selects one of the picture memories according to the backward channel message GOB-by-GOB to suppress the temporal error propagation due to the interframe coding. The information to signal which picture is selected for prediction is included in the encoded bit stream. The decoder of this mode also has an additional plural number of picture memories, to store the correctly decoded video signals with its temporal reference (TR) information. The decoder uses the stored picture whose TR is TRP as the reference picture for interframe decoding, instead of the last decoded picture, if the TRP field exists in the forward message. When the picture whose TR is TRP is not available at the decoder, the decoder may send the forced intra update signal to the encoder.

A positive acknowledgment (ACK) or a negative acknowledgment (NACK) is returned depending on whether the decoder successfully decodes a GOB.

Both forced intra updating and multiple reference picture modes require a back channel from the decoder to the encoder. In case the back channel cannot be provided, the multiple reference picture can still be used to alleviate error propagation. For example, the encoder may always use an average of two previous frames for prediction. Of course, in an error-free environment, the compression efficiency is not as good as for the single-frame prediction, but has a good robustness against the channel errors.

Figure 9.22 illustrates the efficiency of the multiple reference picture in preventing error propagation. In the Figure errors occur at frame 30, where the picture quality drops by 3 dB (graph E) compared with the nonerror case (NE). With the back channel (BC), the encoder in coding of frame 31 uses prediction from frame 29 instead of frame 30, and the picture quality is not very different from the nonerror case. Without the back channel, with a prediction from the average of two previous frames (2F + E), the picture quality improves. However, the improvement is not very significant, since the quality of the picture without error (2F) due to nonoptimum prediction is not as good as with the nonerror mode (NE).

click to expand
Figure 9.22: Use of multiple reference pictures with and without back channel

Data partitioning

Although the individual bits of the VLC coded symbols in a bit stream are equally susceptible to channel errors, the impact of the error on the symbols is unequal. Between the two resynchronisation markers, symbols that appear earlier in the bit stream suffer less from the errors than those which come later. This is due to the cumulative impact of VLC on decoding of the subsequent data. To illustrate the extent of the difference on the unequal susceptibility to errors, consider a segment of VLC coded video data between two resynchronisation markers. Also assume that the segment has N symbols with an average VLC length of L bits/symbol and a channel with a bit error rate of P. If any of the first L bits of the bit stream (those immediately after the first marker) are in error, then the symbol would be in error with a probability of LP. The probability that the second symbol in the bit stream is in error now becomes 2LP, since any error in the first L bits also affects the second symbol. Hence the probability that the last symbol in the bit stream is in error will be NLP, since every error ahead of this symbol can change the value of this symbol. Thus the last symbol is N times more likely to be in error than the first symbol in the bit stream.

In applications where some video data is more important than the others, like the macroblock addresses (as distinct from interframe DCT coefficients), by bringing the important data ahead of the nonimportant data, one can significantly reduce the channel error side effects. This form of partitioning the VLC coded data into segments of various importance is called data partitioning, which is one of the optional modes of H.263 [25-V]. Note that this form of data partitioning is different from the data partitioning used as a layering technique, described in section 8.5.2. There, through the priority break point, the DCT coefficients were divided into two parts and the lower frequency coefficients along with the other data comprised the base layer and the high frequency DCT coefficients were the second layer. Inclusion of the priority break points and other overheads increased the bit rate by about 3–4 per cent (see Figure 8.25). But here, the entire set of data in a GOB is partitioned and are ordered according to the importance of their contributions in video quality, without any additional overhead. For example, within a GOB, the order of importance of data can be: coding status of MBs, motion vectors, block pattern, quantiser parameter, DC coefficients, AC coefficients. Thus it is also possible to extract all the DC coefficients of the blocks in a GOB, and send them ahead of all the AC coefficients.

To appreciate the importance of data partitioning in protecting video against channel errors, Figure 9.23 shows two snap shots of a video sequence with and without data partitioning. It was assumed that in data partitioning, only the DCT coefficients were subjected to errors, but for the normal mode, the bit error could affect any bit of the data. This is a plausible assumption, since normally the important data comprises a small fraction of the bit stream and it can be heavily protected against error. The important data can also use a reversible variable length code (RVLC), such that some of the corrupted data can be retrieved. In fact Annex V of data partitioning recommends RVLC for the slice header (including the macroblock type) and motion vectors [25-V]. The DCT coefficients according to this recommendation use normal VLC. The good picture quality under data partitioning over the normal as shown in Figure 9.23 justifies such a decision. This also shows the insignificance of the DCT coefficients, as their loss hardly affects the picture quality. It should be noted that, in this picture, all the macroblocks were interframe coded. Had there been any intraframe coded macroblock, then its loss would have been noticeable.

click to expand
Figure 9.23: Effects of errors (a with data partitioning) (b without data partitioning)

Table 9.3 compares the normal VLC and RVLC for the combined macroblock type and block pattern (MCBPC). Note that RVLC is symmetric and it has more bits than the normal VLC. Hence its use should be avoided, unless it is vital to prevent drastic image degradation.

Table 9.3: VLC and RVLC bits of MCBPC


MB type


Normal VLC



3 (intra)




















4 (intra + Q)



















Table 9.4 shows the average number of bits used in an experiment for each slice of a QSIF size salesman image test sequence (picture in Figure 9.23). The last column is the average bit/slice in normal coding of the sequence, for the whole nine slices. For data partitioning, the second column is the slice overhead (including the macroblock type, resynchronisation markers), the third column is the motion vector overhead and the fourth column is the number of bits used for the DCT coefficients. The sum of all the bits in data partitioning is given in the fifth column.

Table 9.4: Number of bits per slice for data partitioning

Slice No

Slice header



























































First, since the sum of the slice header and motion vectors is only 8–28 per cent of the data, less for more active slices, they can be easily protected without significantly increasing the total bit rate. Secondly, comparing the total number of bits in data partitioning with normal coding (columns 5 and 6), we see that data partitioning uses about 3–12 per cent more bits than does normal coding. Considering that this increase is due to the use of RVLC for only the header and the motion vectors and some more resynchronisation markers at the end of the important data, then had we used RVLC for the entire bits, the increase in bit rate would have been much higher. Hence the fact that DCT coefficients do not contribute too much to image quality and RVLC needs more bits than VLC; it is very wise not to use RVLC for the DCT coefficients, as Annex V recommends [25-V]. It should be noted that the main cause for the unpleasant appearance of the picture without data partitioning (Figure 9.23b) is the error on the important data of the bit stream, such as MB address and motion vectors. When the coding status of an MB is wrongly addressed to the decoder, visual information is misplaced. Also in the nondata partitioning mode, since the data, MB and motion vectors VLC coded are mixed, any bit error easily causes nonvalid codewords and a large area of the picture will be in error, as shown in Figure 9.23b.

Note that data partitioning is only used for P and B-pictures, because for I-pictures, DCT coefficients are all important and their absence degrades picture quality significantly.

Error detection by postprocessing

In the error correction/detection section of 9.7.1 we saw that with the help of parity bits the decoder can detect an erroneous bit stream. In data communications, the decoder normally ignores the entire segment of the bits and requests for retransmission. Due to the delay sensitive nature of visual services, in video communication retransmission is never used. Moreover, the decoder can decode a part of the bit stream, up to the point where it finds an invalid code word. Hence a part of the corrupted bit stream can be recovered limiting the damaged area.

However, the decoder still cannot identify the exact location of the error (if this were possible, it could have corrected it!). What is certain is that the bits after the invalid codeword up to the next resynchronisation marker are not decodable, as shown in Figure 9.24.

click to expand
Figure 9.24: Error in a bit stream

It is to be expected that several symbols are wrongly decoded before the decoder finds an invalid codeword. In some cases, the entire data may be decodable without encountering an invalid codeword, although this rarely happens. For example, the grey parts of the slices in Figure 9.23b are due to the invalid codewords that the decoder has given up decoding. Figure 9.23b also shows wrongly decoded blocks of pixels, where the decoder can still carry on decoding beyond these blocks. Hence in those parts that are decodable, the correctly decoded data cannot be separated from the wrongly decoded ones, unless some form of processing on the decoded pixels is carried out.

A simple and efficient method of separating correctly decoded blocks from the wrongly decoded ones is to test for pixel continuity at the macroblock (MB) boundaries. For nonerroneous pictures, due to high interpixel correlation pixel differences at the MB borders are normally small, and those due to errors create large differences. As shown in Figure 9.25a, for every decoded MB, the average of upper and lower pixel differences at the MB boundaries is calculated as:


click to expand
Figure 9.25: Pixels at the boundary of (a a macroblock) (b four blocks)

where N is the total number of pixels at the upper and lower borders of the MB.

The boundary difference (BD) of each MB is then compared against a threshold, and for those MBs which are larger, the implication is that they are most likely to be erroneously decoded. Since due to texture or edges in the image there might be some inherent discontinuity at the MB boundaries, the boundary threshold can be made dependent on the local image statistics. For example, the mean value of the boundary differences of all the MB in the slice, or the slice above, with some tolerance (a few times the standard deviation of the mean differences) can be used as the threshold. Experiments show that mean plus four times the standard deviation is a good value for the threshold [17]. The boundary difference, BD, can be calculated separately for luminance and each of the colour differences. A macroblock might have been erroneously decoded if any of these boundary differences so indicated.

Another method is to calculate the boundary differences around the 8 x 8 pixel block boundaries, as shown in Figure 9.25b. In 4:2:0 image format, each MB has four luminance blocks and one of each chrominance block, and hence the boundary difference is applied only to the luminance blocks.

In a similar fashion to the boundary difference of eqn. 9.13, the block boundary is calculated on the inner and outer pixels of the blocks, as shown in Figure 9.25b. Again, if any of the four block boundary values, BD, indicates a discontinuity, the macroblock is most likely to be erroneously decoded. Combining the boundary differences of the macroblock (Figure 9.25a) and the block (Figure 9.25b) increases the reliability of detection.

Assuming that these methods can detect an erroneously decoded MB, then if the first erroneous MB in a slice is found, and provided that the error had only occurred in the bits of this MB, in general it is possible to retrieve the remaining data. Here, after identifying the first erroneous MB, some of the bits are skipped and decoding is performed on the remaining bits. The process is continued such that the remaining bits up to the next resynchronisation marker are completely decodable (no invalid codeword is encountered). In doing so, even parts of the slice/GOB that were not decodable before are now decoded and the erroneous part of the GOB can be confined to one MB.

If errors occur in more than one MB, then it may not be possible to have perfect decoding (no invalid codeword) up to the next resynchronisation marker. Thus in general, when decoding proceeds up to the next resynchronisation marker, the number of erroneous MBs are counted. This number should be less than the number of erroneous MBs in the previous run. The process ends when any further skips in bits and decoding does not further reduce the number of erroneous macroblocks in a GOB.

Figure 9.26 shows the decoded pictures at each stage of this step-by-step skipping and decoding of the bits. For the purpose of demonstration, only one bit was introduced in the bit stream between the resynchronisation markers of some of the slices. The first picture shows the erroneous picture without any postprocessing. The second picture shows the reconstructed picture after the first round of bit skipping in each slice, and so on. As we see, in each stage the erroneous area (number of erroneous MBs) is reduced and further processing does not reduce the number of erroneous MBs (not much differences between pictures d and e). There is only one erroneous MB in each slice of the final picture (Figure 9.26e), which can easily be concealed.

click to expand
Figure 9.26: Step-by-step decoding and skipping of bits in the bit stream

In the above example it was assumed that a single bit error was affecting one MB, or a burst of errors affected only one MB. If errors affect more than one MB, then at the end more than one MB will be in error and of course it will take more time to find these erroneous MBs in the decoding. This is because, after finding the first erroneous MB, since there are some erroneous MBs to follow, perfect decoding (not finding a valid codeword) is not possible. Experiments show that in most cases, all the macroblocks between the first and the last erroneous MB in a slice will be in error. However, it is still possible to recover some of the macroblocks, which without this sort of processing was not possible.

Error concealment

If any of the error resilience methods mentioned so far or their combinations is not sufficient to produce satisfactory picture quality, then one may try to hide the image degradation from the viewer. This is called error concealment.

The main idea behind error concealment is to replace the damaged pixels with pixels from parts of the video that have maximum resemblance. In general, pixel substitution may come from the same frame or from the previous frame. These are called intraframe and interframe error concealment, respectively [18]. Intraframe error concealment

In intraframe error concealment, pixels of an erroneous MB are replaced by those of a neighbouring MB with some form of interpolation. For example, pixels at the macroblock boundary may be directly replaced by the pixels from the other side of the border, and for the other pixels, the average of the neighbouring pixels inversely weighted by their distances may be substituted.

An efficient method of intraframe error concealment is shown in the block diagram of Figure 9.27. A block of pixels, larger than the size of a macroblock (preferably 48 × 48 pixels, equivalent to 3 × 3 MBs) encompassing the MB to be concealed, is fast Fourier transformed (FFT). Pixels of the MB to be concealed initially are filled with grey level values. The FFT coefficients are two-dimensionally lowpass filtered (LPF) to remove the discontinuity due to these inserted pixels. The resultant lowpass filtered coefficients are then inverse fast Fourier transformed (IFFT), to reconstruct a replica of the input pixels. Due to lowpass filtering, the reconstructed pixels are similar but not exactly the same as the input pixels. The extent of dissimilarity depends on the cutoff frequency of the lowpass filter. The lower the cutoff frequency, the stronger is the influence of the neighbouring pixels into the concealed MB. The centre MB at the output now replaces the centre MB at the input, and the whole process of FFT, LPF and IFFT repeats again. The process is repeated several times, and at each time the cutoff frequency of the LPF is gradually increased. To improve the quality of error concealment, the lowpass filter can be made directional, based on the characteristics of the surrounding pixels. The process is terminated when the difference between the pixels of the concealed MB at the input and output is less than a threshold.

click to expand
Figure 9.27: An example of intraframe error concealment

This form of error concealment assumes an isolated erroneous MB surrounded by eight immediate nonerroneous neighbours. This is suitable for JPEG or motion JPEG coded pictures, where error is localised (see Figure 5.17), or for interframe coded pictures, if by means of postprocessing error is confined to an MB (e.g. Figure 9.26e). For video, where there is a danger of error at the same slice/GOB, then pixels of the top and bottom slices should be used, and the two right and left MBs are treated as if they are in error. This impairs the performance of the concealment, and may not be suitable. For video, a more suitable error concealment is interframe error concealment, which is explained in the following. Interframe error concealment

In interframe error concealment, pixels from the previous frame are substituted for the pixels of the MB to be concealed, as shown in Figure 9.28. This could be either by direct substitution, or by their motion compensated version, using an estimated motion vector. Obviously, due to movement, motion compensated substitution is better. The performance of this method depends on how accurately the motion vector for concealment is estimated. In the following, several methods of estimating this motion vector are explained and their error concealment fidelity are compared against each other.

click to expand
Figure 9.28: A grid of 3 × 3 macroblocks in the current and previous frame

Zero mv

Direct substitution of pixels from the MB of the previous frame at the same spatial position of the MB to be concealed (zero motion vector). This is the simplest method of substitution, and is effective in the picture background, or in the foreground with slow motion.

Previous mv

The estimated motion vector is the same as the motion vector of the spatially similar MB in the previous frame. This method, which assumes a uniform motion of objects, performs well most of the time, but, however, it is eventually bound to fail.

Top mv

The estimated motion vector is the same as the motion vector of MB at the top of the wanted MB (e.g. MB number 2 of Figure 9.28). Similarly, the motion vector of the bottom MB (e.g. MB number 5) may be used. Since these two MBs are closest to the current MB, it is expected that their MVs will have the highest similarity. However, this method is as simple as the direct substitution (zero mv) and previous mv.

Mean mv

The average of the motion vectors of the six immediate neighbours represents the estimated mv. The mean value for horizontal displacement, x0, and vertical displacement, y0, are taken separately:


where xi and yi are the horizontal and vertical components of the motion vector i; mvi (xi, yi). Note that, due to averaging, any small perturbations in the neighbouring motion vector components will cancel each other. Thus the estimated motion vector will be different from the motion vector of the neighbouring MB. This method of error concealment may not produces a smooth picture. The discontinuity at the MB boundaries produces a blocking artefact that appears very annoying. Hence this method is not good for parts of the picture with motion in various directions, such as the movement of lips and eyes of a talking head.

Majority mv

The majority of the motion vectors are grouped together, and their mean or other representative value is taken as the estimated motion vector.


where N out of six motion vectors are almost at the same direction. Since in general all motion vectors can differ from each other, then to find the majority, the motion vectors should be vector quantised, and the majority is found among their original values. This method works well for rigid body movement, where the neighbouring motion vectors normally move in the same direction. However, since there are only six neighbouring motion vectors, a definite majority among them cannot be found reliably. Hence for nonrigid movement, such as lips and eyes, this method may not work well.

Vector median mv

The median of a group of vectors is one of the vectors in the group that has the smallest Euclidean distance from all. Thus among the six neighbouring motion vectors mv1-mv6 of Figure 9.28, the jth motion vector, mvj, is the median if:


such that for all motion vectors mvk, 1 ≤ k 6, the distance of vector j, distj is less than the distance of vector k, distk [19].

This method is expected to produce a good result, because since the median of vectors has the least distance from all, it then has the largest correlation with them. Also, since the macroblock to be concealed is at the centre of all and has the highest correlation with them, then it has the same property as the median vector. This good performance is achieved at a higher computational cost. Here, a Euclidean distance of each vector from all the five other vectors should be calculated first, which requires vector distance calculations. Then for each, five distances to be averaged to represent the average distance of a vector from the others. Finally, they should be rank ordered to find the minimum distance.

To compare the relative error concealment performance of each method, four sets of head-and-shoulders type image sequences at QSIF resolutions were subjected to channel errors. In the event of error, the whole GOB was concealed by the above mentioned methods. This is because, due to VLC coding, a single bit error may cause the remaining bits up to the next GOB nondecodable, as shown on the erroneous picture of Figure 9.23. Tables 9.5 and 9.6 summarise the quality of these error concealment methods for QCIF video at 5 and 12.5 frames/s, respectively. To show just the impact of error concealment, measurements were carried out only on the concealed areas.

Table 9.5: PSNR [dB] of various error concealment methods at 5 frames/s


64 Kbps, 5 fps, QCIF




































No errors





Table 9.6: PSNR [dB] of the various error concealment methods at 12.5 frames/s


64 Kbps, 12.5 fps, QCIF




































No errors





As the Tables show, the vector median method gives the best result at both high and low frame rates. That of the majority method is the second best. In all cases, the performance of the average method is as poor as the simple method of top and, in some cases, it is even poorer (seq-1 and seq-4 of Table 9.5). The poor performance of the previous mv means that motion is not uniform. This is particularly evident at the low frame rate of five frames/s. However, all the methods are superior to zero motion, implying that loss concealment by an estimated motion vector improves picture quality.

Also, note that since quality measurements were carried out at the error concealed areas, then the performance at a lower frame rate is poorer than at the higher frame rate. That is, as the frame rate is reduced, the estimated motion vector is less similar to the actual motion vector. Despite this, estimating motion vectors by all the methods gives better performance than not estimating them (zero motion vector). Figure 9.29 shows an accumulated erroneous picture of seq-3 at five frames/s and its concealed one with the median vector method [19].

click to expand
Figure 9.29: An erroneous picture along with its error concealed version

Bidirectional mv

If B-pictures are present in the group of pictures (GOP), then due to stronger relation between the motion vector of a B and its anchor P or I-picture, a better estimation of the motion vector can be made. As an example consider a GOP of N = α and M = 2, that is, the image sequence is made of alternate P and B-pictures, as shown in Figure 9.30.

Figure 9.30: A group of alternate P and B-pictures

To estimate a missing motion vector for a P-picture, say P31, the available motion vectors of the same spatial coordinates of the B-pictures can be used, with the

following substitutions:

  • if only B23 is available, then P31 = 2 × B23
  • if only B21 is available, then P31 = -2 × B21
  • if both B23 and B21 are available, then P31 = B23 - B21
  • if none of them are available, then set P31 = 0

To estimate a missing motion vector of a B-picture, simply divide that of the P-picture by two: or .

Here we have used a simple previous mv estimation method, explained earlier. Although in the tests of Tables 9.5 and 9.6 (images sequences made of P-pictures only) this method did not perform well, since the relation here between P and B-pictures is strong, the method does work well. For example, using MPEG-1 video we have achieved about 3–4 dB improvement over the majority method [20]. The amount of improvement is picture dependent, and it appears for QCIF images coded with H.263; at least 1 dB improvement over the majority can be achieved. Interested readers should consult [20] for further detailed information. Loss concealment

In transmission of video over packet networks such as IP, ATM or wireless packet networks, the video data is packed into the payload of the packets. In this transmission mode two types of distortion may occur. One is the error in the payload, which results in erroneous reception of the bit stream, similar to the effect of channel errors. The second one is either error in the packet header, which results in a packet loss, or if the packet is queued in a congested network. Excessively delayed packets will be of no use and hence they will be discarded either by the switching nodes (routers) or by the receiver itself.

Detection, correction and concealment of the error in the packet payload is similar to that for the previous methods mentioned. For packet loss the methods can be slightly different. First, the decoder by examining the packet sequence number discovers that a packet is missing. Second, when a packet is lost, unlike channel errors, no part of the video data is decodable. Hence, loss concealment is more vital to video over packet networks than is error concealment in a nonpacket transporting environment.

Considering that in coding of video, in particular at low bit rates, not all parts of the picture are coded, then the best concealment for noncoded macroblocks is the direct copy of the previous macroblock without any motion compensation (i.e. zero mv). For those which are coded, as Tables 9.5 and 9.6 show, a motion compensated macroblock gives a better result. However, the information as to which macroblock was or was not coded is not available at the decoder. It is obvious that any attempt to replace the noncoded area by the motion compensated macroblock will degrade the image quality rather than improve it. Our simulations show that replacing a noncode MB with an estimated motion compensated MB would degrade the quality of the pixels in that MB by 7–10 dB [21].

Therefore, for proper loss concealment, the coded and noncoded maroblocks should be discriminated from each other. A noncoded MB should be directly copied from the previous frame, but for the coded one, it should be motion compensated by an estimated motion vector (any of the estimation methods of section Decision on the coding status of a missing MB can be made on the coding status of the MB at the same spatial location in the previous frame. Investigations show that if an MB is coded, it is about 70 per cent certain that it will be coded in the next frame [21]. Also, if an MB is not coded, it is 90 per cent certain that it will not be coded in the next frame. Thus a decision on whether a lost MB should be replaced by direct substitution, or its motion compensated version, can be made based on the coding status of that MB in the previous frame. We call this method of loss concealment, selective loss concealment.

To demonstrate the image enhancement due to loss concealment, the Salesman test image sequence coded at 144 kbit/s, 10 Hz, was exposed to channel errors at a rate of 10-2 bit rate, using the channel error model given in Appendix E [22].

Figure 9.31 shows the objective quality of the entire decoded picture sequence with loss and loss concealment. As the Figure shows, while the quality of the decoded video due to loss is impaired by more than 10 dB, loss concealment enhances degraded image quality by around 7 dB. Figure 9.31 also shows the improvement due to selective concealment versus the full concealment, where a lost macroblock is always replaced by the motion compensated previous macroblock, irrespective of whether a macroblock was coded or not.

click to expand
Figure 9.31: Quality of decoded video with and without loss concealment with a bit error ratio of 10-2 Selection of best estimated motion vector

Although Tables 9.5 and 9.6 show that one method of estimating a lost motion vector is better than the other, nevertheless they represent the average quality over the entire video sequence. Had we compared these methods on a macroblock by macroblock basis, there can be situations in which an overall best method will not perform well. The reason is that the quality of such error/loss concealment depends on the directions and values of the surrounding motion vectors of that macroblock. What makes poor error/loss concealment is that the motion compensated replacement macroblock shows some pixel discontinuity. This makes the reconstructed picture look blocky, which is very disturbing.

To improve the error/loss concealed image quality, one may apply all the above motion estimation methods, and test for image discontinuity around the reconstructed macroblock. The method that gives the least discontinuity is then chosen. Methods introduced in section 9.7.4 can be used as a discontinuity measure.


Although we have extensively described the scalability under JPEG2000 and MPEG-2, in H.263 it is used with different terminology and we visit this subject again. It might be useful to know that scalability in H.263 is not used for distribution purposes, but more as a layering technique. Hence, by unequal error protection on the base layer, this method in conjunction with the other error resilience methods, explained in section 9.7, further improves the robustness of this codec.

Extensions of H.263 also support temporal, SNR and spatial scalability as optional modes [25-O]. This mode is normally used in conjunction with the error control scheme. The capability of this mode and the extent to which its features are supported is signalled by external means such as H.245 [9].

There are three types of enhancement picture in the H.263+ codec that are known as B, EI and EP-pictures [5]. Each of these has an enhancement layer number, ELNUM, which indicates to which layer it belongs, and a reference layer number, RLNUM, which indicates which layer is used for its prediction. The encoder may use any of its basic scalability nodes of temporal, SNR, spatial or their combinations in a multilayer scalability mode. Details of the basic and multilayer scalabilities were given in section 8.5. However, due to the different nature and application of H.263 to MPEG-2, there are some differences.

Temporal scalability

Temporal scalability is achieved using bidirectionally predicted pictures or B-pictures. As usual, B-pictures use prediction from either or both of a previous and subsequent reconstructed picture in the reference layer. These B-pictures differ from the B-picture part of a PB or improved PB frames, in that they are separate entities in the bit stream. They are not syntactically intermixed with a subsequent P or its enhancement part EP.

B-pictures and the B part of PB or improved PB frames are not used as reference pictures for the prediction of any other pictures. This property allows for B-pictures to be discarded if necessary without adversely affecting any subsequent pictures, thus providing temporal scalability. There is no limit to the number of B-pictures that might be inserted between the pairs of the reference pictures in the base layer. A maximum number of such pictures may be signalled by external means (e.g. H.245). However, since H.263 is normally used for low frame rate applications (low bit rates, e.g. mobile), then due to larger separation between the base layer I and P-pictures, there is normally one B-picture between them. Figure 9.32 shows the position of base layer I and P-pictures and the B-pictures of the enhancement layer for most applications.

click to expand
Figure 9.32: B-picture prediction dependency in the temporal scalability

SNR scalability

In SNR scalability, the difference between the input picture and lower quality base layer picture is coded. The picture in the base layer which is used for the prediction of the enhancement layer pictures may be an I-picture, a P-picture, or the P part of a PB or improved PB frame, but should not be a B-picture or the B part of a PB or its improved version.

In the enhancement layer two types of picture are identified, EI and EP. If prediction is only formed from the base layer, then the enhancement layer picture is referred to as an EI-picture. In this case the base layer picture can be an I or a P-picture (or the P part of a PB frame). It is possible, however, to create a modified bidirectionally predicted picture using both a prior enhancement layer picture and temporally simultaneous base layer reference picture. This type of picture is referred to as an EP-picture or enhancement P-picture. Figure 9.33 shows the positions of the base and enhancement layer pictures in an SNR scalable coder. The Figure also shows the prediction flow for the EI and EP enhancement pictures.

click to expand
Figure 9.33: Prediction flow in SNR scalability

For both EI and EP-pictures, prediction from the reference layer uses no motion vectors. However, as with normal P-pictures, EP pictures use motion vectors when predicting from their temporally-prior reference picture in the same frame.

Spatial scalability

The arrangement of the enhancement layer pictures in the spatial scalability is similar to that of SNR scalability. The only difference is that before the picture in the reference layer is used to predict the picture in the spatial enhancement layer, it is downsampled by a factor of two either horizontally or vertically (one-dimensional spatial scalability), or both horizontally and vertically (two-dimensional spatial scalability). Figure 9.34 shows the flow of the prediction in the base and enhancement layer pictures of a spatial scalable encoder.

click to expand
Figure 9.34: Prediction flow in spatial scalability

Multilayer scalability

Undoubtedly multilayer scalability will increase the robustness of H.263 against the channel errors. In the multilayer scalable mode, it is possible not only for B-pictures to be temporally inserted between the base layer pictures of type I, P, PB and improved PB, but also between the enhancement picture types of EI and EP, whether these consist of SNR or spatial enhancement pictures. It is also possible to have more than one SNR or spatial enhancement layer in conjunction with the base layer. Thus a multilayer scalable bit stream can be a combination of SNR layers, spatial layers and B-pictures. With increasing the layer number, the size of a picture cannot decrease. Figure 9.35 illustrates the prediction flow in a multilayer scalable encoder.

click to expand
Figure 9.35: Positions of the base and enhancement layer pictures in a multilayer scalable bit stream

As with the two-layer case, B-pictures may occur in any layer. However, any picture in an enhancement layer which is temporally simultaneous with a B-picture in its reference layer must be a B-picture or the B-picture part of a PB or improved PB frame. This is to preserve the disposable nature of B-pictures. Note, however, that B-pictures may occur in any layers that have no corresponding picture in the lower layers. This allows an encoder to send enhancement video with a higher picture rate than for the lower layers.

The enhancement layer number and the reference layer number of each enhancement picture (B, EI, or EP) are indicated in the ELNUM and RLNUM fields, respectively, of the picture header (when present). If a B-picture appears in an enhancement layer in which temporally surrounding SNR or spatial pictures also appear, the reference layer number (RLNUM) of the B-picture shall be the same as the enhancement layer number (ELNUM). The picture height, width and pixel aspect ratio of a B-picture shall always be equal to those of its temporally subsequent reference layer picture.

Transmission order of pictures

Pictures, which are dependent on other pictures, shall be located in the bit stream after the pictures on which they depend. The bit stream syntax order is specified such that for reference pictures (i.e. pictures having types I, P, EI, EP, or the P part of PB or improved PB), the following two rules shall be obeyed:

  1. All reference pictures with the same temporal reference shall appear in the bit stream in increasing enhancement layer order. This is because each lower layer reference picture is needed to decode the next higher layer reference picture.
  2. All temporally simultaneous reference pictures as discussed in item 1 above shall appear in the bit stream prior to any B-pictures for which any of these reference pictures is the first temporally subsequent reference picture in the reference layer of the B-picture. This is done to reduce the delay of decoding all reference pictures, which may be needed as references for B-pictures.

Then, the B-pictures with earlier temporal references shall follow (temporally ordered within each enhancement layer). The bit stream location of each B-picture shall comply with the following rules:

  1. Be after that of its first temporally subsequent reference pictures in the reference layer. This is because the decoding of the B-pictures generally depends on the prior decoding of that reference picture.
  2. Be after that of all reference pictures that are temporally simultaneous with the first temporally subsequent reference picture in the reference layer. This is to reduce the delay of decoding all reference pictures, which may be needed as references for B-pictures.
  3. Precede the location of any additional temporally subsequent pictures other than B-pictures in its reference layer. Otherwise, it would increase picture storage memory requirement for the reference layer pictures.
  4. Be after that of all EI and EP pictures that are temporally simultaneous with the first temporally subsequent reference picture.
  5. Precede the location of all temporally subsequent pictures within the same enhancement layer. Otherwise, it would introduce needless delay and increase picture storage memory requirements for the enhancement layer.

Figure 9.36 shows two allowable picture transmission orders given by the rules above for the layering structure shown as an example. Numbers next to each picture indicate the bit stream order, separated by commas for the two alternatives.

click to expand
Figure 9.36: Example of picture transmission order

Buffer regulation

Regulation of output bit rates for better distribution of the target bit rate among the encoding parameters is an important part of any video encoder. This is particularly vital in the H.263 encoder, at least for the following reasons:

  • better bit rate regulation requires larger buffer sizes, hence longer delays
  • H.263 is intended for visual telephony, and the encoding delay should be limited, hence smaller buffer sizes are preferred
  • the target bit rate is in the order of 24 kbit/s, and even small size buffers can introduce long delays.

There is no best known method for buffer regulation, and Recommendation H.263 does not standardise any method (neither do other standard encoders). However, at least for the laboratory simulations, one can use those methods designed for the test models. The following is a method that can be used in the simulations [5]. The bit rate is controlled at a macroblock level, by changing the quantiser parameter, QP, depending on the bit rate, the source and target frame rates.

For the first picture, which is intraframe coded, the quantisation parameter is set to its mid range QP = 16 (QP varies from 1 to 31). After the first picture, the buffer content is set to:


For the following pictures the quantiser parameter is updated at the beginning of each new macroblock line. The formula for calculating the new quantiser parameter is:




mean quantiser parameter for the previous picture



number of bits spent for the previous picture


target number of bits per picture



present macroblock number



number of macroblocks in a picture



number of bits spent until now for the picture



bit rate



frame rate of the source picture (typically 25 or 30 Hz)

f target


target frame rate

The first two terms of the above formula are fixed for macroblocks within a picture. The third term adjusts the quantiser parameter during coding of the picture.

The calculated new quantisation parameter, QPnew, must be adjusted so that the difference fits within the definition of dquant. The buffer content is updated after each complete picture by using the following C function:

buffer_content=buffer_content+Bi,99 ;
buffer_content=buffer_content - (R/FR);

The variable frame_incr indicates how many times the last coded picture must be displayed. It also indicates which picture from the source is coded next.

To regulate the frame rate, ftarget, a new is calculated at the start of each frame:


For this buffer regulation, it is assumed that the process of encoding is temporarily stopped when the physical transmission buffer is nearly full, preventing buffer overflow. However, this means that no minimum frame rate and delay can be guaranteed.

Advanced video coding (H 26L)

The long-term objective of the ITU-T video coding experts group under the advanced video coding project is to provide a video coding recommendation to perform substantially better than the existing standards (e.g. H.263+) at very low bit rates. The group worked closely with the MPEG-4 experts group of ISO/IEC for more than six years (from 1997 to 2002). The joint work of the ITU-T and the ISO/IEC is currently called H.26L, with L standing for long-term objectives. The final recommendations of the codec will be made available in 2003, but here we report only on what are known so far. These may change over time, but we report on those parts that not only look fundamental to the new codec, but are also more certain to be adopted as recommendations. The codec is expected to be approved as H.264 by the ITU-T and as MPEG-4 part 10 (IS 14496-10) by ISO/IEC [7].

Simulation results show that H.26L has achieved substantial superiority of video quality over that achieved by the existing most optimised H.263 and MPEG-4 codecs [23]. Most notable features of the H.26L are:

  • Up to 50 per cent in bit rate saving: compared with the H.263+ (H.263V2) or MPEG-4 simple profile (see Chapter 10), H.26L permits an average reduction in bit rate of up to 50 per cent for a similar degree of encoder optimisation at most bit rates. This means that H.26L offers consistently higher quality at all bit rates including low bit rates.
  • Adaptation to delay constraints: H.26L can operate in a low delay mode to adapt telecommunication applications (e.g. H.264 for videoconferencing), while allowing higher processing delay in applications with no delay constraints such as video storage and server-based video streaming applications (MPEG-4V10).
  • Error resilience: H.26L provides the tools necessary to deal with packet loss in packet networks and bit errors in error-prone wireless networks.
  • Network friendliness: the codec has a feature that conceptually separates the video coding layer (VCL) from the network adaptation layer (NAL). The former provides the core high compression representation of the video picture content and the latter supports delivery over various types of network. This facilitates easier packetisation and better information priority control.

How does H 26L (H 264) differ from H 263

Despite the above mentioned features, the underlying approach of H.26L (H.264) is similar to that adopted by the H.263 standard [7]. That is, it follows the generic transform-based interframe encoder of Figure 3.18, which employs block matching motion compensation, transform coding of the residual errors, scalar quantisation with an adaptive quantiser step size, zigzag scanning and run length coding of the transform coefficients. However, some specific changes are made to the H.26L codec to make it not only more compression efficient, but also more resilient against channel errors. Some of the most notable differences between the core H.26L and the H.263 codecs are:

  • H.26L employs 4 × 4 integer transform block sizes, as opposed to 8 x 8 floating points DCT transform used in H.263
  • H.26L employs a much larger number of different motion compensation block sizes per 16 × 16 pixel macroblock; H.263 only supports two such block sizes in its optional mode
  • higher precision of spatial accuracy for motion estimation with quarter pixel accuracy as the default mode for lower complexity mode and eighth pixel accuracy for the higher complexity mode; H.263 uses only half pixel and MPEG-4 uses quarter pixel accuracy
  • the core H.26L uses multiple previous reference pictures for prediction, whereas the H.263 standard uses this feature under the optional mode
  • in addition to I, P and B-pictures, H.26L uses a new type of interstream transitional picture, called an SP-picture
  • the deblocking filter in the motion compensation loop is a part of the core H.26L, whereas H.263 uses it as an option.

In the following sections some details of these new changes known so far are explained.

Integer transform

H.26L is a unique standard coder that employs a purely integer transform as opposed to the DCT transform with noninteger elements used in the other standard codecs. The core H.26L specifies a 4 × 4 integer transform which is an approximation to the 4 × 4 DCT (compare matrices in eqn. 9.20), hence it has a similar coding gain to the DCT transform. However, since the integer transform, Tint, has an exact inverse transform, there is no mismatch between the encoder and the decoder. Note that in DCT, due to the approximation of cosine values, the forward and inverse transformation matrices cannot be exactly the inverse of each other and hence encoder/decoder mismatch is a common problem in all standard DCT-based codecs that has to be rectified. Transformation matrices in eqn. 9.20 compare the 4 × 4 Tint integer transform against the 4 × 4 DCT transform of the same dimensions.


It may be beneficial to note that applying the transform and then the inverse transform does not return the original data. This is due to scaling factors built into the transform definition, where the scaling factors are different for different frequency coefficients. This scaling effect is removed partly by using different quantiser step sizes for the different frequency coefficients, and by a bit shift applied after the inverse transform. Moreover, use of the smaller block size of 4 × 4 reduces the blocking artefacts. Even more interesting, use of integer numbers makes transformation fast, as multiplication by 2 is simply a shift of data by one bit to the left. This can significantly reduce the processing power at the encoder, which can be very useful in power constrained processing applications, such as video over mobile networks, and allow increased parallelism.

Note that, as we saw in Chapter 3, for the two-dimensional transform the second stage transform is applied in the vertical direction of the first stage transform coefficients. This means transposing the transform elements for the second stage. Hence, for integer transform of a block of 4 × 4 pixels, xij, the 16 transform coefficients, yij are calculated as:

(9.21)  click to expand

As an optional mode, H.26L also specifies an integer transform of length 8. In this mode, the integer transform, which again is an approximation to the DCT of length 8, is defined as:

(9.22)  click to expand

Note that the elements of this integer transform are not powers of two, and hence some multiplications are required. Also, in this optional mode, known as the adaptive block transform (ABT), the two-dimensional blocks could be either: 4 x 4, 4 x 8, 8 x 4 or 8 x 8, depending on the texture of the image. Hence, a macroblock may be coded by a combination of these blocks, as required.

Intra coding

Intra coded blocks generate a large amount of data, which could be undesirable for very low bit rate applications. Observations indicate that there are significant correlations among the pixels of adjacent blocks, particularly when the block size is as small as 4 × 4 pixels. In core H.26L, for an intra coded macroblock of 16 × 16 pixels, the difference between the 4 × 4 pixel blocks and their predictions are coded.

In order to perform prediction, the core H.26L offers eight directional prediction modes plus a DC prediction (mode 0) for coding of the luminance blocks, as shown in Figure 9.37 [7]. The arrows of Figure 9.37b indicate the directions of the predictions used for each pixel of the 4 × 4 block in Figure 9.37a.

click to expand
Figure 9.37: A 4 × 4 luminance pixel block and its eight directional prediction modes

For example, when DC prediction (mode 0) is chosen, then the difference between every pixel of the block and a DC predictor defined as:


is coded, where A, B, C etc. are the pixels at the top and left sides of the block to be coded and // indicates division by rounding to the nearest integer. As another example, in mode 1, which is a vertical direction prediction mode, as shown in Figure 9.37b, every column of pixels uses the top pixel at the border as the prediction. In this mode, the prediction for all four pixels a, e, i and m would be A, and the prediction for, say, pixels d, h, l and p would be pixel D. Some modes can have a complex form of prediction. For example, mode 6, which is called the vertical right prediction mode, indicates that the prediction for, say, pixel c is:

prd = (C + D)//2

and the prediction for pixel e in this mode would be


For the complete set of prediction modes, the H.26L Recommendation should be consulted [7].

In the plain areas of the picture there is no need to have nine different prediction modes, as they unnecessarily increase the overhead. Instead, H.26L recommends only four prediction modes: DC, horizontal, vertical and plane [7]. Moreover, in the intra macroblocks of plain areas, normally AC coefficients of each 4 × 4 are small and may not be needed for transmission. However, the DC coefficients can be large, but they are highly correlated. For this reason, the H.26L standard suggests that the 16 DC coefficients in a macroblock should be decorrelated by the Hadamard transform. Hence if the DC coefficient of the 4 × 4 integer transform coefficient of block (i, j) is called xij then the two-dimensional Hadamard transform of the 16 DCT coefficients gives 16 new coefficients yij as:

(9.25)  click to expand

where y00 is the DC value of all the DC coefficients of the 4 × 4 integer transform blocks. The overall DC coefficient y00 is normally large, but the remaining yij coefficients are small. This mode of intra coding is called the intra 16 × 16 mode [7].

For prediction of chrominance blocks (both intra and inter), since there are only four chrominance blocks of each type in a macroblock, then first of all the recommendation suggests using four prediction modes for a chrominance macroblock. Second, the four DC coefficients of the chrominance 4 × 4 blocks are now Hadamard transformed with transformation matrix of

Note all these transformation (as well as the inverse transformation) matrices should be properly weighted such that the reconstructed pixel values are positive with eight bits resolution (see problem 5 in section 9.11).

Inter coding

Interframe predictive coding is where H.26L makes most of its gain in coding efficiency. Motion compensation on each 16 × 16 macroblock can be performed with different block sizes and shapes, as shown in Figure 9.38. In mode 16 × 16, one motion vector per macroblock is used, and in modes 16 × 8 and 8 × 16 there are two motion vectors. In addition, there is a mode known as 8 × 8 split, in which each of the 8 × 8 blocks can be further subdivided independently into 8 × 8, 8 × 4, 4 × 8, 4 × 4 or 8 × 8 intra blocks.

click to expand
Figure 9.38: Various motion compensation modes

Experimental results indicate that using all block sizes and shapes can lead to bit rate savings of more than 15 per cent compared with the use of only one motion vector per 16 × 16 macroblock. Another 20 per cent saving in bit rate is achieved by representing the motion vectors with quarter pixel spatial accuracy as compared with integer pixel spatial accuracy. It is expected that eighth pixel spatial accuracy will increase the saving rate even further.

Multiple reference prediction

The H.26L standard offers the option of using many previous pictures for prediction. This will increase the coding efficiency as well as producing a better subjective image quality. Experimental results indicate that using five reference frames for prediction results in a bit rate saving of about 5–10 per cent. Moreover, using multiple reference frames improves the resilience of H.26L to errors, as we have seen from Figure 9.22.

In addition to allowing immediately previous pictures to be used for prediction, H.26L also allows pictures to be stored for as long as desired and used at any later time for prediction. This is beneficial in a range of applications, such as surveillance, when switching between a number of cameras in fixed positions can be encoded much more efficiently if one picture from each camera position is stored at the encoder and the decoder.

Deblocking filter

The H.26L standard has adopted some of the optional features of H.263 that have been proven to significantly improve the coding performance. For example, while data partitioning was an option in H.263, it is now in the core H.26L. The other important element of the core H.26L is the use of an adaptive deblocking filter that operates on the horizontal and vertical block edges within the prediction loop to remove the blocking artefacts. The filtering is based on 4 × 4 block boundaries, in which pixels on either side of the boundary may be filtered, depending on the pixel differences across the block boundary, the relative motion vectors of the two blocks, whether the blocks have coefficients and whether one or other block is intra coded [7].

Quantisation and scanning

Similar to the H.263 standard, the H.26L standard also uses a dead band zone quantiser and zigzag scanning of the quantised coefficients. However, the number of quantiser step sizes and the adaptation of the step size is different. While in H.263 there are thirty-two different quantisation levels, H.26L recommends fifty-two quantisation levels. Hence, H.26L coefficients can be coded much more coarsely than those of H.263 for higher compression and more finely to produce better quality, if the bit rate budget permits (see problem 8, section 9.11). In H.26L the quantiser step size is changed at a compound rate of 12.5 per cent. The fidelity of the chrominance components is improved by using finer quantisation step sizes as compared with those used for the luminance component, particularly when the luminance quantiser step size is large.

Entropy coding

Before transmission generated data of all types is variable length (entropy) coded. As we discussed in Chapter 3, entropy coding is the third element of the redundancy reduction technique that all the standard video encoders try to employ as best they can. However, the compression efficiency of entropy coding is not at the same degree of spatial and temporal redundancy reduction techniques. Despite this, H.26L has put a lot of emphasis on entropy coding. The reason is that, although with sophisticated entropy coding the entropy of symbols may be reduced by a bit or a fraction of a bit, for very low bit rate applications (e.g. 20 kbit/s), when the symbols are aggregated over the frame or video sequence, the reduction can be quite significant.

The H.26L standard, like its predecessor, specifies two types of entropy coding: Huffman and arithmetic coding. To make Huffman encoding more efficient, it adaptively employs a set of VLC tables based on the context of the symbols. For example, Huffman coding of zigzag scanned transform coefficients is performed as follows. Firstly, a symbol that indicates the number of nonzero coefficients, and the number of these at the end of the scan that have magnitude of one, up to a maximum of three, is encoded. This is followed by one bit for each of the indicated trailing ones to indicate the sign of the coefficient. Then any remaining nonzero coefficient levels are encoded, in reverse zigzag scan order, finishing with the DC coefficient, if nonzero. Finally, the total number of zero coefficients before the last nonzero coefficient is encoded, followed by the number of zero coefficients before each nonzero coefficient, again in reverse zigzag scan order, until all zero coefficients have been accounted for. Multiple code tables are defined for each of these symbol types, and the particular table used adapts as information is encoded. For example, the code table used to encode the total number of zero coefficients before the last nonzero coefficient depends on the actual number of nonzero coefficients, as this limits the maximum value of zero coefficients to be encoded. The processes are complex and for more information the H.26L recommendation [7] should be consulted. However, this complexity is justified by the additional compression advantage that it offers.

For arithmetic coding, H.26L uses context-based adaptive binary arithmetic coding (CABAC). In this mode the intersymbol redundancy of already coded symbols in the neighbourhood of the current symbol to be coded is used to derive a context model. Different models are used for each syntax element (e.g. motion vectors and transform coefficients use different models) and, like Huffman, the models are adapted based on the context of the neighbouring blocks. When the symbol is not binary, it is mapped onto a sequence of binary decisions, called bins. The actual binarisation is done according to a given binary tree. Each binary decision is then encoded with the arithmetic encoder using the new probability estimates, which have been updated during the previous context modelling stage. After encoding of each bin, the probability estimate for the binary symbol that was just encoded is incremented. Hence the model keeps track of the actual statistics. Experiments with test images indicate that context-based adaptive binary arithmetic coding in H.26L improves bit saving by up to 10 per cent over the Huffman code, where the Huffman used in H.26L itself is more efficient than the conventional nonadaptive methods used in the other codecs.

Switching pictures

One of the applications for the H.26L standard is video streaming. In this operation the decoder is expected to decode a compressed video bit stream generated by a standard encoder.

One of the key requirements in video streaming is to adapt the transmission bit rate of the compressed video according to the network congestion condition. If the video stream is generated by an online encoder (real time), then according to the network feedback, rate adaptation can be achieved on the fly by adjusting the encoder parameters such as the quantiser step size, or in the extreme case by dropping frames. In typical streaming, where preencoded video sequences are to be streamed from a video server to a client, the above solution cannot be used. This is because any change in the bit stream makes the decoded picture different from the locally decoded picture of the encoder. Hence the quality of the decoded picture gradually drifts from that of the locally decoded picture of the encoder, and the visual artefacts will further propagate in time so that the quality eventually becomes very poor.

The simplest way of achieving scalability of preencoded bit streams is by producing multiple copies of the same video sequence at several bit rates, and hence qualities. The server then dynamically switches between the bit streams, according to the network congestion or the bandwidth available to the client. However, in switching on the P-pictures, since the prediction frame in one bit stream is different from the other, the problem of picture drift remains unsolved.

To rectify picture drift in bit stream switching, the H.26L Recommendation introduces a new type of picture, called the switching picture or the secondary picture, for short: SP-picture. SP-pictures are generated by the server at the time of switching from one bit stream to another and are transmitted as the first picture after the switching. To see how SP-pictures are generated and how they can prevent picture drift, consider switching between two bit streams as an example, shown in Figure 9.39.

click to expand
Figure 9.39: Use of S frame in bit stream switching

Consider a video sequence encoded at two different bit rates, generating bit streams A and B. Also, consider a client who up to time N has been receiving bit stream A, and at time N wishes to switch to bit stream B. The bit streams per frame at this time are identified as SA and SB, respectively.

To analyse the picture drift, let us assume in both cases that the pictures are entirely predictively coded (P-picture). Thus at time N, in bit stream A the difference between frame N and N - 1, SA = AN - AN-1, is transmitted and for bit stream B, SB = BN - BN-1. The decoded picture at the receiver prediction loop before switching would be frame AN-1. Hence, for drift-free pictures, at the switching time the decoder expects to receive SAB = BN - AN-1, from bit stream B, which is different from SB = BN - BN-1. Therefore, at the switching time, a new bit stream for the duration of one frame, SP, should be generated, as shown in Figure 9.39. SP now replaces SB at the switching time, and if switching was done from bit stream B to A, we needed SP = SBA = AN - BN-1, to replace SA.

The switching pictures SAB and SBA can be generated in two ways, each for a specific application. If it is desired to switch between two bit streams at any time, then at the switching point we need a video transcoder to generate the switching pictures where they are wanted. Figure 9.40 shows such a transcoder, where each bit stream is decoded and then reencoded with the required prediction picture. If a macroblock is intra coded, the intra coded parts of the S frame of the switched bit stream are directly copied in the SP picture.

click to expand
Figure 9.40: Creation of the switching pictures from the bit streams

Although the above method makes it possible to switch between two bit streams at any time, it is expensive to use it along with the bit stream. A cheaper solution would be to generate these switching pictures (SAB and SBA) offline and store them along the bit streams A and B. Normally, these pictures are created periodically, with reasonable intervals. Here there is a trade-off between the frequency of switching between the bit streams and the storage required to store these pictures. It is in this application that these pictures are also called secondary pictures, as well as the switching pictures.

Random access

Another vital requirement in video streaming is to be able to access the video stream and decode the pictures almost at any time. In accessing preencoded video, we have seen that the approach taken by the MPEG codecs was to insert I-pictures at regular intervals (e.g. every 12 frames). Due to the high bit rates of these pictures, their use in very low bit rate codecs, such as H.26L, would be very expensive. The approach taken by the H.26L standard is to create secondary pictures at the time access is required. These pictures are intraframe coded I-pictures and are encoded in such a way as to generate a decoded picture identical to the frame to be accessed. Figure 9.41 shows the position of the secondary picture for accessing the bit stream at the switching frame, S.

click to expand
Figure 9.41: The position of the secondary picture in random accessing of the bit stream

In the Figure the bit stream is made of P-pictures for high compression efficiency. One of these pictures, identified as picture S, is the picture where the client can access the bit stream if he wishes to. From the original picture an SI-picture is encoded to provide an exact match for picture S. Note that the first part of the accessed bit stream is now the data generated by the SI-picture. Thus the amount of bits at the start is high. However, since in video streaming at each session accessing occurs only once, these extra bits compared with the duration of the session are negligible. The average bit rate hardly changes, and is much less than when regular I-pictures are present.

In the above method, similar to the offline creation of secondary pictures for bit stream switching, the frequency of SI pictures is traded against the storage capacity required to store them along with the bit stream.

Network adaptation layer

An important feature of H.26L is the separation of the video coding layer (VCL) from the network adaptation layer (NAL). The VCL is responsible for high compression representation of the video picture content. In the previous sections some of the important methods used in the video coding layer were presented. In this section we describe the NAL which is responsible for efficient delivery of compressed video over the underlying transport network. Specifications for the NAL depend on the nature of the transport network. For example, for fixed bit rate channels, the MPEG-2 transport layer can be used. In this mode, 184 bytes of compressed video bits along with a four-byte header are packed into a packet. However, even if the channel is variable in rate, provided that some form of guaranteeing quality of service can be provided, such as ATM networks, the MPEG-2 transport protocol can still be used. In this case each MPEG-2 packet is segmented into four ATM cells, each cell carrying 47 bytes in its payload.

Perhaps the most interesting type of NAL is the one for the transport of a compressed video stream over a nonreliable network such as the best effort Internet protocol (IP) network. In this application NAL takes the compressed bit stream and converts it into packets that can be conveyed directly over the real time transport protocol (RTP) [24].

RTP has emerged as the de facto specification for the transport of time sensitive streamed media over IP networks. RTP provides a wrapper for media data that identifies the media type and provides synchronisation information in the form of a time stamp. These features enable individual packets to be reconstructed into a media stream by a recipient. To enable flow control and management of the media stream, additional information is carried between the sender and the receiver using the associated RTP control protocol (RTCP). Whenever an RTP stream is required, associated out of band RTCP flows between the sender and receiver are established. This enables the sender and receiver to exchange information concerning the performance of the media transmission and may be of use to higher level application control functions. Typically, RTP is mapped onto and carried in user data packets (UDP). In such cases, an RTP session is usually associated with an even numbered port and its associated RTCP flows with the next highest odd numbered port.

Encapsulation of video bit stream into RTP packets should be done with the following procedures:

  • arrange partitions in an intelligent way into packets
  • eventually split/merge partitions to match the packet size constraints
  • avoid the redundancy of (mandatory) RTP header information and information in the video stream
  • define receiver/decoder reactions to packet losses.

In packetised transmission, the packetisation scheme has to make some assumptions on typical network conditions and constraints. The following set of assumptions has been made on packetised transmission of H.26L video streams:

  • maximum packet size: around 1500 bytes for any transmission media except the dial-up link; for this link 500 byte packets to be used
  • packet loss characteristics: nonbursty, due to drop tail router implementations and assuming reasonable pacing algorithm (e.g. no bursting occurs at the sender)
  • packet loss rate: up to 20 per cent.

In packetised transmission, the quality of video can be improved by stronger protection of the important packets against errors or losses. This is very much in line with the data partitioning feature of the H.26L coded bit stream. In the NAL it is expected that the bit stream generated per slice will be data partitioned into three parts, each being carried in a different packet. The first packet contains the important part of the compressed data, such as header type, macroblock type and address, motion vectors etc. The second packet contains intra information, and the third packet is assembled with the rest of the video bit stream, such as transform coefficients.

This configuration allows decoding of the first packet independently from the second and third packets, but not vice versa. As the first packet is more important than the second packet, which is more important than the third (it carries the most vital video information and is also necessary to decode the second and third packets), it is more heavily protected with forward error correction than the second and third packets.

It should be noted that slices should be structured to ensure that all packets meet the maximum packet size to avoid a network splitting/recombination process. This means inserting the slice header at appropriate points.



Assume the DCT coefficients of problem 3 of Chapter 8 are generated by an H.263 encoder. After zigzag scanning, and quantisation with th = q = 8, they are converted into three-dimensional events of (last, run, index). Identify these events.

prepend 0 to all events of problem 3 of chapter 8 , except the last event, where 1 should be appended, and no need for eob, e.g. first event (0, 4, 0) and the last event (1, 2, -1)


The neighbouring motion vectors of the motion vector MV are shown in Figure 9.42. Find:

  1. the median of the neighbouring motion vectors
  2. the motion vector data (MVD), if the motion vector MV is (2,1).

Figure 9.42

for x, the median of (3, 4, -1) is 3 and for y, the median of (-3, 3, 1) is 1. hence the prediction vector is (3, 1) and mvd = (2 - 3 = -1; 1 - 1 = 0) = (-1, 0)


The intensities of four pixels A, B, C and D of the borders of two macroblocks are given in Figure 9.43.

Figure 9.43

Using the deblocking filter of eqn. 9.4, find the interpolated pixels, B1 and C1 at the macroblock boundary for each of:

  1. A = 100, B = 150, C = 115 and D = 50
  2. A = B = 150 and C = D = 50

Assume the quantiser parameter of macroblock 2 is QP = 16.

 a. thus b 1 = 150 - 8=142 and c 1 = 115 + 8 = 123 b. d = -31.25, and d 1 = 0, hence b and c do not change.


Figure 9.44 shows the six neighbouring macroblocks of a lost motion vector. The value of these motion vectors are also given. Calculate the estimated motion vector for this macroblock for each of the following loss concealment methods:

  1. top
  2. bottom
  3. mean
  4. majority
  5. vector median

click to expand
Figure 9.44

 a. (3, 4), b. (0, -3), c. (1, 0.5), d. (3, 2.6) e. (-1, -1).


Show that for the integer transforms of lengths four to be orthonormal, the DC and the second AC coefficients should be divided by 2, but the first and the third AC coefficients should be divided by . Determine the inverse transformation matrix and show it is an orthonomal matrix.

in order for a matrix to be orthonormal, multiplying each row by itself should be 1. hence in row 1 and 3 (basis vectors 0 and 2), their values are 4, hence they should be divided by . in rows 2 and 4 their products give: 4 + 1 + 1 + 4 = 10, hence their values should be divided by . thus the forward 4 4 integer transform becomes and the inverse transform is its transpose as can be tested, this inverse transform is orthornormal, e.g.:


A block of 4 × 4 pixels given by:

is two-dimensionally transformed by a 4 × 4 DCT and the integer transform of problems 5. The coefficients are zigzag scanned and N out of 16 coefficients in the scanning order are retained. For each transform determine the reconstructed block, and its PSNR value, given that the number of retained coefficients are:

  1. N = 10
  2. N = 6
  3. N = 3

with the integer transform of problem 5, the two-dimensional transform coefficients will be 431 -156 91 -15 43 52 30 1 -6 -46 -26 -7 -13 28 -19 14 the reconstructed pixels with the retained coefficients are; for n = 10: 105 121 69 21 69 85 62 44 102 100 98 119 196 175 164 195 which gives an mse error of 128.75, or psnr of 27.03 db. the reconstructed pixels with the retained 6 and 3 coefficients give psnr of 22.90 and 18 db, respectively. with 4 4 dct, these values are 26.7, 23.05 and 17.24 db, respectively. as we see the integer transform has the same performance as the dct. if we see it is even better for some, this is due to the approximation of cosine elements.


The fifty-two quantiser levels of an H.26L encoder may be indexed from 0 to 51. If at the lowest index index_0, the quantiser step size is Q, find the quantiser step sizes at the following indices:

  1. index_8
  2. index_16
  3. index_32
  4. index_48
  5. index_51

index-0 = qp a. index-8 = 2 qp b. index-16 = 4 qp c. index-24 = 8 qp ....index-40 = 32 qp d. index-48 = 64 qp e.


If the lowest index of H.26L has the same quantisation step size as the lowest index of H.263, show that the H.26L quantiser is finer at lower step sizes but is coarser at larger step sizes than that of H.263.

compared with h.263, at lower indices h.263 is coarser, e.g. at index-8 the quantiser parameter for h.263 is 8 qp, but for h.26l is 2 qp etc. at higher indices, the largest quantiser parameter for h.263 is 31 qp , but that of h.26l is 88 qp , hence at larger indices h.26l has a coarser quantiser.



Prepend 0 to all events of problem 3 of Chapter 8, except the last event, where 1 should be appended, and no need for EOB, e.g. first event (0, 4, 0) and the last event (1, 2, -1)


For x, the median of (3, 4, -1) is 3 and for y, the median of (-3, 3, 1) is 1. Hence the prediction vector is (3, 1) and MVD = (2 - 3 = -1; 1 - 1 = 0) = (-1, 0)


  1. thus B1 = 150 - 8=142 and C1 = 115 + 8 = 123

  2. d = -31.25, and d1 = 0, hence B and C do not change.


  1. (3, 4),
  2. (0, -3),
  3. (1, 0.5),
  4. (3, 2.6)
  5. (-1, -1).


In order for a matrix to be orthonormal, multiplying each row by itself should be 1. Hence in row 1 and 3 (basis vectors 0 and 2), their values are 4, hence they should be divided by . In rows 2 and 4 their products give: 4 + 1 + 1 + 4 = 10, hence their values should be divided by .

Thus the forward 4 × 4 integer transform becomes

And the inverse transform is its transpose

As can be tested, this inverse transform is orthornormal, e.g.:

click to expand


With the integer transform of problem 5, the two-dimensional transform coefficients will be

















The reconstructed pixels with the retained coefficients are; for N = 10:

















which gives an MSE error of 128.75, or PSNR of 27.03 dB. The reconstructed pixels with the retained 6 and 3 coefficients give PSNR of 22.90 and 18 dB, respectively.

With 4 × 4 DCT, these values are 26.7, 23.05 and 17.24 dB, respectively.

As we see the integer transform has the same performance as the DCT. If we see it is even better for some, this is due to the approximation of cosine elements.


index-0 = QP

  1. index-8 = 2QP
  2. index-16 = 4QP
  3. index-24 = 8QP....index-40 = 32QP
  4. index-48 = 64QP


Compared with H.263, at lower indices H.263 is coarser, e.g. at index-8 the quantiser parameter for H.263 is 8QP, but for H.26L is 2QP etc.

At higher indices, the largest quantiser parameter for H.263 is 31QP, but that of H.26L is 88 QP, hence at larger indices H.26L has a coarser quantiser.


1 Draft ITU-T Recommendation, H.263: 'Video coding for low bit rate communication'. July 1995

2 H.261: 'ITU-T Recommendation, H.261, video codec for audiovisual services at p × 64 kbit/s'. Geneva, 1990

3 MPEG-1: 'Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s'. ISO/IEC 1117-2: video, November 1991

4 MPEG-2: 'Generic coding of moving pictures and associated audio information'. ISO/IEC 13818-2 video, draft international standard, November 1994

5 Draft ITU-T Recommendation H.263+: 'Video coding for very low bit rate communication'. September 1997

6 ITU-T recommendation, H.263++: 'Video coding for low bit rate communication'. ITU-T SG16, February 2000

7 Joint video team (JVT) of ISO/IEC MPEG and ITU-T VCEG: text of committee draft of joint video specification. ITU-T Rec, H.264 | ISO/IEC 14496-10 AVC, May 2002

8 MPEG-4: 'Testing and evaluation procedures document'. ISO/IEC JTC1/SC29/WG11, N999, July 1995

9 ITU-T Recommendation, H.245 'Control protocol for multimedia communication'. September 1998

10 WALLACE, G.K.: 'The JPEG still picture compression standard', Commun. ACM, 1991, 34, pp.30–44

11 GHANBARI, M.,DE FARIA, S.GOH, I.N., and TAN, K.T.: 'Motion compensation for very low bit rate video', Signal Process., Image Commun., 1994 7, pp.567–580

12 NETRAVALI, A.N., and ROBBINS, J.B.: 'Motion-compensated television coding: Part I', Bell Syst. Tech. J., 1979, 58, pp.631–670

13 LOPES, F.J.P., and GHANBARI, M.: 'Analysis of spatial transform motion estimation with overlapped compensation and fractional-pixel accuracy', IEE Proc., Vis. Image Signal Process., 1999, 146, pp.339–344

14 SEFERIDIS, V., and GHANBARI, M.: 'General approach to block matching motion estimation', Opt. Eng., 1993, 32, pp.1464–1474

15 NAKAYA, Y., and HARASHIMA, H.: 'Motion compensation based on spatial transformation', IEEE Trans. Circuits Syst. Vide. Technol., 1994, 4, pp.339–356

16 BLAHUT, R.E.: 'Theory and practice of error control codes' (Addison-Wesley, 1983)

17 KHAN, E.,GUNJI, H.,LEHMANN, S., and GHANBARI, M.: 'Error detection and correction in H.263 coded video over wireless networks'. International workshop on packet video, PVW2002, Pittsburgh, USA

18 GHANBARI, M., and SEFERIDIS, V.: 'Cell loss concealment in ATM video codecs', IEEE Trans. Circuits Syst. Video Technol., Special issue on packet video, 1993, 3:3, pp.238–247

19 GHANBARI, S., and BOBER, M.Z.: 'A cluster-based method for the recovery of lost motion vectors in video coding'. The 4th IEEE conference on Mobile and wireless communications networks, MWCN'2002, September 9–11, 2002, Stockholm, Sweden

20 SHANABLEH, T., and GHANBARI, M.: 'Loss concealment using B-pictures motion information', IEEE Trans. Multimedia (to be published in 2003)

21 LIM, C.P.,TAN, E.A.W.,GHANBARI, M., and GHANBARI, S.: 'Cell loss concealment and packetisation in packet video', Int. J. Imaging Syst. Technol., 1999, 10, pp.54–58

22 ITU SGXV working party XV/I, Experts Group for ATM video coding, working document AVC-205, January 1992

23 JOCH, A.,KOSSENTINI, F., and NASIOPOULOS, P., 'A performance analysis of the ITU-T draft H.26L video coding standard'. Proceedings of 12th international Packet video workshop, 24–26 April 2002, Pittsburgh, USA

24 SCHULZRINE, H,CASNER, S.,FREDRICK, R., and JACOBSON, V.: 'RFC 1889: RTP: A transport protocol for real-time applications'. Audio-video transport working group, January 1996

25 Some of the H.263 annexes used in this chapter:

  • Annex C: 'Considerations for multipoint'
  • Annex D: 'Unrestricted motion vector mode'
  • Annex E: 'Syntax-based arithmetic coding mode'
  • Annex F: 'Advanced prediction mode'
  • Annex G: 'PB frames mode'
  • Annex H: 'Forward error correction for coded video signal'
  • Annex I: 'Advanced intra coding mode'
  • Annex J: 'Deblocking filter mode'
  • Annex K: 'Slice structured mode'
  • Annex M: 'Improved PB frames mode'
  • Annex N: 'Reference picture selection mode'
  • Annex O: 'Temporal, SNR, and spatial scalability mode'
  • Annex R: 'Independent segment decoding mode'
  • Annex S: 'Alternative inter VLC mode'
  • Annex V: 'Data partitioning'

Standard Codecs(c) Image Compression to Advanced Video Coding
Standard Codecs: Image Compression to Advanced Video Coding (IET Telecommunications Series)
ISBN: 0852967101
EAN: 2147483647
Year: 2005
Pages: 148
Authors: M. Ghanbari

Similar book on Amazon © 2008-2017.
If you may any questions please contact us: