3.3 Temporal redundancy reduction

By using the differences between successive images, temporal redundancy is reduced. This is called interframe coding. For static parts of the image sequence, temporal differences will be close to zero, and hence are not coded. Those parts that change between the frames, either due to illumination variation or to motion of the objects, result in significant image error, which needs to be coded. Image changes due to motion can be significantly reduced if the motion of the object can be estimated, and the difference is taken on the motion compensated image.

Figure 3.6 shows the interframe error between successive frames of the Claire test image sequence and its motion compensated counterpart. It is clear that motion compensation can substantially reduce the interframe error.

click to expand
Figure 3.6: (a Interframe) (b motion compensated interframe)

3.3.1 Motion estimation

To carry out motion compensation, the motion of the moving objects has to be estimated first. This is called motion estimation. The commonly used motion estimation technique in all the standard video codecs is the block matching algorithm (BMA). In a typical BMA, a frame is divided into blocks of M × N pixels or, more usually, square blocks of N² pixels [3]. Then, for a maximum motion displacement of w pixels per frame, the current block of pixels is matched against a corresponding block at the same coordinates but in the previous frame, within the square window of width N + 2w (Figure 3.7). The best match on the basis of a matching criterion yields the displacement.

click to expand
Figure 3.7: The current and previous frames in a search window

Various measures such as the cross correlation function (CCF), mean-squared error (MSE) and mean absolute error (MAE) can be used in the matching criterion [4–6]. For the best match, in the CCF the correlation has to be maximised, whereas in the latter two measures the distortion must be minimised. In practical coders both MSE and MAE are used, since it is believed that CCF would not give good motion tracking, especially when the displacement is not large [6]. The matching functions of the type MSE and MAE are defined as, for MSE:

and for MAE:

where f(m, n) represents the current block of N² pixels at coordinates (m, n) and g(m + i, n + j) represents the corresponding block in the previous frame at new coordinates (m + i, n + j). At the best matched position of i = a and j = b, the motion vector MV(a, b), represents the displacement of all the pixels within the block.

To locate the best match by full search, (2w + 1)² evaluations of the matching criterion are required. To reduce processing cost, MAE is preferred to MSE and hence is used in all the video codecs. However, for each block of N² pixels we still need to carry out (2w + 1)² tests, each with almost 2N² additions and subtractions. This is still far from being suitable for the implementation of BMA in software-based codecs. Measurements of the video encoders' complexity show that motion estimations comprise almost 50–70 per cent of the overall encoder's complexity [7]. This of course depends on the motion activity in the scene and whether a fast DCT is used in deriving the transform coefficients. For example, the percentage of the processing time required to calculate the motion vectors of the Mobile and Claire test image sequences in an MPEG-1 software-based encoder is given in Table 3.1. Note that although more processing time is required for motion estimation in B-pictures than P-pictures, since the search ranges in P-pictures are much larger than for B-pictures, the overall processing time for motion estimation in P-pictures can be larger than that of the B-pictures, as shown in the Table. The reason for these will be dealt with in Chapter 7, when we talk about different picture types in the MPEG-1 encoder. In any case, as motion estimation is a costly process, fast motion estimation techniques are highly desirable.

Table 3.1: Percentage of processing time required to carry out motion estimation in an MPEG-1 encoder
Category	Fast DCT		Brute force DCT
Picture type	Mobile	Claire	Mobile	Claire
P-frame ME	66.1 %	68.4%	53.3%	56.1 %
B -frame ME	58.2%	60.9%	46.2%	48.7%

3.3.2 Fast motion estimation

In the past two decades a number of fast search methods for motion estimation have been introduced to reduce the computational complexity of BMA. The basic principle of these methods is that the number of search points can be reduced, by selectively checking only a small number of specific points, assuming that the distortion measure monotonically decreases towards the best matched point. Jain and Jain [6] were the first to use a two-dimensional logarithmic (TDL) search method to track the direction of a minimum mean-squared error distortion measure. In their method, the distortion for the five initial positions, one at the centre of the coordinate and four at coordinates (±w/2, ±w/2) of the search window, are computed first. In the next step, three more positions with the same step size in the direction of the previous minimum position are searched. The step size is then halved and the above procedure is continued until the step size becomes unity. Finally, all the nine positions are searched. With this method, for w = 5 pixels/frame, 21 positions are searched as opposed to 121 positions required in the full search method.

Koga et al. [8] use a three-step search (TSS) method to compute motion displacements up to six pixels/frame. In their method all eight positions surrounding the coordinate with a step size of w/2 are searched first. At each minimum position the search step size is halved and the next eight new positions are searched. This method, for w = 6 pixels/frame, searches 25 positions to locate the best match. The technique is the recommended method for the test of software-based H.261 [9] for videophone applications.

In Kappagantula and Rao's [4] modified motion estimation algorithm (MMEA), prior to halving the step sizes, two more positions are also searched. With this method for w = 7 pixels/frame, only 19 MAE computations are required. In Srinivasan and Rao's [10] conjugate direction search (CDS) method, at every iteration of the direction search two conjugate directions with a step size of one pixel, centred at the minimum position, are searched. Thus, for w = 5 pixels/frame, there will be only 13 searches at most.

Another method of fast BMA is the cross search algorithm (CSA) [11]. In this method, the basic idea is still a logarithmic step search, which has also been exploited in [4,6,8], but with some differences, which lead to fewer computational search points. The main difference is that at each iteration there are four search locations which are the end points of a cross (×) rather than (+). Also, at the final stage, the search points can be either the end points of (×) or (+) crosses, as shown in Figure 3.8. For a maximum motion displacement of w pixels/frame, the total number of computations becomes 5 + 4 log₂ w.

click to expand
Figure 3.8: An example of the CSA search for w = 8 pixels/frame

Puri et al. [12] have introduced the orthogonal search algorithm (OSA) in which, with a logarithmic step size, at each iteration four new locations are searched. This is the fastest method of all known fast MBAs. In this method, at every step, two positions are searched alternately in the vertical and horizontal directions. The total number of test points is 1 + 4 log₂ w.

Table 3.2 shows the computational complexity of various fast search methods, for a range of motion speeds from 4 to 16 pixels/frame. The motion compensation efficiency of these algorithms for a motion speed of w = 8 pixels/frame for two test image sequences is tabulated in Table 3.3.

Table 3.2: Computational complexity
Algorithm	Maximum number of search points	w
Algorithm	Maximum number of search points	4	8	16
FSM	(2w + 1)²	81	289	1089
TDL	2 + 7 log₂ w	16	23	30
TSS	1 + 8 log₂ w	17	25	33
MMEA	1 + 6 log₂ w	13	19	25
CDS	3 + 2w	11	19	35
OSA	1 + 4 log₂ w	9	13	17
CSA	5 + 4 log₂ w	13	17	21

Table 3.3: Compensation efficiency
Algorithm	Split screen		Trevor white
Algorithm	entropy (bits/pel)	standard deviation	entropy (bits/pel)	standard deviation
FSM	4.57	7.39	4.41	6.07
TDL	4.74	8.23	4.60	6.92
TSS	4.74	8.19	4.58	6.86
MMEA	4.81	8.56	4.69	7.46
CDS	4.84	8.86	4.74	7.54
OSA	4.85	8.81	4.72	7.51
CSA	4.82	8.65	4.68	7.42

It can be seen that, although fast search methods reduce the computational complexity of the full search method (FSM) significantly, their motion estimation accuracy (compensation efficiency) has not been degraded noticeably.

3.3.3 Hierarchical motion estimation

The assumption of monotonic variation of image intensity employed in the fast BMAs often causes false estimations, especially for larger picture displacements. These methods perform well for slow moving objects, such as those in video conferencing. However, for higher motion speeds, due to the intrinsic selective nature of these methods, they often converge to a local minimum of distortion.

One method of alleviating this problem is to subsample the image to smaller sizes, such that the motion speed is reduced by the sampling ratio. The process is done on a multilevel image pyramid, known as the hierarchical block matching algorithm (HBMA) [13]. In this technique, pyramids of the image frames are reconstructed by successive two-dimensional filtering and subsampling of the current and past image frames. Figure 3.9 shows a three-level pyramid, where for simplicity each level of the upper level of the pyramid is taken as the average of four adjacent pixels of one level below. Effectively this is a form of lowpass filtering.

click to expand
Figure 3.9: A three-level image pyramid

Conventional block matching, either full search or any fast method, is first applied to the highest level of the pyramid (level 2 in Figure 3.9). This motion vector is then doubled in size, and further refinement within one pixel search is carried out in the following level. The process is repeated to the lowest level. Therefore, with an n-level pyramid the maximum motion speed of w at the highest level is reduced to w/2ⁿ^-1.

For example, a maximum motion speed of 32 pixels/frame with a three-level pyramid is reduced to eight pixels/frame, which is quite manageable by any fast search method. Note also that this method can be regarded as another type of fast search with a performance very close to the full search irrespective of the motion speed, but the computational complexity can be very close to the fast logarithmic methods.

As an example, for a maximum motion speed of 32 pixels/frame, which is very common in high definition video or most TV sports programmes (particularly in the P-pictures of the standard codecs, which can be several frames apart from each other) although the nonhierarchical full search BMA requires (2 × 32 + 1)² = 4225 operations, a four-level hierarchy, where the motion speed at the top level is 32/2^4-1 = 4 pixels/frame, only requires (2 × 4 + 1)² + 3 × 9 = 108 operations. Here with the full search method 81 operations are carried out at the top level, and at each lower level nine new positions are searched.