9.10 Advanced video coding (H.26L)

9.10 Advanced video coding (H.26L)

The long-term objective of the ITU-T video coding experts group under the advanced video coding project is to provide a video coding recommendation to perform substantially better than the existing standards (e.g. H.263+) at very low bit rates. The group worked closely with the MPEG-4 experts group of ISO/IEC for more than six years (from 1997 to 2002). The joint work of the ITU-T and the ISO/IEC is currently called H.26L, with L standing for long-term objectives. The final recommendations of the codec will be made available in 2003, but here we report only on what are known so far. These may change over time, but we report on those parts that not only look fundamental to the new codec, but are also more certain to be adopted as recommendations. The codec is expected to be approved as H.264 by the ITU-T and as MPEG-4 part 10 (IS 14496-10) by ISO/IEC [7].

Simulation results show that H.26L has achieved substantial superiority of video quality over that achieved by the existing most optimised H.263 and MPEG-4 codecs [23]. Most notable features of the H.26L are:

  • Up to 50 per cent in bit rate saving: compared with the H.263+ (H.263V2) or MPEG-4 simple profile (see Chapter 10), H.26L permits an average reduction in bit rate of up to 50 per cent for a similar degree of encoder optimisation at most bit rates. This means that H.26L offers consistently higher quality at all bit rates including low bit rates.

  • Adaptation to delay constraints: H.26L can operate in a low delay mode to adapt telecommunication applications (e.g. H.264 for videoconferencing), while allowing higher processing delay in applications with no delay constraints such as video storage and server-based video streaming applications (MPEG-4V10).

  • Error resilience: H.26L provides the tools necessary to deal with packet loss in packet networks and bit errors in error-prone wireless networks.

  • Network friendliness: the codec has a feature that conceptually separates the video coding layer (VCL) from the network adaptation layer (NAL). The former provides the core high compression representation of the video picture content and the latter supports delivery over various types of network. This facilitates easier packetisation and better information priority control.

9.10.1 How does H.26L (H.264) differ from H.263

Despite the above mentioned features, the underlying approach of H.26L (H.264) is similar to that adopted by the H.263 standard [7]. That is, it follows the generic transform-based interframe encoder of Figure 3.18, which employs block matching motion compensation, transform coding of the residual errors, scalar quantisation with an adaptive quantiser step size, zigzag scanning and run length coding of the transform coefficients. However, some specific changes are made to the H.26L codec to make it not only more compression efficient, but also more resilient against channel errors. Some of the most notable differences between the core H.26L and the H.263 codecs are:

  • H.26L employs 4 × 4 integer transform block sizes, as opposed to 8 x 8 floating points DCT transform used in H.263

  • H.26L employs a much larger number of different motion compensation block sizes per 16 × 16 pixel macroblock; H.263 only supports two such block sizes in its optional mode

  • higher precision of spatial accuracy for motion estimation with quarter pixel accuracy as the default mode for lower complexity mode and eighth pixel accuracy for the higher complexity mode; H.263 uses only half pixel and MPEG-4 uses quarter pixel accuracy

  • the core H.26L uses multiple previous reference pictures for prediction, whereas the H.263 standard uses this feature under the optional mode

  • in addition to I, P and B-pictures, H.26L uses a new type of interstream transitional picture, called an SP-picture

  • the deblocking filter in the motion compensation loop is a part of the core H.26L, whereas H.263 uses it as an option.

In the following sections some details of these new changes known so far are explained.

9.10.2 Integer transform

H.26L is a unique standard coder that employs a purely integer transform as opposed to the DCT transform with noninteger elements used in the other standard codecs. The core H.26L specifies a 4 × 4 integer transform which is an approximation to the 4 × 4 DCT (compare matrices in eqn. 9.20), hence it has a similar coding gain to the DCT transform. However, since the integer transform, Tint, has an exact inverse transform, there is no mismatch between the encoder and the decoder. Note that in DCT, due to the approximation of cosine values, the forward and inverse transformation matrices cannot be exactly the inverse of each other and hence encoder/decoder mismatch is a common problem in all standard DCT-based codecs that has to be rectified. Transformation matrices in eqn. 9.20 compare the 4 × 4 Tint integer transform against the 4 × 4 DCT transform of the same dimensions.

(9.20) 

It may be beneficial to note that applying the transform and then the inverse transform does not return the original data. This is due to scaling factors built into the transform definition, where the scaling factors are different for different frequency coefficients. This scaling effect is removed partly by using different quantiser step sizes for the different frequency coefficients, and by a bit shift applied after the inverse transform. Moreover, use of the smaller block size of 4 × 4 reduces the blocking artefacts. Even more interesting, use of integer numbers makes transformation fast, as multiplication by 2 is simply a shift of data by one bit to the left. This can significantly reduce the processing power at the encoder, which can be very useful in power constrained processing applications, such as video over mobile networks, and allow increased parallelism.

Note that, as we saw in Chapter 3, for the two-dimensional transform the second stage transform is applied in the vertical direction of the first stage transform coefficients. This means transposing the transform elements for the second stage. Hence, for integer transform of a block of 4 × 4 pixels, xij, the 16 transform coefficients, yij are calculated as:

(9.21) click to expand

As an optional mode, H.26L also specifies an integer transform of length 8. In this mode, the integer transform, which again is an approximation to the DCT of length 8, is defined as:

(9.22) click to expand

Note that the elements of this integer transform are not powers of two, and hence some multiplications are required. Also, in this optional mode, known as the adaptive block transform (ABT), the two-dimensional blocks could be either: 4 x 4, 4 x 8, 8 x 4 or 8 x 8, depending on the texture of the image. Hence, a macroblock may be coded by a combination of these blocks, as required.

9.10.3 Intra coding

Intra coded blocks generate a large amount of data, which could be undesirable for very low bit rate applications. Observations indicate that there are significant correlations among the pixels of adjacent blocks, particularly when the block size is as small as 4 × 4 pixels. In core H.26L, for an intra coded macroblock of 16 × 16 pixels, the difference between the 4 × 4 pixel blocks and their predictions are coded.

In order to perform prediction, the core H.26L offers eight directional prediction modes plus a DC prediction (mode 0) for coding of the luminance blocks, as shown in Figure 9.37 [7]. The arrows of Figure 9.37b indicate the directions of the predictions used for each pixel of the 4 × 4 block in Figure 9.37a.

click to expand
Figure 9.37: A 4 × 4 luminance pixel block and its eight directional prediction modes

For example, when DC prediction (mode 0) is chosen, then the difference between every pixel of the block and a DC predictor defined as:

(9.23) 

is coded, where A, B, C etc. are the pixels at the top and left sides of the block to be coded and // indicates division by rounding to the nearest integer. As another example, in mode 1, which is a vertical direction prediction mode, as shown in Figure 9.37b, every column of pixels uses the top pixel at the border as the prediction. In this mode, the prediction for all four pixels a, e, i and m would be A, and the prediction for, say, pixels d, h, l and p would be pixel D. Some modes can have a complex form of prediction. For example, mode 6, which is called the vertical right prediction mode, indicates that the prediction for, say, pixel c is:

prd = (C + D)//2

and the prediction for pixel e in this mode would be

(9.24) 

For the complete set of prediction modes, the H.26L Recommendation should be consulted [7].

In the plain areas of the picture there is no need to have nine different prediction modes, as they unnecessarily increase the overhead. Instead, H.26L recommends only four prediction modes: DC, horizontal, vertical and plane [7]. Moreover, in the intra macroblocks of plain areas, normally AC coefficients of each 4 × 4 are small and may not be needed for transmission. However, the DC coefficients can be large, but they are highly correlated. For this reason, the H.26L standard suggests that the 16 DC coefficients in a macroblock should be decorrelated by the Hadamard transform. Hence if the DC coefficient of the 4 × 4 integer transform coefficient of block (i, j) is called xij then the two-dimensional Hadamard transform of the 16 DCT coefficients gives 16 new coefficients yij as:

(9.25) click to expand

where y00 is the DC value of all the DC coefficients of the 4 × 4 integer transform blocks. The overall DC coefficient y00 is normally large, but the remaining yij coefficients are small. This mode of intra coding is called the intra 16 × 16 mode [7].

For prediction of chrominance blocks (both intra and inter), since there are only four chrominance blocks of each type in a macroblock, then first of all the recommendation suggests using four prediction modes for a chrominance macroblock. Second, the four DC coefficients of the chrominance 4 × 4 blocks are now Hadamard transformed with transformation matrix of

Note all these transformation (as well as the inverse transformation) matrices should be properly weighted such that the reconstructed pixel values are positive with eight bits resolution (see problem 5 in section 9.11).

9.10.4 Inter coding

Interframe predictive coding is where H.26L makes most of its gain in coding efficiency. Motion compensation on each 16 × 16 macroblock can be performed with different block sizes and shapes, as shown in Figure 9.38. In mode 16 × 16, one motion vector per macroblock is used, and in modes 16 × 8 and 8 × 16 there are two motion vectors. In addition, there is a mode known as 8 × 8 split, in which each of the 8 × 8 blocks can be further subdivided independently into 8 × 8, 8 × 4, 4 × 8, 4 × 4 or 8 × 8 intra blocks.

click to expand
Figure 9.38: Various motion compensation modes

Experimental results indicate that using all block sizes and shapes can lead to bit rate savings of more than 15 per cent compared with the use of only one motion vector per 16 × 16 macroblock. Another 20 per cent saving in bit rate is achieved by representing the motion vectors with quarter pixel spatial accuracy as compared with integer pixel spatial accuracy. It is expected that eighth pixel spatial accuracy will increase the saving rate even further.

9.10.5 Multiple reference prediction

The H.26L standard offers the option of using many previous pictures for prediction. This will increase the coding efficiency as well as producing a better subjective image quality. Experimental results indicate that using five reference frames for prediction results in a bit rate saving of about 5–10 per cent. Moreover, using multiple reference frames improves the resilience of H.26L to errors, as we have seen from Figure 9.22.

In addition to allowing immediately previous pictures to be used for prediction, H.26L also allows pictures to be stored for as long as desired and used at any later time for prediction. This is beneficial in a range of applications, such as surveillance, when switching between a number of cameras in fixed positions can be encoded much more efficiently if one picture from each camera position is stored at the encoder and the decoder.

9.10.6 Deblocking filter

The H.26L standard has adopted some of the optional features of H.263 that have been proven to significantly improve the coding performance. For example, while data partitioning was an option in H.263, it is now in the core H.26L. The other important element of the core H.26L is the use of an adaptive deblocking filter that operates on the horizontal and vertical block edges within the prediction loop to remove the blocking artefacts. The filtering is based on 4 × 4 block boundaries, in which pixels on either side of the boundary may be filtered, depending on the pixel differences across the block boundary, the relative motion vectors of the two blocks, whether the blocks have coefficients and whether one or other block is intra coded [7].

9.10.7 Quantisation and scanning

Similar to the H.263 standard, the H.26L standard also uses a dead band zone quantiser and zigzag scanning of the quantised coefficients. However, the number of quantiser step sizes and the adaptation of the step size is different. While in H.263 there are thirty-two different quantisation levels, H.26L recommends fifty-two quantisation levels. Hence, H.26L coefficients can be coded much more coarsely than those of H.263 for higher compression and more finely to produce better quality, if the bit rate budget permits (see problem 8, section 9.11). In H.26L the quantiser step size is changed at a compound rate of 12.5 per cent. The fidelity of the chrominance components is improved by using finer quantisation step sizes as compared with those used for the luminance component, particularly when the luminance quantiser step size is large.

9.10.8 Entropy coding

Before transmission generated data of all types is variable length (entropy) coded. As we discussed in Chapter 3, entropy coding is the third element of the redundancy reduction technique that all the standard video encoders try to employ as best they can. However, the compression efficiency of entropy coding is not at the same degree of spatial and temporal redundancy reduction techniques. Despite this, H.26L has put a lot of emphasis on entropy coding. The reason is that, although with sophisticated entropy coding the entropy of symbols may be reduced by a bit or a fraction of a bit, for very low bit rate applications (e.g. 20 kbit/s), when the symbols are aggregated over the frame or video sequence, the reduction can be quite significant.

The H.26L standard, like its predecessor, specifies two types of entropy coding: Huffman and arithmetic coding. To make Huffman encoding more efficient, it adaptively employs a set of VLC tables based on the context of the symbols. For example, Huffman coding of zigzag scanned transform coefficients is performed as follows. Firstly, a symbol that indicates the number of nonzero coefficients, and the number of these at the end of the scan that have magnitude of one, up to a maximum of three, is encoded. This is followed by one bit for each of the indicated trailing ones to indicate the sign of the coefficient. Then any remaining nonzero coefficient levels are encoded, in reverse zigzag scan order, finishing with the DC coefficient, if nonzero. Finally, the total number of zero coefficients before the last nonzero coefficient is encoded, followed by the number of zero coefficients before each nonzero coefficient, again in reverse zigzag scan order, until all zero coefficients have been accounted for. Multiple code tables are defined for each of these symbol types, and the particular table used adapts as information is encoded. For example, the code table used to encode the total number of zero coefficients before the last nonzero coefficient depends on the actual number of nonzero coefficients, as this limits the maximum value of zero coefficients to be encoded. The processes are complex and for more information the H.26L recommendation [7] should be consulted. However, this complexity is justified by the additional compression advantage that it offers.

For arithmetic coding, H.26L uses context-based adaptive binary arithmetic coding (CABAC). In this mode the intersymbol redundancy of already coded symbols in the neighbourhood of the current symbol to be coded is used to derive a context model. Different models are used for each syntax element (e.g. motion vectors and transform coefficients use different models) and, like Huffman, the models are adapted based on the context of the neighbouring blocks. When the symbol is not binary, it is mapped onto a sequence of binary decisions, called bins. The actual binarisation is done according to a given binary tree. Each binary decision is then encoded with the arithmetic encoder using the new probability estimates, which have been updated during the previous context modelling stage. After encoding of each bin, the probability estimate for the binary symbol that was just encoded is incremented. Hence the model keeps track of the actual statistics. Experiments with test images indicate that context-based adaptive binary arithmetic coding in H.26L improves bit saving by up to 10 per cent over the Huffman code, where the Huffman used in H.26L itself is more efficient than the conventional nonadaptive methods used in the other codecs.

9.10.9 Switching pictures

One of the applications for the H.26L standard is video streaming. In this operation the decoder is expected to decode a compressed video bit stream generated by a standard encoder.

One of the key requirements in video streaming is to adapt the transmission bit rate of the compressed video according to the network congestion condition. If the video stream is generated by an online encoder (real time), then according to the network feedback, rate adaptation can be achieved on the fly by adjusting the encoder parameters such as the quantiser step size, or in the extreme case by dropping frames. In typical streaming, where preencoded video sequences are to be streamed from a video server to a client, the above solution cannot be used. This is because any change in the bit stream makes the decoded picture different from the locally decoded picture of the encoder. Hence the quality of the decoded picture gradually drifts from that of the locally decoded picture of the encoder, and the visual artefacts will further propagate in time so that the quality eventually becomes very poor.

The simplest way of achieving scalability of preencoded bit streams is by producing multiple copies of the same video sequence at several bit rates, and hence qualities. The server then dynamically switches between the bit streams, according to the network congestion or the bandwidth available to the client. However, in switching on the P-pictures, since the prediction frame in one bit stream is different from the other, the problem of picture drift remains unsolved.

To rectify picture drift in bit stream switching, the H.26L Recommendation introduces a new type of picture, called the switching picture or the secondary picture, for short: SP-picture. SP-pictures are generated by the server at the time of switching from one bit stream to another and are transmitted as the first picture after the switching. To see how SP-pictures are generated and how they can prevent picture drift, consider switching between two bit streams as an example, shown in Figure 9.39.

click to expand
Figure 9.39: Use of S frame in bit stream switching

Consider a video sequence encoded at two different bit rates, generating bit streams A and B. Also, consider a client who up to time N has been receiving bit stream A, and at time N wishes to switch to bit stream B. The bit streams per frame at this time are identified as SA and SB, respectively.

To analyse the picture drift, let us assume in both cases that the pictures are entirely predictively coded (P-picture). Thus at time N, in bit stream A the difference between frame N and N - 1, SA = AN - AN-1, is transmitted and for bit stream B, SB = BN - BN-1. The decoded picture at the receiver prediction loop before switching would be frame AN-1. Hence, for drift-free pictures, at the switching time the decoder expects to receive SAB = BN - AN-1, from bit stream B, which is different from SB = BN - BN-1. Therefore, at the switching time, a new bit stream for the duration of one frame, SP, should be generated, as shown in Figure 9.39. SP now replaces SB at the switching time, and if switching was done from bit stream B to A, we needed SP = SBA = AN - BN-1, to replace SA.

The switching pictures SAB and SBA can be generated in two ways, each for a specific application. If it is desired to switch between two bit streams at any time, then at the switching point we need a video transcoder to generate the switching pictures where they are wanted. Figure 9.40 shows such a transcoder, where each bit stream is decoded and then reencoded with the required prediction picture. If a macroblock is intra coded, the intra coded parts of the S frame of the switched bit stream are directly copied in the SP picture.

click to expand
Figure 9.40: Creation of the switching pictures from the bit streams

Although the above method makes it possible to switch between two bit streams at any time, it is expensive to use it along with the bit stream. A cheaper solution would be to generate these switching pictures (SAB and SBA) offline and store them along the bit streams A and B. Normally, these pictures are created periodically, with reasonable intervals. Here there is a trade-off between the frequency of switching between the bit streams and the storage required to store these pictures. It is in this application that these pictures are also called secondary pictures, as well as the switching pictures.

9.10.10 Random access

Another vital requirement in video streaming is to be able to access the video stream and decode the pictures almost at any time. In accessing preencoded video, we have seen that the approach taken by the MPEG codecs was to insert I-pictures at regular intervals (e.g. every 12 frames). Due to the high bit rates of these pictures, their use in very low bit rate codecs, such as H.26L, would be very expensive. The approach taken by the H.26L standard is to create secondary pictures at the time access is required. These pictures are intraframe coded I-pictures and are encoded in such a way as to generate a decoded picture identical to the frame to be accessed. Figure 9.41 shows the position of the secondary picture for accessing the bit stream at the switching frame, S.

click to expand
Figure 9.41: The position of the secondary picture in random accessing of the bit stream

In the Figure the bit stream is made of P-pictures for high compression efficiency. One of these pictures, identified as picture S, is the picture where the client can access the bit stream if he wishes to. From the original picture an SI-picture is encoded to provide an exact match for picture S. Note that the first part of the accessed bit stream is now the data generated by the SI-picture. Thus the amount of bits at the start is high. However, since in video streaming at each session accessing occurs only once, these extra bits compared with the duration of the session are negligible. The average bit rate hardly changes, and is much less than when regular I-pictures are present.

In the above method, similar to the offline creation of secondary pictures for bit stream switching, the frequency of SI pictures is traded against the storage capacity required to store them along with the bit stream.

9.10.11 Network adaptation layer

An important feature of H.26L is the separation of the video coding layer (VCL) from the network adaptation layer (NAL). The VCL is responsible for high compression representation of the video picture content. In the previous sections some of the important methods used in the video coding layer were presented. In this section we describe the NAL which is responsible for efficient delivery of compressed video over the underlying transport network. Specifications for the NAL depend on the nature of the transport network. For example, for fixed bit rate channels, the MPEG-2 transport layer can be used. In this mode, 184 bytes of compressed video bits along with a four-byte header are packed into a packet. However, even if the channel is variable in rate, provided that some form of guaranteeing quality of service can be provided, such as ATM networks, the MPEG-2 transport protocol can still be used. In this case each MPEG-2 packet is segmented into four ATM cells, each cell carrying 47 bytes in its payload.

Perhaps the most interesting type of NAL is the one for the transport of a compressed video stream over a nonreliable network such as the best effort Internet protocol (IP) network. In this application NAL takes the compressed bit stream and converts it into packets that can be conveyed directly over the real time transport protocol (RTP) [24].

RTP has emerged as the de facto specification for the transport of time sensitive streamed media over IP networks. RTP provides a wrapper for media data that identifies the media type and provides synchronisation information in the form of a time stamp. These features enable individual packets to be reconstructed into a media stream by a recipient. To enable flow control and management of the media stream, additional information is carried between the sender and the receiver using the associated RTP control protocol (RTCP). Whenever an RTP stream is required, associated out of band RTCP flows between the sender and receiver are established. This enables the sender and receiver to exchange information concerning the performance of the media transmission and may be of use to higher level application control functions. Typically, RTP is mapped onto and carried in user data packets (UDP). In such cases, an RTP session is usually associated with an even numbered port and its associated RTCP flows with the next highest odd numbered port.

Encapsulation of video bit stream into RTP packets should be done with the following procedures:

  • arrange partitions in an intelligent way into packets

  • eventually split/merge partitions to match the packet size constraints

  • avoid the redundancy of (mandatory) RTP header information and information in the video stream

  • define receiver/decoder reactions to packet losses.

In packetised transmission, the packetisation scheme has to make some assumptions on typical network conditions and constraints. The following set of assumptions has been made on packetised transmission of H.26L video streams:

  • maximum packet size: around 1500 bytes for any transmission media except the dial-up link; for this link 500 byte packets to be used

  • packet loss characteristics: nonbursty, due to drop tail router implementations and assuming reasonable pacing algorithm (e.g. no bursting occurs at the sender)

  • packet loss rate: up to 20 per cent.

In packetised transmission, the quality of video can be improved by stronger protection of the important packets against errors or losses. This is very much in line with the data partitioning feature of the H.26L coded bit stream. In the NAL it is expected that the bit stream generated per slice will be data partitioned into three parts, each being carried in a different packet. The first packet contains the important part of the compressed data, such as header type, macroblock type and address, motion vectors etc. The second packet contains intra information, and the third packet is assembled with the rest of the video bit stream, such as transform coefficients.

This configuration allows decoding of the first packet independently from the second and third packets, but not vice versa. As the first packet is more important than the second packet, which is more important than the third (it carries the most vital video information and is also necessary to decode the second and third packets), it is more heavily protected with forward error correction than the second and third packets.

It should be noted that slices should be structured to ensure that all packets meet the maximum packet size to avoid a network splitting/recombination process. This means inserting the slice header at appropriate points.



Standard Codecs(c) Image Compression to Advanced Video Coding
Standard Codecs: Image Compression to Advanced Video Coding (IET Telecommunications Series)
ISBN: 0852967101
EAN: 2147483647
Year: 2005
Pages: 148
Authors: M. Ghanbari

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net