2. Standards and State of the Art

2.1 Still Image Coding

For many years the Discrete Cosine Transform (DCT) has represented the state of the art in still image coding. JPEG is the standard that has incorporated this technology [1,2]. JPEG has been a success and has been deployed in many applications reaching worldwide use. However, for some time it was very clear that a new still image coding standard needed to be introduced to serve the new range of applications which have emerged in the last years. The result is JPEG2000 that has already been standardized [3,4]. The JPEG2000 standard uses the Discrete Wavelet Transform. Tests have indicated that at low data rates JPEG2000 provides about 20% better compression efficiency for the same image quality than JPEG. JPEG2000 also offers a new set of functionalities. These include error resilience, arbitrarily shaped region of interest, random access, lossless, and lossy coding as well as a fully scalable bit stream. These functionalities introduce more complexity for the encoder. MPEG-4 has a "still image" mode known as Visual Texture Coding (VTC) which also uses wavelets but supports less functionalities than JPEG2000 [5,6]. For a comparison between the JPEG2000 standard, JPEG, MPEG-4 VTC, and other lossless JPEG schemes see [7]. For further discussion on the role of image and video standards see [8].

2.2 Video Coding

During the last ten years, the hybrid scheme combining motion compensated prediction and DCT has represented the state of the art in video coding. This approach is used by the ITU H.261 [9] and H.263 [10], [11] standards as well as for the MPEG-1 [12] and MPEG-2 [13] standards. However in 1993, the need to add new content-based functionalities and to provide the user the possibility to manipulate the audio-visual content was recognized and a new standard effort known as MPEG-4 was launched. In addition to these functionalities, MPEG-4 provides also the possibility of combining natural and synthetic content. MPEG-4 phase 1 became an international standard in 1999 [5,6]. MPEG-4 is having difficulties finding widespread use, mainly due to the protection of intellectual property and to the need to develop automatic and efficient segmentation schemes.

The frame-based part of MPEG-4, which incorporates error resilience tools, is finding its way in the mobile communications and Internet streaming. H.263, and several variants of it [11], is also very much used in mobile communication and streaming and it will be interesting to see how these two standards compete in these applications.

The natural video part of MPEG-4 is also based in motion compensation prediction followed by the DCT; the fundamental difference is that of adding the coding of the object shape. Due to its powerful object-based approach, the use of the most efficient coding techniques, and the large variety of data types that it incorporates, MPEG-4 represents today the state of the art in terms of visual data coding technology [6]. How MPEG-4 is deployed and what applications will make use of its many functionalities is still an open question.

2.3 The New Standard JVT/H.26L

The Joint Video Team (JVT) standard development project is a joint project of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) for the development of a new video coding standard [14], [15]. The JVT project was created in December of 2001 to take over the work previously under way in the ITU-T H.26L project of VCEG and create a final design for standardization in both the ITU-T and MPEG [16].

The main goals of the JVT/H.26L standardization effort are the definition of a simple and straightforward video coding design to achieve enhanced compression performance and provision of a "network-friendly" packet-based video representation addressing "conversational" (i.e., video telephony) and "non-conversational" (i.e., storage, broadcast, or streaming) applications. Hence, the JVT/H.26L design covers a Video Coding Layer (VCL), which provides the core high compression representation of the video picture content, and a Network Adaptation Layer (NAL), which packages that representation for delivery over each distinct class of networks. The VCL design has achieved a significant improvement in rate-distortion efficiency - providing nearly a factor of two in bitrate savings against existing standards. The NAL designs are being developed to transport the coded video data over existing and future networks such as circuit-switched wired networks, MPEG-2/H.222.0 transport streams, IP networks with RTP packetization, and 3G wireless systems.

A key achievement expected from the JVT project is a substantial improvement in video coding efficiency for a broad range of application areas. The JVT goal is a capability of 50% or greater bit rate savings from H.263v2 or MPEG-4 Advanced Simple Profile at all bit rates. The new standard will be known as MPEG-4 part 10 AVC and ITU-T Recommendation H.264.

2.3.1 Technical Overview

The following description of the new JVT/H.26L standard has been extracted from the excellent review presented in [17].

The JVT/H.26L design supports the coding of video (in 4:2:0 chrominance format) that contains either progressive or interlaced frames, which may be mixed together in the same sequence. Each frame or field picture of a video sequence is partitioned into fixed size macroblocks that cover a rectangular picture area of 16x16 luminance and 8x8 chrominance samples. The luminance and chrominance samples of a macroblock are generally spatially or temporally predicted, and the resulting prediction error signal is transmitted using transform coding. The macroblocks are organized in slices, which generally represent subsets of a given picture that can be decoded independently. Each macroblock can be transmitted in one of several coding modes depending on the slice-coding type. In all slice-coding types, two classes of intra-coding modes are supported. In addition to the intra-modes, various predictive or motion-compensated modes are provided for P-slice macroblocks. The JVT/H.26L syntax supports quarter- and eighth-pixel accurate motion compensation. JVT/H.26L generally supports multi-picture motion-compensated prediction. That is, more than one prior coded picture can be used as reference for building the prediction signal of predictive coded blocks.

In comparison to prior video coding standards, the concept of B-slices/B-pictures (B for bi-predictive) is generalized in JVT/H.26L. For example, other pictures can reference B-pictures for motion-compensated prediction depending on the memory management control operation of the multi-picture buffering. JVT/H.26L is basically similar to prior coding standards in that it utilizes transform coding of the prediction error signal. However, in JVT/H.26L the transformation is applied to 4x4 blocks unless the Adaptive Block size Transform (ABT) is enabled. For the quantization of transform coefficients, JVT/H.26L uses scalar quantization. In JVT/H.26L, two methods of entropy coding are supported.

2.3.2 JVT/H.26L Tools

The following tools are defined in JVT/H.26L:

Table 39.1: Tools supported in JVT/H.26L
Coding tools		Baseline Profile	Main Profile
Picture formats	Progressive pictures	X	X
Picture formats	Interlaced pictures	Level 2.1 and above	Level 2.1 and above
Slice/picture types	I and P coding types	X	X
	B coding types		X
	SI and SP coding types
Macro block prediction	Tree-structured motion compensation	X	X
	Intra blocks on 8 8 basis
	Multi-picture motion comp.	X	X
	1/4-pel accurate motion compensation	X	X
	1/8-pel accurate motion compensation
Transform coding	Adaptive block size transform		X
Entropy coding	VLC-based entropy coding	X	X
Entropy coding	CABAC		X
In-loop filtering	In-loop deblocking filter	X	X

2.3.3 JVT/H.26L Comparisons with Prior Coding Standards

For demonstrating the coding performance of JVT/H.26L, the new standard has been compared to the successful prior coding standards MPEG-2 [13], H.263 [10], and MPEG-4 [5], for a set of popular QCIF (10Hz and 15 Hz) and CIF (15Hz and 30Hz) sequences with different motion and spatial detail information. The QCIF sequences are: Foreman, News, Container Ship, and Tempete. The CIF sequences are: Bus, Flower Garden, Mobile and Calendar, and Tempete. All video encoders have been optimised with regards to their rate-distortion efficiency using Lagrangian techniques [18], [19]. In addition to the performance gains, the use of a unique and efficient coder control for all video encoders allows a fair comparison between them in terms of coding efficiency. For details see [17]. Table 39.2 shows the average bitrate savings of JVT/H.26L with respect to MPEG-4 ASP (Advanced Simple Profile), H.263 HLP (High Latency Profile), and MPEG-2.

Table 39.2: Average bitrate savings of JVT/H.26L
Coder	MPEG-4 ASP	H.263 HLP	MPEG-2
JVT/H.26L	38.62%	48.80%	64.46%
MPEG-4 ASP	-	16.65%	42.95%
H.263 HLP	-	-	30.61%

2.4 What Can be done to Improve the Standards?

Can something be done to "significantly" improve the performance of compression techniques? How will this affect the standards? We believe that no significant improvements are to be expected in the near future. However, compression techniques that require new types of functionalities driven by applications will be developed. For example, Internet applications may require new types of techniques that support scalability modes tied to the network transport (see JVT/H.26L standard for some of this ). We may also see proprietary methods developed that use variations on standards, such as the video compression technique used by RealNetworks, for applications where the content provider wishes the user to obtain both the encoder and decoder from them so that the provider can gain economic advantage.

2.4.1 Still Image Coding

JPEG2000 represents the state of the art with respect to still image coding standards. This is mainly due to the 20% improvement in coding efficiency with respect to the DCT as well as the new set of functionalities incorporated. Nonlinear wavelet decomposition may bring further improvement [20]. Other improvements will include the investigation of color transformations for color images [20] and perceptual models [22]. Although other techniques, such as fractal coding or vector quantization have also being studied, they have not found their way into the standards. Other alternate approaches such as "second generation techniques" [23] raised a lot of interest for the potential of high compression ratios. However, they have not been able to provide very high quality. Second generation techniques and, in particular, segmentation-based image coding schemes have produced a coding approach more suitable for content access and manipulation than for strictly coding applications. These schemes are the basis of MPEG-4 object-based schemes.

There are many schemes that may increase the coding efficiency of JPEG2000. But all these schemes may only improve by a small amount. We believe that the JPEG2000 framework will be widely used for many applications.

2.4.2 Video Coding

All the video coding standards based on motion prediction and the DCT produce block artifacts at low data rate. There has been a lot of work using postprocessing techniques to reduce blocking artifacts [24], [25], [26]. A great deal of work has been done to investigate the use of wavelets in video coding. This work has taken mainly two directions. The first one is to code the prediction error of the hybrid scheme using the DWT [27]. The second one is to use a full 3-D wavelet decomposition [28], [29]. Although these approaches have reported coding efficiency improvements with respect to the hybrid schemes, most of them are intended to provide further functionalities such as scalability and progressive transmission. See the later developments in fully scalable 3D subband coding schemes in [30], [31].

One of the approaches that reports major improvements using the hybrid approach is the one proposed in [32]. Long-term memory prediction extends motion compensation from the previous frame to several past frames with the result of increased coding efficiency. The approach is combined with affine motion compensation. Data rate savings between 20 and 50% are achieved using the test model of H.263+. The corresponding gains in PSNR are between 0.8 and 3 dB.

It can be said that MPEG-4 version 10 also known as JVT/H.26L represents the state of the art in video coding. In addition, MPEG-4 combines frame-based and segmentation-based approaches along with the mixing of natural and synthetic content allowing efficient coding as well as content access and manipulation. There is no doubt that other schemes may improve the coding efficiency established in JVT/H.26L but no significant breakthrough is expected. The basic question remains: what is next? The next section will try to provide some clues.