Coding of High Quality Moving Pictures (MPEG-2)

Overview

Following the universal success of the H.261 and MPEG-1 video codecs, there was a growing need for a video codec to address a wide variety of applications. Considering the similarity between H.261 and MPEG-1, ITU-T, and ISO/IEC made a joint effort to devise a generic video codec. Joining the study was a special group in ITU-T, Study Group 15, SG15, who were interested in coding of video for transmission over the future broadband integrated services digital networks (B-ISDN) using asynchronous transfer mode (ATM) transport. The devised generic codec was finalised in 1995, and takes the name of MPEG-2/H.262, although it is more commonly known as MPEG-2 [1].

At the time of the development, the following applications for the generic codec were foreseen:

  • BSS

broadcasting satellite service (to the home)

  • CATV

cable TV distribution on optical networks, copper etc.

  • CDAD

cable digital audio distribution

  • DAB

digital audio broadcasting (terrestrial and satellite)

  • DTTB

digital terrestrial television broadcast

  • EC

electronic cinema

  • ENG

electronic news gathering (including SNG, satellite news gathering)

  • FSS

fixed satellite service (e.g. to head ends)

  • HTT

home television theatre

  • IPC

interpersonal communications (videoconferencing, videophone, . . .)

  • ISM

interactive storage media (optical discs etc.)

  • MMM

multimedia mailing

  • NCA

news and current affairs

  • NDS

networked database services (via ATM etc.)

  • RVS

remote video surveillance

  • SSM

serial storage media (digital VTR etc.).

Of particular importance is the application to satellite systems where the limitations of radio spectrum and satellite parking orbit result in pressure to provide acceptable quality TV signals at relatively low bit rates. As we will see at the end of this Chapter, today we can accommodate about 6–8 MPEG-2 coded TV programs in the same satellite channel that used to carry only one analogue TV program. Numerous new applications have been added to the list. In particular high definition television (HDTV) and digital versatile disc (DVD) for home storage systems appear to be the main beneficiaries of further MPEG-2 development.

MPEG 2 systems

The MPEG-1 standard targeted coding of audio and video for storage, where the media error rate is negligible [2]. Hence the MPEG-1 system is not designed to be robust to bit error rates. Also, MPEG-1 was aimed at software oriented image processing, where large and variable length packets could reduce the software overhead [3].

The MPEG-2 standard on the other hand is more generic for a variety of audiovisual coding applications. It has to include error resilience for broadcasting, and ATM networks. Moreover, it has to deliver multiple programs simultaneously without requiring them to have a common time base. These require that the MPEG-2 transport packet length should be short and fixed.

MPEG-2 defines two types of stream: the program stream and the transport stream. The program stream is similar to the MPEG-1 systems stream, but uses a modified syntax and new functions to support advanced functionalities (e.g. scalability). It also provides compatibility with the MPEG-1 systems stream, that is MPEG-2 should be capable of decoding an MPEG-1 bit stream. Like the MPEG-1 decoder, program stream decoders typically employ long and variable length packets. Such packets are well suited for software-based processing and error-free transmission environments, such as coding for storage of video on a disc. Here the packet sizes are usually 1–2 kbytes long, chosen to match the disc sector sizes (typically 2 kbytes). However, packet sizes as long as 64 kbytes are also supported.

The program stream also includes features not supported by MPEG-1 systems. These include scrambling of data, assignment of different priorities to packets, information to assist alignment of elementary stream packets, indication of copyright, indication of fast forward, fast reverse and other trick modes for storage devices. An optional field in the packets is provided for testing the network performance, and optional numbering of a sequence of packets is used to detect lost packets.

In the transport stream, MPEG-2 significantly differs from MPEG-1 [4]. The transport stream offers robustness for noisy channels as well as the ability to assemble multiple programs into a single stream. The transport stream uses fixed length packets of size 188 bytes with a new header syntax. This can be segmented into four 47-bytes to be accommodated in the payload of four ATM cells, with of the AAL1 adaptation scheme [5]. It is therefore more suitable for hardware processing and for error correction schemes, such as those required in television broadcasting, satellite/cable TV and ATM networks. Furthermore, multiple programs with independent time bases can be multiplexed in one transport stream. The transport stream also allows synchronous multiplexing of programs, fast access to the desired program for channel hopping, multiplexing of programs with clocks unrelated to transport clock and correct synchronisation of elementary streams for playback. It also allows control of the decoder buffers during start up and playback for both constant and variable bit rate programs.

A basic data structure that is common to the organisation of both the program stream and transport stream is called the packetised elementary stream (PES) packet. Packetising the continuous streams of compressed video and audio bit streams (elementary streams) generates PES packets. Simply stringing together PES packets from the various encoders with other packets containing necessary data to generate a single bit stream generates a program stream. A transport stream consists of packets of fixed length containing four bytes of header followed by 184 bytes of data, where the data is obtained by segmenting the PES packets.

Figure 8.1 illustrates both types of program and transport stream multiplexes of MPEG-2 systems. Like MPEG-1, the MPEG-2 systems layer is also capable of combining multiple sources of user data along with encoded audio and video. The audio and video streams are packetised to form PES packets that are sent to either a program multiplexer or a transport multiplexer, resulting in a program stream or transport stream, respectively. As mentioned earlier, program streams are intended for an error-free environment such as digital storage media (DSM). Transport streams are intended for noisier environments such as terrestrial broadcast channels.

click to expand
Figure 8.1: MPEG-2 systems multiplex of program and transport streams

At the receiver the transport streams are decoded by a transport demultiplexer (which includes a clock extraction mechanism), unpacketised by a depacketiser, and sent to audio and video decoders for decoding, as shown in Figure 8.2.

click to expand
Figure 8.2: MPEG-2 systems demultiplexing of program and transport streams

The decoded signals are sent to the receiver buffer and presentation units, which output them to a display device and a speaker at the appropriate time. Similarly, if the program streams are used, they are decoded by the program stream demultiplexer, depacketiser and sent to the audio and video decoders. The decoded signals are sent to the respective buffer to await presentation. Also similar to MPEG-1 systems, the information about systems timing is carried by the clock reference field in the bit stream that is used to synchronise the decoder systems clock (STC). Presentation time stamps (PTS) that are also carried by the bit stream control the presentation of the decoded output.

Level and profile

MPEG-2 is intended to be generic in the sense that it serves a wide range of applications, bit rates, resolutions, qualities and services. Applications should cover, among other things, digital storage media, television broadcasting and communications. In the course of the development, various requirements from typical applications were considered, and they were integrated into a single syntax. Hence MPEG-2 is expected to facilitate the interchange of bit streams among different applications. Considering the practicality of implementing the full syntax of the bit stream, however, a limited number of subsets of the syntax are also stipulated by means of profile and level [6].

A profile is a subset of the entire bit stream syntax that is defined by the MPEG-2 specification. Within the bounds imposed by the syntax of a given profile it is still possible to encompass very large variations in the performance of encoders and decoders depending upon the values taken by parameters in the bit stream. For instance, it is possible to specify frame sizes as large as (approximately) 214 samples wide by 214 lines high. It is currently neither practical nor economical to implement a decoder capable of dealing with all possible frame sizes. In order to deal with this problem levels are defined within each profile. A level is a defined set of constraints imposed on parameters in the bit stream. These constraints may be simple limits on numbers. Alternatively, they may take the form of constraints on arithmetic combinations of the parameters (e.g. frame width multiplied by frame height multiplied by frame rate). Both profiles and levels have a hierarchical relationship, and the syntax supported by a higher profile or level must also support all the syntactical elements of the lower profiles or levels.

Bit streams complying with the MPEG-2 specification use a common syntax. In order to achieve a subset of the complete syntax, flags and parameters are included in the bit stream which signal the presence or otherwise of syntactic elements that occur later in the bit stream. Then to specify constraints on the syntax (and hence define a profile) it is only necessary to constrain the values of these flags and parameters that specify the presence of later syntactic elements.

In order to parse the bit stream into specific applications, they are ordered into layers. If there is only one layer, the coded video data is called a nonscalable video bit stream. For two or more layers, the bit stream is called a scalable hierarchy. In the scalable mode, the first layer, called the base layer, is always decoded independently. Other layers are called enhancement layers, and can only be decoded together with the lower layers.

Before describing how various scalabilities are introduced in MPEG-2, let us see how levels and profiles are defined. MPEG-2 initially defined five hierarchical structure profiles and later on added two profiles that do not fit the hierarchical structure. The profile mainly deals with the supporting tool for coding, such as the group of picture structure, the picture format and scalability. The seven known profiles and their current applications are summarised in Table 8.1.

Table 8.1: Various profiles defined for MPEG-2

Type

Supporting tools

Application

Simple

I & P picture, 4:2:0 format; nonscalable

currently not used

Main

simple profile + B-pictures

broadcast TV

SNR scalable

main profile + SNR scalability

currently not used

Spatial

SNR profile + Spatial scalability

currently not used

High

spatial profile + 4: 2: 2 format

currently not used

4:2:2

IBIBIB . . . pictures, extension of main profile to high bit rates

studio postproduction; high quality video for storage (VTR) & Video distribution

Multiview

main profile + temporal scalability

several video streams; stereo presentation

The level deals with the picture resolutions such as the number of pixels per line, lines per frame, frame per seconds (fps) and bits per second or the bit rate (e.g. Mbps). Table 8.2 summarises the levels that are most suitable for each profile.

Table 8.2: The levels defined for each profile

Level

resolutions

Simple I,P

4:2:0

Main I,P,B

4:2:0

SNR I,P,B

4: 2:0

Spatial I,P,B

4:2:0

High I,P,B

4:2:0 4:2:2

4:2:2 I,P,B 4:2:0 4:2:2

Multiview I,P,B 4: 2: 0

Low

pel/line

   

352

352

   

352

line/fr

   

288

288

   

288

fps

   

30/25

30/15

   

30/25

Mbps

   

4

4

   

8

Main

pel/line

720

720

720

 

720

720

720

line/fr

576

576

576

 

576

512/608

576

fps

30/25

30/25

30/25

 

30/25

30/25

30/25

Mbps

15

15

15

 

20

50

25

High 1440

pel/line

1440

 

1440

1440

1440

   

line/fr

1152

 

1152

1152

1152

   

fps

60

 

60

60

60

   

Mbps

60

 

60

80

100

   

High

pel/line

1920

   

1920

1920

 

1920

line/fr

1152

   

1152

1152

 

1152

fps

60

   

60

60

 

60

Mbps

80

   

100

130

 

300

The main profile main line (MP@ML) is the most widely used pair for broadcast TV and the 4:2:2 profile and main line (4:2:2@ML) is for studio video production and recording.

How does the MPEG 2 video encoder differ from MPEG 1?

Major differences

From the profile and level we see that the picture resolutions in MPEG-2 can vary from SIF (352 × 288 × 25 or 30) to HDTV with 1920 x 1250 x 60. Moreover, most of these pictures are interlaced, whereas in MPEG-1 pictures are noninterlaced (progressive). Coding of interlaced pictures is the first difference between the two coding schemes.

In the MPEG-2 standard, combinations of various picture formats and the interlaced/progressive option create a new range of macroblock types. Although each macroblock in a progressive mode has six blocks in the 4:2:0 format, the number of blocks in the 4:4:4 image format is 12. Also, the dimensions of the unit of blocks used for motion estimation/compensation can change. In the interlaced pictures, since the number of lines per field is half the number of lines per frame then, with equal horizontal and vertical resolutions, for motion estimation it might be appropriate to choose blocks of 16 x 8, i.e. 16 pixels over eight lines. These types of submacroblock have half the number of blocks of the progressive mode.

The second significant difference between the MPEG-1 and the MPEG-2 video encoders is the new function of scalability. The scalable modes of MPEG-2 are intended to offer interoperability among different services or to accommodate the varying capabilities of different receivers and networks upon which a single service may operate. They allow a receiver to decode a subset of the full bit stream in order to display an image sequence at a reduced quality, spatial and temporal resolution.

Minor differences

Apart from the two major distinctions there are some other minor differences, which have been introduced to increase the coding efficiency of MPEG-2. They are again due to the picture interlacing used in MPEG-2. The first one is the scanning order of DCT coefficients. In MPEG-1, like H.261, zigzag scanning is used. MPEG-2 has the choice of using alternate scan, as shown in Figure 8.3b. For interlaced pictures, since the vertical correlation in the field pictures is greatly reduced, should field prediction be used, an alternate scan may perform better than a zigzag scan.

click to expand
Figure 8.3: Two types of scanning method (a zigzag scan) (b alternate scan)

The second minor difference is on the nature of quantisation of the DCT coefficients. MPEG-2 supports both linear and nonlinear quantisation of the DCT coefficients. The nonlinear quantisation increases the precision of quantisation at high bit rates by employing lower quantiser scale values. This improves picture quality at low contrast areas. At lower bit rates, where larger step sizes are needed, again the nonlinear behaviour of the quantiser provides a larger dynamic range for quantisation of the coefficients.

MPEG 1 and MPEG 2 syntax differences

The IDCT mismatch control in MPEG-1 is slightly different from that in MPEG-2. After the inverse quantisation process of the DCT coefficients, the MPEG-1 standard requires that all the nonzero coefficients are added with 1 or -1. In the MPEG-2 standard, only the last coefficient need be added with 1 or -1 provided that the sum of all coefficients is even after inverse quantisation. Another significant variance is the run level values. In MPEG-1, those that cannot be coded with a variable length code (VLC) are coded with the escape code, followed by either a 14-bit or 22-bit fixed length coding (FLC), whereas for MPEG-2 they are followed by an 18-bit FLC.

The constraint parameter flag mechanism in MPEG-1 has been replaced by the profile and level structures in MPEG-2. The additional chroma formats (4:2:2 and 4:4:4) and the interlaced related operations (field prediction and scalable coding modes) make MPEG-2 bit stream syntax different from that of MPEG-1.

The concept of the group of pictures (GOP) layer is slightly different. GOP in MPEG-2 may indicate that certain B-pictures at the beginning of an edited sequence comprise a broken link which occurs if the forward reference picture needed to predict the current B-pictures is removed from the bit stream by an editing process. It is an optional structure for MPEG-2 but mandatory for MPEG-1. The final point is that slices in MPEG-2 must always start and end on the same horizontal row of macroblocks. This is to assist the implementations in which the decoding process is split into some parallel operations along horizontal strips within the same pictures.

Although these differences may make direct decoding of the MPEG-1 bit stream by an MPEG-2 decoder infeasible, the fundamentals of video coding in the two codecs remain the same. In fact, as we mentioned, there is a need for backward compatibility, such that the MPEG-2 decoder should be able to decode the MPEG-1 encoded bit stream. Thus MPEG-1 is a subset of MPEG-2. They employ the same concept of a group of pictures, and the interlaced field pictures now become I, P and B-fields, and all the macroblock types have to be identified as field or frame based. Therefore in describing the MPEG-2 video codec, we will avoid repeating what has already been said about MPEG-1 in Chapter 7. Instead we concentrate on those parts which have risen due to interlacing and scalability of MPEG-2. However, for information on the difference between MPEG-1 and MPEG-2 refer to [7].

MPEG 2 nonscalable coding modes

This simple nonscalable mode of the MPEG-2 standard is the direct extension of the MPEG-1 coding scheme with the additional feature of accommodating interlaced video coding. The impact of the interlaced video on the coding methodology is that interpicture prediction may be carried out between the fields, as they are closer to each other. Furthermore, for slow moving objects, vertical pixels in the same frame are closer, making frame prediction more efficient.

As usual we define the prediction modes on a macroblock basis. Also, to be in line with the MPEG-2 definitions, we define the odd and the even fields as the top and bottom fields, respectively. A field macroblock, similar to the frame macroblock, consists of 16 x 16 pixels. In the following, five modes of prediction are described [3]. They can be equally applied to P and B-pictures, unless specified otherwise.

Similar to the reference model in H.261, software-based reference codecs for laboratory testing have also been considered for MPEG-1 and 2. For these codecs, the reference codec is called the test model (TM), and the latest version of the test model is TM5 [8].

Frame prediction for frame pictures

Frame prediction for frame pictures is exactly identical to the predictions used in MPEG-1. Each P-frame can make a prediction from the previous anchor frame, and there is one motion vector for each motion compensated macroblock. B-frames may use previous, future or interpolated past and future anchor frames. There will be up to two motion vectors (forward and backward) for each B-frame motion compensated macroblock. Frame prediction works well for slow to moderate motion as well as panning over a detailed background.

Field prediction for field pictures

Field prediction is similar to frame prediction, except that pixels of the target macroblock (MB to be coded) belong to the same field. Prediction macroblocks also should belong to one field, either from the top or the bottom field. Thus for P-pictures the prediction macroblock comes from the two most recent fields, as shown in Figure 8.4. For example, the prediction for the target macroblocks in the top field of a P-frame, TP, may come either from the top field, TR, or the bottom field, BR, of the reference frame.


Figure 8.4: Field prediction of field pictures for P-picture MBs

The prediction for the target macroblocks in the bottom field, BP, can be made from its two recent fields, the top field of the same frame, TP, or the bottom field of the reference frame, BR.

For B-pictures the prediction MBs are taken from the two most recent anchor pictures (I/P or P/P). Each target macroblock can make a forward or a backward prediction from either of the fields.

For example in Figure 8.5 the forward prediction for the bottom field of a B-picture, BB, is either TP or BP, and the backward prediction is taken from TF or BF. There will be one motion vector for each P-field target macroblock, and two motion vectors for those of B-fields.

click to expand
Figure 8.5: Field prediction of field pictures for B-picture MBs

Field prediction for frame pictures

In this case the target macroblock in a frame picture is split into two top field and bottom field pixels, as shown in Figure 8.6. Field prediction is then carried out independently for each of the 16 x 8 pixel target macroblocks.

click to expand
Figure 8.6: A target macroblock is split into two 16 × 8 field blocks

For P-pictures, two motion vectors are assigned for each 16 × 16 pixel target macroblock. The 16 × 8 predictions may be taken from either of the two most recently decoded anchor pictures. Note that the 16 x 8 field prediction cannot come from the same frame, as was the case in field prediction for field pictures.

For B-pictures, due to the forward and the backward motion, there can be two or four motion vectors for each target macroblock. The 16 × 8 predictions may be taken from either field of the two most recently decoded anchor pictures.

Dual prime for P pictures

Dual prime is only used in P-pictures where there are no B-pictures in the GOP. Here only one motion vector is encoded (in its full format) in the bit stream together with a small differential motion vector correction. In the case of the field pictures two motion vectors are then derived from this information. These are used to form predictions from the two reference fields (one top, one bottom) which are averaged to form the final prediction. In the case of frame pictures this process is repeated for the two fields so that a total of four field predictions are made.

Figure 8.7 shows an example of dual prime motion compensated prediction for the case of frame pictures. The transmitted motion vector has a vertical displacement of three pixels. From the transmitted motion vector two preliminary predictions are computed, which are then averaged to form the final prediction.

click to expand
Figure 8.7: Dual prime motion compensated prediction for P-pictures

The first preliminary prediction is identical to the field prediction, except that the reference pixels should come from the previously coded fields of the same parity (top or bottom fields) as the target pixels. The reference pixels, which are obtained from the transmitted motion vector, are taken from two fields (taken from one field for field pictures). In the Figure the predictions for target pixels in the top field, TP, are taken from the top reference field, TR. Target pixels in the bottom field, BP, take their predictions from the bottom reference field, BR.

The second preliminary prediction is derived using a computed motion vector plus a small differential motion vector correction. For this prediction, reference pixels are taken from the parity field opposite to the first parity preliminary prediction. For the target pixels in the top field TP, pixels are taken from the bottom reference field BR. Similarly, for the target pixels in the bottom field BP, prediction pixels are taken from the top reference field TR.

The computed motion vectors are obtained by a temporal scaling of the transmitted motion vector to match the field in which the reference pixels lie, as shown in Figure 8.7. For example, for the transmitted motion vector of value 3, the computed motion vector for TP would be 3 × 1/2 = 1.5, since the reference field BR is mid way between the top reference field and the top target field. The computed motion vector for the bottom field is 3 x 3/2 = 4.5, as the distance between the reference top field and the bottom target field is three fields (3/2 frames). The differential motion vector correction, which can have up to one half pixel precision, is then added to the computed motion vector to give the final corrected motion vector.

In the Figure the differential motion vector correction has a vertical displacement of -0.5 pixel. Therefore the corrected motion vector for the top target field, TP, would be 1.5 - 0.5 = 1, and for the bottom target field it is 4.5 - 0.5 = 4, as shown with thicker lines in the Figure.

For interlaced video the performance of dual prime prediction can, under some circumstances, be comparable to that of B-picture prediction and has the advantage of low encoding delay. However, for dual prime, unlike B-pictures, the decoded pixels should be stored to be used as reference pixels.

x 8 motion compensation for field pictures

In motion compensation mode, a field of 16 × 16 pixel macroblocks is split into upper half and lower half 16 × 8 pixel blocks, and a separate field prediction is carried out for each. Two motion vectors are transmitted for each P-picture macroblock and two or four motion vectors for the B-picture macroblock. This mode of motion compensation may be useful in field pictures that contain irregular motion. Note the difference between this mode and the field prediction for frame pictures in section 8.4.3. Here a field macroblock is split into two halves, and in the field prediction for frame pictures a frame macroblock is split into two top and bottom field blocks.

Thus the five modes of motion compensation in MPEG-2 in relation to field and frame predictions can be summarised in Table 8.3.

Table 8.3: Five motion compensation modes in MPEG-2

Motion compensation mode

Use in field pictures

Use in frame pictures

Frame prediction for frame pictures

no

yes

Field prediction for field pictures

yes

no

Field prediction for frame pictures

no

yes

Dual prime for P-pictures

yes

yes

16 × 8 motion compensation for field pictures

yes

no

Restrictions on field pictures

It should be noted that field pictures have some restrictions on I, P and B-picture coding type and motion compensation. Normally, the second field picture of a frame must be of the same coding type as the first field. However, if the first field picture of a frame is an I-picture, then the second field can be either I or P. If it is a P-picture, the prediction macroblocks must all come from the previous I-picture, and dual prime cannot be used.

Motion vectors for chrominance components

As explained, the motion vectors are estimated based on the luminance pixels, hence they are used for the compensation of the luminance component. For each of the two chrominance components the luminance motion vectors are scaled according to the image format:

  • 4:2:0 both the horizontal and vertical components of the motion vector are scaled by dividing by two
  • 4:2:2 the horizontal component of the motion vector is scaled by dividing by two; the vertical component is not altered
  • 4:4:4 the motion vector is unmodified.

Concealment motion vectors

Concealment motion vectors are motion vectors that may be carried by the intra macroblocks for the purpose of concealing errors should transmission error result in loss of information. A concealment motion vector is present for all intra macroblocks if, and only if, the concealment_motion_vectors flag in the picture header is set. In the normal course of events no prediction is formed for such macroblocks, since they are of intra type. The specification does not specify how error recovery shall be performed. However, it is a recommendation that concealment motion vectors should be suitable for use by a decoder that is capable of performing the function. If concealment is used in an I-picture then the decoder should perform prediction in a similar way to a P-picture.

Concealment motion vectors are intended for use in the case where a data error results in information being lost. There is therefore little point in encoding the concealment motion vector in the macroblock for which it is intended to be used. This is because, if the data error results in the need for error recovery, it is very likely that the concealment motion vector itself would be lost or corrupted. As a result the following semantic rules are appropriate:

  • For all macroblocks except those in the bottom row of macroblocks concealment motion vectors should be appropriate for use in the macroblock that lies vertically below the macroblock in which the motion vector occurs.
  • When the motion vector is used with respect to the macroblock identified in the previous rule a decoder must assume that the motion vector may refer to samples outside of the slices encoded in the reference frame or reference field.
  • For all macroblocks in the bottom row of macroblocks the reconstructed concealment motion vectors will not be used. Therefore the motion vector (0,0) may be used to reduce unnecessary overhead.

Scalability

Scalable video coding is often regarded as being synonymous with layered video coding, which was originally proposed by the author to increase robustness of video codecs against packet (cell) loss in ATM networks [9]. At the time (late 1980s), H.261 was under development and it was clear that purely interframe coded video by this codec was very vulnerable to loss of information. The idea behind layered coding was that the codec should generate two bit streams, one carrying the most vital video information and called the base layer, and the other to carry the residual information to enhance the base layer image quality, named the enhancement layer. In the event of network congestion, only the less important enhancement data should be discarded, and the space made available for the base layer data. Such a methodology had an influence on the formation of ATM cell structure, to provide two levels of priority for protecting base layer data [5]. This form of two-layer coding is now known as SNR scalability in the MPEG-2 standard, and currently a variety of new two-layer coding techniques have been devised. They now form the basic scalability functions of the MPEG-2 standard.

Before describing various forms of scalability in some detail, it will be useful to know the similarity and more importantly any dissimilarity between these two coding methods. In the next section this will be dealt with in some depth, but since scalability is the commonly adopted name for all the video coding standards, throughout the book we use scalability to address both methods.

The scalability tools defined in the MPEG-2 specifications are designed to support applications beyond that supported by the single-layer video. Among the noteworthy applications areas addressed are video telecommunications, video on asynchronous transfer mode networks (ATM), interworking of video standards, video service hierarchies with multiple spatial, temporal and quality resolutions, HDTV with embedded TV, systems allowing migration to higher temporal resolution HDTV etc. Although a simple solution to scalable video is the simulcast technique, which is based on transmission/storage of multiple independently coded reproductions of video, a more efficient alternative is scalable video coding, in which the bandwidth allocated to a given reproduction of video can be partially reutilised in coding of the next reproduction of video. In scalable video coding, it is assumed that given an encoded bit stream, decoders of various complexities can decode and display appropriate reproductions of the coded video. A scalable video encoder is likely to have increased complexity when compared with a single-layer encoder. However, the standard provides several different forms of scalability which address nonoverlapping applications with corresponding complexities. The basic scalability tools offered are: data partitioning, SNR scalability, spatial scalability and temporal scalability. Moreover, combinations of these basic scalability tools are also supported and are referred to as hybrid scalability. In the case of basic scalability, two layers of video, referred to as the base layer and the enhancement layer, are allowed, whereas in hybrid scalability up to three layers are supported.

Layering versus scalability

Considering the MPEG-1 and MPEG-2 systems functions, defined in section 8.1, we see that MPEG-2 puts special emphasis on the transport of the bit stream. This is because MPEG-1 video is mainly for storage and software-based decoding applications in an almost error free environment, whereas MPEG-2, or H.262 in the ITU-T standard, is for transmission and distribution of video in various networks. Depending on the application, the emphasis can be put on either transmission or distribution. In the introduction to MPEG-2/H262 potential applications for this codec were listed. The major part of the application is the transmission of video over networks, such as: satellite and terrestrial broadcasting, news gathering, personal communications, video over ATM networks etc., where for better quality of service the bit stream should be protected against channel misbehaviour, which is very common in these environments. One way of protecting data against channel errors is to add some redundancy, like forward error correcting bits into the bit stream. The overhead is a percentage of the bit rate (depending on the amount of protection) and will be minimal if the needed protection part had a small channel rate requirement. Hence it is logical to design the codec in such a way as to generate more than one bit stream, and protect the most vital bit stream against the error to produce a basic quality picture. The remaining bit streams should be such that their presence enhances the video quality, but their absence or corruption should not degrade the video quality significantly. Similarly, in ATM networks, partitioning the bit stream into various parts of importance, and then providing a guaranteed channel capacity for a small part of the bit stream, is much easier than dealing with the entire bit stream. This is the fundamental concept behind the layered video coding. The most notable point is that the receiver is always expecting to receive the entire bit stream, but if some parts are not received, the picture will not break up, or the quality will not degrade significantly. Increasing the number of layers, and unequally protecting the layers against errors (cell loss in ATM networks) according to their importance in their contribution to video quality, can give a graceful degradation of video quality.

On the other hand, some applications set forth for MPEG-2 are mainly for distribution of digital video to the receivers of various capabilities. For example, in cable TV distribution over optical networks (CATV), the prime importance is to be able to decode various quality pictures from a single bit stream. In this network, there might be receivers with various decoding capability (processing powers) or customers with different requirements for video quality. Then, according to the need, a portion of the bit stream is decoded for that specific service. Here, it is assumed that the error rate in the optical channels is negligible and sufficient channel capacity for the bit stream exists, both assumptions are plausible.

Thus comparing layered coding with scalability, the fundamental difference is that in layered coding the receiver expects to receive the entire bit stream, but occasionally some parts might be in error or missing, while in scalable coding, a receiver expects to decode a portion of the bit stream, but when that is available, it remains so for the entire communication session. Of course, in scalable coding, for efficient compression, generation of the bit stream is made in a hierarchical structure such that the basic portion of the bit stream gives a video of minimum acceptable quality, similar to the base layer video. The subsequent segments of the bit stream enhance the video quality accordingly, similar to the enhancement layers. If the channel requires any protection, then the base layer bit stream should be guarded, or in ATM networks the required channel capacity be provided. Now in this respect we can say scalable and layered video coding are the same. This means that scalable coding can be used as a layering technique, but, however, layered coded data may not be scalable. Thus, scalability is a more generic name for layering, and throughout the book we use scalability to address both. Those parts where the video codec acts as a layered encoder but not as a scalable encoder, will be particularly identified.

Data partitioning

Data partitioning is a tool intended for use when two channels are available for the transmission and/or storage of a video bit stream, as may be the case in ATM networks, terrestrial broadcasting, magnetic media etc. Data partitioning in fact is not true scalable coding, but as we will see it is a layered coding technique. It is a means of dividing the bit stream of a single-layer MPEG-2 into two parts or two layers. The first layer comprises the critical parts of the bit stream (such as headers, motion vectors, lower order DCT coefficients) which are transmitted in the channel with the better error performance. The second layer is made of less critical data (such as higher DCT coefficients) and is transmitted in the channel with poorer error performance. Thus, degradations to channel errors are minimised since the critical parts of a bit stream are better protected. Data from neither channel may be decoded on a decoder that is not intended for decoding data partitioned bit streams. Even with the proper decoder, data extracted from the second layer decoder cannot be used unless the decoded base layer data is available.

A block diagram of a data partitioning encoder is shown in Figure 8.8. The single-layer encoder is in fact a nonscalable MPEG-2 video encoder that may or may not include B-pictures. At the encoder, during the quantisation and zigzag scanning of each 8 × 8 DCT coefficient, the scanning is broken at the priority break point (PBP), as shown in Figure 8.9.

click to expand
Figure 8.8: Block diagram of a data partitioning encoder

click to expand
Figure 8.9: Position of the priority break point in a block of DCT coefficients

The first part of the scanned quantised coefficients after variable length coding, with the other overhead information such as motion vectors, macroblock types and addresses etc., including the priority break point (PBP), is taken as the base layer bit stream. The remaining scanned and quantised coefficients plus the end of block (EOB) code constitute the enhancement layer bit stream. Figure 8.9 also shows the position of the priority break point in the DCT coefficients.

The base and the enhancement layer bit streams are then multiplexed for transmission into the channel. For prioritised transmission such as ATM networks, each bit stream is first packetised into high and low priority cells and the cells are multiplexed. At the decoder, knowing the position of PBP, a block of DCT coefficients is reconstructed from the two bit streams. Note that PBP indicates the last DCT coefficient of the base. Its position at the encoder is determined based on the portion of channel rate from the total bit rate allocated to the base layer.

Figure 8.10 shows single shots of an 8 Mbit/s data partitioning MPEG-2 coded video and its associated base layer picture. The priority break point is adjusted for a base layer bit rate of 2 Mbit/s. At this bit rate, the quality of the base layer is almost acceptable. However, some areas in the base layer show blocking artefacts and in others the picture is blurred. Blockiness is due to the reconstruction of some macroblocks from only the DC and/or from a few AC coefficients. Blurriness is due to loss of high frequency DCT coefficients.

click to expand
Figure 8.10: Data partitioning (a enhanced) (b base picture)

It should be noted that, since the encoder is a single-layer interframe coder, then at the encoder both the base and the enhancement layer coefficients are used at the encoding prediction loop. Thus reconstruction of the picture from the base layer only can result in a mismatch between the encoder and decoder prediction loops. This causes picture drift on the reconstructed picture, that is a loss of enhancement data at the decoder is accumulated and appears as mosquito-like noise. Picture drift only occurs on P-pictures, but since B-pictures may use P-pictures for prediction, they suffer from picture drift too. Also, I-pictures reset the feedback prediction, hence they clean up the drift. The more frequent the I-pictures, the less is the appearance of picture drift, but at the expense of higher bit rates.

In summary to have a drift-free video, the receiver should receive the entire bit stream. That is, a receiver that decodes only the base layer portion of the bit stream cannot produce a stable video. Therefore data partitioned bit stream is not scalable, but it is layered coded. It is in fact the simplest form of layering technique, which has no extra complexity over the single-layer encoder.

Although the base picture suffers from picture drift and may not be usable alone, that of the enhanced (base layer plus the enhancement layer) picture with occasional losses is quite acceptable. This is due to normally low loss rates in most networks (e.g. less than 10-4 in ATM networks), such that before the accumulation of loss becomes significant, the loss area is cleaned up by I-pictures.

SNR scalability

SNR scalability is a tool intended for use in video applications involving telecommunications and multiple quality video services with standard TV and enhanced TV, i.e. video systems with the common feature that a minimum of two layers of video quality is necessary. SNR scalability involves generating two video layers of the same spatio-temporal resolution but different video qualities from a single video source such that the base layer is coded by itself to provide the basic video quality and the enhancement layer is coded to enhance the base layer. The enhancement layer when added back to the base layer regenerates a higher quality reproduction of the input video. Since the enhancement layer is said to enhance the signal-to-noise ratio (SNR) of the base layer, this type of scalability is called SNR. Alternatively, as we will see later, SNR scalability could have been called coefficient amplitude scalability or quantisation noise scalability. These types, although a bit wordy, may better describe the nature of this encoder.

Figure 8.11 shows a block diagram of a two-layer SNR scalable encoder. First, the input video is coded at a low bit rate (lower image quality), to generate the base layer bit stream. The difference between the input video and the decoded output of the base layer is coded by a second encoder, with a higher precision, to generate the enhancement layer bit stream. These bit streams are multiplexed for transmission over the channel. At the decoder, decoding of the base layer bit stream results in the base picture. When the decoded enhancement layer bit stream is added to the base layer, the result is an enhanced image. The base and the enhancement layers may either use the MPEG-2 standard encoder or the MPEG-1 standard for the base layer and MPEG-2 for the enhancement layer. That is, in the latter a 4:2:0 format picture is generated at the base layer, but a 4:2:0 or 4:2:2 format picture at the second layer.

click to expand
Figure 8.11: Block diagram of a two-layer SNR scalable coder

It may appear that the SNR scalable encoder is much more complex than is the data partitioning encoder. The former requires at least two nonscalable encoders whereas data partitioning is a simple single-layer encoder, and partitioning is just carried out on the bit stream. The fact is that if both layer encoders in the SNR coder are of the same type, e.g. both nonscalable MPEG-2 encoders, then the two-layer encoder can be simplified. Consider Figure 8.12, which represents a simplified nonscalable MPEG-2 of the base layer [10].

click to expand
Figure 8.12: A DCT based base layer encoder

According to the Figure, the difference between the input pixels block X and their motion compensated predictions Y are transformed into coefficients T(X - Y). These coefficients after quantisation can be represented with T(X - Y) - Q, where Q is the introduced quantisation distortion. The quantised coefficients after the inverse DCT (IDCT) reconstruct the prediction error. They are then added to the motion compensated prediction to reconstruct a locally decoded pixel block Z.

Thus the interframe error signal X - Y after transform coding becomes:

(8.1) 

and after quantisation, a quantisation distortion Q is introduced to the transform coefficients. Then eqn. 8.1 becomes:

(8.2) 

After the inverse DCT the reconstruction error can be formulated as:

(8.3) 

where T--1 is the inverse transformation operation. Because transformation is a linear operator, then the reconstruction error can be written as:

(8.4) 

Also, due to the orthonormality of the transform, where T-1T = 1, eqn. 8.4 is simplified to:

(8.5) 

When this error is added to the motion compensated prediction Y, the locally decoded block becomes:

(8.6) 

Thus, according to Figure 8.12, what is coded by the second layer encoder is:

(8.7) 

that is, the inverse transform of the base layer quantisation distortion. Since the second layer encoder is also an MPEG encoder (e.g. a DCT-based encoder), then DCT transformation of X - Z in eqn. 8.7 would result in:

(8.8) 

where again the orthonormality of the transform is employed. Thus the second layer transform coefficients are in fact the quantisation distortions of the base layer transform coefficients, Q. For this reason the codec can also be called a coefficient amplitude scalability or quantisation noise scalability unit.

Therefore the second layer of an SNR scalable encoder can be a simple requantiser, as shown in Figure 8.12, without much more complexity than a data partitioning encoder. The only problem with this method of coding is that, since normally the base layer is poor, or at least worse than the enhanced image (base plus the second layer), then the used prediction is not good. A better prediction would be a picture of the sum of both layers, as shown in Figure 8.13. Note that the second layer is still encoding the quantisation distortion of the base layer.

click to expand
Figure 8.13: A two-layer SNR scalable encoder with drift at the base layer

In this encoder, for simplicity, the motion compensation, variable length coding of both layers and the channel buffer have been omitted. In the Figure Qb and Qe are the base and the enhancement layer quantisation step sizes, respectively. The quantisation distortion of the base layer is requantised with a finer precision (Qe < Qb), and then it is fed back to the prediction loop, to represent the coding loop of the enhancement layer. Now compared with data partitioning, this encoder only requires a second quantiser, and so the complexity is not so great.

Note the tight coupling between the two-layer bit streams. For freedom from drift in the enhanced picture, both bit streams should be made available to the decoder. For this reason this type of encoder is called an SNR scalable encoder with drift at the base layer or no drift in the enhancement layer. If the base layer bit stream is decoded by itself then, due to loss of differential refinement coefficients, the decoded picture in this layer will suffer from picture drift. Thus, this encoder is not a true scalable encoder, but is in fact a layered encoder. Again, although the drift should only appear in P-pictures, since B-pictures use P-pictures for predictions, this drift is transferred into B-pictures too. I-pictures reset the distortion and drift is cleaned up.

For applications with the occasional loss of information in the enhancement layer, parts of the picture have the base layer quality, and other parts that of the enhancement layer. Therefore picture drift can be noticed in these areas.

If a true SNR scalable encoder with drift-free pictures at both layers is the requirement, then the coupling between the two layers must be loosened. Applications such as simulcasting of video with two different qualities from the same source need such a feature. One way to prevent picture drift is not to feed back the enhancement data into the base layer prediction loop. In this case the enhancement layer will be intra coded and bit rate will be very high.

In order to reduce the second layer bit rate, the difference between the input to and the output of the base layer (see Figure 8.11) can be coded by another MPEG encoder [11]. However, here we need two encoders and the complexity is much higher than that for data partitioning. To reduce the complexity, we need to code only the quantisation distortion. However, since the transformation operator and most importantly in SNR scalability the temporal and spatial resolutions of the base and enhancement layers pictures are identical, then the motion estimation and compensation can be shared between them. Following the previous eqns 8.1–8.8, we can also simplify the two independent encoders into one encoder generating two bit streams, such that each bit stream is drift-free decodable. Figure 8.14 shows a block diagram of a three-layer truly SNR scalable encoder, where the generated picture of each layer is drift free and can be used for simulcasting [12].

click to expand
Figure 8.14: A three-layer drift-free SNR scalable encoder

In the Figure T, B, IT and EC represent transformation, prediction buffer, inverse transformation and entropy coding and Qi, is the ith layer quantiser. A common motion vector is used at all layers. Note that although this encoder looks to be made of several single-layer encoders, since motion estimation and many coding decisions are common to all the layers, and the motion estimation comprises about 55-70 per cent of encoding complexity of an encoder, the increase in complexity is moderate. Figure 8.15 shows the block diagram of the corresponding three-layer SNR decoder.

click to expand
Figure 8.15: A block diagram of a three-layer SNR decoder

After entropy decoding (ED) each layer is inverse quantised and then all are added together to represent the final DCT coefficients. These coefficients are inverse transformed and are added to the motion compensated previous picture to reconstruct the final picture. Note that there is only one motion vector, which is transmitted at the base layer. Note also, the decoder of Figure 8.13, with drift in the base layer, is also similar to this Figure, but with only two layers of decoding.

Figure 8.16 shows the picture quality of the base layer at 2 Mbit/s. That of the base plus the enhancement layer would be similar to those of the data partitioning, albeit with slightly higher bit rate. At this bit rate the extra bits would be of the order of 15–20 per cent, due to the overhead of the second layer data [11] (also see Figure 8.26). Due to coarser quantisation, some parts of the picture are blocky, as was the case in data partitioning. However, since any significant coefficient can be included at the base layer, the base layer picture of this encoder, unlike that of data partitioning, does not suffer from loss of high frequency information.

click to expand
Figure 8.16: Picture quality of the base layer of SNR encoder at 2 Mbit/s

Experimental results show that the picture quality of the base layer of an SNR scalable coder is much superior to that of data partitioning, especially at lower bit rates [13]. This is because, at lower base layer bit rates, data partitioning can only retain DC and possibly one or two AC coefficients. Reconstructed pictures with these few coefficients are very blocky.

Spatial scalability

Spatial scalability involves generating two spatial resolution video streams from a single video source such that the base layer is coded by itself to provide the basic spatial resolution and the enhancement layer employs the spatially interpolated base layer which carries the full spatial resolution of the input video source [14]. The base and the enhancement layers may either use both the coding tools in the MPEG-2 standard, or the MPEG-1 standard for the base layer and MPEG-2 for the enhancement layer, or even an H.261 encoder at the base layer and an MPEG-2 encoder at the second layer. Use of MPEG-2 for both layers achieves a further advantage by facilitating interworking between video coding standards. Moreover, spatial scalability offers flexibility in the choice of video formats to be employed in each layer. The base layer can use SIF or even lower resolution pictures at 4:2:0, 4:2:2 or 4:1:1 formats, and the second layer can be kept at CCIR-601 with 4:2:0 or 4:2:2 format. Like the other two scalable coders, spatial scalability is able to provide resilience to transmission errors as the more important data of the lower layer can be sent over a channel with better error performance, and the less critical enhancement layer data can be sent over a channel with poorer error performance. Figure 8.17 shows a block diagram of a two-layer spatial scalable encoder.

click to expand
Figure 8.17: Block diagram of a two-layer spatial scalable encoder

An incoming video is first spatially reduced in both the horizontal and vertical directions to produce a reduced picture resolution. For 2:1 reduction, normally a CCIR-601 video is converted into an SIF image format. The filters for the luminance and the chrominance colour components are the 7 and 4 tap filters, respectively, described in section 2.3. The SIF image sequence is coded at the base layer by an MPEG-1 or MPEG-2 standard encoder, generating the base layer bit stream. The bit stream is decoded and upsampled to produce an enlarged version of the base layer decoded video at CCIR-601 resolution. The upsampling is carried out by inserting zero level samples between the luminance and chrominance pixels, and interpolating with the 7 and 4 tap filters, similar to those described in section 2.3. An MPEG-2 encoder at the enhancement layer codes the difference between the input video and the interpolated video from the base layer. Finally, the base and enhancement layer bit streams are multiplexed for transmission into the channel.

If the base and the enhancement layer encoders are of the same type (e.g. both MPEG-2), then the two encoders can interact. This is not only to simplify the two-layer encoder, as was the case for the SNR scalable encoder, but also to make the coding more efficient. Consider a macroblock at the base layer. Due to 2:1 picture resolution between the enhancement and the base layers, the base layer macroblock corresponds to four macroblocks at the enhancement layer. Similarly, a macroblock at the enhancement layer corresponds to a block of 8 x 8 pixels at the base layer. The interaction would be in the form of upsampling the base layer block of 8 x 8 pixels into a macroblock of 16 × 16 pixels, and using it as a part of the prediction in the enhancement layer coding loop.

Figure 8.18 shows a block of 8 × 8 pixels from the base layer that is upsampled and is combined with the prediction of the enhancement layer to form the final prediction for a macroblock at the enhancement layer. In the Figure the base layer upsampled macroblock is weighted by w and that of the enhancement layer by 1 - w.

click to expand
Figure 8.18: Principle of spatio-temporal prediction in the spatial scalable encoder

More details of the spatial scalable encoder are shown in Figure 8.19. The base layer is a nonscalable MPEG-2 encoder, where each block of this encoder is upsampled, interpolated and fed to a weighting table (WT). The coding elements of the enhancement layer are shown without the motion compensation, variable length code and the other coding tools of the MPEG-2 standard. A statistical table (ST) sets the weighting table elements. Note that the weighted base layer macroblocks are used in the prediction loop, which will be subtracted from the input macroblocks. This part is similar to taking the difference between the input and the decoded base layer video and coding their differences by a second layer encoder, as was illustrated in the general block diagram of this encoder in Figure 8.17.

click to expand
Figure 8.19: Details of spatial scalability encoder

Figure 8.20 shows a single shot of the base layer picture at 2 Mbit/s. The picture produced by the base plus the enhancement layer at 8 Mbit/s would be similar to that resulting for data partitioning, shown in Figure 8.10. Note that since picture size is one quarter of the original, the 2 Mbit/s allocated to the base layer would be sufficient to code the base layer pictures at almost identical quality to the base plus the enhancement layer at 8 Mbit/s. An upsampled version of the base layer picture to fill the display at the CCIR-601 size is also shown in the Figure. Comparing this picture with those of data partitioning and the simple version of the SNR scalable coders, it can be seen that the picture is almost free from blockiness. However, still some very high frequency information is missing and aliasing distortions due to the upsampling will be introduced into the picture. Note that the base layer picture can be used alone without picture drift. This was not the case for data partitioning and the simple SNR scalable encoders. However, the price paid is that this encoder is made up of two MPEG encoders, and is more complex than data partitioning and SNR scalable encoders. Note that unlike the true SNR scalable encoder of Figure 8.14, here due to differences in the picture resolutions of the base and enhancement layers, the same motion vector cannot be used for both layers.

click to expand
Figure 8.20: (a Base layer picture of a spatial scalable encoder at 2 Mbit/s) (b its enlarged version)

Temporal scalability

Temporal scalability is a tool intended for use in a range of diverse video applications from telecommunications to HDTV. In such systems migration to higher temporal resolution systems from that of lower temporal resolution systems may be necessary. In many cases, the lower temporal resolution video systems may be either the existing systems or the less expensive early generation systems. The more sophisticated systems may then be introduced gradually.

Temporal scalability involves partitioning of video frames into layers, in which the base layer is coded by itself to provide the basic temporal rate and the enhancement layer is coded with temporal prediction with respect to the base layer. The layers may have either the same or different temporal resolutions, which, when combined, provide full temporal resolution at the decoder. The spatial resolution of frames in each layer is assumed to be identical to that of the input video. The video encoders of the two layers may not be identical. The lower temporal resolution systems may only decode the base layer to provide basic temporal resolution, whereas more sophisticated systems of the future may decode both layers and provide high temporal resolution video while maintaining interworking capability with earlier generation systems.

Since in temporal scalability the input video frames are simply partitioned between the base and the enhancement layer encoders, the encoder need not be more complex than a single-layer encoder. For example, a single-layer encoder may be switched between the two base and enhancement modes to generate the base and the enhancement bit streams alternately. Similarly, a decoder can be reconfigured to decode the two bit streams alternately. In fact, the B-pictures in MPEG-1 and MPEG-2 provide a very simple temporal scalability that is encoded and decoded alongside the anchor I and P-pictures within a single codec. I and P-pictures are regarded as the base layer, and the B-pictures become the enhancement layer. Decoding of I and P-pictures alone will result in the base pictures with low temporal resolution, and when added to the decoded B-pictures the temporal resolution is enhanced to its full size. Note that, since the enhancement data does not affect the base layer prediction loop, both the base and the enhanced pictures are free from picture drift.

Figure 8.21 shows the block diagram of a two-layer temporal scalable encoder. In the Figure a temporal demultiplexer partitions the input video into the base and enhancement layer, input pictures. For the 2:1 temporal scalability shown in the Figure, the odd numbered pictures are fed to the base layer encoder and the even numbered pictures become inputs to the second layer encoder. The encoder at the base layer is a normal MPEG-1, MPEG-2, or any other encoder. Again for greater interaction between the two layers, either to make encoding simple or more efficient, both layers may employ the same type of coding scheme.

click to expand
Figure 8.21: A block diagram of a two-layer temporal scalable encoder

At the base layer the lower temporal resolution input pictures are encoded in the normal way. Since these pictures can be decoded independently of the enhancement layer, they do not suffer from picture drift. The second layer may use prediction from the base layer pictures, or from its own picture, as shown for frame 4 in the Figure. Note that at the base layer some pictures might be coded as B-pictures, using their own previous, future or their interpolation as prediction, but it is essential that some pictures should be coded as anchor pictures. On the other hand, in the enhancement layer, pictures can be coded in any mode. Of course, for greater compression, at the enhancement layer, most if not all the pictures are coded as B-pictures. These B-pictures have the choice of using past, future and their interpolated values, either from the base or the enhancement layer.

Hybrid scalability

MPEG-2 allows combination of individual scalabilities such as spatial, SNR or temporal scalability to form hybrid scalability for certain applications. If two scalabilities are combined, then three layers are generated and they are called the base layer, enhancement layer 1 and enhancement layer 2. Here enhancement layer 1 is a lower layer relative to enhancement layer 2, and hence decoding of enhancement layer 2 requires the availability of enhancement layer 1. In the following some examples of hybrid scalability are shown.

8.5.6.1 Spatial and temporal hybrid scalability

Spatial and temporal scalability is perhaps the most common use of hybrid scalability. In this mode the three-layer bit streams are formed by using spatial scalability between the base and enhancement layer 1, while temporal scalability is used between enhancement layer 2 and the combined base and enhancement layer 1, as shown in Figure 8.22.

click to expand
Figure 8.22: Spatial and temporal hybrid scalability encoder

In this Figure, the input video is temporally partitioned into two lower temporal resolution image sequences In-1 and In-2. The image sequence In-1 is fed to the spatial scalable encoder, where its reduced version, In-0, is the input to the base layer encoder. The spatial encoder then generates two bit streams, for the base and enhancement layer 1. The In-2 image sequence is fed to the temporal enhancement encoder to generate the third bit stream, enhancement layer 2. The temporal enhancement encoder can use the locally decoded pictures of a spatial scalable encoder as predictions, as was explained in section 8.5.4.

8.5.6.2 SNR and spatial hybrid scalability

Figure 8.23 shows a three-layer hybrid encoder employing SNR scalability and spatial scalability. In this encoder the SNR scalability is used between the base and enhancement layer 1 and the spatial scalability is used between layer 2 and the combined base and enhancement layer 1. The input video is spatially downsampled (reduced) to lower resolution as In-1 is to be fed to the SNR scalable encoder. The output of this encoder forms the base and enhancement layer 1 bit streams. The locally decoded pictures from the SNR scalable coder are upsampled to full resolution to form prediction for the spatial enhancement encoder.

click to expand
Figure 8.23: SNR and spatial hybrid scalability encoder

8.5.6.3 SNR and temporal hybrid scalability

Figure 8.24 shows an example of an SNR and temporal hybrid scalability encoder. The SNR scalability is performed between the base layer and the first enhancement layer. The temporal scalability is used between the second enhancement layer and the locally decoded picture of the SNR scalable coder. The input image sequence through a temporal demultiplexer is partitioned into two sets of image sequences, and these are fed to each individual encoder.

click to expand
Figure 8.24: SNR and temporal hybrid scalability encoder

8.5.6.4 SNR, spatial and temporal hybrid scalability

The three scalable encoders might be combined to form a hybrid coder with a larger number of levels. Figure 8.25 shows an example of four levels of scalability, using all the three scalability tools mentioned.

click to expand
Figure 8.25: SNR, spatial and temporal hybrid scalability encoder

The temporal demultiplexer partions the input video into image sequences In-1 and In-2. Image sequence In-2 is coded at the highest enhancement layer (enhancement 3), with the prediction from the lower levels. The image sequence In-1 is first downsampled to produce a lower resolution image sequence, In-0. This sequence is then SNR scalable coded, to provide the base and the first enhancement layer bit streams. An upsampled and interpolated version of the SNR scalable decoded video forms the prediction for the spatial enhancement encoder. The output of this encoder results in the second enhancement layer bit stream (enhancement 2).

Figure 8.25 was just an example of how various scalability tools can be combined to produce bit streams of various degrees of importance. Of course, depending on the application, formation of the base and the level of the hierarchy of the higher enhancement layers might be defined in a different way to suit the application. For example, when the above scalability methods are applied to each of the I, P and B-pictures, since these pictures have different levels of importance, then their layered versions can increase the number of layers even further.

Overhead due to scalability

Although scalability or layering techniques provide a means of delivering a better video quality to the receivers than do single-layer encoders, this is done at the expense of higher encoder complexity and higher bit rate. We have seen that data partitioning is the simplest form of layering, and spatial scalability the most complex one. The amount of extra bits generated by these scalability techniques is also different.

Data partitioning is a single-layer encoder, but inclusion of the priority break point (PBP) in the zigzag scanning path and the fact that the zero run of the zigzag scan is now broken into two parts, incur some additional bits. These extra bits, along with redundant declaration of the macroblock addresses at both layers, generate some overhead over the single-layer coder. Our investigations show that the overhead bit is of the order of 3–4 per cent of the single-layer counterpart, almost irrespective of the percentage of the bits from the total bit rate assigned to the base layer.

In SNR scalability the second layer codes the quantisation distortions of the base layer, plus the other addressing information. The additional bits over the single layer depend on the relationship between the quantiser step sizes of the base and enhancement layers, and consequently on the percentage of the total bits allocated to the base layer. At the lower percentages, the quantiser step size of the base layer is large and hence the second layer efficiently codes any residual base layer quantisation distortions. This is very similar to successive approximation (two sets of bit planes), hence it is not expected that the SNR scalable coding efficiency will be much worse than the single layer.

At the higher percentages, the quantiser step sizes of the base and enhancement layers become close to each other. Considering that for a base layer quantiser step size of Qb, the maximum quantisation distortion of the quantised coefficients is Qb/2 and the nonquantised ones that fall in the dead zone is Qb, then as long as the enhancement quantiser step size Qb > Qe > Qb/2, none of the significant base layer coefficients is coded by the enhancement layer, except of course the ones in the dead zone of the base layer. Thus, again both layers code the data efficiently, that is the coefficient is either coded at the base layer or the enhancement layer, and the overall coding efficiency is not worse than the single layer. Reducing the base layer bit rate from its maximum value, means increasing Qb. As long as Qb/2 < Qe, none of the base layer quantisation distortions (except the ones on the dead zone) can be coded by the enhancement layer. Hence, the enhancement layer does not improve the picture quality noticeably, and since the base layer is coded at a lower bit rate, the overall quality will be worse than that for the single layer. The worst quality occurs when Qe = Qb/2.

If the aim was to produce the same picture quality, then the bit rate of the SNR scalable coder had to be increased, as shown in Figure 8.26. In this Figure the overall bit rate of the SNR scalable coder is increased over the single layer such that the picture quality under both encoders is identical. The percentage of the bits assigned to the base layer from the total bits is varied from its minimum value to its maximum value. As we see the poorest performance of the SNR scalable (the highest overhead) is when the base layer is allocated about 40–50 per cent of the total bit rate. In fact, at this bit rate, the average quantiser step size of the enhancement layer is half of that of the base layer. This maximum overhead is 30 per cent and that of the data partitioning is also shown, which reads about three per cent irrespective of the bits assigned to the base layer.

click to expand
Figure 8.26: Increase in bit rate due to scalability

In spatial scalability the smaller size picture of the base layer is upsampled and its difference with the input picture is coded by the enhancement layer. Hence the enhancement layer, in addition to the usual redundant addressing, has to code two new items of information. One is the aliasing distortion of the base layer due to upsampling and the other is the quantisation distortion of the base layer, if there is any.

At the very low percentage of the bits assigned to the base layer, both of these distortions are coded efficiently by the enhancement layer. As the percentage of bits assigned to the base layer increases, similar to SNR scalability the overhead increases too. At the time where the quantiser step sizes of both layers become equal, Qb = Qe, any increase in the base layer bit rate (making Qb < Qe), means that the enhancement layer cannot improve the distortions of the base layer further. Beyond this point, aliasing distortion will be the dominant distortion and any increase in the base layer bit rate will be wasted. Thus, as the bit rate budget of the base layer increases, the overhead increases too, as shown in Figure 8.26. This differs from the behaviour of the SNR scalability.

In fact, in spatial scalability with a fixed total bit rate, increasing the base layer bit rate beyond the critical point of Qb = Qe will reduce the enhancement layer bit rate budget. In this case, increasing the base layer bit rate will increase the aliasing distortion and hence, as the base layer bit rate increases, the overall quality decreases!!

In temporal scalability, in contrast to the other scalability methods, in fact the bit rate can be less than that for a single-layer encoder!! This is because there is no redundant addressing to create overhead. Moreover, since the enhancement layer pictures have more choice for their optimum prediction, from either base or enhancement layers, they are coded more efficiently than the single layer. A good example is the B-pictures in MPEG-2, that can be coded at much lower bit rate than the P-pictures. Thus temporal scalability in fact can be slightly more efficient than for single-layer coding.

Applications of scalability

Considering the nature of the basic scalability of data partitioning, SNR, spatial and temporal scalability and their behaviour with regard to picture drift and the overhead, suitable applications for each method may be summarised as:

  1. Data partitioning: this mode is the simplest of all but since it has a poor base layer quality and is sensitive to picture drift, it can be used in the environment where there is rarely any loss of enhancement data (e.g. loss rate <10-6). Hence the best application would be video over ATM networks, where through admission control, the loss ratio can be maintained at low levels [15].
  2. SNR scalability: in this method two pictures of the same spatio-temporal resolutions are generated, but one has a lower picture quality than the other. SNR scalability has generally a higher bit rate over nonscalable encoders, but can have a good base picture quality and can be drift free. Hence suitable applications can be:

    • transmission of video at different quality of interest, such as multiquality video, video on demand, broadcasting of TV and enhanced TV
    • video over networks with a high error or packet loss rates, such as Internet, or heavily congested ATM networks.
  3. Spatial scalability: this is the most complex form of scalability, where each layer requires a complete encoder/decoder. Such a loose dependency between the layers has the advantage that each layer is free to use any codec, with different spatio-temporal and quality resolutions. Hence there can be numerous applications for this mode, such as:

    • interworking between two different standard video codecs (e.g. H.263 and MPEG-2)
    • simulcasting of drift free good quality video at two spatial resolutions, such as standard TV and HDTV
    • distribution of video over computer networks
    • video browsing
    • reception of good quality low spatial resolution pictures over mobile networks
    • similar to other scalable coders, transmission of error resilience video over packet networks.
  4. Temporal scalability: this is a moderately complex encoder, where either a single-layer coder encodes both layers, such as coding of B and the anchor I and P-pictures in MPEG-1 and 2, or two separate encoders operating at two different temporal rates. The major applications can then be:

    • migration to progressive (HDTV) from the current interlaced broadcast TV
    • internetworking between lower bit rate mobile and higher bit rate fixed networks
    • video over LANs, Internet and ATM for computer workstations
    • video over packet (Internet/ATM) networks for loss resilience.

Tables 8.4, 8.5 and 8.6 summarise a few applications of various scalability techniques that can be applied to broadcast TV. In each application, parameters of the base and enhancement layers are also shown.

Table 8.4: Applications of SNR scalability

Base layer

Enhancement layer

Application

ITU-R-601

same resolution and format as lower layer

two quality service for standard TV

high definition

same resolution and format as lower layer

two quality service for HDTV

4:2:0 high definition

4:2:2 chroma simulcast

video production/distribution

Table 8.5: Applications of spatial scalability

Base

Enhancement

Application

progressive (30 Hz)

progressive (30 Hz)

CIF/QCIF compatibility or scalability

interlace (30 Hz)

interlace (30 Hz)

HDTV/SDTV scalability

progressive (30 Hz)

interlace (30 Hz)

ISO/IECE11172-2/compatibility with this specification

interlace (30 Hz)

progressive (60 Hz)

migration to HR progressive HDTV

Table 8.6: Applications of temporal scalability

Base

Enhancement

Higher

Application

progressive (30 Hz)

progressive (30 Hz)

progressive (60 Hz)

migration to HR progressive HDTV

interlace (30 Hz)

interlace (30 Hz)

progressive (60 Hz)

migration to HR progressive HDTV

Video broadcasting

Currently more than ninety-five percent of MPEG-2 coded video is for broadcasting applications, carried via terrestrial, satellite and cable TV networks to homes. In Europe, the standard ITU-R 601 video is encoded in the range of 3–8 Mbit/s, depending on the scene content of the video. The lower end of the bit rate is for head-and-shoulders type video such as the video clip of a news reader, and the higher bit rates are required for the critical scenes such as sports programs, similar to the snap shot shown in Figure 8.10. Normally, scenes with grass and tree leaves, which have detailed texture and random motion due to wind, if they appear alongside a plain scene, like a lake or stream, are the most difficult to code. Random motion of the detailed area makes motion estimation useless, and for a limited bit rate budget, increase in quantiser step size will cause blocking artefacts in the plain areas. For HDTV video, the required bit rate is of the order of 20 Mbit/s.

In both terrestrial and satellite TV, for better channel utilisation, several TV programs may be multiplexed and then digitally modulated on to a carrier. At the destination, the receiver, known as the set top box, separates the channels, decodes each program and feeds the individual analogue signals to the television set for display. Although the same multiplexing technique can be used, because digital modulation techniques for terrestrial and satellite are different, unfortunately the same set top box cannot be used for both.

For multiplexing of TV programs, the individual bit streams are first decoded and are then reencoded to a new target bit rate. For optimum multiplexing, the target bit rate for each TV channel is made dependent on the content (statistics) of each program, and hence it is called statistical multiplexing. Here, more complex video might be assigned higher bit rates and since video complexity may vary over time, then for optimum statistical multiplexing we need to monitor the video complexity continuously.

One way of calculating the complexity of a scene in a video program is to define the scene complexity as the sum of the complexity indices of its I, P and B-pictures in a group of pictures (GOP) [16]. For each picture type, the complexity index is the product of its average quantiser step size to its bit rate in that frame. For example, the scene complexity index (SCI) of a video with a GOP structure of N = 12 and M = 3, which has one I, three P and eight B-pictures is:

(8.9) 

where I, P and B are the target bit rates for the I, P and B pictures, and QI, QP and QB are their respective average quantiser step sizes. After calculating SCI for each TV program, the total bit rate is divided between the TV channels in proportion to their SCI. Values of the SCI can be continuously calculated on a frame by frame basis (within a window of a GOP), to provide optimum statistical multiplexing.

One of the main attractions of digital satellite TV is the benefit of broadcasting many TV programmes from a single transponder. In the analogue era, one satellite transponder with a bandwidth of 36 MHz could accommodate only one frequency modulated (FM) TV programme, whereas currently about 6–8 digital TV programmes can be multiplexed into 27 Msymbol/s and are accommodated in the same transponder. There are even stations that squeeze about 10–15 digital TV programmes into a transponder, albeit at a slightly lower video quality. In addition to this increase of the number of TV channels, the required transmitted power for digital can be of the order of 10–20 per cent of analogue, or for the same power, the satellite dishes can be made much smaller (45–60 cm diameter dishes compared to 80 cm used in analogue). Digital terrestrial TV also benefits from the low power transmitters.

In digital terrestrial TV, normally one programme is digitally modulated into an 8 MHz (European) UHF channel [17]. The bit stream prior to channel modulation is orthogonal frequency division multiplexed (OFDM) into 1705 carriers (2000 carriers is also an option) and the channel modulation is a 64-QAM. At a higher modulation rate (e.g. 256-QAM) it is even possible to accommodate an 18–24 Mbit/s bit stream into the same 8 MHz UHF channel, thus being able to multiplex 2-3 digital programmes (or even higher for poorer quality) into one existing analogue UHF terrestrial channel.

Since at the baseband a 4–8 Mbit/s MPEG-2 video is OFDM modulated into almost 2000 carriers, then each bit of the video signals is transmitted at a rate of 2–4 kbit/s. Such a low data rate (large interval) is very robust against interference, similar to a high frequency burst of noise, and can be cleaned up easily. Thus OFDM is particularly attractive for ghost-free TV broadcasting in big cities, where multiple reflections from tall buildings can create interference (a common problem with analogue TV). Moreover, it is possible to cover the whole broadcast TV network (nationwide programmes not regional programmes) with a single frequency, since interference is not a problem. This will release a lot of wireless bandwidth for other communication services. Finally, similar to satellite, the transmitter power can be reduced by a factor of ten.

The price paid for all these benefits of digital TV is the sensitivity of the digital TV to channel errors. During heavy rain or snow, pictures become blocky or in the more severe cases, a complete loss of picture (picture freeze). This is the main disadvantage of digital TV since, in analogue TV, weaker reception may cause snowy pictures, which is better than picture break up or freeze in digital TV. To alleviate this problem, layered video coding with unequal error protection to various layers may be used. Or one may use the more intelligent technique of distributing the transmitter power among the layers, such that the picture quality gradually degrades, closer to quality degradation in analogue TV.

Digital versatile disc (DVD)

Digital versatile/video disc (DVD) is a new storage medium for MPEG-2 coded high quality video. DVD discs with 9 Gbytes storage capacity (in two tracks of 4.5 Gbytes) are introduced to replace the 648 (or 700) Mbytes CD-ROMs. The main reason for the introduction of this new product is that viewers' expectations of video quality have grown over time. CD-ROMs could only store MPEG-1 compressed video of SIF format at about a target rate of 1.2 Mbyte/s. When SIF pictures are enlarged to the standard size (e.g. 720 pixels by 576 lines), to be displayed on TV sets, for certain scenes the enlarged pictures look blocky. This is usually not sufficient for home movies or HDTV programmes.

In DVD, video of CCIR-601 standard size is MPEG-2 compressed. Considering the double-track DVD discs of total capacity of 9 Gbytes, the nominal movie of 90 minutes long can be coded at an average bit rate of 6–12 Mbit/s, depending whether one or both tracks are used.

To increase the video quality and at the same time to optimise the storage capacity, the MPEG-2 encoder is set to encode the video at a variable bit rate (VBR). This is done by fixing the quantiser step size at a constant value, producing video of almost constant quality over the entire programme, irrespective of scene complexity. Due to constant quantiser step size, during high picture activity, the instantaneous bit rate of the encoder can be very high (e.g. 30 Mbit/s). However, these events only occur for a short period, and there are occasions when the scenes might be very quiet, producing lower bit rates (e.g. 2 Mbit/s). Depending on the proportions of the scene activities in the video, its peak-to-mean bit rate ratio, even smoothed over a GOP can be of the order of 3–5 (peak/mean ratio smoothed over one frame can easily rise above 10). Thus had the video been coded at a constant bit rate (CBR), then for the same picture quality as VBR, the target bit rate would have to be set to the peak bit rate. Hence for quiet scenes, the storage capacity of the disc can be wasted. In fact the advantage of VBR over CBR is the saving in storage capacity by the ratio of the peak bit rate to the mean bit rate, which can be considerable.

The main problem with VBR is that the chunk of compressed data read from the disc decodes a variable number of pictures per given time unit (e.g. seconds). For a uniform and smooth display (e.g. 25 pictures per second), the read data from the disc has to be smoothed. This is done by writing it into a random access memory (RAM) and reading it at the desired rate of the decoder. Considering that today electronic notebooks are equipped with 256-512 Mbytes RAM, they are not too expensive to be included in the DVD decoders. These are sufficient to store about 5–10 minutes of the programme, well over what is needed to produce pictures without interruptions.

Video over ATM networks

MPEG-2 and in particular layered video coding and the asynchronous transfer mode (ATM) networks have a very strong link. They were introduced at about the same time (early 1990) and influenced each other's development. The cell loss priority in ATM is the direct product of the success of layered two-layer video coding in delivering a minimum acceptable picture quality [9]. Selection of 188-byte packet size for the MPEG-2 transport stream was influenced by the ATM cell size. An ATM cell (packet) is 53 bytes long with a five-byte header and a 48-byte payload. One of the ATM adaptation layers (AAL) (where the data to be transported is interfaced to the channel) called AAL1 accepts 47-byte raw data, and adds one-byte synchronisation and other information to the payload [15]. Hence each MPEG-2 packet can be transported with four ATM cells.

ATM is a slotted channel, where each 53-byte cell is seized by the server to insert its data for transmission. If the server has nothing to send, the cell is left empty. Hence, the source can send its data at a variable rate. Variable bit rate (VBR) video transmission is particularly attractive since compressed video is variable in rate by nature. With a constant quantiser step size, the video is coded at almost constant quality. At low picture activity (low motion and low texture) less bits are generated and at high picture activity more bits are generated. Such variable bit rate transmission makes statistical multiplexing even more effective than the one used with fixed bit rate broadcast TV. It is easy to show that more variable bit rate services can be accommodated in a given channel than the fixed bit rate services.

The main problem with VBR transmission is that, if bursts of data from various services occur at the same time, there will be more traffic than the network can handle and it will be congested. In this case, cells carrying visual information might be excessively delayed. There is a maximum tolerable delay beyond which late arrival cells will be of no use. Either the switching nodes or the receiver can discard these cells. In the former case, the cell discard is due to the limited capacity of the switching multiplex buffer, and in the latter, the received information is too late to be of any use by the decoder. In both cases, loss of cells leads to degradation in picture quality.

The cell loss priority bit in ATM cells coupled with two-layer video coding can enhance the video quality significantly. Here the base layer video is assigned high priority and the enhancement layer lower priority. In the event of network congestion low priority cells (enhancement data) can be discarded and room made available to the high priority cells (base layer). For example, even in the normal MPEG-2 with a GOP structure of N = 12 and M = 3, which can be regarded as temporal scalability, during the network congestion all the B-pictures can be temporarily discarded to make room for the I and P-pictures.

In ATM networks, in addition to layering, the packetisation strategy also plays an important role in the video quality. One form of packetisation may confine the effect of a lost packet to a small area of the picture, while other methods may spread degradation to a larger area. With AAL1 packetisation [15], where every 47 bytes of the bit stream are packed into the ATM cell payload without any further processing, if a cell is lost, the following cells may not be recoverable until the next slice or GOB. Thus a large part of a picture slice may be degraded, depending on the location of the lost macroblock. This problem can be overcome by making the first macroblock of each cell absolutely addressed, hence the loss can be confined to a smaller area of the picture [18]. Let us call this method of packing AALx, as shown in Figure 8.27.

click to expand
Figure 8.27: Structure of ALL1 and AALx cells

In AALx, where the first macroblock in each ATM cell is absolutely addressed, the lost area could be confined to the area covered by the lost cell. All following cells could then be decodable. For the decoder to be able to recognise the absolute address, an additional 11-bit header (absolute address header) must be inserted before the address. Also, the average length of the relative addressing is normally two bits, whereas the absolute address can be nine bits long, resulting in an additional seven bits [18]. Thus AALx has an almost five per cent extra overhead compared to AAL1. Referring to the multiplex cell discard graphs, this can result in five to ten times more cell loss, depending on the network load and the number of channels in the multiplex [19].

In an experiment, 90 frames of the Salesman image sequence were MPEG-2 coded with the first frame being intra (I-frame) coded and the remaining frames predictively (P-frame) coded (N = ∞, M = 1). Two types of packetisation method, AALx and AAL1 were used. The AALx-type cells were discarded with the ITU-T cell loss model with a cell loss rate of 10-2 and a mean burst length of 1 (see Appendix E) [20]. Those of AAL1 were discarded at cell loss rates of 10-3 (10 times lower) and 10-4 (100 times lower) with the same mean burst length. From Figure 8.28, it can be seen that AALx outperforms AAL1 at ten times lower cell loss rate, but is inferior to AAL1 with a cell loss rate of 100 times lower. Considering that in the experiment AALx is likely to experience five to ten times more loss than AAL1, AALx is a better packetisation scheme for this type of image format (e.g. H.261 or H.263).

click to expand
Figure 8.28: PSNR of MPEG-2 coded video sequence GOP (IPPPPPP...) (a AALx with error rate of 10-2) (b AAL1 with error rate of 10-3) (c AAL1 with error rate of 10-4)

In another experiment the same 90 frames of the Salesman image sequence were MPEG-2 coded with a GOP structure of N = 12, M = 1. The packetisation techniques were similar to those of the previous experiment. In this case, shown in Figure 8.29, AALx does not show the same improvement over AAL1, as was the case for Figure 8.28. In fact its performance, due to higher overhead, is worse than AAL1 with ten times lower cell loss rate.

click to expand
Figure 8.29: PSNR of MPEG-2 coded video sequence with 12 frames per GOP (IPP... IPPP ...IP...) (a AALx with error rate of 10-2) (b AAL1 with error rate of 10-3) (c AAL1 with error rate of 10-4)

The implications of these two experiments are that with MPEG-1 and 2 structures, where there are regular I-pictures every N frames, AAL1 outperforms AALx. But for very large N (e.g. in H.261) AALx is better than AAL1.

It should be noted the video quality can be improved by concealing the effect of packet loss or channel errors. In MPEG-2 there is an option that additional motion vectors for I-pictures are derived and they are transmitted in the following slice. In case some of the macroblocks are damaged, these motion vectors are used to copy pixels from the previous frame, displaced by the amount of motion vectors to replace the damaged macroblocks. In the next Chapter more general forms of concealing side effects of packet losses and channel errors will be discussed in greater depth.

Problems

1. 

Why are the systems in MPEG-2 different from those in MPEG-1?

2. 

Which of the following represents level and profile?

  1. 1.5 Mbit/s
  2. SIF
  3. SNR scalability
  4. 720 × 576 pixels.

3. 

The DCT coefficients of a motion compensated picture block are given as:

33

-10

-41

3

17

2

7

-13

61

-5

23

12

-11

5

6

-9

-3

11

3

9

-15

6

3

-1

2

-34

6

4

0

1

3

1

-21

-3

0

5

12

3

0

1

-7

-5

9

3

2

7

-1

-2

6

3

2

5

7

-2

-3

1

-5

4

-2

6

3

1

2

1

They are linearly quantised with th = q, zigzag scanned and the assigned bits are calculated from Figure 6.12. For q = 8, identify the two-dimensional events of (run, index) and the number of bits required to code the block.

4. 

The block of problem 3 is partitioned into two, and the priority break point (PBP) is set at coefficient (2, 2). Assuming that PBP can be identified with six bits, and the quantiser step size is q = 8, calculate the number of bits generated in each layer and the total number of bits. (Note the first DCT coefficient is defined at (0, 0).)

5. 

The block in problem 3 is SNR scalable coded with the base and enhancement quantiser step sizes of 14 and 8, respectively. What are the number of generated bits in each layer, and the total number of bits (assume in each layer th = q)?

6. 

An MPEG-2 coded video with its associated audio and forward error correcting codes comprises 8 Mbit/s. With a 64-QAM modulation, determine how many such videos can be accommodated in a UHF channel of 8 MHz bandwidth, with 2 MHz guard band. Assume each modulated symbol occupies 1.25 Hz of the channel.

7. 

Draw a two-state channel error model and determine the transition probabilities for each of the following conditions:

  1. bit error rate of P = 10-5 and burst length of B = 5
  2. bit error rate of P = 10-5 and burst length of B = 1.

8. 

Table 8.7 shows the duration of various parts of a 90 minute VBR-MPEG-2 coded video stored on a DVD. The given bit rate is smoothed over a GOP, but is presented in Mbit/s:

  1. calculate the required storage capacity
  2. calculate the peak-to-mean bit rate ratio
  3. calculate the storage required if the video was coded in CBR at a quality not poorer than the VBR.
Table 8.7: Duration of various picture activity in a DVD program

Duration [min]

0.5

5

10

20

30

24.5

Bit rate [Mb/s]

20

15

10

7.5

5

4

9. 

The ATM cells with AAL1 adaptation layer have a five-byte header and 48-byte payload, of which 47 bytes are used for packing the video data. If the channel bit error rate is 10-7, calculate the probability that:

  1. video is decoded erroneously
  2. the cell is lost.

10. 

The stored DVD video in problem 8 is to be streamed via an ATM network, with a maximum channel capacity of 50 Mbit/s. Due to the other users on the link, on the average only 30 per cent of the link capacity can be used by the DVD server. With an AAL1 packetisation, calculate the time required to download the entire DVD video stream over the link.

11. 

The cell loss rate of an ATM link can be modelled with , where 0 ≤ ρ ≤ 1 is the load of the link. Twenty-five video sources, each coded at an average bit rate of 4 Mbit/s, are streamed via a 155 Mbit/s ATM link, with AAL1 adaptation layer. Calculate:

  1. the network load, ρ
  2. the loss rate that each ATM cell may experience.

12. 

The video sources in problem 11 were two-layer coded with SNR scalability, but the overall video quality was assumed to remain the same. If in each source 50 per cent of its data is assigned to the base layer, and the base layer cells are always served in preference to the enhancement layers cells, calculate the cell loss probability at the:

  1. base layer
  2. enhancement layer

(hint, use Figure 8.25 for the additional overhead due to scalability).

13. 

Repeat problem 12 for data partitioning.

14. 

Repeat problem 12 for spatial scalability.

References

1 MPEG-2: 'Generic coding of moving pictures and associated audio information'. ISO/IEC 13818-2 Video, draft international standard, November 1994

2 MPEG-1: 'Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s'. ISO/IEC 1117-2: video, November 1991

3 HASKEL, B.G.,PURI, A., and NETRAVALI, A.N.: 'Digital video: an introduction to MPEG-2' (Chapman and Hall, 1997)

4 'Generic coding of moving pictures and associated audio information', ISO/IEC 13818-1 Systems, draft international standard, November 1994

5 ITU-T recommendation I.363: 'B-ISDN ATM adaptation layer (AAL) specification'. June 1992

6 OKUBA, S.,McCANN, K., and LIPPMAN, A.: 'MPEG-2 requirements, profile and performance verification', Signal Process., Image Commun., 1995, 7:3 pp.201–209

7 SAVATIER, T.: 'Difference between MPEG-1 and MPEG-2 video'. ISO/IEC JTC1/SC29/WG11 MPEG94/37, March 1994

8 Test model editing committee: 'MPEG-2 video test model 5'. ISO/IEC JTC1/SC29/WG11 doc. N0400, April 1993

9 GHANBARI, M.: 'Two-layer coding of video signals for VBR networks', IEEE J. Sel. Areas Commun., 1989, 7:5, pp.771–781

10 GHANBARI, M.: 'An adapted H.261 two-layer video codec for ATM networks', IEEE Trans. Commun., 1992, 40:9, pp.1481–1490

11 GHANBARI, M., and SEFERIDIS, V.: 'Efficient H.261 based two-layer video codecs for ATM networks', IEEE Trans. Circuits Syst. Video Technol., 1995, 5:2, pp.171–175

12 ITU-T study group XVI: 'Efficient coding of synchronised H.26L streams'. Document VCG-N35, September 2001

13 HERPEL, C.: 'SNR scalability vs data partitioning for high error rate channels'. ISO/IEC JTC1/SC29/WG11 doc. MPEG 93/658, July 1993

14 MORRISON, G., and PARKE, I.: 'A spatially layered hierarchical approach to video coding', Signal Process., Image Commun., 1995, 5:5–6, pp.445–462

15 ITU-T draft recommendation I.371: 'Traffic control and congestion control in B-ISDN'. Geneva, 1992

16 ROSDIANA, E., and GHANBARI, M.: 'Picture complexity based rate allocation algorithm for transcoded video over ABR networks', Electron. Lett., 2000, 36:6, pp.521–522

17 dvb blue book, ftp://dvbftp@ftp.dvb.org/Blue_Books/

18 GHANBARI, M., and HUGHES, C.J.: 'Packing coded video signals into ATM cells', IEEE ACM Trans. Networking, 1993, 1:5, pp.505–509

19 HUGHES, C.J.,GHANBARI, M.,PEARSON, D.E.,SEFERIDIS, V., and XIONG, J.: 'Modelling and subjective assessment of cell discard in ATM Video', IEEE Trans. Image Process., 1993, 2:2, pp.212–222

20 ITU SGXV working party XV/I, Experts Group for ATM video coding, working document AVC-205, January 1992

Page not found. Sorry. :(



Standard Codecs(c) Image Compression to Advanced Video Coding
Standard Codecs: Image Compression to Advanced Video Coding (IET Telecommunications Series)
ISBN: 0852967101
EAN: 2147483647
Year: 2005
Pages: 148
Authors: M. Ghanbari

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net