8.5 Scalability

8.5 Scalability

Scalable video coding is often regarded as being synonymous with layered video coding, which was originally proposed by the author to increase robustness of video codecs against packet (cell) loss in ATM networks [9]. At the time (late 1980s), H.261 was under development and it was clear that purely interframe coded video by this codec was very vulnerable to loss of information. The idea behind layered coding was that the codec should generate two bit streams, one carrying the most vital video information and called the base layer, and the other to carry the residual information to enhance the base layer image quality, named the enhancement layer. In the event of network congestion, only the less important enhancement data should be discarded, and the space made available for the base layer data. Such a methodology had an influence on the formation of ATM cell structure, to provide two levels of priority for protecting base layer data [5]. This form of two-layer coding is now known as SNR scalability in the MPEG-2 standard, and currently a variety of new two-layer coding techniques have been devised. They now form the basic scalability functions of the MPEG-2 standard.

Before describing various forms of scalability in some detail, it will be useful to know the similarity and more importantly any dissimilarity between these two coding methods. In the next section this will be dealt with in some depth, but since scalability is the commonly adopted name for all the video coding standards, throughout the book we use scalability to address both methods.

The scalability tools defined in the MPEG-2 specifications are designed to support applications beyond that supported by the single-layer video. Among the noteworthy applications areas addressed are video telecommunications, video on asynchronous transfer mode networks (ATM), interworking of video standards, video service hierarchies with multiple spatial, temporal and quality resolutions, HDTV with embedded TV, systems allowing migration to higher temporal resolution HDTV etc. Although a simple solution to scalable video is the simulcast technique, which is based on transmission/storage of multiple independently coded reproductions of video, a more efficient alternative is scalable video coding, in which the bandwidth allocated to a given reproduction of video can be partially reutilised in coding of the next reproduction of video. In scalable video coding, it is assumed that given an encoded bit stream, decoders of various complexities can decode and display appropriate reproductions of the coded video. A scalable video encoder is likely to have increased complexity when compared with a single-layer encoder. However, the standard provides several different forms of scalability which address nonoverlapping applications with corresponding complexities. The basic scalability tools offered are: data partitioning, SNR scalability, spatial scalability and temporal scalability. Moreover, combinations of these basic scalability tools are also supported and are referred to as hybrid scalability. In the case of basic scalability, two layers of video, referred to as the base layer and the enhancement layer, are allowed, whereas in hybrid scalability up to three layers are supported.

8.5.1 Layering versus scalability

Considering the MPEG-1 and MPEG-2 systems functions, defined in section 8.1, we see that MPEG-2 puts special emphasis on the transport of the bit stream. This is because MPEG-1 video is mainly for storage and software-based decoding applications in an almost error free environment, whereas MPEG-2, or H.262 in the ITU-T standard, is for transmission and distribution of video in various networks. Depending on the application, the emphasis can be put on either transmission or distribution. In the introduction to MPEG-2/H262 potential applications for this codec were listed. The major part of the application is the transmission of video over networks, such as: satellite and terrestrial broadcasting, news gathering, personal communications, video over ATM networks etc., where for better quality of service the bit stream should be protected against channel misbehaviour, which is very common in these environments. One way of protecting data against channel errors is to add some redundancy, like forward error correcting bits into the bit stream. The overhead is a percentage of the bit rate (depending on the amount of protection) and will be minimal if the needed protection part had a small channel rate requirement. Hence it is logical to design the codec in such a way as to generate more than one bit stream, and protect the most vital bit stream against the error to produce a basic quality picture. The remaining bit streams should be such that their presence enhances the video quality, but their absence or corruption should not degrade the video quality significantly. Similarly, in ATM networks, partitioning the bit stream into various parts of importance, and then providing a guaranteed channel capacity for a small part of the bit stream, is much easier than dealing with the entire bit stream. This is the fundamental concept behind the layered video coding. The most notable point is that the receiver is always expecting to receive the entire bit stream, but if some parts are not received, the picture will not break up, or the quality will not degrade significantly. Increasing the number of layers, and unequally protecting the layers against errors (cell loss in ATM networks) according to their importance in their contribution to video quality, can give a graceful degradation of video quality.

On the other hand, some applications set forth for MPEG-2 are mainly for distribution of digital video to the receivers of various capabilities. For example, in cable TV distribution over optical networks (CATV), the prime importance is to be able to decode various quality pictures from a single bit stream. In this network, there might be receivers with various decoding capability (processing powers) or customers with different requirements for video quality. Then, according to the need, a portion of the bit stream is decoded for that specific service. Here, it is assumed that the error rate in the optical channels is negligible and sufficient channel capacity for the bit stream exists, both assumptions are plausible.

Thus comparing layered coding with scalability, the fundamental difference is that in layered coding the receiver expects to receive the entire bit stream, but occasionally some parts might be in error or missing, while in scalable coding, a receiver expects to decode a portion of the bit stream, but when that is available, it remains so for the entire communication session. Of course, in scalable coding, for efficient compression, generation of the bit stream is made in a hierarchical structure such that the basic portion of the bit stream gives a video of minimum acceptable quality, similar to the base layer video. The subsequent segments of the bit stream enhance the video quality accordingly, similar to the enhancement layers. If the channel requires any protection, then the base layer bit stream should be guarded, or in ATM networks the required channel capacity be provided. Now in this respect we can say scalable and layered video coding are the same. This means that scalable coding can be used as a layering technique, but, however, layered coded data may not be scalable. Thus, scalability is a more generic name for layering, and throughout the book we use scalability to address both. Those parts where the video codec acts as a layered encoder but not as a scalable encoder, will be particularly identified.

8.5.2 Data partitioning

Data partitioning is a tool intended for use when two channels are available for the transmission and/or storage of a video bit stream, as may be the case in ATM networks, terrestrial broadcasting, magnetic media etc. Data partitioning in fact is not true scalable coding, but as we will see it is a layered coding technique. It is a means of dividing the bit stream of a single-layer MPEG-2 into two parts or two layers. The first layer comprises the critical parts of the bit stream (such as headers, motion vectors, lower order DCT coefficients) which are transmitted in the channel with the better error performance. The second layer is made of less critical data (such as higher DCT coefficients) and is transmitted in the channel with poorer error performance. Thus, degradations to channel errors are minimised since the critical parts of a bit stream are better protected. Data from neither channel may be decoded on a decoder that is not intended for decoding data partitioned bit streams. Even with the proper decoder, data extracted from the second layer decoder cannot be used unless the decoded base layer data is available.

A block diagram of a data partitioning encoder is shown in Figure 8.8. The single-layer encoder is in fact a nonscalable MPEG-2 video encoder that may or may not include B-pictures. At the encoder, during the quantisation and zigzag scanning of each 8 × 8 DCT coefficient, the scanning is broken at the priority break point (PBP), as shown in Figure 8.9.

click to expand
Figure 8.8: Block diagram of a data partitioning encoder

click to expand
Figure 8.9: Position of the priority break point in a block of DCT coefficients

The first part of the scanned quantised coefficients after variable length coding, with the other overhead information such as motion vectors, macroblock types and addresses etc., including the priority break point (PBP), is taken as the base layer bit stream. The remaining scanned and quantised coefficients plus the end of block (EOB) code constitute the enhancement layer bit stream. Figure 8.9 also shows the position of the priority break point in the DCT coefficients.

The base and the enhancement layer bit streams are then multiplexed for transmission into the channel. For prioritised transmission such as ATM networks, each bit stream is first packetised into high and low priority cells and the cells are multiplexed. At the decoder, knowing the position of PBP, a block of DCT coefficients is reconstructed from the two bit streams. Note that PBP indicates the last DCT coefficient of the base. Its position at the encoder is determined based on the portion of channel rate from the total bit rate allocated to the base layer.

Figure 8.10 shows single shots of an 8 Mbit/s data partitioning MPEG-2 coded video and its associated base layer picture. The priority break point is adjusted for a base layer bit rate of 2 Mbit/s. At this bit rate, the quality of the base layer is almost acceptable. However, some areas in the base layer show blocking artefacts and in others the picture is blurred. Blockiness is due to the reconstruction of some macroblocks from only the DC and/or from a few AC coefficients. Blurriness is due to loss of high frequency DCT coefficients.

click to expand
Figure 8.10: Data partitioning (a enhanced) (b base picture)

It should be noted that, since the encoder is a single-layer interframe coder, then at the encoder both the base and the enhancement layer coefficients are used at the encoding prediction loop. Thus reconstruction of the picture from the base layer only can result in a mismatch between the encoder and decoder prediction loops. This causes picture drift on the reconstructed picture, that is a loss of enhancement data at the decoder is accumulated and appears as mosquito-like noise. Picture drift only occurs on P-pictures, but since B-pictures may use P-pictures for prediction, they suffer from picture drift too. Also, I-pictures reset the feedback prediction, hence they clean up the drift. The more frequent the I-pictures, the less is the appearance of picture drift, but at the expense of higher bit rates.

In summary to have a drift-free video, the receiver should receive the entire bit stream. That is, a receiver that decodes only the base layer portion of the bit stream cannot produce a stable video. Therefore data partitioned bit stream is not scalable, but it is layered coded. It is in fact the simplest form of layering technique, which has no extra complexity over the single-layer encoder.

Although the base picture suffers from picture drift and may not be usable alone, that of the enhanced (base layer plus the enhancement layer) picture with occasional losses is quite acceptable. This is due to normally low loss rates in most networks (e.g. less than 10-4 in ATM networks), such that before the accumulation of loss becomes significant, the loss area is cleaned up by I-pictures.

8.5.3 SNR scalability

SNR scalability is a tool intended for use in video applications involving telecommunications and multiple quality video services with standard TV and enhanced TV, i.e. video systems with the common feature that a minimum of two layers of video quality is necessary. SNR scalability involves generating two video layers of the same spatio-temporal resolution but different video qualities from a single video source such that the base layer is coded by itself to provide the basic video quality and the enhancement layer is coded to enhance the base layer. The enhancement layer when added back to the base layer regenerates a higher quality reproduction of the input video. Since the enhancement layer is said to enhance the signal-to-noise ratio (SNR) of the base layer, this type of scalability is called SNR. Alternatively, as we will see later, SNR scalability could have been called coefficient amplitude scalability or quantisation noise scalability. These types, although a bit wordy, may better describe the nature of this encoder.

Figure 8.11 shows a block diagram of a two-layer SNR scalable encoder. First, the input video is coded at a low bit rate (lower image quality), to generate the base layer bit stream. The difference between the input video and the decoded output of the base layer is coded by a second encoder, with a higher precision, to generate the enhancement layer bit stream. These bit streams are multiplexed for transmission over the channel. At the decoder, decoding of the base layer bit stream results in the base picture. When the decoded enhancement layer bit stream is added to the base layer, the result is an enhanced image. The base and the enhancement layers may either use the MPEG-2 standard encoder or the MPEG-1 standard for the base layer and MPEG-2 for the enhancement layer. That is, in the latter a 4:2:0 format picture is generated at the base layer, but a 4:2:0 or 4:2:2 format picture at the second layer.

click to expand
Figure 8.11: Block diagram of a two-layer SNR scalable coder

It may appear that the SNR scalable encoder is much more complex than is the data partitioning encoder. The former requires at least two nonscalable encoders whereas data partitioning is a simple single-layer encoder, and partitioning is just carried out on the bit stream. The fact is that if both layer encoders in the SNR coder are of the same type, e.g. both nonscalable MPEG-2 encoders, then the two-layer encoder can be simplified. Consider Figure 8.12, which represents a simplified nonscalable MPEG-2 of the base layer [10].

click to expand
Figure 8.12: A DCT based base layer encoder

According to the Figure, the difference between the input pixels block X and their motion compensated predictions Y are transformed into coefficients T(X - Y). These coefficients after quantisation can be represented with T(X - Y) - Q, where Q is the introduced quantisation distortion. The quantised coefficients after the inverse DCT (IDCT) reconstruct the prediction error. They are then added to the motion compensated prediction to reconstruct a locally decoded pixel block Z.

Thus the interframe error signal X - Y after transform coding becomes:

(8.1) 

and after quantisation, a quantisation distortion Q is introduced to the transform coefficients. Then eqn. 8.1 becomes:

(8.2) 

After the inverse DCT the reconstruction error can be formulated as:

(8.3) 

where T--1 is the inverse transformation operation. Because transformation is a linear operator, then the reconstruction error can be written as:

(8.4) 

Also, due to the orthonormality of the transform, where T-1T = 1, eqn. 8.4 is simplified to:

(8.5) 

When this error is added to the motion compensated prediction Y, the locally decoded block becomes:

(8.6) 

Thus, according to Figure 8.12, what is coded by the second layer encoder is:

(8.7) 

that is, the inverse transform of the base layer quantisation distortion. Since the second layer encoder is also an MPEG encoder (e.g. a DCT-based encoder), then DCT transformation of X - Z in eqn. 8.7 would result in:

(8.8) 

where again the orthonormality of the transform is employed. Thus the second layer transform coefficients are in fact the quantisation distortions of the base layer transform coefficients, Q. For this reason the codec can also be called a coefficient amplitude scalability or quantisation noise scalability unit.

Therefore the second layer of an SNR scalable encoder can be a simple requantiser, as shown in Figure 8.12, without much more complexity than a data partitioning encoder. The only problem with this method of coding is that, since normally the base layer is poor, or at least worse than the enhanced image (base plus the second layer), then the used prediction is not good. A better prediction would be a picture of the sum of both layers, as shown in Figure 8.13. Note that the second layer is still encoding the quantisation distortion of the base layer.

click to expand
Figure 8.13: A two-layer SNR scalable encoder with drift at the base layer

In this encoder, for simplicity, the motion compensation, variable length coding of both layers and the channel buffer have been omitted. In the Figure Qb and Qe are the base and the enhancement layer quantisation step sizes, respectively. The quantisation distortion of the base layer is requantised with a finer precision (Qe < Qb), and then it is fed back to the prediction loop, to represent the coding loop of the enhancement layer. Now compared with data partitioning, this encoder only requires a second quantiser, and so the complexity is not so great.

Note the tight coupling between the two-layer bit streams. For freedom from drift in the enhanced picture, both bit streams should be made available to the decoder. For this reason this type of encoder is called an SNR scalable encoder with drift at the base layer or no drift in the enhancement layer. If the base layer bit stream is decoded by itself then, due to loss of differential refinement coefficients, the decoded picture in this layer will suffer from picture drift. Thus, this encoder is not a true scalable encoder, but is in fact a layered encoder. Again, although the drift should only appear in P-pictures, since B-pictures use P-pictures for predictions, this drift is transferred into B-pictures too. I-pictures reset the distortion and drift is cleaned up.

For applications with the occasional loss of information in the enhancement layer, parts of the picture have the base layer quality, and other parts that of the enhancement layer. Therefore picture drift can be noticed in these areas.

If a true SNR scalable encoder with drift-free pictures at both layers is the requirement, then the coupling between the two layers must be loosened. Applications such as simulcasting of video with two different qualities from the same source need such a feature. One way to prevent picture drift is not to feed back the enhancement data into the base layer prediction loop. In this case the enhancement layer will be intra coded and bit rate will be very high.

In order to reduce the second layer bit rate, the difference between the input to and the output of the base layer (see Figure 8.11) can be coded by another MPEG encoder [11]. However, here we need two encoders and the complexity is much higher than that for data partitioning. To reduce the complexity, we need to code only the quantisation distortion. However, since the transformation operator and most importantly in SNR scalability the temporal and spatial resolutions of the base and enhancement layers pictures are identical, then the motion estimation and compensation can be shared between them. Following the previous eqns 8.1–8.8, we can also simplify the two independent encoders into one encoder generating two bit streams, such that each bit stream is drift-free decodable. Figure 8.14 shows a block diagram of a three-layer truly SNR scalable encoder, where the generated picture of each layer is drift free and can be used for simulcasting [12].

click to expand
Figure 8.14: A three-layer drift-free SNR scalable encoder

In the Figure T, B, IT and EC represent transformation, prediction buffer, inverse transformation and entropy coding and Qi, is the ith layer quantiser. A common motion vector is used at all layers. Note that although this encoder looks to be made of several single-layer encoders, since motion estimation and many coding decisions are common to all the layers, and the motion estimation comprises about 55-70 per cent of encoding complexity of an encoder, the increase in complexity is moderate. Figure 8.15 shows the block diagram of the corresponding three-layer SNR decoder.

click to expand
Figure 8.15: A block diagram of a three-layer SNR decoder

After entropy decoding (ED) each layer is inverse quantised and then all are added together to represent the final DCT coefficients. These coefficients are inverse transformed and are added to the motion compensated previous picture to reconstruct the final picture. Note that there is only one motion vector, which is transmitted at the base layer. Note also, the decoder of Figure 8.13, with drift in the base layer, is also similar to this Figure, but with only two layers of decoding.

Figure 8.16 shows the picture quality of the base layer at 2 Mbit/s. That of the base plus the enhancement layer would be similar to those of the data partitioning, albeit with slightly higher bit rate. At this bit rate the extra bits would be of the order of 15–20 per cent, due to the overhead of the second layer data [11] (also see Figure 8.26). Due to coarser quantisation, some parts of the picture are blocky, as was the case in data partitioning. However, since any significant coefficient can be included at the base layer, the base layer picture of this encoder, unlike that of data partitioning, does not suffer from loss of high frequency information.

click to expand
Figure 8.16: Picture quality of the base layer of SNR encoder at 2 Mbit/s

Experimental results show that the picture quality of the base layer of an SNR scalable coder is much superior to that of data partitioning, especially at lower bit rates [13]. This is because, at lower base layer bit rates, data partitioning can only retain DC and possibly one or two AC coefficients. Reconstructed pictures with these few coefficients are very blocky.

8.5.4 Spatial scalability

Spatial scalability involves generating two spatial resolution video streams from a single video source such that the base layer is coded by itself to provide the basic spatial resolution and the enhancement layer employs the spatially interpolated base layer which carries the full spatial resolution of the input video source [14]. The base and the enhancement layers may either use both the coding tools in the MPEG-2 standard, or the MPEG-1 standard for the base layer and MPEG-2 for the enhancement layer, or even an H.261 encoder at the base layer and an MPEG-2 encoder at the second layer. Use of MPEG-2 for both layers achieves a further advantage by facilitating interworking between video coding standards. Moreover, spatial scalability offers flexibility in the choice of video formats to be employed in each layer. The base layer can use SIF or even lower resolution pictures at 4:2:0, 4:2:2 or 4:1:1 formats, and the second layer can be kept at CCIR-601 with 4:2:0 or 4:2:2 format. Like the other two scalable coders, spatial scalability is able to provide resilience to transmission errors as the more important data of the lower layer can be sent over a channel with better error performance, and the less critical enhancement layer data can be sent over a channel with poorer error performance. Figure 8.17 shows a block diagram of a two-layer spatial scalable encoder.

click to expand
Figure 8.17: Block diagram of a two-layer spatial scalable encoder

An incoming video is first spatially reduced in both the horizontal and vertical directions to produce a reduced picture resolution. For 2:1 reduction, normally a CCIR-601 video is converted into an SIF image format. The filters for the luminance and the chrominance colour components are the 7 and 4 tap filters, respectively, described in section 2.3. The SIF image sequence is coded at the base layer by an MPEG-1 or MPEG-2 standard encoder, generating the base layer bit stream. The bit stream is decoded and upsampled to produce an enlarged version of the base layer decoded video at CCIR-601 resolution. The upsampling is carried out by inserting zero level samples between the luminance and chrominance pixels, and interpolating with the 7 and 4 tap filters, similar to those described in section 2.3. An MPEG-2 encoder at the enhancement layer codes the difference between the input video and the interpolated video from the base layer. Finally, the base and enhancement layer bit streams are multiplexed for transmission into the channel.

If the base and the enhancement layer encoders are of the same type (e.g. both MPEG-2), then the two encoders can interact. This is not only to simplify the two-layer encoder, as was the case for the SNR scalable encoder, but also to make the coding more efficient. Consider a macroblock at the base layer. Due to 2:1 picture resolution between the enhancement and the base layers, the base layer macroblock corresponds to four macroblocks at the enhancement layer. Similarly, a macroblock at the enhancement layer corresponds to a block of 8 x 8 pixels at the base layer. The interaction would be in the form of upsampling the base layer block of 8 x 8 pixels into a macroblock of 16 × 16 pixels, and using it as a part of the prediction in the enhancement layer coding loop.

Figure 8.18 shows a block of 8 × 8 pixels from the base layer that is upsampled and is combined with the prediction of the enhancement layer to form the final prediction for a macroblock at the enhancement layer. In the Figure the base layer upsampled macroblock is weighted by w and that of the enhancement layer by 1 - w.

click to expand
Figure 8.18: Principle of spatio-temporal prediction in the spatial scalable encoder

More details of the spatial scalable encoder are shown in Figure 8.19. The base layer is a nonscalable MPEG-2 encoder, where each block of this encoder is upsampled, interpolated and fed to a weighting table (WT). The coding elements of the enhancement layer are shown without the motion compensation, variable length code and the other coding tools of the MPEG-2 standard. A statistical table (ST) sets the weighting table elements. Note that the weighted base layer macroblocks are used in the prediction loop, which will be subtracted from the input macroblocks. This part is similar to taking the difference between the input and the decoded base layer video and coding their differences by a second layer encoder, as was illustrated in the general block diagram of this encoder in Figure 8.17.

click to expand
Figure 8.19: Details of spatial scalability encoder

Figure 8.20 shows a single shot of the base layer picture at 2 Mbit/s. The picture produced by the base plus the enhancement layer at 8 Mbit/s would be similar to that resulting for data partitioning, shown in Figure 8.10. Note that since picture size is one quarter of the original, the 2 Mbit/s allocated to the base layer would be sufficient to code the base layer pictures at almost identical quality to the base plus the enhancement layer at 8 Mbit/s. An upsampled version of the base layer picture to fill the display at the CCIR-601 size is also shown in the Figure. Comparing this picture with those of data partitioning and the simple version of the SNR scalable coders, it can be seen that the picture is almost free from blockiness. However, still some very high frequency information is missing and aliasing distortions due to the upsampling will be introduced into the picture. Note that the base layer picture can be used alone without picture drift. This was not the case for data partitioning and the simple SNR scalable encoders. However, the price paid is that this encoder is made up of two MPEG encoders, and is more complex than data partitioning and SNR scalable encoders. Note that unlike the true SNR scalable encoder of Figure 8.14, here due to differences in the picture resolutions of the base and enhancement layers, the same motion vector cannot be used for both layers.

click to expand
Figure 8.20: (a Base layer picture of a spatial scalable encoder at 2 Mbit/s) (b its enlarged version)

8.5.5 Temporal scalability

Temporal scalability is a tool intended for use in a range of diverse video applications from telecommunications to HDTV. In such systems migration to higher temporal resolution systems from that of lower temporal resolution systems may be necessary. In many cases, the lower temporal resolution video systems may be either the existing systems or the less expensive early generation systems. The more sophisticated systems may then be introduced gradually.

Temporal scalability involves partitioning of video frames into layers, in which the base layer is coded by itself to provide the basic temporal rate and the enhancement layer is coded with temporal prediction with respect to the base layer. The layers may have either the same or different temporal resolutions, which, when combined, provide full temporal resolution at the decoder. The spatial resolution of frames in each layer is assumed to be identical to that of the input video. The video encoders of the two layers may not be identical. The lower temporal resolution systems may only decode the base layer to provide basic temporal resolution, whereas more sophisticated systems of the future may decode both layers and provide high temporal resolution video while maintaining interworking capability with earlier generation systems.

Since in temporal scalability the input video frames are simply partitioned between the base and the enhancement layer encoders, the encoder need not be more complex than a single-layer encoder. For example, a single-layer encoder may be switched between the two base and enhancement modes to generate the base and the enhancement bit streams alternately. Similarly, a decoder can be reconfigured to decode the two bit streams alternately. In fact, the B-pictures in MPEG-1 and MPEG-2 provide a very simple temporal scalability that is encoded and decoded alongside the anchor I and P-pictures within a single codec. I and P-pictures are regarded as the base layer, and the B-pictures become the enhancement layer. Decoding of I and P-pictures alone will result in the base pictures with low temporal resolution, and when added to the decoded B-pictures the temporal resolution is enhanced to its full size. Note that, since the enhancement data does not affect the base layer prediction loop, both the base and the enhanced pictures are free from picture drift.

Figure 8.21 shows the block diagram of a two-layer temporal scalable encoder. In the Figure a temporal demultiplexer partitions the input video into the base and enhancement layer, input pictures. For the 2:1 temporal scalability shown in the Figure, the odd numbered pictures are fed to the base layer encoder and the even numbered pictures become inputs to the second layer encoder. The encoder at the base layer is a normal MPEG-1, MPEG-2, or any other encoder. Again for greater interaction between the two layers, either to make encoding simple or more efficient, both layers may employ the same type of coding scheme.

click to expand
Figure 8.21: A block diagram of a two-layer temporal scalable encoder

At the base layer the lower temporal resolution input pictures are encoded in the normal way. Since these pictures can be decoded independently of the enhancement layer, they do not suffer from picture drift. The second layer may use prediction from the base layer pictures, or from its own picture, as shown for frame 4 in the Figure. Note that at the base layer some pictures might be coded as B-pictures, using their own previous, future or their interpolation as prediction, but it is essential that some pictures should be coded as anchor pictures. On the other hand, in the enhancement layer, pictures can be coded in any mode. Of course, for greater compression, at the enhancement layer, most if not all the pictures are coded as B-pictures. These B-pictures have the choice of using past, future and their interpolated values, either from the base or the enhancement layer.

8.5.6 Hybrid scalability

MPEG-2 allows combination of individual scalabilities such as spatial, SNR or temporal scalability to form hybrid scalability for certain applications. If two scalabilities are combined, then three layers are generated and they are called the base layer, enhancement layer 1 and enhancement layer 2. Here enhancement layer 1 is a lower layer relative to enhancement layer 2, and hence decoding of enhancement layer 2 requires the availability of enhancement layer 1. In the following some examples of hybrid scalability are shown.

8.5.6.1 Spatial and temporal hybrid scalability

Spatial and temporal scalability is perhaps the most common use of hybrid scalability. In this mode the three-layer bit streams are formed by using spatial scalability between the base and enhancement layer 1, while temporal scalability is used between enhancement layer 2 and the combined base and enhancement layer 1, as shown in Figure 8.22.

click to expand
Figure 8.22: Spatial and temporal hybrid scalability encoder

In this Figure, the input video is temporally partitioned into two lower temporal resolution image sequences In-1 and In-2. The image sequence In-1 is fed to the spatial scalable encoder, where its reduced version, In-0, is the input to the base layer encoder. The spatial encoder then generates two bit streams, for the base and enhancement layer 1. The In-2 image sequence is fed to the temporal enhancement encoder to generate the third bit stream, enhancement layer 2. The temporal enhancement encoder can use the locally decoded pictures of a spatial scalable encoder as predictions, as was explained in section 8.5.4.

8.5.6.2 SNR and spatial hybrid scalability

Figure 8.23 shows a three-layer hybrid encoder employing SNR scalability and spatial scalability. In this encoder the SNR scalability is used between the base and enhancement layer 1 and the spatial scalability is used between layer 2 and the combined base and enhancement layer 1. The input video is spatially downsampled (reduced) to lower resolution as In-1 is to be fed to the SNR scalable encoder. The output of this encoder forms the base and enhancement layer 1 bit streams. The locally decoded pictures from the SNR scalable coder are upsampled to full resolution to form prediction for the spatial enhancement encoder.

click to expand
Figure 8.23: SNR and spatial hybrid scalability encoder

8.5.6.3 SNR and temporal hybrid scalability

Figure 8.24 shows an example of an SNR and temporal hybrid scalability encoder. The SNR scalability is performed between the base layer and the first enhancement layer. The temporal scalability is used between the second enhancement layer and the locally decoded picture of the SNR scalable coder. The input image sequence through a temporal demultiplexer is partitioned into two sets of image sequences, and these are fed to each individual encoder.

click to expand
Figure 8.24: SNR and temporal hybrid scalability encoder

8.5.6.4 SNR, spatial and temporal hybrid scalability

The three scalable encoders might be combined to form a hybrid coder with a larger number of levels. Figure 8.25 shows an example of four levels of scalability, using all the three scalability tools mentioned.

click to expand
Figure 8.25: SNR, spatial and temporal hybrid scalability encoder

The temporal demultiplexer partions the input video into image sequences In-1 and In-2. Image sequence In-2 is coded at the highest enhancement layer (enhancement 3), with the prediction from the lower levels. The image sequence In-1 is first downsampled to produce a lower resolution image sequence, In-0. This sequence is then SNR scalable coded, to provide the base and the first enhancement layer bit streams. An upsampled and interpolated version of the SNR scalable decoded video forms the prediction for the spatial enhancement encoder. The output of this encoder results in the second enhancement layer bit stream (enhancement 2).

Figure 8.25 was just an example of how various scalability tools can be combined to produce bit streams of various degrees of importance. Of course, depending on the application, formation of the base and the level of the hierarchy of the higher enhancement layers might be defined in a different way to suit the application. For example, when the above scalability methods are applied to each of the I, P and B-pictures, since these pictures have different levels of importance, then their layered versions can increase the number of layers even further.

8.5.7 Overhead due to scalability

Although scalability or layering techniques provide a means of delivering a better video quality to the receivers than do single-layer encoders, this is done at the expense of higher encoder complexity and higher bit rate. We have seen that data partitioning is the simplest form of layering, and spatial scalability the most complex one. The amount of extra bits generated by these scalability techniques is also different.

Data partitioning is a single-layer encoder, but inclusion of the priority break point (PBP) in the zigzag scanning path and the fact that the zero run of the zigzag scan is now broken into two parts, incur some additional bits. These extra bits, along with redundant declaration of the macroblock addresses at both layers, generate some overhead over the single-layer coder. Our investigations show that the overhead bit is of the order of 3–4 per cent of the single-layer counterpart, almost irrespective of the percentage of the bits from the total bit rate assigned to the base layer.

In SNR scalability the second layer codes the quantisation distortions of the base layer, plus the other addressing information. The additional bits over the single layer depend on the relationship between the quantiser step sizes of the base and enhancement layers, and consequently on the percentage of the total bits allocated to the base layer. At the lower percentages, the quantiser step size of the base layer is large and hence the second layer efficiently codes any residual base layer quantisation distortions. This is very similar to successive approximation (two sets of bit planes), hence it is not expected that the SNR scalable coding efficiency will be much worse than the single layer.

At the higher percentages, the quantiser step sizes of the base and enhancement layers become close to each other. Considering that for a base layer quantiser step size of Qb, the maximum quantisation distortion of the quantised coefficients is Qb/2 and the nonquantised ones that fall in the dead zone is Qb, then as long as the enhancement quantiser step size Qb > Qe > Qb/2, none of the significant base layer coefficients is coded by the enhancement layer, except of course the ones in the dead zone of the base layer. Thus, again both layers code the data efficiently, that is the coefficient is either coded at the base layer or the enhancement layer, and the overall coding efficiency is not worse than the single layer. Reducing the base layer bit rate from its maximum value, means increasing Qb. As long as Qb/2 < Qe, none of the base layer quantisation distortions (except the ones on the dead zone) can be coded by the enhancement layer. Hence, the enhancement layer does not improve the picture quality noticeably, and since the base layer is coded at a lower bit rate, the overall quality will be worse than that for the single layer. The worst quality occurs when Qe = Qb/2.

If the aim was to produce the same picture quality, then the bit rate of the SNR scalable coder had to be increased, as shown in Figure 8.26. In this Figure the overall bit rate of the SNR scalable coder is increased over the single layer such that the picture quality under both encoders is identical. The percentage of the bits assigned to the base layer from the total bits is varied from its minimum value to its maximum value. As we see the poorest performance of the SNR scalable (the highest overhead) is when the base layer is allocated about 40–50 per cent of the total bit rate. In fact, at this bit rate, the average quantiser step size of the enhancement layer is half of that of the base layer. This maximum overhead is 30 per cent and that of the data partitioning is also shown, which reads about three per cent irrespective of the bits assigned to the base layer.

click to expand
Figure 8.26: Increase in bit rate due to scalability

In spatial scalability the smaller size picture of the base layer is upsampled and its difference with the input picture is coded by the enhancement layer. Hence the enhancement layer, in addition to the usual redundant addressing, has to code two new items of information. One is the aliasing distortion of the base layer due to upsampling and the other is the quantisation distortion of the base layer, if there is any.

At the very low percentage of the bits assigned to the base layer, both of these distortions are coded efficiently by the enhancement layer. As the percentage of bits assigned to the base layer increases, similar to SNR scalability the overhead increases too. At the time where the quantiser step sizes of both layers become equal, Qb = Qe, any increase in the base layer bit rate (making Qb < Qe), means that the enhancement layer cannot improve the distortions of the base layer further. Beyond this point, aliasing distortion will be the dominant distortion and any increase in the base layer bit rate will be wasted. Thus, as the bit rate budget of the base layer increases, the overhead increases too, as shown in Figure 8.26. This differs from the behaviour of the SNR scalability.

In fact, in spatial scalability with a fixed total bit rate, increasing the base layer bit rate beyond the critical point of Qb = Qe will reduce the enhancement layer bit rate budget. In this case, increasing the base layer bit rate will increase the aliasing distortion and hence, as the base layer bit rate increases, the overall quality decreases!!

In temporal scalability, in contrast to the other scalability methods, in fact the bit rate can be less than that for a single-layer encoder!! This is because there is no redundant addressing to create overhead. Moreover, since the enhancement layer pictures have more choice for their optimum prediction, from either base or enhancement layers, they are coded more efficiently than the single layer. A good example is the B-pictures in MPEG-2, that can be coded at much lower bit rate than the P-pictures. Thus temporal scalability in fact can be slightly more efficient than for single-layer coding.

8.5.8 Applications of scalability

Considering the nature of the basic scalability of data partitioning, SNR, spatial and temporal scalability and their behaviour with regard to picture drift and the overhead, suitable applications for each method may be summarised as:

  1. Data partitioning: this mode is the simplest of all but since it has a poor base layer quality and is sensitive to picture drift, it can be used in the environment where there is rarely any loss of enhancement data (e.g. loss rate <10-6). Hence the best application would be video over ATM networks, where through admission control, the loss ratio can be maintained at low levels [15].

  2. SNR scalability: in this method two pictures of the same spatio-temporal resolutions are generated, but one has a lower picture quality than the other. SNR scalability has generally a higher bit rate over nonscalable encoders, but can have a good base picture quality and can be drift free. Hence suitable applications can be:

    • transmission of video at different quality of interest, such as multiquality video, video on demand, broadcasting of TV and enhanced TV

    • video over networks with a high error or packet loss rates, such as Internet, or heavily congested ATM networks.

  3. Spatial scalability: this is the most complex form of scalability, where each layer requires a complete encoder/decoder. Such a loose dependency between the layers has the advantage that each layer is free to use any codec, with different spatio-temporal and quality resolutions. Hence there can be numerous applications for this mode, such as:

    • interworking between two different standard video codecs (e.g. H.263 and MPEG-2)

    • simulcasting of drift free good quality video at two spatial resolutions, such as standard TV and HDTV

    • distribution of video over computer networks

    • video browsing

    • reception of good quality low spatial resolution pictures over mobile networks

    • similar to other scalable coders, transmission of error resilience video over packet networks.

  4. Temporal scalability: this is a moderately complex encoder, where either a single-layer coder encodes both layers, such as coding of B and the anchor I and P-pictures in MPEG-1 and 2, or two separate encoders operating at two different temporal rates. The major applications can then be:

    • migration to progressive (HDTV) from the current interlaced broadcast TV

    • internetworking between lower bit rate mobile and higher bit rate fixed networks

    • video over LANs, Internet and ATM for computer workstations

    • video over packet (Internet/ATM) networks for loss resilience.

Tables 8.4, 8.5 and 8.6 summarise a few applications of various scalability techniques that can be applied to broadcast TV. In each application, parameters of the base and enhancement layers are also shown.

Table 8.4: Applications of SNR scalability

Base layer

Enhancement layer

Application

ITU-R-601

same resolution and format as lower layer

two quality service for standard TV

high definition

same resolution and format as lower layer

two quality service for HDTV

4:2:0 high definition

4:2:2 chroma simulcast

video production/distribution

Table 8.5: Applications of spatial scalability

Base

Enhancement

Application

progressive (30 Hz)

progressive (30 Hz)

CIF/QCIF compatibility or scalability

interlace (30 Hz)

interlace (30 Hz)

HDTV/SDTV scalability

progressive (30 Hz)

interlace (30 Hz)

ISO/IECE11172-2/compatibility with this specification

interlace (30 Hz)

progressive (60 Hz)

migration to HR progressive HDTV

Table 8.6: Applications of temporal scalability

Base

Enhancement

Higher

Application

progressive (30 Hz)

progressive (30 Hz)

progressive (60 Hz)

migration to HR progressive HDTV

interlace (30 Hz)

interlace (30 Hz)

progressive (60 Hz)

migration to HR progressive HDTV



Standard Codecs(c) Image Compression to Advanced Video Coding
Standard Codecs: Image Compression to Advanced Video Coding (IET Telecommunications Series)
ISBN: 0852967101
EAN: 2147483647
Year: 2005
Pages: 148
Authors: M. Ghanbari

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net