9.8 Scalability

Although we have extensively described the scalability under JPEG2000 and MPEG-2, in H.263 it is used with different terminology and we visit this subject again. It might be useful to know that scalability in H.263 is not used for distribution purposes, but more as a layering technique. Hence, by unequal error protection on the base layer, this method in conjunction with the other error resilience methods, explained in section 9.7, further improves the robustness of this codec.

Extensions of H.263 also support temporal, SNR and spatial scalability as optional modes [25-O]. This mode is normally used in conjunction with the error control scheme. The capability of this mode and the extent to which its features are supported is signalled by external means such as H.245 [9].

There are three types of enhancement picture in the H.263+ codec that are known as B, EI and EP-pictures [5]. Each of these has an enhancement layer number, ELNUM, which indicates to which layer it belongs, and a reference layer number, RLNUM, which indicates which layer is used for its prediction. The encoder may use any of its basic scalability nodes of temporal, SNR, spatial or their combinations in a multilayer scalability mode. Details of the basic and multilayer scalabilities were given in section 8.5. However, due to the different nature and application of H.263 to MPEG-2, there are some differences.

9.8.1 Temporal scalability

Temporal scalability is achieved using bidirectionally predicted pictures or B-pictures. As usual, B-pictures use prediction from either or both of a previous and subsequent reconstructed picture in the reference layer. These B-pictures differ from the B-picture part of a PB or improved PB frames, in that they are separate entities in the bit stream. They are not syntactically intermixed with a subsequent P or its enhancement part EP.

B-pictures and the B part of PB or improved PB frames are not used as reference pictures for the prediction of any other pictures. This property allows for B-pictures to be discarded if necessary without adversely affecting any subsequent pictures, thus providing temporal scalability. There is no limit to the number of B-pictures that might be inserted between the pairs of the reference pictures in the base layer. A maximum number of such pictures may be signalled by external means (e.g. H.245). However, since H.263 is normally used for low frame rate applications (low bit rates, e.g. mobile), then due to larger separation between the base layer I and P-pictures, there is normally one B-picture between them. Figure 9.32 shows the position of base layer I and P-pictures and the B-pictures of the enhancement layer for most applications.

click to expand
Figure 9.32: B-picture prediction dependency in the temporal scalability

9.8.2 SNR scalability

In SNR scalability, the difference between the input picture and lower quality base layer picture is coded. The picture in the base layer which is used for the prediction of the enhancement layer pictures may be an I-picture, a P-picture, or the P part of a PB or improved PB frame, but should not be a B-picture or the B part of a PB or its improved version.

In the enhancement layer two types of picture are identified, EI and EP. If prediction is only formed from the base layer, then the enhancement layer picture is referred to as an EI-picture. In this case the base layer picture can be an I or a P-picture (or the P part of a PB frame). It is possible, however, to create a modified bidirectionally predicted picture using both a prior enhancement layer picture and temporally simultaneous base layer reference picture. This type of picture is referred to as an EP-picture or enhancement P-picture. Figure 9.33 shows the positions of the base and enhancement layer pictures in an SNR scalable coder. The Figure also shows the prediction flow for the EI and EP enhancement pictures.

click to expand
Figure 9.33: Prediction flow in SNR scalability

For both EI and EP-pictures, prediction from the reference layer uses no motion vectors. However, as with normal P-pictures, EP pictures use motion vectors when predicting from their temporally-prior reference picture in the same frame.

9.8.3 Spatial scalability

The arrangement of the enhancement layer pictures in the spatial scalability is similar to that of SNR scalability. The only difference is that before the picture in the reference layer is used to predict the picture in the spatial enhancement layer, it is downsampled by a factor of two either horizontally or vertically (one-dimensional spatial scalability), or both horizontally and vertically (two-dimensional spatial scalability). Figure 9.34 shows the flow of the prediction in the base and enhancement layer pictures of a spatial scalable encoder.

click to expand
Figure 9.34: Prediction flow in spatial scalability

9.8.4 Multilayer scalability

Undoubtedly multilayer scalability will increase the robustness of H.263 against the channel errors. In the multilayer scalable mode, it is possible not only for B-pictures to be temporally inserted between the base layer pictures of type I, P, PB and improved PB, but also between the enhancement picture types of EI and EP, whether these consist of SNR or spatial enhancement pictures. It is also possible to have more than one SNR or spatial enhancement layer in conjunction with the base layer. Thus a multilayer scalable bit stream can be a combination of SNR layers, spatial layers and B-pictures. With increasing the layer number, the size of a picture cannot decrease. Figure 9.35 illustrates the prediction flow in a multilayer scalable encoder.

click to expand
Figure 9.35: Positions of the base and enhancement layer pictures in a multilayer scalable bit stream

As with the two-layer case, B-pictures may occur in any layer. However, any picture in an enhancement layer which is temporally simultaneous with a B-picture in its reference layer must be a B-picture or the B-picture part of a PB or improved PB frame. This is to preserve the disposable nature of B-pictures. Note, however, that B-pictures may occur in any layers that have no corresponding picture in the lower layers. This allows an encoder to send enhancement video with a higher picture rate than for the lower layers.

The enhancement layer number and the reference layer number of each enhancement picture (B, EI, or EP) are indicated in the ELNUM and RLNUM fields, respectively, of the picture header (when present). If a B-picture appears in an enhancement layer in which temporally surrounding SNR or spatial pictures also appear, the reference layer number (RLNUM) of the B-picture shall be the same as the enhancement layer number (ELNUM). The picture height, width and pixel aspect ratio of a B-picture shall always be equal to those of its temporally subsequent reference layer picture.

9.8.5 Transmission order of pictures

Pictures, which are dependent on other pictures, shall be located in the bit stream after the pictures on which they depend. The bit stream syntax order is specified such that for reference pictures (i.e. pictures having types I, P, EI, EP, or the P part of PB or improved PB), the following two rules shall be obeyed:

All reference pictures with the same temporal reference shall appear in the bit stream in increasing enhancement layer order. This is because each lower layer reference picture is needed to decode the next higher layer reference picture.
All temporally simultaneous reference pictures as discussed in item 1 above shall appear in the bit stream prior to any B-pictures for which any of these reference pictures is the first temporally subsequent reference picture in the reference layer of the B-picture. This is done to reduce the delay of decoding all reference pictures, which may be needed as references for B-pictures.

Then, the B-pictures with earlier temporal references shall follow (temporally ordered within each enhancement layer). The bit stream location of each B-picture shall comply with the following rules:

Be after that of its first temporally subsequent reference pictures in the reference layer. This is because the decoding of the B-pictures generally depends on the prior decoding of that reference picture.
Be after that of all reference pictures that are temporally simultaneous with the first temporally subsequent reference picture in the reference layer. This is to reduce the delay of decoding all reference pictures, which may be needed as references for B-pictures.
Precede the location of any additional temporally subsequent pictures other than B-pictures in its reference layer. Otherwise, it would increase picture storage memory requirement for the reference layer pictures.
Be after that of all EI and EP pictures that are temporally simultaneous with the first temporally subsequent reference picture.
Precede the location of all temporally subsequent pictures within the same enhancement layer. Otherwise, it would introduce needless delay and increase picture storage memory requirements for the enhancement layer.

Figure 9.36 shows two allowable picture transmission orders given by the rules above for the layering structure shown as an example. Numbers next to each picture indicate the bit stream order, separated by commas for the two alternatives.

click to expand
Figure 9.36: Example of picture transmission order