10.2 Video object plane (VOP)

In object-based coding the video frames are defined in terms of layers of video object planes (VOP). Each video object plane is then a video frame of a specific object of interest to be coded, or to be interacted with. Figure 10.1a shows a video frame that is made of three VOPs. In this Figure, the two objects of interest are the balloon and the aeroplane. They are represented by their video object planes of VOP₁ and VOP₂. The remaining part of the video frame is regarded as a background, represented by VOP₀. For coding applications, the background is coded only once, and the other object planes are encoded through time. At the receiver the reconstructed background is repeatedly added to the other decoded object planes. Since in each frame the encoder only codes the objects of interest (e.g. VOP₁ and/or VOP₂), and usually these objects represent a small portion of the video frame, then the bit rate of the encoded video stream can be extremely low. Note that, had the video frame of Figure 10.1a been coded with a conventional codec such as H.263, since clouds in the background move, the H.263 encoder would have inevitably encoded most parts of the picture with a much higher bit rate than that generated from the two objects.

click to expand
Figure 10.1: (a A video frame composed of) (b balloon VOP₁), (c aeroplane VOP₂ and) (d the background VOP₀)

The VOP can be a semantic object in the scene, such as the balloon and aeroplane in Figure 10.1. It is made of Y, U and V components plus their shapes. The shapes are used to mask the background, and help to identify object boarders.

In MPEG-4 video, the VOPs are either known by construction of the video sequence (hybrid sequence based on blue screen composition or synthetic sequences) or are defined by semiautomatic segmentation. In the former, the shape information is represented by eight bits, known as the greyscale alpha plane. This plane is used to blend several video object planes to form the video frame of interest. Thus with eight bits, up to 256 objects can be identified within a video frame. In the second case, the shape is a binary mask, to identify individual object borders and their positions in the video frames.

Figure 10.2 shows the binary shapes of the balloon and aeroplane in the above example. Both cases are currently considered in the encoding process. The VOP can have an arbitrary shape. When the sequence has only one rectangular VOP of fixed size displayed at a fixed interval, it corresponds to frame-based coding. Frame-based coding is similar to H.263.

click to expand
Figure 10.2: Shape of objects (a balloon) (b aeroplane)

10.2.1 Coding of objects

Each video object plane corresponds to an entity that after being coded is added to the bit stream. The encoder sends, together with the VOP, composition information to indicate where and when each VOP is to be displayed. Users are allowed to trace objects of interest from the bit stream. They are also allowed to change the composition of the entire scene displayed by interacting with the composition information.

Figure 10.3 illustrates a block diagram of an object-based coding verification model (VM). After defining the video object planes, each VOP is encoded and the encoded bit streams are multiplexed to a single bit stream. At the decoder the chosen object planes are extracted from the bit stream and then are composed into an output video to be displayed.

click to expand
Figure 10.3: An object-based video encoder/decoder

10.2.2 Encoding of VOPs

Figure 10.4 shows a general overview of the encoder structure for each of the video object planes (VOPs). The encoder is mainly composed of two parts: the shape encoder and the traditional motion and texture encoder (e.g. H.263) applied to the same VOP.

click to expand
Figure 10.4: VOP encoder structure

Before explaining how the shape and the texture of the objects are coded, in the following we first explain how a VOP should be represented for efficient coding.

10.2.3 Formation of VOP

The shape information is used to form a VOP. For maximum coding efficiency, the arbitrary shape VOP is encapsulated in a bounding rectangle such that the object contains the minimum number of macroblocks. To generate the bounding rectangle, the following steps are followed:

Generate the tightest rectangle around the object, as shown in Figure 10.5. Since the dimensions of the chrominance VOP are half of the luminance VOP (4:2:0), then the top left position of the rectangle should be an even numbered pixel.

Figure 10.5: Intelligent VOP formation
If the top left position of this rectangle is the origin of the frame, skip the formation procedure.
Form a control macroblock at the top left corner of the tightest rectangle, as shown in Figure 10.5.
Count the number of macroblocks that completely contain the object, starting at each even numbered point of the control macroblock. Details are as follows:
1. Generate a bounding rectangle from the control point to the right bottom side of the object that consists of multiples of 16 × 16 pixel macroblocks.
2. Count the number of macroblocks in this rectangle that contain at least one object pixel.
Select that control point which results in the smallest number of macroblocks for the given object.
Extend the top left coordinate of the tightest rectangle to the selected control coordinate.

This will create a rectangle that completely contains the object but with the minimum number of macroblocks in it. The VOP horizontal and vertical spatial references are taken directly from the modified top left coordinate.