2. Review of Techniques

This section presents related research in video modeling, techniques for video segmentation for CUT and GT detection and wavelets theory for video content analysis.

2.1 Video Modeling

There are two approaches proposed to model the temporal logical structure of video, namely the structured modeling approach [3] and stratification [23].

The structured modeling approach is based on the idea of dividing the video sequences into atomic shots. Each shot corresponds to an event and is used as a basic unit for manipulation. The contextual information between video shots is modeled using additional higher-level constructs. The whole video is represented in a hierarchical concept structure. The basic idea of structure-based modeling is illustrated in Figure 8.1. This modeling reveals the basic necessity for video segmentation. It is the first step to extract the lowest level information.

click to expand
Figure 8.1: Structure modeling of video.

The stratification modeling is more flexible. It is based on logical structure rather than temporal structure. The video sequences are modeled as overlapping chunks called strata. Each stratum represents one single concept that can be easily described. The content information of a subsequence is derived from the union of descriptions of all associated strata. One example of video stratification modeling on news video can be seen in Figure 8.2.

click to expand
Figure 8.2: Stratification Modeling of Video.

In another modeling approach, [9] builds a quad-tree for the temporal structure to facilitate navigation on video frames. In this model, the video is indexed in multi-temporal resolution. Different temporal resolutions correspond to different layers in the quad-tree. When browsing the video, switching between different temporal resolutions will result in different playback rates.

[9] only separates the sequences in units of frames, which is not as flexible as shots when moving up to higher-level analysis. In addition, the static quad-tree structure construction uses too much redundant data because in most cases, adjacent frames are quite similar in content. However, the idea of studying video in multi-temporal resolution can be applied to video segmentation problems.

In another multi-resolution approach, [16] models video by generating a trail of points in a low-dimensional space where each point is derived from physical features of a single frame in the video clip. Intuitively, this leads to clusters of points whose frames are similar in this reduced dimension feature space and correspond to parts of the video clip where little or no change in content is present. This modeling has a great advantage in dimensionality reduction and has good visual representation. Since this model maps the video content into another space, the features in the new space can be utilized to analyze video content that may bring in new insights.

Based on the modeling techniques presented, it can be seen that video is actually a multi-level structured medium. This feature makes video suitable to be studied in a multi-resolution view. Instead of studying the video data stream directly, one way is to transform the video stream into a trajectory in another feature space. Some phenomenon that is not observable in the original data space may have features that are observable in the new space. Another advantage is that the complexity of the problem may be reduced when the dimensionality is reduced. Other similar modeling approaches may be found in [33].

2.2 Video Segmentation

Video segmentation is a fundamental step in video processing and has received a lot of attention in recent years. A number of techniques have been suggested for video segmentation for both the raw and compressed video streams. These techniques can be divided broadly into 6 classes: pixel or block comparison, histogram comparison, methods based on DCT coefficients and motion vectors in MPEG encoded video sequences, feature comparison and video editing model-based methods. Some of the approaches will be described in this section.

Pixel or Region Comparison

The change between frames can be detected by comparing the difference in intensity values of corresponding pixels in the two frames. The algorithm counts the number of pixels or regions changed, and the shot boundary is declared if the percentage of the total number of pixels or region changed exceeds a certain threshold [1][8][15][24].

However, large camera and object movements in a shot may result in false shot boundaries being detected using such an algorithm. This is because these movements bring about pixel or region changes that are big enough to exceed the threshold. Fortunately, the effects brought about by the camera and object movements can be reduced to some degree by enlarging the regions to compute the average intensity values.

Histogram Comparison

Nagasaka et al. [24] presented the earliest work that demonstrates the importance of video segmentation. They observed that there is a large difference in histograms between the two frames separated by a CUT. They evaluated several basic functions for computing the difference histogram measurement for CUT detection. Their implementations are based mainly on the differences between the gray level and color histograms. Because the histogram features are more global-based, it is less sensitive to camera motion and object movement. Similar to the pixel or region based approach, when the difference between frames is above some threshold, a shot boundary is declared.

In order to eliminate momentary noise, they assume that its influence is always not more than half a whole frame, and apply sub-frame techniques to remove the effect. But this is not always true in a real case. For example, camera flash always causes the whole picture to be brightened, and large objects moving fast into the view will not follow the half-frame assumption. In this sense, their noise-tolerance level is too limited.

Twin-Comparison

Although the pixel or region based algorithms and histogram-based algorithms are not perfect, these techniques are well developed and can achieve a high-level of accuracy in CUT detection. However, the accuracy of these techniques for GT detection is not high. One of the most successful early methods that attempted to handle both CUT and GT is the twin-comparison method [45]. Under the observation that a sharp change in contents between successive frames indicates a CUT, and a moderate change corresponds to a GT, the algorithm sets two thresholds. If the difference is above the upper threshold, a CUT is declared. If the difference is between the upper and a lower threshold, the algorithm starts to accumulate the difference between successive frames. When this accumulated difference exceeds the high threshold, a GT is declared.

The algorithm also analyses motion vectors in video frames to distinguish camera motions like panning and zooming from that of GTs. The algorithm is simple and intuitive, and has been found to be effective on a wide variety of video. The twin-comparison algorithm seems to be the first real approach to solve the GT problem by extending the histogram comparison method from CUT to GT domain.

DCT Coefficients in MPEG

Solving the video segmentation problem by extracting image features from decompressed video stream is obviously not very efficient. A number of researchers have developed techniques that work directly on the MPEG video stream without decompression [46][40][29][26].

In MPEG compression the image is divided into a set of 8 by 8 pixel blocks. The pixels in the blocks are transformed into 64 coefficients using the discrete cosine transform (DCT), which are quantized and Huffman entropy encoded. In this case, the most popular way to work with compressed video stream is to utilize the DCT coefficients since these coefficients in a frequency domain are mathematically related to the spatial domain.

The most well known work in compressed domains is the DC image that was introduced by [40]. The DC images are generated by some coefficients, DC+2AC, from the original frame and the reduced image has similar content representation as the old frames. They also applied template matching and global color statistic comparison (RGB color histogram) in DC images to solve the video segmentation problem [40]. They first computed the inter-frame difference sequence, and then applied a sliding window of fixed size m, using dynamic threshold to declare video transitions. Their experimental results showed that the algorithm works 70 times faster than in raw domains with similar accuracy.

Motion Information in MPEG

In MPEG compressed video, motion information is available readily in the form of motion vectors and the type of coding of the macroblocks. The compressed data consists of I-, P- and B-frames. An I-frame is completely intra-frame coded. A P-frame is predictive coded with motion compensation from past I- or P-frames. Both these frames are used for bi-directional motion compensation of B-frames.

Macroblock type information is a simple video coding feature, which is very helpful for video indexing and analysis. In I-frames, macroblocks are intra-coded. In P-frames, it is forward predicted as well as intra-coded and skipped. In B-frames, macroblock is one of the five possible types: forward predicted, backward predicted, bi-directional predicted, intra-coded and skipped. The count of these types of macroblocks and their ratios are very useful.

Zhang et al. [46] have experimented with motion-based segmentation using the motion vectors as well as DCT coefficients. Similarly Meng et al. [22] use the ratio between the numbers of intra-coded macroblocks to detect scene changes in P- and B- frames. Kobla et al. [17] use macroblock type information for the shot boundary detection. Nang et al. [25] compute macroblock type changes to detect shot boundaries.

Camera motion should be taken into account during shot segmentation. Kobla et al. [17] and Zhang et al. [46] estimate camera pan and tilt using motion vectors.

The motion vectors represent a sparse approximation to real optical flow and the macroblock coding type is encoder specific. Gargi et al. [11] mentioned that the block matching methods do not do well compared to intensity/color-based algorithms.

Video Editing Model-Based Methods

GT is produced during the video editing stage when the video editor is used to merge different video shots using a variety of techniques to produce different types of gradual transitions. Thus, GT detection can be considered as an inverse process of the video editing process and also has a direct relationship with the different editing techniques applied. By modeling techniques to produce different video transitions, researchers can find a way to solve this problem [12].

Hampapur et al. [12] proposed a framework to solve the video segmentation problem by building up different video editing models for different GTs. Their work classifies the GTs based on 3 basic editing models, namely, the chromatic edits, spatial edits and mixed chromatic and spatial edits. When processing the whole video, they calculated the chromatic images and spatial images from the differential images generated from the consecutive frames. By observing the different responses of these two images, not only can they declare the video transition, they can also roughly classify the video transitions. The disadvantage of their technique is that when faced with new types of transitions, a new threshold for the responses must be set for detection.

Other Feature-Based Approaches

There are many other approaches to handle different types of GTs. [29] adopted a statistical approach that analyzes the distribution of difference-histograms and tries to characterize different types of transitions with different distribution patterns. Zabih et al. [43] proposed a feature-based algorithm that uses the patterns of the increasing and decreasing edges in the video frames to detect GTs. In fact, up until now, more than one hundred algorithms or approaches have been proposed to solve the video segmentation problem. Recent research focus is moving from CUT to GT, from simple shot detection to more complex scene detection, and more work is being done in the compressed domain. The amount of research work carried out in this topic demonstrates that it is an important and difficult problem.

2.3 Wavelets

Wavelets analysis is a mathematical theory that was introduced to the engineering world in recent years. Wavelets are functions that satisfy certain mathematical requirements and are used in representing data or other functions, and wavelet algorithms process data at different scales or resolutions. This enabled the gross feature as well as small detailed features of the data to be noticed at the same time, thus achieving the effects of seeing both the forest and the trees, so to speak. Working with this advantage, wavelets are suitable for multi-resolution analysis and have been used in many fields including data compression, astronomy, acoustics, nuclear engineering, sub-band coding, signal and image processing, neuron-physiology, music, magnetic resonance imaging, speech discrimination, optics, fractals, turbulence, earthquake-prediction, radar, human vision and pure mathematics.

Wavelets have also been applied to solve the video segmentation problem. [42] used wavelets to spatially decompose every frame into the low and high resolution component to extract the edge spectrum average feature to detect fade, and applied double chromatic difference on the low-resolution component to identify dissolve transitions.

2.4 TMRA

The temporal multi-resolution analysis (TMRA) approach used in this chapter is built upon the ideas presented in the early works reviewed. Essentially, TMRA models the video streams as trails in low-dimensional space as is done in [16]. However, TMRA adopts the global or spatial color representation for each video frame, and maps the frames into a fixed low dimensional space. TMRA also performs multi-resolution wavelets analysis on features as is adopted in [9][42]. However, their approaches analysed one video frame at a time and thus they are essentially single temporal resolution approaches. In contrast, TMRA considers varying number of frames, and thus different temporal resolution, in the analysis. By analysing video transitions in the multiple temporal resolution space, we are able to develop a unifying approach to characterize different types of transitions. In contrast, methods that rely on single temporal resolution need different techniques to handle different types of transitions.

The only work that uses the temporal resolution idea is [40]. However, their approach uses a fixed temporal resolution of 20, which can handle only short transitions of about 12–30 frames in length. By the simultaneous analysis of multi-temporal-resolution, TMRA is able to handle transitions of arbitrary length, and the detection is not sensitive to threshold selection. This chapter is based on our published work on TMRA for video segmentation [19] with several modifications and improvements over it.