3. Adaptive Video Segmentation

Digital video has become ubiquitous in today's society. It comes in many forms including music videos, movies, surveillance video, Unmanned Aerial Vehicle (UAV) video, home movies, and news broadcasts. Although there are many similarities, each type of video has its own distinct and defining characteristics. For example, news broadcasts can be thought of as a series of anchorperson and story shots [2, 4]. UAV video consists of constant and continuous camera motion [38]. As a result of these and other distinct characteristics, there is no catchall segmentation algorithm. If the type of video is not known before analysis, how does one determine the best possible segmentation algorithm to use? The solution to this problem yields developing a composite solution of the existing algorithms that is sensitive to the video under analysis. To date, little research effort has been conducted regarding creating a composite video segmentation solution. This section describes the current research in this area as well as suggestions for developing a composite solution.

3.1 Segmentation

Document retrieval methods have shown increased performance when combining results from various document representations. Katzer et al. [48] compared text document retrieval performance using different document representation methods. Their results discovered that the different document representations retrieved different sets of relevant documents. As a result, performing information retrieval with multiple document representations improved retrieval performance over using a single method. As the document retrieval methods rely on various document representations, the various video segmentation algorithms rely on different characteristics of the underlying video. Color-based methods create histograms of the frame content and compute distance metrics between histograms to search for shot boundaries [10, 12, 13, 15–17]. Model-based methods create Hidden Markov Models (HMM) of each possible state and transition in a video sequence [14, 22]. Edge-based methods utilize edge maps of the frame content to search for segmentation boundaries [12, 20]. Motion-based methods rely on velocity and displacement vectors to compute the amount of motion between video frames to determine shot boundaries [38]. The results obtained from the document filtering community with respect to the combination of multiple methods to increase retrieval performance suggests that the various digital video representations can be combined to increase shot boundary segmentation detection performance.

Research efforts, to date, have focused on the implementation of segmentation algorithms independently [5, 8, 16, 49]. Limited research has attempted to combine the shot boundary detection algorithms into a composite system [50]. The process of combining various input sources can be achieved in a variety of ways. Most of the research on combining multiple input sources has come from the area of document retrieval [51–58]. In this area, researchers attempt to combine multiple representations of queries, documents, or multiple retrieval techniques. Research in this field has shown that significant improvements can be achieved by combining multiple evidences [51, 56]. One application that attempts to incorporate results from multiple input sources is the development of metasearch engines. A metasearch engine is a system that provides access to multiple existing search engines. Its primary goal is to collect and reorganize the results of user queries of multiple search engines. There has been considerable research with respect to the effectiveness and performance of metasearch engines [59–66]. The result-merging step of a metasearch engine combines the query results of several search engines into a single result. A metasearch engine usually associates a weight or similarity measure to each of the retrieved documents and returns a ranked list of documents based on this value [62]. These weights can be derived from an adjustment of the document ranks value of the local search engines or defining a global value based on all the retrieved documents. This type of approach only focuses on the results of multiple searches and not a combination of the actual search engines.

Browne [50] experimented with combining three shot boundary algorithms. The shot detection algorithms used for their research are color histograms, edge detections, and encoded macroblocks. It was concluded from the experiments that a dynamic threshold implementation of each algorithm improved shot boundary detection performance. Weighted Boolean logic was used to combine the three shot boundary detection algorithms. The algorithm works as follows: The three shot detection algorithms are executed in parallel using dynamic thresholds for each algorithm. A shot is determined in a hierarchical manner. If the color histogram algorithm is above its adaptive threshold, a shot boundary is detected. If the edge detection algorithm is above its adaptive threshold and the color histogram algorithm is above a minimum threshold then a shot boundary is detected. Lastly, if the encoded macroblock algorithm is above its threshold and the color histogram algorithm is above its minimum threshold, a shot boundary is detected.

One problem with the Brown implementation is that the three algorithms utilized are not really combined. Each algorithm is run independently and the results of the various algorithms are combined for further analyses. As a result, the algorithm does not try to determine the most appropriate algorithm to use based on the content. Moreover, the implementation does not determine how well an algorithm works for certain content. All three algorithms are always run simultaneously. Additionally, the dynamic thresholds are only utilized within the context of each individual algorithm. There is no mechanism for the combined algorithms to adapt to a global threshold.

Figure 12.1 depicts an adaptive video shot boundary detection technique motivated by the metasearch engine approach. Each algorithm is run independently and the results of each algorithm are fused together by a data fusion technique. Data fusion can be facilitated by Boolean logic or a weighting scheme.

click to expand
Figure 12.1: Multiple Methods

An improved approach to creating an adaptive system fusing multiple approaches draws on experience gained in document filtering. Document filtering attempts to select documents identified as relevant to one or more query profiles [53]. Documents are accepted or rejected that are independent of previous or subsequent examined documents. Hull et al. developed statistical classifiers of the probability of relevance and combined these relevance measures to improve filtering performance. The solution they chose is based on machine learning. Each classifier utilizes a different learning algorithm. The hypothesis in their experiment was that different learning algorithms used different document representations and optimization strategies; thus the combination of multiple algorithms would perform better than any individual method. This study concluded that the combination of multiple strategies could improve performance under various conditions.

Associating probabilities with each shot detection algorithm could facilitate transitioning this approach to the video shot boundary domain. Different learning algorithms use different document representations; similarly the various shot boundary detection techniques require different representations of the video. Histogram-based techniques require the creation of color histograms for each video frame, whereas motion-based techniques require the creation of motion vectors to determine object and camera motion between frames. These probabilities could determine how certain a particular type of algorithm will perform in the presence of a shot boundary. Figure 12.2 depicts a multiple method video shot boundary detection technique motivated by the combination of multiple probabilities from different algorithms.

click to expand
Figure 12.2: Multiple Probability Methods

One technique for combining the results of multiple methods would be simple averaging of the probabilities. If the average probability of the multiple methods is above a threshold, a shot boundary is detected. Hull et al. state that it may be more useful to look at averaging the log-odds ratios instead. If one algorithm determines with high probability that a shot boundary is detected or not detected the average log odds will show this certainty much more directly than will the average probability [53]. One problem with this technique is that each algorithm's probability is given equal weight. A better strategy would be to give more weight to the algorithms based on the content. Certain algorithms perform better on some content than others.

This method is based on the instrumentation of segmentation methods with assignment of a probability or likelihood that a shot boundary is observed. The wide variety of algorithms needs to be benchmarked so as to ascertain performance over a wide class of video content. This content will be utilized to determine a statistical method to assign confidence values to results based on the content analysis. For example, the confidence measure for a threshold-based histogram method may be based on variance properties of the histogram differences. Large variances indicate content that is not well modelled by histogram methods; so confidence will be low.

Katzer et al. [48] compared document retrieval performance using different document representation methods. They demonstrated that performing information retrieval with multiple document representations improved retrieval performance over using a single method. Transitioning this idea to the digital video domain suggests that the various digital video representations can be combined to increase shot boundary detection performance. Figure 12.3 depicts a multiple representation video shot boundary detection technique.

click to expand
Figure 12.3: Multiple Inputs

In this technique, multiple video representations (histograms, motion vectors, edge maps, DC terms, etc.) can be used together to determine shot boundaries. This technique differs from the previous techniques described above in that multiple algorithms for each type of input source are not used. Instead, all different video representations are sent to a data fusion engine. This data fusion would use the various digital video representations in a unified algorithm to detect shot boundaries. This unified algorithm would be based on translation of existing algorithmic solutions.