2. Review of Video Segmentation and Summarization

Network constraints or bandwidth limitations do not always allow for the dissemination of high-bandwidth mediums such as video. Moreover, based on its content, it may be inefficient to transmit an entire video. In a wireless environment the amount of available bandwidth does not allow for the transmission of an entire video sequence. In such an environment, there needs to be a mechanism to transmit only the information of relative importance. Often, the content of a video sequence can be captured and conveyed with a few representative images. Imagine recording and capturing the video of a still scene where the content remains fixed or has minimal changes. It would be impractical and inefficient to capture and transmit the entire video because of the massive amounts of redundancy. Little if anything has changed from the first frame to the last. The content and semantic meaning of the video could be captured with a few frames, thus tremendously reducing the amount of needed bandwidth. Eliminating the redundancy is effective as long as it preserves the overall meaning and nature of the video. Much research has been conducted on developing content-based summarizations of digital video. The goal of the research has been to detect temporal boundaries, identify meaningful segments from the video, and extract and construct a subset of the video from the original via keyframes or key-sequences [8]. Thus, summarization can be thought of as a two-step process of video segmentation and abstraction.

Commercial video sequences can be thought of as a hierarchy of a sequence of stories, which are composed of a set of scenes, which are composed of a set of shots, with each shot containing a sequence of frames. The most common segmentation algorithms rely on low-level image features at the shot level to partition the video. This is because partitioning of a video at the scene or story level is difficult because there is no standard or universal definition of scenes or stories. A shot is an unbroken sequence of frames taken from one camera [9]. There are two basic types of shot transitions: abrupt and gradual. Abrupt transitions, called cuts, occur when a frame from a subsequent shot immediately follows a frame from the previous shot. These types of transitions are relatively easy to detect. Gradual transitions consist of slow change between frames from one shot to a frame of a different shot. These types of transitions include cross-dissolves, fade-ins, fade-outs, and other graphical editing effects such as wipes [10]. A fade-in is the gradual increase of intensity starting from one frame to the next. A fade-out is a slow decrease in brightness from one frame to the next. A cross-dissolve is when one frame is superimposed on another, and while one frame gets dimmer, the other frame gets brighter. A dissolve can be considered an overlapping of a fade-in and a fade-out [11]. Gradual transitions are more difficult to detect. This is because camera and object motion can inhibit the accurate detection of gradual transitions, causing false positives.

There have been several research projects comparing and evaluating the performance of shot detection techniques. Koprinska et al. [9] provides a survey of the existing approaches in the compressed and uncompressed domain. Dailianas [11] compared several segmentation algorithms across different types of video. Lienhart [12] evaluated the performance of various existing shot detection algorithms on a diverse set of video sequences with respect to the accuracy of each detection method, and the recognition of cuts, fades, and dissolves. Lupatini et al. [13] compared and evaluated the performance of three classes of shot detection algorithms: histogram-based, motion-based, and contour-based. Boreczsky and Rowe [14] compared various compressed and uncompressed video shot detection algorithms.

2.1 Frame-Based Techniques

The majority of the shot-based segmentation algorithms operate in the uncompressed domain. A similarity measure between successive frames is defined and compared against a predetermined threshold. A cut is determined when the distance value between two images falls below this predetermined threshold. Gradual transitions can be detected by adaptive thresholding techniques [15] or using cumulative difference measures.

One way to detect possible transitions between successive frames is to compare the corresponding pixel values between the two frames and count how many pixels have changed. If the number of changed pixels is above a predetermined threshold, a shot is detected [9]. A potential problem with this implementation is its sensitivity to camera motion. If a camera moves a few pixels between two successive frames, a large number of pixels will be counted as being changed. Zhang [15] attempts to reduce this effect with the use of a smoothing filter. Before each pixel comparison the candidate pixel is replaced with the average value of the pixels within its 3x3 neighborhood. Additionally, this filter is used to reduce noise in the input images.

A likelihood ratio has been utilized to compare successive frames based on the assumption of uniform second-order statistics over a region [15]. In this algorithm each frame is subdivided into blocks and the corresponding blocks are compared based on statistical characteristics of their intensity values. One advantage that this method has over the pixel difference method is that it improves the tolerance against noise associated with camera and object movement. Additionally, the likelihood ratio has a broader dynamic range than does the percentage used with the pixel difference method [15]. This broader dynamic range makes it easier to choose a threshold t to distinguish between changed and unchanged areas. A problem with the likelihood ratio algorithm is that it is possible for two different corresponding blocks to have the same density function, causing no segment to be detected.

Histogram-based algorithm techniques use a distance metric between histograms as a similarity measure. The basic assumption is that content does not change abruptly within shots, but does across shot boundaries [12]. Once an image has been represented as a histogram there are various distance metrics that can be used. Zhang et al. [15] concluded that the χ² method of histogram comparison enhances the difference between two frames across a cut; however it also increases the difference between frames with small camera and object movements. Additionally, it was concluded that the overall performance of the χ² method is little better than the absolute bin difference method, with χ² being more computationally intensive.

Histograms are attractive because they are effective at determining abrupt changes between frames. They are also tolerant to translational and rotational motions about the view axis, and change gradually with the angle of view, change in scale, or occlusion [16]. However, they completely ignore the spatial distribution of intensity values between frames. Consecutive frames that have different spatial distribution, but have similar histograms, are considered similar. Zhang et al. [15] and Nagaska and Tanaka [17] are examples of color-based histogram implementations.

One solution to the problems associated with global histograms is to create local histograms. Local histograms segment a frame into smaller blocks and compute histograms for each region. This method is tolerant to local changes in motion; however it still sensitive to changes in luminance over an entire frame [13]. Nagaska and Tanaka [17] split each frame into 16 uniform blocks and evaluate the difference between histograms in corresponding blocks. The χ² method is used to compute a distance metric between frames. The largest difference value is discarded in order to reduce the effects of noise, object, and camera movements.

Gargi et al. [10] experimented with computing histograms for various shot detection methods in different color spaces. The color spaces include RGB, HSV, YIQ, XYZ, L*a*b*, L*u*v*, Munsell, and Opponent. They concluded that the Munshell space produced the best performance results. The Munshell space is used because it is close to the human perception of colors [16]. Additionally, Zhang et al. [16] used a dominant color technique that only examined colors corresponding to the histograms with the most bins. This is based on the assumption that small histogram bins are likely to contain noise, distorting shot detection results.

Lupatini et al. [13] compared the performance of twelve shot detection methods based on histograms, motion, and contours. Although the best performance was achieved with histogram-based algorithms that use color or hue information, these algorithms failed when there was a continuous camera motion due to long camera pans that changed the content of the scene. Boreczky [14] compared the performance of three histogram-based algorithms, a motion-based algorithm, and an algorithm based on DCT coefficients. They concluded that the histogram-based algorithms perform better in general than the motion-based and DCT-based algorithms. However, the histogram-based algorithms did a poor job of identifying gradual transitions. As a result, the authors suggest using an additional edge-based algorithm to accurately identify gradual transitions. Histogram-based algorithms alone cannot accurately detect shot boundaries. Multiple methods need to be utilized in order to accurately characterize the different types of shot transitions.

Ferman et al. [18] introduced a temporal video segmentation technique based on 2-class clustering to eliminate the subjective nature of selecting thresholds. Video segmentation is treated as a 2-class clustering problem, where the two classes are "scene change" and "no scene change." The K-means clustering algorithm [19] is applied to the similarity measure of color histograms between successive frames. The χ² method and the histogram difference method are used to compute the similarity metric in the RGB and YUV color spaces. From their experiments, the χ² method in the YUV color space detected the larger number of correct transitions; however when the complexity of the distance metric is factored in, the histogram difference method in the YUV color space was the best in terms of overall performance. One major advantage of this technique is that it eliminates the need to select a predetermined threshold. Additionally multiple features can be used to improve performance of the algorithm. In subsequent research, Ferman and Tekalp [18] utilize two features, histograms and pixel differences, for video segmentation via clustering.

Zabith et al. [20] detect cuts, fades, dissolves, and wipes based on the appearance of intensity edges that are distant from edges in the previous frame. A summarization of the edge pixels that appear far from an existing pixel (entering edge pixel) and edge pixels that disappear far from an existing pixel (exiting pixel) are used to detect cuts, fades, and dissolves. The method is further improved to tolerate camera motion by performing motion compensation. The global motion between frames is calculated and used to align frames before detecting the entering and disappearing edge pixels. One disadvantage of this technique is that it is not able to handle multiple moving objects [9]. Another disadvantage is false positives due to the limitations of edge detection. Changes in image brightness, or low quality frames, where edges are harder to detect, may cause false positives. Lienhart [12] experimented with the edge-based segmentation algorithm of Zabith et al. [20] and concluded that false positives can arrive from abrupt entering and existing lines of text. In order to reduce the false positive, the classification of hard cuts was extended. Additionally, Lienhart [12] found that hard cuts from monochrome images were commonly classified as fades.

2.2 Other Techniques

Boreczky and Wilcox [21] segment video using Hidden Markov Models (HMM) [22]. The features used for segmentation are the distance between gray-level histograms, an audio distance based on the acoustic difference in intervals just before and just after the frames, and an estimate of the object motion between frames. States in the HMM are used to model fades, dissolves, cuts, pan, and zoom. The arcs between states model the allowable progression of states. The arc from a state to itself denotes the length of time the video is in that state. Transition probabilities and the means and variances of the Gaussian distribution are learned during the training phase using the Baum-Welch algorithm [22]. Training data consists of features (histograms, audio distance, and object motion) from a collection of video content. After the parameters are trained, segmenting the video into its shots and camera motions is performed using the Viterbi algorithm [22]. One advantage of this technique is that thresholds are not subjectively determined; they are learned automatically based on the training data. Another advantage of this technique is that it allows for the inclusion of multiple features in the training data.

Most video is stored in a compressed format such as MPEG. As a result, there has been considerable interest in performing video segmentation directly on the coded MPEG compressed video [23–27]. One advantage to performing video segmentation in the compressed domain is the decreased computational complexity when decoding is avoided. Additionally, algorithm operations are faster [9]. Additionally, the encoded video inherently contains computed features such as motion vectors and block averages that can be utilized. However, the speed and efficiency of algorithms in the compressed domain comes at the cost of the increased complexity of the algorithm implementation.

2.3 Summarization

Video summarization attempts to present a pictorial summary of an underlying video sequence in a more compact form, eliminating redundancy. Video summarization focuses on finding a smaller set of images to represent the visual content, and presenting these keyframes to the user. Abstraction involves mapping an entire segment of video into a smaller number of representative images [16]. A still-image abstract, also known as a static storyboard, is a collection of salient still images or keyframes generated from the underlying video. Most summarization research involves extracting keyframes and developing a browser-based interface that best represents the original video. Li et al. [28] describes three advantages of a still-image representation: 1) still-image abstracts can be created much faster than moving image abstracts, since no manipulation of the audio or text information is necessary, 2) the temporal order of the representative frames can be displayed so that users can grasp concepts more quickly, and 3) extracted still images are available for printing, if desired. There have been numerous research efforts on keyframe extraction [29–31]. Li et al. [28] groups these techniques into the following categories: sampling-based, shot-based, color-based, motion-based, mosaic-based, and segment-based keyframe extraction techniques.

2.3.1 Still Image Representation

In some systems the first and last frame of each shot are selected to represent shot content [26, 28, 32–34]. Although this may reduce the total number of keyframes and provide information about the total number of keyframes a priori, this method is not an accurate and sufficient representation of the shot content. It does not characterize or capture dynamic action or motion within a shot. Ideally, keyframes should be extracted based on the underlying semantic content. But, semantic analysis of a video is a difficult research problem. As a result, most keyframe extraction techniques rely on low-level image features, such as color and motion.

Zhang et al. [16] extract keyframes based on color content. Keyframes are extracted in a sequential manner. The density of the keyframe selection process can be adjusted. However, the default is that the first and last frames of each shot are considered keyframes. Once a keyframe has been selected, a color histogram comparison method is employed on subsequent frames and the previous keyframe. If the distance metric exceeds a predetermined threshold, a keyframe is selected. The Munsell space was chosen to define keyframes, because of its relation to human perception [16]. The color space is quantized into 64 "super-cells" using a standard least squares clustering algorithm. A 64-bin histogram is calculated for each keyframe, with each bin having the normalized count of the number of pixels that fall in the corresponding supercell.

Ferman and Tekalp [35] propose a keyframe extraction method based on the clustering of frames within a shot. Frames within a shot are clustered into a group based on a color histogram similarity measure. The frame closest to the center of the largest cluster is selected as the keyframe for the shot. Zhuang et al. [36] propose a method for keyframe extraction based on unsupervised clustering. The color histogram is used to represent the visual content within a frame. A 16x8 2D histogram is used in the HSV color space. For each cluster, the frame that is closest to the centroid is selected as the keyframe.

Wolf proposes a motion based keyframe selection algorithm based on optical flow [37]. The algorithm computes the flow field for each frame. The sum of the magnitudes of the optical flow components is used as the motion metric. The keyframe selection process selects keyframes that are at the local minima of motion between two local maximas. Dixon and Owen propose a method based on motion analysis and overlap management [38]. This method is optimized for content wherein camera motion dominates the motion field and saliency requires coverage of the viewable area.

One of the main problems with keyframe extraction algorithms is that they produce an excessive amount of keyframes from long video sequences. Additionally, due to redundant scenes and alternating shots, most keyframe extraction algorithms that operate on single shots produce redundant keyframes. In order to combat these problems Girgensohn et al. [29] calculate an importance measure for each segment based on its rarity and duration. A pictorial summary is created based on the segments whose scores are above a threshold. Keyframes are selected near the center of each segment. Uchihashi et al. [39] attempted to reduce the overall number of keyframes by developing pictorial summaries based on a shot importance measure and frame-packing algorithm. Their hierarchical clustering method is used to group similar shots together and a score is given to each cluster. A keyframe is extracted from each cluster and the size of the keyframe is based on the score given to each cluster. A frame-packing algorithm is used to visually display the different sized keyframes to the user in a unified manner.

2.3.2 Moving Image Representation

Video Skims are short video clips consisting of a collection of image sequences and the corresponding audio, extracted from the original longer video sequence. Video skims represent a temporal multimedia abstraction that is played rather than viewed statically. They are comprised of the most relevant phrases, sentences, and image sequences. The goal of video skims is to present the original video sequence in an order of magnitude less time [40]. There are two basic types of video skimming: summary sequence and highlight [28]. Summary sequences are used to provide a user with an impression of the video sequence, while a highlight video skim contains only the most interesting parts of a video sequence. The MoCA project [41] developed automated techniques to extract highlight video skims to produce movie trailers. Scenes containing important objects, events, and people are used to develop the video skims. Selecting highlights from a video sequence is a subjective process; as a result most existing video-skimming work focuses on the generation of summary sequences [28].

The Informedia Project at Carnegie Mellon University [40, 42] utilizes speech, closed caption text, speech processing, and scene detection to automatically segment news and documentary video. They have created a digital library with over a terabyte of video data. One aspect of the summarization method of the Informedia Digital Video Library System is composed of partitioning the video into shots and keyframes. Multi-level video summarization is facilitated through visual icons, which are keyframes with a relevance measure in the form of a thermometer, one-line headlines, static filmstrip views, utilizing one frame per scene change, active video skims, and the transcript following of an audio track. The text keywords are extracted from the transcript and the closed captioning text by using a Term Frequency Inverse Document Frequency (TF-IDF) technique. The audio data is extracted from the segments corresponding to the selected keywords and its neighboring areas to improve audio comprehension. Image extraction is facilitated by selecting frames with faces and text, frames following camera motion, frames with camera motion and face or text, and frames at the beginning of a video sequence. Video Skimming is created by the confluence of the extracted audio and image extraction. Experiments of this skimming approach have shown impressive results on limited types of documentary video that have explicit speech or text contents [28]. It remains unclear whether this technique may produce similar results with video containing more complex audio contents.

2.3.3 User Interfaces

The second aspect of video summarization involves developing user-interfaces that best represent the original video sequence. The user- interface design is the part of a system that combines the video technology with the users requirements. If users cannot use the overall system effectively, then the system fails. The major concerns developing effective user interfaces is the trade-off between the different levels and types of abstractions presented to the user [43]. The more condense the abstraction, the easier it is for a potential user to browse through; however, a condensed abstraction may not provide enough information to obtain the overall meaning and understanding of the video. Contrastingly, a more detailed abstraction may present the user with enough information to comprehend the video sequence; however the amount of information in the abstraction may take a potential user a long time to browse.

Lee and Smeaton [43] surveyed and categorized the user-interfaces of various browsing tools. Some of the criteria that were used in classifying the browsing tools were whether they had a single keyframe display, a keyframe storyboard, options to change the density of the keyframes, interactive hierarchical keyframe browser, Time playback of keyframes, video skims, VCR like playback, and keyframes with playback synchronization. They concluded that in order to develop an effective user-interface, it was necessary to identify the different classes of users and the features and functionality needed for each class.

Researchers have also focused on results from user studies to determine the best possible user-interface for maximum efficiency and user comprehension. Christel et al. [40, 44] performed user studies on secondary school and college students on the effectiveness of the multiple levels of abstraction and summarization techniques utilized in the Informedia Digital Library [45]. Studies were done with respect to the relative value of each abstraction method and its impact on the understanding of the content. The abstraction methods used in the Informedia Digital Library are text titles or headlines, thumbnail keyframe images or poster frames, filmstrips, video skims, and match bars. The Informedia Digital library then provides the user with multiple levels of abstraction and summarization. Although speech and text processing can significantly aid in the creation of a rich summarization technique, these processing genres are beyond the scope of our research.

Komlodi et al. [46] performed user studies on 30 undergraduate psychology majors to determine the effectiveness of three keyframe selection interfaces. Comparisons were made with respect to user performance, interface efficiency, and user satisfaction. The study experimented with using static storyboard displays as well as dynamic slideshow displays. The three displays that were used were static storyboard with 12 keyframes, static storyboard with 4 keyframes, and dynamic slideshow with 12 keyframes. The dependent variables that were measured in the experiment were object identification, action identification, gist comprehension, selection precision, selection recall, examination time, and user satisfaction. The experiment performed by Komlodi et al. [46] concluded that static keyframe displays are better for object identification than dynamic keyframe displays. An adaptive algorithm must be able to determine when object identification is important for a certain type of video and provide a display that maximizes the users understanding and comprehension of the content.

Wei et al. [47] studied the effectiveness of video summaries containing keyframes only, keywords and phrases only, and a combination of keyframes and keywords or phrases with respect to user comprehension and understanding. The study concluded that summaries containing both text and imagery are more effective than either modality alone.

2.4 Summary

This section has presented a broad view of the current research related to video segmentation and summarization. Most video segmentation research focuses on the creation of algorithms for specific classes of content. When the type of video is known a priori, any number of algorithms can be chosen with relative success. However, when the type of video is unknown, an adaptive method is needed to adjust to the type of content for the best possible segmentation result and the best summarization method for maximum user comprehension and efficiency.