2. Brief Review of Related Work

To date, video content overview has mainly been achieved by using keyframes extracted from original video sequences. Many works focus on breaking video into shots, and then finding a fixed number of keyframes for each detected shot. Tonomura et al. used the first frame from each shot as a keyframe [1]. Ueda et al. represented each shot using its first and last frames [2]. Ferman and Tekalp clustered the frames in each shot, and selected the frame closest to the center of the largest cluster as the keyframe [3].

An obvious disadvantage to the above equal-number keyframe assignments is that long shots in which camera pan and zoom as well as object motion progressively unveil the entire event will not be adequately represented. To address this problem, DeMenthon et al. proposed to assign keyframes of a variable number according to the activity level of each scene shot [4]. Their method represents a video sequence as a trajectory curve in a high dimensional feature space, and uses the recursive binary curve splitting algorithm to find a set of perceptually significant points to approximate the video curve. This approximation is repeated until the approximation error falls below a user specified value. Frames corresponding to the perceptually significant points are then used as keyframes to summarize the video content. As the curve splitting algorithm assigns more points to larger curvatures, this method naturally assigns more keyframes to shots with more variations.

Keyframes extracted from a video sequence may contain duplications and redundancies. In a television program with two talking persons, the video camera usually switches back and forth between the two persons, with the insertion of some overall views of the scene. Applying the above keyframe selection methods to this kind of video sequences will yield many keyframes that are almost identical. To remove redundancies from keyframes, Yeung et al. selected one keyframe from each video shot, performed hierarchical clustering on these keyframes based on their visual similarity and temporal distance, and then retained only one keyframe per cluster [5]. Girgensohn and Boreczky also applied the hierarchical clustering technique to group the keyframes into as many clusters as specified by the user. For each cluster, a keyframe is selected such that the constraints of an even distribution of keyframes over the length of the video and a minimum distance between keyframes are met [6].

Apart from the above methods of keyframe selection, summarizing video content using keyframes has its own inherent limitations. A video program is a continuous audio/visual recording of real-world scenes. A set of static keyframes captures no temporal properties nor audio content of the original video. While keyframes are effective in helping the user to identify the desired frames or shots from a video, they are far from sufficient for the user to get a general idea of the video content, and to judge if the content is relevant or not. Besides, a long video sequence, e.g., one or two hours, is likely to produce thousands of keyframes, and this excessive number of keyframes may well create another information flood rather than serving as an information abstraction.

There have been research efforts that strive to output motion video summaries to accommodate better content overviews. The CueVideo system from IBM provides two summarization functions: moving story board (MSB) and fast video playback. The MSB is composed of a slide show that displays a string of keyframes, one for each shot, together with a synchronized playback of the entire audio track of the original video. The time scale modulation (TSM) technique is employed to achieve a faster audio playback speed that preserves the timbre and quality of the speech [7]. Although the MSB can shorten the video watching time by 10% to 15%, it does not provide a content abstract of the original video. Therefore, rather than considering the MSB as a video summarization tool, it is more appropriate to consider it as a lower bitrate and higher speed video playback method. On the other hand, the fast video playback function from the CueVideo system plays long, static shots with a faster speed (higher frame rate), and plays short, dynamic shots with a slower speed (lower frame rate) [8]. However, this variable frame rate playback causes static shots to look more dynamic, and dynamic shots to look more static, therefore dramatically distorting the temporal characteristics of the video sequence.

The Informedia system from Carnegie Mellon University provides the video skim that strives to summarize the input video by identifying those video segments that contain either semantically important scenes or statistically significant keywords and phrases [9]. The importance of each video segment is measured using a set of heuristic rules that are highly subjective and content specific. This rule-based approach has certainly limited the ability of the system to handle diverse video sequences. I. Yahiaoui et al. also proposed a similar method that summarizes multi-episode videos based on statistics as well as heuristics [10].