3. Summarization

Video summarization aims at providing an abstract of a video for shortening the navigation and browsing the original video. The problematic of video summarization is to be able to present in a synthetic way the content of video, while preserving 'the essential message of the original' [38]. According to [39] there are two different types of video abstracts: still-image and moving-image abstracts. Still-image abstracts, like with Video Manga [40][41], are presentations of salient images or key-frames while moving-image abstracts consist of a sequence of image sequences. The former are referred to as video summary, the latter as video skimming.

Video skimming is a more difficult process since it imposes an audio-visual synchronization of the selected sequences of images in order to restitute a coherent abstract of the entire video content. Video skimming can be achieved by using audio time scale modification [42] which consists in compressing the video and speeding up the audio and the speech while preserving the timbre, the voice quality and the pitch. Another approach consists in highlighting the important scenes (sequences of frames) in order to build a video trailer. In [38][43], scenes with a lot of contrast, scenes in the average coloration of the video, as well as scenes with a lot of different frames are automatically detected and are integrated in the trailer as they are supposed to be important scenes. Action scenes (containing explosion, gun shot, rapid camera movement) are also detected. Close captioning can also be used for selecting audio segments that contain some selected keywords [44] that together with the corresponding image segments put in chronological order will constitute an abstract. Clustering is also often used to gather video frames that share similar color or motion features [45]. Once the set of frame clusters is obtained, the more representative key-frame is extracted from each cluster. The video skimming is built by assembling video shots that contain these keyframes. In the work of [46] sub-parts of shots are processed using a hierarchical algorithm to generate the video summaries.

Video summary can be seen as a more simple task to perform since it consists in extracting from a video sequences of frames or sequences of segments of video as the best abstract of it, without considering audio segments selection and synchronization or close caption. The representation of video summaries can be composed of still images or of moving images, and may use the video cinematographic structure (clip, scenes, shots, frame) as well. The problem is still to define the relevant parts of video (still or moving images) to be kept in the summary.

Many existing approaches consider only signal-based summary generation. For instance Sun and Kankanhalli [47] proposed the Content Based Adaptive Clustering that defines a hierarchical removal of clusters of images according to color differences. This work is conceptually similar to [48] (based on Genetic Algorithms) and [49] (based on singular value decomposition). Other approaches, like [50] and [51], try to use objects and/or background for summary generation, hoping for more meaningful results, at the expense of an increased complexity. Another signal-level feature present on videos is motion; authors in [52] proposed to use such motion and gesture recognition of people in the context of filmed talk with slides. In [53], MPEG-7 content representations are used to generate semantics summaries based on relevant shots that can be subsampled based on the motions/colors in each shot.

In some specific repetitive contexts, like for electrocardiograms [54], the process uses a priori knowledge to extract summaries. Other contexts, like broadcast news [55], help the system to find out the important parts of the original videos.

Approaches propose also to use multiple features to summarize videos: [55] use closed caption extraction and speaker change detection, when [44] extract human faces and significant audio parts.

The process of defining video summaries is complex and prone to errors. We consider that the use of strata-like information in a video database environment is able to produce meaningful summaries by using high level semantic annotations. In order to achieve this goal, we propose to use the powerful formalism of Conceptual Graphs in order to represent complex video content. We are currently developing this approach in a generator of adaptive video summaries, called VISU (which stands for VIdeo SUmmarization). This system is based on two models we present in the following section.