3. Three Types of Video Summarization

Video is a medium which contains both audio and image tracks. Based on which track of a video is being analyzed for content summarization, we can classify a video summarization method into one of the following three basic categories: audio-centric summarization, image-centric summarization, and audio-visual summarization. An audio-centric summarization can be achieved by analyzing mainly the audio track of the original video and selecting those video segments that contain either important audio sounds or semantically important speeches. Conversely, an image-centric summarization can be accomplished by focusing on the image track of the video and selecting those video segments whose image contents bear either visual or semantic significances. Finally, we can achieve an audio-visual summarization by decoupling the audio and image track of the input video, summarizing the two tracks separately, and then integrating the two summaries according to certain alignment rules.

Audio-centric summarization is useful for summarizing video programs whose visual contents are less significant than their audio contents. Examples of such video programs include news conferences, political debates, seminars, etc., where shots of talking heads, static scenes, and audiences consist of the most part of their visual contents. In contrast, image-centric summarization is useful for summarizing video programs which do not contain many significant audio contents. Action movies, television broadcast sports games, and surveillance videos are good examples of this kind of video programs.

There are certain video programs that do not have a strong synchronization between their audio and visual contents. Consider a television news program in which an audio segment presents information concerning the number of casualties caused by a recent earthquake. The corresponding image segment could be a close shot of a reporter in the field, of rescue teams working at the scene of a collapsed building, or of a regional map illustrating the epicenter of the earthquake. The audio content is related to, but does not directly refer to the corresponding image content. This kind of video production patterns is very common among such video programs as news, documentaries, etc. For these video programs, since there is no strong synchronization between the audio and visual contents, and a video segment containing significant audio content does not necessarily contain significant visual content at the same time (and vice versa), an audio-visual summarization is the most appropriate summarization type because it can maximize the coverage for both the audio and visual contents of the original video without having to sacrifice either of them.