Chapter 10: Audio and Visual Content Summarization of a Video Program | Handbook of Video Databases: Design and Applications (Internet and Communications)

Yihong Gong
NEC USA, Inc.
C&C Research Laboratories
10080 North Wolfe Road, SW3-350
Cupertino, CA, U.S.A.
<ygong@ccrl.sj.nec.com>

1. Introduction

Video is a voluminous, redundant, and time-sequential medium whose overall content can not be captured at a glance. The voluminous and sequential nature of video programs not only creates congestion in computer systems and communication networks, but also causes bottlenecks in human information comprehension because humans have a limited information processing speed. In the past decade, great efforts have been made to relieve the computer and communication congestion problems. However, in the whole video content creation, processing, storage, delivery, and utilization loop, the human bottleneck problem has long been neglected. Without technologies enabling fast and effective content overviews, browsing through video collections and finding desired video programs from a long list of search results will remain arduous and painful tasks.

Automatic video content summarization is one of the promising solutions to the human bottleneck problem. Video summarization is a process intended to create a concise form of the original video in which important content is preserved and redundancy is eliminated. Similar to paper abstracts and book prefaces, video summaries are valuable for accelerating human information comprehension. A concise and informative video summary enables users to quickly perceive the general content of a video and helps them to decide whether the content is of interest or not. In today's fast-paced world with floods of information, such a video summary will remarkably enhance the users' ability to sift through huge volumes of video data, and facilitate their decisions on what to take and what to discard. When accessing videos from remote servers, a compact video summary can be used as a video "thumbnail" to the original video, which requires much less efforts to download and comprehend. For most home users with limited network bandwidths, this type of video thumbnail can well prevent users from spending minutes or tens of minutes to download lengthy video programs, only to find them irrelevant. With the deployment and evolution of the third-generation mobile communication networks, it will soon be possible to access videos using palm computers and cellular phones. For wireless video accesses, video summaries will become a must because mobile devices have limited memory and battery capacities, and are subject to expensive airtime charges. For video content retrieval, a video summary will allow the users to quickly browse through large video libraries and efficiently spot the desired videos from a long list of search results.

There are many possible ways to summarize video content. To date, the most common approach is to extract a set of keyframes from the original video and display them as thumbnails in a storyboard. However, keyframes extracted from a video sequence are a static image set that contains no temporal properties nor audio information. While keyframes are effective in helping the user to identify the desired shots from a video, they are far from sufficient for the user to get a general idea of the video content, and to judge if the content is relevant or not.

In this chapter, we present three video content summarization systems developed by multidisciplinary researchers in NEC USA, C&C Research Laboratories. These summarization systems are able to produce three kinds of motion video summaries: (1) audio-centric summary, (2) image-centric summary, and (3) audio-visual content summary. The audio-centric summary is created by using text summarization techniques to select video segments that contain semantically important spoken sentences; the image-centric summary is composed by eliminating redundant video frames and preserving visually rich video segments; and the audio-visual summary is constructed by summarizing the audio and visual contents of the original video separately, and then integrating the two summaries with a partial alignment. A Bipartite Graph-based audio-visual alignment algorithm was developed to efficiently find the best alignment solution between the audio and the visual summaries that satisfies the predefined alignment requirements. These three types of motion video summaries are intended to provide natural, diverse, and effective audio/visual content overviews for a broad range of video programs.

In the next section, related video summarization work is briefly reviewed. A discussion of the different types of video programs and their appropriate summarization methods is provided in section 3. Sections 4, 5, and 6 describe the audio-centric summarization, the image-centric summarization, and the audio-visual summarization methods, respectively. Section 7 provides a brief account of future video content summarization research. A glossary of important terms can be found at the end of the chapter.