Chapter 44: Segmenting Stories in News Video | Handbook of Video Databases: Design and Applications (Internet and Communications)

Lekha Chaisorn, Tat-Seng Chua, and Chin-Hui Lee
School of Computing
National University of Singapore
3 Science Drive 2, Singapore 117543
{lekhacha, chuats, chl}@comp.nus.edu.sg

1. Introduction

The rapid advances in computing, multimedia, and networking technologies have results in the production and distribution of large amount of multimedia data, in particular digital video. To effectively manage these sources of videos, it is necessary to organize them in a way that facilitates user access and supports personalization. For example, when we are viewing news on the Internet, we may want to view just the video segments of our interest like sports or only those segments when the Prime Minister is giving a speech.

Research on segmenting an input video into shots, and using these shots as the basis for video organization, is well established [3, 15]. A shot represents a contiguous sequence of visually similar frames. It, however, does not usually convey any coherent semantics to the users. As users remember video contents in terms of events or stories but not in terms of changes in visual appearances as in shots, it is necessary to organize video contents in terms of small, single-story units that represent the conceptual chunks in users' memory. These units can then be organized hierarchically to facilitate browsing by the users. For a specific domain such as the news, these video units can further be classified according to their semantics such as meeting, sport, etc.

This work aims at developing a system to automatically segment and classify news video into semantic units using a learning-based approach. It is well known that the learning-based approaches are sensitive to feature selection and often suffers from data sparseness problems due to difficulties in obtaining annotated data for training. One approach to tackle the data sparseness problem is to perform the analysis at multiple levels as is done successfully in natural language processing (NLP) research [7]. For example, in NLP, it has been found to be effective to perform the part-of-speech tagging at the word level, before the phrase or sentence analysis at the higher level. Thus in this research, we propose a two-level, multi-modal framework to tackle the news story boundary detection problem. The video is analyzed at the shot and story unit (or scene) levels using a variety of features. At the shot level, we use a set of low-level and high-level features to model the contents of each shot. We employ a Decision Tree to classify the video shots into one of the 13 pre-defined categories. At the story level, we perform HMM (Hidden Markov Models) analysis [19] to identify news story boundaries. To focus our research, we adopt the news video domain, as such video is usually more structured and has clearly defined story units.

Briefly, the content of this paper is organized as follows. Section 2 describes related research and Section 3 discusses the design of the multi-modal two-level classification framework. Section 4 presents the details of shot level classification and Section 5 discusses the details of story/scene segmentation. Section 6 discusses the experiment results, and Section 7 contains our conclusion and discussion of future work.