2. Related Work

Most research in the area of content based indexing and retrieval of multimedia information focuses on searching large video archives. These systems are similar to ours in that they use features from the visual, audio, and/or transcript data to segment and index video content. However, these systems differ in (1) their scale (they deal with large video archives such as the 600,000 hours of news owned by the BBC) and (2) in their need for precision since they are intended as professional applications. Our research focuses on devices for consumers' homes. We assume a device with limited channel tuning, storage, and processing power in order to be affordable. We also focus on content-based retrieval that consumers would want in their homes such as automatic personalization of content retrieval based on user profiles.

The following are complementary approaches to multimodal processing of visual, audio, and transcript information in video analysis. Rui et al. present an approach that uses low-level audio features to detect excited speech and hits in a baseball game. They employ a probabilistic framework for automatic "highlight" extraction [1]. Syeda-Mahmood et al. present event detection in multimedia presentations from teaching and training videos [2]. The foils (slides) are detected using visual analysis. Their system searches the audio track for phrases that appear as screen text. They employ a probabilistic model that exploits the co-occurrence of visual and audio events.

Another approach based on the observation that semantic concepts in videos interact and appear in context was proposed by Naphade [3]. To model this contextual interaction explicitly, a probabilistic graphical network of multijects or a multinet was proposed. Using probabilistic models for multijects, "rocks", "sky", "snow", "water-body" and "forestry/greenery" and using a factor graph as the multinet, they built a framework for semantic indexing.

Reported systems for content-based access to images and video include Query-By-Image-and Video-Content (QBIC) [4], VisualGrep [5], DVL of AT&T, InforMedia [6], VideoQ [7], MoCA [8], Vibe [9] and CONIVAS [10]. In particular, the InforMedia, MoCA, and VideoQ systems are more related to Video Scouting. The InforMedia project is a digital video library system containing methods to create a short synopsis of each video primarily based on speech recognition, natural language understanding, and caption text. The MoCA project is designed to provide content-based access to a movie database. Besides segmenting movies into salient shots and generating an abstract of the movie, the system detects and recognizes title credits and performs audio analysis. The VideoQ system classifies videos using compressed domain analysis consisting of three modules: parsing, visualization, and authoring.