Nevenka Dimitrova, Radu Jasinschi, Lalitha Agnihotri, John Zimmerman, Thomas McGee, and Dongge Li
345 Scarborough Road
Briarcliff Manor, NY 10598
(Nevenka.Dimitrova, Radu.Jasinschi, Lalitha.Agnihotri)@philips.com
For many years there has been a vision of the future television where users can "watch what they want, when they want." This vision first emerged through the concept of video-on-demand (VOD), where users could get streaming TV programs and movies on demand. While this approach was expected to be available today, it can currently only be found in hotels and to a limited degree as Pay-Per-View movies available on cable and satellite. Recently, however, a second approach has been in the form of hard disk recorders (HDR) available today from TiVo and ReplayTV. The current models of these devices record 30–140 hours of programs, allowing users to truly watch what they want when they want. Users navigate electronic program guides where they select programs to be recorded. In addition, TiVo "personalizes" the TV experience. Users rate shows and then the TiVo recorders automatically record TV programs that they infer the users will like. Hard disk recorders offer users tremendous flexibility to time-shift and they have pushed the idea of personalization into the living room. However, they are limited because they have no tools for personalizing information at a sub-program level.
Content-based video analysis and retrieval has been an active area of research for the last decade. As we have progressed through the low level algorithms for visual analysis and used them for video indexing at first [Nagasaka], the unfolding possibilities of combining the results from audio, visual, and transcript analysis were considered far fetched. However, the InforMedia project started first by combining visual processing with keyword spotting [Hauptman]. Recently there have been multiple reports of combining different modalities for indexing topical events in multimedia presentations, detecting highlights in baseball TV programs, and others. We have been building a system called Video Scout, which monitors TV channels for content selection and recording based on multimodal integration of video, audio, and transcript processing. The Video Scout advances existing reported systems in using a comprehensive set of visual characteristics including cuts, videotext, faces, audio characteristics such as noise, silence, music, and speech, and transcript characteristics such as keywords, word histograms, and categories. The system architecture allows for integrated processing of all the different features within a probabilistic framework.
We designed Video Scout so that the users can make content requests in their user profiles and Scout begins recording TV programs. In addition, Scout actually "watches" the TV programs it records and personalizes program segments based on the users' profiles. Scout analyses the visual, audio, and transcript data in order to segment and index the programs. When accessing full programs, users see a high-level overview as well as topic-specific starting points. For example: users can quickly find and playback Dolly Parton's musical performance within an episode of Late Show with David Letterman. In addition, users can access video segments organized by topic. For example: users can quickly find all of the segments on Philips Electronics that Scout has recorded from various financial news programs.