David Gibbon, Lee Begeja, Zhu Liu, Bernard Renger, and Behzad Shahraray
Multimedia Processing Research
AT&T Labs - Research
Middletown, New Jersey, USA
At a minimum, a multimedia database must contain the media itself, plus some level of metadata which describes the media. In early systems, this metadata was limited to simple fields containing basic attributes such as the media title, author, etc. and the user would select media using relational queries on the fields. Video on demand systems are one example of this. More recently, systems have been developed to analyze the content of the media to create rich streams of features for indexing and retrieval. For example, by including a transcription of the dialog (along with the temporal information) in the database, the advances that have been made in the field of text information retrieval can be applied to the multimedia information retrieval domain. Typically these systems support random access to the linear media using keyword searches  and in some cases, they support queries based on media content .
While media content analysis algorithms can provide a wealth of data for searching and browsing, often this data has a significant amount of error. For example, speech recognition word error rates for broadcast news tasks are approximately 30%  and shot boundary detection algorithms may produce false positives or miss shots . Multimodal processing of multimedia content can be used to alleviate the problems introduced by such imperfect media analysis algorithms. Also, multimodal processing can yield higher level semantics than processing individual media in isolation. Higher level semantics will, in turn, make information retrieval from multimedia databases more accurate.
Multimodal processing can also be used to go beyond random access to media. For some classes of content, it is possible to recover the boundaries of topics and extract segments of interest to the user. With this, there is less burden on the user. Rather than being presented with a list of points of interest in a particular media stream, a logically cohesive content unit can be extracted and presented to the user with little or no interaction.
This chapter will focus on using multimodal processing in the domain of broadcast television content with the goal of automatically producing customized video content for individual users. Multimodal topic segmentation is used to extract video clips from a wide range of content sources which are of interest to users as indicated by the user's profile. We will primarily use the Closed Caption as the text source although the textual information may come from a variety of sources including post-production scripts, very large vocabulary automatic speech recognition, and transcripts which are aligned with audio using speech processing methods .