Ashutosh Garg, Milind R. Naphade, and Thomas S. Huang
Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign
Urbana, Illinois, USA
Generation and dissemination of digital media content poses a challenging problem of efficient storage and retrieval. Of particular interest to us are audio and visual content. From sharing of picture albums and home videos to movie advertisement through interactive preview clips, live broadcasts of various shows or multimedia reports of news as it happens, multimedia information has found in the internet and the television powerful media to reach us. With innovations in hand-held and portable computing devices and wired and wireless communication technology (pocket PCs, organizers, cell-phones) on one end and broadband internet devices on the other, supply and dissemination of unclassified multimedia is overwhelming. Humans assimilate content at a semantic level and apply their knowledge to the task of sifting through large volumes of multimodal data. To invent tools that can gain widespread popularity we must try to emulate human assimilation of this content. We are thus faced with the problem of multimedia understanding if we are to bridge the gap between media features and semantics.
Current techniques in content-based retrieval for image sequences support the paradigm of query by example using similarity in low-level media features [1,2,3,4,5,6]. The query must be phrased in terms of a video clip or at least a few key frames extracted from the query clip. Retrieval is based on a matching algorithm, which ranks the database clips according to a heuristic measure of similarity between the query and the target. While effective for browsing and low-level search, this paradigm has limitations. Low-level similarity may not match with the user's perception of similarity. Also, the assumption that clips reflecting desire are available during query is unrealistic. It is also essential to fuse information from multiple modalities, especially the image sequence and audio streams. Most systems use either the image sequence [5,6,4,2,1], or the audio track [7,8,9,10,11,12], while few use both the modalities [13,14,12].
One way of organizing a video for efficient browsing and searching is shown in Figure 2.1. A systematic top-down breakdown of the video into scenes, shots and key frames exists in the form of a table of contents (ToC). To enable access to the video in terms of semantic concepts, there needs to be a semantic index (SI). The links connect entries in the SI to shots/scenes in the ToC and also indicate a measure of confidence.
Figure 2.1: Organizing a Video with a Table of Contents (ToC) and a Semantic Index (SI). The ToC gives a top-down break-up in terms of scenes, shots and key frames. The SI lists key-concepts occurring in the video. The links indicate the exact location of these concepts and the confidence measure.
Automatic techniques for generating the ToC exist, though they use low-level features for extracting key frames as well as constructing scenes. The first step in generating the ToC is the segmentation of the video track into smaller units. Shot boundary detection can be performed in compressed domain [15,16,17] as well as uncompressed domain . Shots can be grouped based on continuity, temporal proximity and similarity to form scenes . Most systems support query by image sequence content [2,3,4,5,6] and can be used to group shots and enhance the ability to browse. Naphade et al.  presented a scheme that supports query by audiovisual content using dynamic programming. The user may browse a video and then provide one of the clips in the ToC structure as an example to drive the retrieval systems mentioned earlier. Chang et al.  allow the user to provide a sketch of a dominant object along with its color shape and motion trajectory. Key frames can be extracted from shots to help efficient browsing.
The need for a semantic index is felt to facilitate search using key words or key concepts. To support such semantics, models of semantic concepts in terms of multimodal representations are needed. For example, a query to find explosion on a beach can be supported if models for the concepts explosion and beach are represented in the system. This is a difficult problem. The difficulty lies in the gap that exists between low-level media features and high-level semantics. Query using semantic concepts has motivated recent research in semantic video indexing [13,19,20,12] and structuring [21,22,23]. We  presented novel ideas in semantic indexing by learning probabilistic multimedia representations of semantic events like explosion and sites like waterfall . Chang et al.  introduced the notion of semantic visual templates. Wolf et al.  used hidden Markov models to parse video. Ferman et al.  attempted to model semantic structures like dialogues in video.
The two aspects of mapping low-level features to high-level semantics are the concepts represented by the multiple media and the context, in which they appear. We view the problem of semantic video indexing as a multimedia understanding problem. Semantic concepts do not occur in isolation. There is always a context to the co-occurrence of semantic concepts in a video scene. We presented a probabilistic graphical network to model this context [24,25] and demonstrated that modeling the context explicitly provides a significant improvement in performance. For further details on modeling context, the reader is referred to [24,25]. In this paper we concentrate on the problem of detecting complex audiovisual events. We apply a novel learning architecture and algorithm to fuse information from multiple loosely coupled modalities to detect audiovisual events such as explosion.
Detecting semantic events from audio-visual data with spatio-temporal support is a challenging multimedia understanding problem. The difficulty lies in the gap that exists between low-level features and high-level semantic labels. Often, one needs to depend on multiple modalities to interpret the semantics reliably. This necessitates efficient schemes, which can capture the characteristics of high level semantic events by fusing the information extracted from multiple modalities.
Research in fusing multiple modalities for detection and recognition has attracted considerable attention. Most techniques for fusing features from multiple modalities having temporal support are based on Markov models. Examples include the hidden Markov model (HMM)  and several variants of the HMM, like the coupled hidden Markov model , factorial hidden Markov model , the hierarchical hidden Markov model , etc. A characteristic of these models is the stage, at which the features from the different modalities are merged.
We present a novel algorithm, which combines feature with temporal support from multiple modalities. Two main features that distinguish our model from existing schemes are (a) the ability to account for non-exponential duration and (b) the ability to map discrete state input sequences to decision sequences. The standard algorithms modeling the video-events use HMMs, which model the duration of events as an exponentially decaying distribution. However, we argue that the duration is an important characteristic of each event and we demonstrate it by the improved performance over standard HMMs. We test the model on the audio-visual event explosion. Using a set of hand-labeled video data, we compare the performance of our model with and without the explicit model for duration. We also compare performance of the proposed model with the traditional HMM and observe that the detection performance can be improved.