2. Prior Art

In this section we review prior art in the fields of event detection and information fusion using multimedia features.

2.1 Event Detection

Recent work in temporal modeling of image sequences includes work in parsing and structuring as well as modeling visual events. Statistical models like the hidden Markov models (HMM) have been used for structuring image sequences [5,21,22]. Yeung et al. introduced dialog detection [5]. Topical classification of image sequences can provide information about the genres of videos like news, sports, etc. Examples include [29]. Extraction of semantics from image-sequences is difficult. Recent work dealing with semantic analysis of image sequences include Naphade et al. [13,30], Chang et al. [19], and Brand et al. [27]. Naphade et al. [13,30] use hidden Markov models to detect events in image sequences. Chang et al. [2] allow user-defined templates of semantics in image sequences. Brand et al. [27] use coupled HMMs to model complex actions in Tai Chi movies.

Recent work in segmentation and classification of audio streams includes [7,8,9,10,11,12,31,32]. Naphade and Huang [7] used hidden Markov models (HMMs) for representing the probability density functions of auditory features computed over a time series. Zhang and Kuo [12] used features based on heuristics for audio classification. HMMs have been successfully applied in speech recognition.

Among the state-of-the-art techniques in multimedia retrieval very few techniques use multiple modalities. Most techniques, using audiovisual data, perform temporal segmentation on one medium and then analyze the other medium. For example the image sequence is used for temporal segmentation and the audio is then analyzed for classification. Examples include [33,34], and the Informedia project [35] that uses the visual stream for segmentation and the audio stream for content classification. Such systems also exist for particular video domains like broadcast news [36], sports [12,29,37], meeting videos, etc. Wang et al. [34] survey a few techniques for analysis using a similar approach for similar domains. In case of domain-independent retrieval, while existing techniques attempt to determine what is going on in the speech-audio, most techniques go as far as classifying the genre of the video using audiovisual features. Other techniques for video analysis include the unsupervised clustering of videos [38]. Naphade et al. [14] have presented an algorithm to support query by audiovisual content. Another popular domain is the detection and verification of a speaker using speech and an image sequence obtained by a camera looking at the person [39]. This is particularly applicable to the domain of intelligent collaboration and human-computer interaction. Recent work in semantic video indexing includes Naphade et al. [13,40,41].

2.2 Fusion Models

Audio-Visual analysis to detect the semantic concepts in videos poses a challenging problem. One main difficulty arises from the fact that the different sensors are noisy in terms of the information they contain about different semantic concepts. For example, based on pure vision, it is hard to make out between a explosion and normal fire. Similarly, audio alone may give confusing information. On one hand one may be able to filter out the ambiguity arising from one source of information by looking (analysing) other source. While, on the other hand, these different sources may provide complementary information, which may be essential in inference.

Motivated by these difficulties, in the past few years a lot of research has gone into developing algorithms for fusing information from different modalities. Since different modalities may not be sampled at the same temporal rate, it becomes a challenging problem to seamlessly integrate different modalities (e.g., audio is normally sampled at 44KHz whereas video is sampled at 30fps.) At the same time, one may not even have the synchronized streams (sources of information) or the sources of information may have very different characteristics (audio - continuous, inputs to the computer through keyboard - discrete.) If we assume that one can get features from the different streams on a common scale of time, the two main categories of fusion models are those that favor early integration of features versus those that favor late integration. Early integration refers to combining the information at the level of raw features. Simple early integration is often observed in the form of concatenation of weighted features from different streams. More involved models of early integration have been proposed by using some form of Markov models. [27] have proposed the coupled hidden Markov models and used it for detection of human activities. Ghahramani et al. [28] have proposed the factorial hidden Markov models. The main difference in these models arises from the conditional independence assumptions that they make between the states of the different information sources. They assume that the different sources are tightly coupled and model them using a single generative process.

In many situations, especially when the different sources are providing complementary information, one may prefer late integration. It refers to doing inferencing of each stream independently of the others and then combining the output of the two. This is especially important and shows improved results as how one looks at the essential information contained in the different streams and the sensor depend characteristics do not play any role. It also allows one to learn different models for each source independently of one another and then combine the output. One may simply look at the weighted decisions of different sources or may actually use probabilistic models to model the dependencies. For example [42] have proposed the use of dynamic Bayesian networks over the output of the different streams to solve the problem of speaker detection. Similarly, [13] have proposed the use of hierarchical HMMS.

We observed that in the case of movie, the audio and the visual streams normally carry complementary information. For example, a scene of explosion is not just characterized by a huge thunder but also a visual effect corresponding to bright red and yellow colors. Motivated by this fact we propose the use of late coupling which seems to be better suited for this framework. Fusion of multimodal feature streams (especially audio and visual feature streams) has been applied to problems like Bimodal speech [43], speaker detection [42], summarization of video [36], query by audio-visual content [14] and event detection in movies [13]. Examples of fusion of other streams include fusion of text and image content, motion and image content, etc.