In this paper we have analyzed the problem of detecting temporal events in videos using multiple sources of information. We have argued that some of the main characteristics of this problem are the duration of the events and the dependence between the different concepts. In the past, using standard HMMs for fusion, both of these issues were ignored. We have shown as how one can use the standard probabilistic models, modify them and obtain superior performance.
In particular, we present a new model, the duration dependent input output Markov model (DDIOMM) for performing integration of intermediate decisions from different feature streams to detect events in multimedia. The model provides a hierarchical mechanism to map media features to output decision sequences through intermediate state sequences. It forces the multimodal input streams to be dependent given the target event. It also supports discrete non-exponential duration models for events. By combining these two features in the framework of generative models for inference, we present a simple and efficient decision sequence decoding Viterbi algorithm. We demonstrate the strength of our model by experimenting with audio-visual data from movies and the audiovisual event explosion. Experiments comparing the DDIOMM with the IOMM as well as the HMM reveal that the DDIOMM results in lower classification error and improves detection.