HP Cambridge Research Laboratory
Cambridge, Massachusetts, USA
Given the recent advances on video coding and streaming technology and the pervasiveness of video as a form of communication, there is currently a strong interest in the development of techniques for browsing, categorizing, retrieving, and automatically summarizing video. In this context, two tasks are of particular relevance: the decomposition of a video stream into its component units, and the extraction of features for the automatic characterization of these units. Unfortunately, current video characterization techniques rely on image representations based on low-level visual primitives (such as color, texture, and motion)  that, while practical and computationally efficient, fail to capture most of the structure that is relevant for the perceptual decoding of the video. As a result, it is difficult to design systems that are truly useful for naive users. Significant progress can only be attained by a deeper understanding of the relationship between the message conveyed by the video and the patterns of visual structure that it exhibits.
There are various domains where these relationships have been thoroughly studied, albeit not always from a computational standpoint. For example, it is well known by film theorists that the message strongly constrains the stylistic elements of the video [2,3], which are usually grouped into two major categories: the elements of montage and the elements of mise-en-scene. Montage refers to the temporal structure, namely the aspects of film editing, while mise-en-scene deals with spatial structure, i.e., the composition of each image, and includes variables such as the type of set in which the scene develops, the placement of the actors, aspects of lighting, focus, camera angles, and so on.
Building computational models for these stylistic elements can prove useful in two ways: on one hand it will allow the extraction of semantic features enabling video characterization and classification much closer to that which people use than current descriptors based on texture properties or optical flow. On the other hand, it will provide constraints for the low-level analysis algorithms required to perform tasks such as video segmentation, key-framing, and so on.
The first point is illustrated by Figure 3.1, where we show how a collection of promotional trailers for commercially released feature films populates a 2-D feature space based on the most elementary characterization of montage and mise-en-scene: average shot duration vs. average shot activity.  Despite the coarseness of this characterization, it captures aspects that are important for semantic movie classification: close inspection of the genre assigned to each movie by the motion picture association of America reveals that in this space the movies cluster by genre!
Figure 3.1: Shot activity vs. duration features. The genre of each movie is identified by the symbol used to represent the movie in the plot ( 2000 IEEE).
The long-term goal of our video understanding research is to exploit knowledge of video structure as a means to enable the principled design of computational models for video semantics. This knowledge can either be derived from existing theories of content production, or learned from collections of training examples. The basic idea is to ground the semantic analysis directly on the high-level patterns of structure rather than on lower level attributes such as texture, colour, or optical flow. Since prior knowledge plays a fundamental role in our strategy, we have been placing significant emphasis on the use of Bayesian inference as the computational foundation of all content characterization work. In this chapter, we 1) review ongoing efforts for the development of statistical models for characterizing semantically relevant aspects of video and 2) present our first attempt at building a system that relies on such models to achieve the goal of semantic characterization. We show that accounting for video structure can both lead significant improvements in low-level tasks such as shot segmentation, and enable surprisingly accurate semantic classification in terms of primitives such as action, dialog, or the type of set.
The activity features are described in section 4.