2. The Role of Video Structure


2. The Role of Video Structure

The main premise behind our work is that the ability to infer semantic information from a video stream is a monotonic function of the amount of structure exhibited by the video. While newscasts are at the highly structured end of the spectrum and, therefore, constitute one of the simplest categories to analyse [4,5,6], the raw output of a personal camera exhibits almost no structure and is typically too difficult to characterize semantically [7]. Between these two extrema, there are various types of content for which the characterization has a varying level of difficulty. Our interests are mostly in domains that are more generic than newscasts, but still follow enough content production codes to exhibit a significant amount of structure. A good example of such a domain is that of feature films.

We have already mentioned that, in this domain, computational modelling of the elements of montage and mise-en-scene is likely to place a central role for any type of semantics based processing. From the content characterization perspective, the important point is that both montage and mise-en-scene tend to follow some very well established production codes or rules. For example, a director trying to put forth a text deeply rooted in the construction of character (e.g., a drama or a romance) will necessarily have to rely on a fair amount of facial close-ups, as close-ups are the most powerful tool for displaying emotion, [2] an essential requirement to establish a bond between audience and characters. If, on the other hand, the goal is to put forth a text of the action or suspense genres, the elements of mise-en-scene become less relevant than the rhythmic patterns of montage. In action or suspense scenes, it is imperative to rely on fast cutting, and manipulation of the cutting rate is the tool of choice for keeping the audience "at the edge of their seats." Directors who exhibit supreme mastery in the manipulation of the editing patterns are even referred to as montage directors. [3]

While there is a fundamental element of montage, the shot duration, it is more difficult to identify a single defining characteristic of mise-en-scene. It is, nevertheless, clear that scene activity is an important one: while action movies contain many active shots, character based stories are better conveyed by scenes of smaller activity (e.g., dialogues). Furthermore, the amount of activity is usually correlated with the amount of violence in the content (at least that of a gratuitous nature) and can provide clues for its detection. It turns out that activity measures also provide strong clues for the most basic forms of video parsing, namely the detection of shot boundaries. For these reasons we start the chapter by analysing a simple model of video structure that reduces montage to shot duration and mise-en-scene to shot activity. It is shown that even such a simple model can lead to shot segmentation algorithms that significantly outperform the current state of the art.

This, of course, does not mean that the semantic characterization problem is solved. In the second part of the chapter, we introduce a generic Bayesian architecture that addresses that more ambitious problem. The basic idea is to rely on 1) a multitude of low-level sensors for visual events that may have semantic relevance (e.g., activity, skin tones, energy in some spatial frequency bands, etc.) and 2) knowledge about how these sensors react to the presence of the semantic stimulae of interest. These two components are integrated through a Bayesian formalism that encodes the knowledge of sensor interaction into a collection of conditional probabilities for sensor measurements given semantic state.

The shot segmentation module and the Bayesian semantic classification architecture form the basis of the Bayesian Modelling of Video Editing and Structure (BMoViES) system for video characterization. An overview of this system is presented in the third part of the chapter, which also illustrates various applications of semantic modelling in video retrieval, summarization, browsing, and classification. More details about the work here discussed can be found in references [7–13].

[2]The importance of close-ups is best summarized in the quote from Charles Chaplin: "Tragedy is a close-up, comedy a long shot."

[3]The most popular example in this class is Alfred Hitchcock, who relied intensively on editing to create suspense in movies like "Psycho" or "Birds"[2].




Handbook of Video Databases. Design and Applications
Handbook of Video Databases: Design and Applications (Internet and Communications)
ISBN: 084937006X
EAN: 2147483647
Year: 2003
Pages: 393

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net