The most obvious application of BMoViES is the classification of movies or movie clips into semantic categories. These categories can then be used as basis for the development of content filtering agents that will screen a media stream for items that comply with a previously established user-profile. However, filtering is not the only problem of interest, since users may also be interested in actively searching for specific content in a given database. In this case, the problem becomes one of information retrieval. As illustrated by Figure 3.10 the classification and retrieval problems are dual.
Figure 3.10: The duality between classification and retrieval ( 1998 IEEE).
A classifier is faced with data observations and required to make inferences with respect to the attributes on the basis of which the classification is to be performed. On the other hand, a retrieval system is faced with attribute specifications (e.g., "find all the shots containing people in action") and required to find the data that best satisfies these specifications. Hence, while classification requires the ability to go from sensor observations to attributes, retrieval requires the ability to go from attribute specifications to sensor configurations.
This type of two-way inferences requires a great deal of flexibility from the content characterization architecture that cannot be achieved with most statistical classifiers (such as neural networks , decision trees , or support vector machines ) which require a very precise definition of which variables are inputs and which ones are outputs. Since a Bayesian network does not have inputs or outputs, but only hidden and observed nodes, at a given point in time any node can be an input or an output, and the two dual problems can be handled equally well. Hence, it provides a unified solution to the problems of information filtering and retrieval.
To evaluate the accuracy of the semantic classification of BMoViES, we applied the system to a database of about 100 video clips (total of about 3000 frames) from the movie "Circle of friends." The database is a sub-sampling of approximately 25 minutes of film, and contains a wide variety of scenes and high variation of imaging variables such as lighting, camera viewpoints, etc. To establish ground truth, the video clips were also manually classified.
Table 3.1 presents the classification accuracy achieved by BMoViES for each of the semantic attributes in the model. Overall the system achieved an accuracy of 88.7%. Given the simplicity of the sensors this is a very satisfying result. Some of the classification errors, which illustrate the difficulty of the task, are presented in Figure 3.11.
Figure 3.11: Classification errors in BMoViES. People were not detected in the left three clips; crowd was not recognized on the right ( 1998 IEEE).
As a retrieval system, BMoViES supports standard query by example, where the user provides the system with a video clip and asks it to "find all the clips that look like this". BMoViES then searches for the video clip in the database that maximizes the likelihood P(S|A) of sensor configurations given attribute specifications.
The point that distinguishes BMoViES from most of the current retrieval systems is that the retrieval criterion, semantic similarity, is much more meaningful than the standard visual similarity criterion. In fact, whenever a user orders the machine to "search for a picture like this," the user is with high likelihood referring to pictures that are semantically similar to the query image (e.g., "pictures which also contain people") but which do not necessarily contain identical patterns of colour and texture.
Figure 3.12 presents an example of retrieval by semantic similarity. The image in the top right is a key-frame of the clip submitted as a query by the user, and the remaining images are key frames of the clips returned by BMoViES. Notice that most of the suggestions made by the system are indeed semantically similar to the query, but very few are similar in terms of colour and texture patterns.
Figure 3.12: Example based retrieval in BMoViES. The top left image is a key frame of the clip submitted to the retrieval system. The remaining images are key frames of the best seven matches found by the system ( 1998 IEEE).
Because retrieval of visual information is a complex problem, it is unlikely that any retrieval system will be able to always find the desired database entries in response to a user request. Consequently, the retrieval process is usually interactive, and it is important that retrieval systems can learn from user feedback in order to minimize the number of iterations required for each retrieval operation. If it is true that simply combining a relevance feedback mechanism with a low-level image representation is unlikely to solve the retrieval problem, it is also unlikely that the sophistication of the representation can provide a solution by itself. The goal is therefore to build systems that combine sophisticated representations and learning or relevance feedback mechanisms. Once again, the flexibility inherent to the Bayesian formulation enables a unified framework for addressing both issues.
In the particular case of Bayesian architectures, support for relevance feedback follows immediately from the ability to propagate beliefs, and the fact that nodes can be either hidden or observed at any point in time. The situation is illustrated in Figure 3.13, which depicts two steps of an interactive retrieval session. The user starts by specifying a few attributes, e.g., "show me scenes with people." After belief propagation the system finds the sensor configurations that are most likely to satisfy those specifications and retrieves the entries in the database that lead to those configurations. The user inspects the returned sequences and gives feedback to the system, e.g., "not interested in action scenes," beliefs are updated, new data fetched from the database, and so on. Since the attributes known by the system are semantic, this type of feedback is very intuitive, and much more tuned to the way in which the user evaluates the content himself than relevance feedback based on low level features.
Figure 3.13: Relevance feedback in the Bayesian setting ( 1998 IEEE).
Figure 3.14 illustrates the ability of BMoViES to support meaningful user interaction. The top row presents the video clips retrieved in a response to a query where the action attribute was instantiated with yes, and the remaining attributes with don't care. The system suggests a shot of ballroom dancing as the most likely to satisfy the query, followed by a clip containing some graphics and a clip of a rugby match. In this example, the user was not interested in clips containing a lot of people. Specifying no for the crowd attribute led to the refinement shown in the second row of the figure. The ballroom shot is no longer among the top suggestions, which tend to include at most one or two people. At this point, the user specified that he was looking for scenes shot in a natural set, leading the system to suggest the clips shown in the third row of the figure. The clips that are most likely to satisfy the specification contain scenes of people running in a forest. Finally, the specification of no for the close-up attribute led to the suggestion of the bottom row of the figure, where the clips containing close-ups were replaced for clips where the set becomes predominant.
Figure 3.14: Relevance feedback in BMoViES. Each row presents the response of the system to the query on the left. The action (A), crowd (C), natural set (N), and close-up (D) attributes are instantiated with yes (y), no (n), or don't care (x). The confidence of the system of each of the retrieved clips is shown on top of the corresponding key frame ( 1998 IEEE).
In addition to intuitive relevance feedback mechanisms, the semantic characterization performed by BMoViES enables powerful modes of user interaction via a combination of summarization, visualization, and browsing. For a system capable of inferring content semantics, summarization is a simple outcome of the characterization process. Because system and user understand the same language, all that is required is that the system can display the inferred semantic attributes in a way that does not overwhelm the user. The user can then use his/her own cognitive resources to extrapolate from these semantic attributes to other attributes, usually of higher semantic level, that may be required for a coarse understanding of the content.
In BMoViES, this graphical summarization is attained in the form of a time-line that displays the evolution of the state of the semantic attributes throughout the movie. Figure 3.15 presents the time-lines resulting from the analysis of the promotional trailers of the movies "Circle of friends" (COF) an d "The river wild" (TRW). Each line in the time-line corresponds to the semantic attribute identified by the letter on the left margin - "A" for action, "D" for close-up, "C" for crowd, and "S" for natural set - and each interval between small tick marks displays the state of the attribute in one shot of the trailer - filled (empty) intervals mean that the attribute is active (not present). The shots are represented in the order by which they appear in the trailer.
Figure 3.15: Semantic time-lines for the trailers of the movies "Circle of friends" (top) and "The river wild" (bottom) ( 1998 IEEE).
By simple visual inspection of these time-lines, the user can quickly extract a significant amount of information about the content of the two movies. Namely, it is possible to immediately understand that while COF contains very few action scenes, consists mostly of dialogue, and is for the most part shot in man-made sets, TRW is mostly about action, contains few dialogues, and is shot in the wilderness. When faced with such descriptions, few users looking for a romance would consider TRW worthy of further inspection, and few users looking for a thriller would give COF further consideration.
In fact, it can be argued that given a written summary of the two movies few people would have doubts in establishing the correspondence between summaries and movies based on the information provided by the semantic time-lines alone. To verify this the reader is invited to consider the following summaries, extracted from the Internet Movie Database .
Circle of Friends:
A story about the lives, loves, and betrayals of three Irish girls, Bennie, Eve, and Nan as they go to Trinity College, Dublin. Bennie soon seems to have found her ideal man in Jack, but events conspire to ruin their happiness.
The River Wild:
Gail, an expert at white water rafting, takes her family on a trip down the river to their family's house. Along the way, the family encounters two men who are inexperienced rafters that need to find their friends down river. Later, the family finds out that the pair of men are armed robbers. The men then physically force the family to take them down the river to meet their accomplices. The rafting trip for the family is definitely ruined, but most importantly, their lives are at stake.
Since visual inspection is significantly easier and faster than reading text descriptions, the significance of this conjecture is that semantic summarization could be a viable replacement for the textual summaries that are now so prevalent. We are currently planning experiments with human subjects that will allow a more objective assessment of the benefits of semantic summarization.
It can obviously be argued that the example above does not fully stretch the capabilities of the semantic characterization, i.e., that the movies belong to such different genres that the roughest of the semantic characterizations would allow a smart user to find the desired movie. What if instead of distinguishing COF from TRW, we would like to differentiate TRW from "Ghost and the Darkness" (GAD)? GAD is summarized as follows:
Ghost and the Darkness:
Set in 1898, this movie is based on the true story of two lions in Africa that killed 130 people over a nine-month period, while a bridge engineer and an experienced old hunter tried to kill them.
From this summary, one would also expect GAD to contain a significant amount of action, little dialogue, and be shot in the wilderness. How would the semantic characterization help here?
There are two answers to this question. The first is that it would not because the characterization is not fine enough to distinguish between TRW and GAD. The solution would be to augment the system with finer semantic attributes, e.g., to subdivide the natural set attribute into classes like "river," "forest," "savannah," "desert," etc. The second, significantly simpler, is that while simply looking at the time-lines would not help, interacting with them would.
Consider the action scenes in the two movies. While in TRW we would expect to see a river, a woman, and good and bad guys, in GAD we would expect to see savannah, lions, and hunters. Thus, the action scenes would probably be the place to look first. Consider next the TRW time-line in the bottom of Figure 3.15. The high concentration of action shots in the highlighted area indicates that this is likely to be the best area to look for action. This is confirmed by Figure 3.16, which presents key frames for each of the shots in the area. By actually viewing the shots represented in the figure, it is clear that the action occurs in a river, that there are good and bad guys (the first, and third shots depict a fight), and there are a woman and a child in the boat, i.e., even when the information contained in them is not enough to completely disambiguate the content, the semantic attributes provide a way to quickly access the relevant portions of the video stream. Semantic-based access is an important feature on its own for browsing as it allows users to quickly move on to the portions of the video that really interest them.
Figure 3.16: Key-frames of the shots in the highlighted area of the time-line in the bottom of Figure 3.15. The shot (correctly) classified as not containing action is omitted ( 1998 IEEE).