The Bayesian shot segmentation module of section 5 and the Bayesian architecture of section 6 form the basis of the BMoViES system for video classification, retrieval summarization, and browsing. This system is our first attempt at evaluating the practical feasibility of extracting semantics in a domain as sophisticated as that of movies.
In the current implementation, the system recognizes four semantic shot attributes: the presence/absence of a close-up, the presence/absence of a crowd in the scene, the type of set (nature vs. urban), and if the shot contains a significant amount of action or not. These attributes can be seen as a minimalist characterization of mise-en-scene which, nevertheless, provides a basis for categorizing the video into relevant semantic categories such as "action vs dialog," "city vs country side," or combinations of these. Also, as discussed in section 2, it captures the aspects of mise-en-scene that are essential for the inference of higher level semantic attributes such as suspense or drama.
Currently, the sensor set consists of three sensors measuring the following properties: shot activity, texture energy, and amount of skin tones in the scene. Shot activity is measured as discussed in section 4. The texture energy sensor performs a 3-octave wavelet decomposition of each image, and measures the ratio of the total energy in the high-pass horizontal and vertical bands to the total energy in all the bands other than the DC. It produces a low output whenever there is a significant amount of vertical or horizontal structure in the images (as is the case in most man-made environments) and a high output when this is not the case (as is typically the case in natural settings). Finally, the skin tones sensor identifies the regions of each image that contain colours consistent with human skin, measures the area of each of these regions and computes the entropy of the resulting vector (regarding each component as a probability). This sensor outputs a low value when there is a single region of skin and high values otherwise. The situation of complete absence of skin tones is also detected, the output of the sensor being set to one.
Sensor measurements are integrated across each shot by averaging the individual frame outputs. In order to quantize the sensor outputs, their range was thresholded into three equally sized bins. In this way, each sensor provides a ternary output corresponding to the states no, yes, or maybe. For example, the activity sensor can output one of three states: "there is no significant activity in this shot," "there is a significant amount of activity in this shot," or "maybe there is significant activity, I can't tell."
The Bayesian network implemented in BMoViES is presented in Figure 3.9. The parameters of this model can either be learned from training data  or set according to expert knowledge. In the current implementation we followed the latter approach. Both the structure and the probabilities in the model were hand-coded, using common-sense (e.g., the output of the skin tones sensor will be yes with probability 0.9 for a scene of a crowd in a man-made set). No effort was made to optimize the overall performance of the system by tweaking the network probabilities.
Figure 3.9: Bayesian network implemented in BmoViES ( 1998 IEEE).
To see how explaining away occurs in BMoViES, consider the observation of a significant amount of skin tones. Such observation can be synonymous with either a close-up or a scene of a crowd. However, if a crowd is present there will also be a significant response by the texture sensor, while the opposite will happen if the shot consists of a close-up. Hence, the texture sensor "explains away" the observation of skin tones, and rules out the close-up hypothesis, even though it is not a crowd detector.