This section discusses the experimental setups and results of the shot-level classification and scene/story boundary detection.
We use two days of news video (one from May 2001, the other from June 2001) obtained from the MediaCorp of Singapore to test the performance of our system. Each day of news video is half an hour in duration. One day is used for training, and the other for testing. In order to eliminate indexing errors, we manually index all the features of the shots segmented using the multi-resolution analysis algorithm [14]. After the removal of commercials, the training data set contains 200 shots and testing data set contains 183 shots. The numbers of story/scene boundaries are respectively 39 and 40 for the training and test data sets.
In Information Retrieval, there are several methods to measure the performance of the systems. One method is to use precision (P) and recall (R) values. The formulas are expressed as below:
(44.4) |
(44.5) |
where
NC | the number of correct boundaries detected |
FN | the number of False Negatives (missed) |
FP | the number of False Positives (not a boundary but is detected as a boundary) |
By giving equal weights to precision and recall, we can derive an F1 value to measure the overall system performance as:
(44.6) |
The results of shot-level classification using the Decision Tree are presented in Table 44.1. The diagonal entries in Table 44.1 show the number of shots correctly classified into the respective category, while the off-diagonal entries show those wrongly classified. It can be seen that the largest classification error occurs in the Anchor category where a large number of shots are misclassified as Speech. This is because their contents are quite similar, and we probably need additional features like background or speaker change, and the context of neighboring shots to differentiate them.
Classified as-> | a | b | c | d | e | f | g | h | i | j |
---|---|---|---|---|---|---|---|---|---|---|
| 26 | 1 | ||||||||
| 16 | 4 | ||||||||
| 2 | |||||||||
| 13 | |||||||||
| 1 | |||||||||
| 82 | 1 | ||||||||
| 1 | 11 | ||||||||
| 1 | 8 | ||||||||
| 6 | |||||||||
| 5 |
Overall, our initial results indicate that we could achieve a classification accuracy of over 95%.
In order to ascertain the effectiveness of the set of features selected, we perform separate experiments by using different number of features. As face is found to be the most important feature, we use the face as the first feature to be given to the system. With the face feature alone, the system returns an accuracy of only 59.6%. If we include the audio feature, the accuracy increases rapidly to 78.2%. However, this accuracy is still far below the accuracy that we could achieve by using all the features. When we successively add in the rest of features in the order of shot type, motion, videotext, text centralization, and shot duration, the performance of the system improves steadily and eventually reaches the accuracy of 95.10%. The analysis indicates that all the features are essential in shot classification. Figure 44.7 shows the summary of the analysis.
Figure 44.7: Summary of the features analysis
Figure 44.8 gives the interface of the shot classification system and shows the output category of each shot in an input news video.
Figure 44.8: The output category for each shot in an input news video sequence
We set up three experiments (Tests I, II, and III) for scene/ story boundary detection. As explained in Section 5, our experiments indicate that the number of states equal to 4 gives the best results. Thus, we set the number of states to 4 in these three HMM tests.
For Test I, we assume that all the shots are correctly tagged. We perform the HMM to locate the story boundaries and we could achieve a F1 value of 93.7%. This experiment demonstrates that HMM is effective in news story boundary detection.
Test II is similar to Test I except that we perform the HMM analysis on the set of shots tagged using the earlier shot classification stage with about 5% tagging error. The test shows that we are able to achieve an F1 measure of 89.7%.
The results of both tests are detailed in Table 44.2.
Test | NB | NC | FN | FP | R (%) | P (%) | F1 (%) |
---|---|---|---|---|---|---|---|
I | 40 | 37 | 3 | 2 | 94.9 | 92.5 | 93.7 |
II | 40 | 35 | 5 | 3 | 87.5 | 92.1 | 89.7 |
** NB: the total number of correct boundaries. |
In Test III, we want to verify whether it is necessary to perform the two-level analysis in order to achieve the desired level of performance. We perform HMM analysis on the set of shots with their original feature set but without the category information. We vary the number of features used from the full feature set to only a few essential features. The best result we could achieve is only 37.6% in F1 value. This test shows that although in theory a single stage analysis should perform the best, in practice, because of data sparseness, the 2-level analysis is superior.
In order to evaluate the importance of each feature used in Test II, we perform another set of experiments using only the individual feature one at a time, and by adding the second and third feature to the Tag-ID feature. The results are listed in Table 44.3.
Feature | NS | NC | FN | FP | R | P | F1 |
---|---|---|---|---|---|---|---|
Tag | 6 | 35 | 5 | 6 | 87.5 | 85.4 | 86.4 |
Sp | 6 | 35 | 5 | 93 | 87.5 | 27.3 | 41.7 |
Sc | 5 | 26 | 14 | 90 | 65.0 | 22.4 | 33.3 |
Tag +Sp | 6 | 37 | 3 | 7 | 92.5 | 84.1 | 88.9 |
Tag+Sp+Sc | 4 | 35 | 5 | 3 | 87.5 | 92.1 | 89.7 |
Table 44.3 indicates that by using only the Tag-ID feature, the system could achieve an F1 measure of 86.4%. On the other hand, the use of the second and the third feature alone return low F1 measures of 41.7 and 33.3 respectively. However, by combining the last two features with the Tag-ID feature, the system's F1 performance improves gradually from 86.4% (with Tag-ID as the only feature) to 88.9% (Tag-ID +Sp), and reaches 89.7% when all the three features are included (Tag-ID +Sp +Sc). The analysis indicates that the first feature (Tag-ID) is the most important feature for scene/story boundary detection. It further confirms that shot classification facilitates the detection of news boundaries, and therefore our two-level approach is effective.
Here we analyse how the HMM framework detects the story boundaries. Figure 44.9 lists two examples of the output state sequences resulting from the HMM analysis with 4 states. Figure 44.9 indicates that State 4 signals the transition from current news topic to the next. In most of the cases, state 4 corresponds to Anchor shot (see the first example in Figure 44.9). However, there are exceptions as indicated in the second example, where state 4 represents the transition from Weather news to Special news. Thus, by detecting State 4, we can locate the boundary of the current news topic.
Figure 44.9: Two examples of the observation sequences and their output state sequences