4. Performance Evaluation of Temporal Segmentation Techniques

In this section we give some guidelines for the performance evaluation of temporal segmentation techniques. In particular, we define a test bed made of two techniques, respectively named T1 and T2, used as comparison terms and suitable for detection of abrupt and abrupt/gradual transitions, respectively. Both T1 and T2 techniques require a manual or semi-automatic work for adapting thresholds and other parameters to the feature of input sequences, in order to obtain performances near to the optimum for both of them. It is with these optimal or near-to-optimal performances that a new temporal segmentation technique can be compared. As an example, we will evaluate the performance of our MLP-based approach. The design of this evaluation framework is part of a project on video temporal segmentation currently under development at the Computer Science and Artificial Intelligence lab of the University of Palermo.

We also describe the dataset we used for the evaluation. We used four video sequences representative of different types of video. In particular, we considered two sport sequences (soccer and F1 world championship), one news sequence and one naturalistic documentary. Details on test sequences are reported in Table 7.2.

Table 7.2: Characteristics of dataset
Sequence	Frame rate (fps)	# of frames	# of abrupt transitions	# of gradual transitions	Total # of transitions
Soccer	25	19260	115	51	166
F1	25	20083	106	14	120
News	25	20642	122	6	128
Nature	25	25016	150	12	162
	Total	85001	493	83	576

Most of the gradual transitions are dissolves. A few wipes are present and they are mainly in the F1 sequence. As previously stated both the T2 technique and the MLP-based one are aimed to the detection of a gradual transition but cannot discriminate between different kinds of gradual transitions.

In order to evaluate the performance of a segmentation algorithm we should primarily consider correctly detected transitions, false detections and missed transitions. Then we should analyze the performance with respect to gradual transition classification accuracy.

Recall and precision factors are normally used to evaluate performance [28] of transition detection techniques. These quality factors are defined as follows:

(7.25)

where n_c is the number of transitions correctly detected, n_f is the number of false positives and n_m is the number of missed detections. Ideally, n_f = n_m =0 so that both the factors are 1. In our test bed we found useful a new quality factor defined in the following way:

(7.26)

where n_tot is the total number of transitions and n_m = n_tot - n_c. This quality factor takes simultaneously into account the ability to avoid both false and missed detections.

To evaluate the performance of transition classification techniques a different quality factor is needed. Assume n_cc as the number of abrupt transitions detected and correctly classified, n_cg as the number of gradual transitions detected and correctly classified and n_sw the number of transitions detected but misclassified. We can define the following quality factor:

(7.27)

This quality factor is 1 if n_sw=0. Note that, as previously stated, I_sw is only a measure of transition classification accuracy and then cannot replace the quality factor Q_r that is a measure of transition detection performance. The two factors should always be used together to evaluate the performance of a system.

4.1 Performance Evaluation of the Technique T1

We used the interframe metrics bin-to-bin, chi-square and histogram correlation, and called T1 the technique whose result coincides, in each analyzed case, with the output of the best performing of the three methods. We evaluated the quality factor Q_r for the different HDM used varying the value of the threshold.

Using the annotated dataset described in the previous section we were able to determine for each sequence the optimal threshold. For example, in Figures 7.7 and 7.8 is reported the quality factor Q_r as a function of the threshold using the bin-to-bin metric, respectively, for soccer and nature videos. Obviously, as the threshold increases the number of false detections decreases but the number of missed detections increases. From Figures 7.7 and 7.8 it is possible to see that Q_r for the nature video is higher, meaning a better performance, than Q_r for the soccer video. This is due to the larger presence of gradual transitions in soccer video that are misdetected.

click to expand
Figure 7.7: Performance of abrupt detection technique T1 for the soccer test sequence.

click to expand
Figure 7.8: Performance of abrupt detection technique T1 for the nature test sequence.

For the other sequences and for the other metrics we obtained very similar shapes of the curves of Figures 7.7 and 7.8. In Tables 7.3–7.5 it is summarized the performance of this technique on the four different test video and using three different histogram metrics in correspondence to the optimal value of the threshold:

Table 7.3: Performance using the b2b metric
Sequence	Optimal threshold	Qr
Soccer	0.30	0.68
News	0.30	0.76
F1	0.30	0.72
Nature	0.35	0.79

Table 7.4: Performance using the chi-square metric
Sequence	Optimal threshold	Qr
Soccer	0.35	0.62
News	0.35	0.74
F1	0.40	0.69
Nature	0.40	0.75

Table 7.5: Performance using the histogram correlation metric.
Sequence	Optimal threshold	Qr
Soccer	0.20	0.64
News	0.25	0.74
F1	0.25	0.71
Nature	0.25	0.77

It should be noted that the bin-to-bin metric, despite of its simpleness, exhibits the best behavior for the proposed input sequences.

4.2 Performance Evaluation of the Technique T2

To evaluate the performance of a temporal segmentation technique in presence of gradual transitions, as a term of comparison we implemented a multi-threshold technique, called T2, inspired by the algorithm presented in [7]. The technique T2, similarly to other techniques presented in literature, is based on the observation that the video properties relevant for the shot transition detection are intrinsically local, i.e., depending on the behavior of the interframe metric within a temporal window of a few decades of frames. Within the window, a shot boundary may be detected and classified analyzing statistical factors like the mean and the standard deviation of the sequence of interframe metric values. In particular, the local standard deviation calculated when an abrupt transition is present within the window is very different from those obtained in case of a gradual transition, due to the more distributed changes of the interframe metric values in the latter case.

Table 7.6 shows the definition and the meaning of thresholds and other factors intervening in the transition detection process.

Table 7.6: Factors intervening in transition detection (technique T2)
Parameter	Description
w	Temporal window of analysis
T_hc = μw + α_cσ_w	High threshold for abrupt transitions
T_hg = μ_w + α_gσ_w	High threshold for gradual transitions
T_lc = β_cμ_T	Low threshold for abrupt transitions
T_lg = β_gμ_T	Low threshold for gradual transitions

In the above definitions, μ_w and σ_w, respectively, are the mean and the standard deviation of the interframe metric values within the temporal window w, while μ_T is the global mean of the interframe metric. The coefficients α_c, α_g, β_c and β_g are determined experimentally, as well as the optimal size of the window w.

The technique T2 operates in the following way. Shot boundary occurrences are detected in correspondence to the central pair of frames within the current temporal window. To detect a cut, the corresponding interframe metric value must be the maximum of values belonging to the current window, and also greater than the local thresholds T_hc and T_hg. Moreover, it must also be greater than the global threshold T_lc, whose value is depending, through the coefficient β_c, on the global mean of the interframe metric. Similarly, a gradual transition is detected in correspondence to the central pair of frames within the current temporal window if the corresponding interframe metric value is a local maximum, and if it is greater than T_hg and T_lg, but not greater than T_hc.

It should be noted that verifying the presence of a local maximum before declaring a transition is a condition necessary to avoid the detection of more than one transition within the same temporal window. Moreover, two different local thresholds, T_lc and T_lg, have been introduced to overcome the problem of local perturbation effects, which could cause exceeding the threshold T_hg, as it has also been experimentally verified. T_lc and T_lg must be different, because the average metric values related to abrupt transitions are very different from the average metric values related to gradual transitions.

Figure 7.9 shows the behavior of the interframe metric bin-to-bin for the soccer sequence, outlining the variation of local thresholds T_hc and T_hg (the values w = 21, α_c = 4 and α_g = 1.5 have been assumed). It should be noted that both the thresholds are exceeded in correspondence to abrupt transitions, while T_hg (but not T_hc) is sometimes exceeded when searching for gradual transitions.

click to expand
Figure 7.9: Interframe bin-to-bin metric and local thresholds for a soccer sequence.

These aspects are even more evident in Figure 7.10, where the metric bin-to-bin is shown for a shorter sequence of 250 frames extracted from the same video sequence. In this figure, two cuts of different value are reported, along with a gradual transition (dissolve) characterized by prominent fluctuations.

click to expand
Figure 7.10: Interframe bin-to-bin metric and local thresholds for a soccer sequence.

As the technique T2 is largely dependent on various parameters and thresholds, we carefully evaluated its performance by letting the parameters vary in quite large intervals, with small variation steps (intervals and steps were determined on the basis of preliminary experiments), as shown in Table 7.7.

Table 7.7: Range of parameters used for experiments
Parameter	Range	Step
w	15–49	2
α_c	2.0–5.0	0.5
α_g	1.0–3.0	0.5
β_c	2.0–5.0	0.5
β_g	1.0–4.0	0.5

In Tables 7.8–7.10 we report the best result obtained for each metric and each sequence and the set of parameters that led to the result:

Table 7.8: Optimal result using bin-to-bin metric
Sequence	Qr	*I_sw*	W	*α_c*	*α_g*	*β_c*	*β_g*
Soccer	0.84	0.94	21	4.0	1.5	3.0	2.0
F1	0.82	0.92	25	3.5	1.5	3.0	2.0
News	0.83	0.97	21	3.5	2.0	3.0	2.0
Nature	0.83	0.96	21	3.5	2.0	3.0	2.0

Table 7.9: Optimal result using chi-squares metric
Sequence	Qr	*I_sw*	W	*α_c*	*α_g*	*β_c*	*β_g*
Soccer	0.72	0.80	29	3.5	2.0	3.5	3.0
F1	0.75	0.76	27	3.0	2.0	3.0	2.5
News	0.78	0.86	27	3.5	2.5	3.5	2.5
Nature	0.71	0.88	25	3.0	2.5	3.5	3.0

Table 7.10: Optimal result using correlation metric
Sequence	Qr	*I_sw*	W	*α_c*	*α_g*	*β_c*	*β_g*
Soccer	0.76	0.90	25	4.0	2.5	2.0	1.5
F1	0.77	0.92	29	4.0	2.5	2.5	1.5
News	0.78	0.86	27	4.0	2.5	2.0	1.5
Nature	0.75	0.79	25	4.5	2.5	2.5	2.0

4.3 Performance Evaluation of the MLP-Based Technique

To evaluate the performance of the proposed MLP-based technique we used the soccer and news video as training set and the nature and F1 video as test set. After each training epoch, the MLP performance is evaluated both on the training and test set.

The MLP architecture has been determined experimentally. The number of input units is obviously related to the probable number of frames belonging to a gradual transition. From video data available in our dataset we observed that the duration of gradual transitions is usually between 10 and 20 frames, i.e., below one second at 25 fps, a result in accord with modern electronic editing tools. We then tried several networks with 21 or 31 input units, a choice allowing to capture an entire transition without the risk of enclosing more than one transition within an unique temporal window. A negative consequence of this choice is the impossibility of dealing with very short shot transitions, like those ones typical of some commercial or musical sequences. However, we assumed our dataset more representative of kinds of video sequences of interest for video segmentation.

We tried network architectures with different numbers of hidden units. If the number of input units is intuitively related to the duration of gradual transitions, the choice of the number of hidden units is critical, in the sense that a limited number of hidden units can limit the ability of the network to generalize, while an excessive number of hidden units can induce problems of overtraining. Even this aspect has been investigated experimentally.

In Table 7.11 the performance of different MLP architectures after 50 training epochs is reported. The network architecture is indicated with three numbers indicating, respectively, the number of input, hidden and output units. The performance is expressed in terms of the quality factors previously defined, both on training set and testing set:

Table 7.11: Performance of MLPs of different architecture
	(21,10,3)		(21,40,3)		(21,100,3)		(31,40,3)
	Qr	*I_sw*	*Q_r*	*I_sw*	*Q_r*	*I_sw*	*Q_r*	*I_sw*
Soccer	0.87	0.92	0.95	0.99	0.86	0.90	0.99	0.99
News	0.84	0.94	0.94	1.00	0.84	0.94	0.94	1.00
F1	0.84	0.90	0.89	0.97	0.82	0.91	0.88	0.97
Nature	0.80	0.91	0.91	0.99	0.77	0.88	0.89	0.96

From inspection of Table 7.11 it is evident the best performance is obtained using a network with 21 input units and 40 hidden units. As expected, the network with 10 or 100 hidden units does not perform well. Since the use of 31 input units does not give significant performance improvement on our dataset, the (21-40-3) network has been selected for its lower computational load and for its better ability to detect temporally close transitions.

4.4 Comparison of Different Techniques

In this section we compare the three techniques we described in the previous sections. In Table 7.12 a summary of most important results is reported. Note that the results of techniques T1 and T2 were obtained using a choice of optimal parameters (for T1 and T2). Similarly the results of the MLP-based technique for the sequences soccer and news are computed on the training set. These results are then to be considered optimal and, in practice, we should expect worse performance.

Table 7.12: Performance comparison among the described techniques
	T1		T2		MLP
	*Q_r*	*I_sw*	*Q_r*	*I_sw*	*Q_r*	*I_sw*
Soccer	0.68	---	0.84	0.94	0.95	0.99
News	0.76	---	0.82	0.92	0.94	1.00
F1	0.72	---	0.83	0.97	0.89	0.97
Nature	0.79	---	0.83	0,96	0.91	0.99

On the other side, it is important to note that the performance of the MLP-based classifier on the sequences F1 and Nature, which are outside of the training set, is not based on the knowledge of the ground truth and then may be considered representative of the algorithm behavior. If we used the optimal parameters computed for techniques T1 and T2 for the sequences soccer and news to analyze the sequences F1 and Nature, we would obtain a worse performance. Finally, although there is no theoretical justification, we note that, independently of the decision technique, best results are obtained using bin-to-bin metric.