4. Performance Evaluation of Temporal Segmentation Techniques


4. Performance Evaluation of Temporal Segmentation Techniques

In this section we give some guidelines for the performance evaluation of temporal segmentation techniques. In particular, we define a test bed made of two techniques, respectively named T1 and T2, used as comparison terms and suitable for detection of abrupt and abrupt/gradual transitions, respectively. Both T1 and T2 techniques require a manual or semi-automatic work for adapting thresholds and other parameters to the feature of input sequences, in order to obtain performances near to the optimum for both of them. It is with these optimal or near-to-optimal performances that a new temporal segmentation technique can be compared. As an example, we will evaluate the performance of our MLP-based approach. The design of this evaluation framework is part of a project on video temporal segmentation currently under development at the Computer Science and Artificial Intelligence lab of the University of Palermo.

We also describe the dataset we used for the evaluation. We used four video sequences representative of different types of video. In particular, we considered two sport sequences (soccer and F1 world championship), one news sequence and one naturalistic documentary. Details on test sequences are reported in Table 7.2.

Table 7.2: Characteristics of dataset

Sequence

Frame rate (fps)

# of frames

# of abrupt transitions

# of gradual transitions

Total # of transitions

Soccer

25

19260

115

51

166

F1

25

20083

106

14

120

News

25

20642

122

6

128

Nature

25

25016

150

12

162

Total

85001

493

83

576

Most of the gradual transitions are dissolves. A few wipes are present and they are mainly in the F1 sequence. As previously stated both the T2 technique and the MLP-based one are aimed to the detection of a gradual transition but cannot discriminate between different kinds of gradual transitions.

In order to evaluate the performance of a segmentation algorithm we should primarily consider correctly detected transitions, false detections and missed transitions. Then we should analyze the performance with respect to gradual transition classification accuracy.

Recall and precision factors are normally used to evaluate performance [28] of transition detection techniques. These quality factors are defined as follows:

(7.25)

where nc is the number of transitions correctly detected, nf is the number of false positives and nm is the number of missed detections. Ideally, nf = nm =0 so that both the factors are 1. In our test bed we found useful a new quality factor defined in the following way:

(7.26) click to expand

where ntot is the total number of transitions and nm = ntot - nc. This quality factor takes simultaneously into account the ability to avoid both false and missed detections.

To evaluate the performance of transition classification techniques a different quality factor is needed. Assume ncc as the number of abrupt transitions detected and correctly classified, ncg as the number of gradual transitions detected and correctly classified and nsw the number of transitions detected but misclassified. We can define the following quality factor:

(7.27)

This quality factor is 1 if nsw=0. Note that, as previously stated, Isw is only a measure of transition classification accuracy and then cannot replace the quality factor Qr that is a measure of transition detection performance. The two factors should always be used together to evaluate the performance of a system.

4.1 Performance Evaluation of the Technique T1

We used the interframe metrics bin-to-bin, chi-square and histogram correlation, and called T1 the technique whose result coincides, in each analyzed case, with the output of the best performing of the three methods. We evaluated the quality factor Qr for the different HDM used varying the value of the threshold.

Using the annotated dataset described in the previous section we were able to determine for each sequence the optimal threshold. For example, in Figures 7.7 and 7.8 is reported the quality factor Qr as a function of the threshold using the bin-to-bin metric, respectively, for soccer and nature videos. Obviously, as the threshold increases the number of false detections decreases but the number of missed detections increases. From Figures 7.7 and 7.8 it is possible to see that Qr for the nature video is higher, meaning a better performance, than Qr for the soccer video. This is due to the larger presence of gradual transitions in soccer video that are misdetected.

click to expand
Figure 7.7: Performance of abrupt detection technique T1 for the soccer test sequence.

click to expand
Figure 7.8: Performance of abrupt detection technique T1 for the nature test sequence.

For the other sequences and for the other metrics we obtained very similar shapes of the curves of Figures 7.7 and 7.8. In Tables 7.3–7.5 it is summarized the performance of this technique on the four different test video and using three different histogram metrics in correspondence to the optimal value of the threshold:

Table 7.3: Performance using the b2b metric

Sequence

Optimal threshold

Qr

Soccer

0.30

0.68

News

0.30

0.76

F1

0.30

0.72

Nature

0.35

0.79

Table 7.4: Performance using the chi-square metric

Sequence

Optimal threshold

Qr

Soccer

0.35

0.62

News

0.35

0.74

F1

0.40

0.69

Nature

0.40

0.75

Table 7.5: Performance using the histogram correlation metric.

Sequence

Optimal threshold

Qr

Soccer

0.20

0.64

News

0.25

0.74

F1

0.25

0.71

Nature

0.25

0.77

It should be noted that the bin-to-bin metric, despite of its simpleness, exhibits the best behavior for the proposed input sequences.

4.2 Performance Evaluation of the Technique T2

To evaluate the performance of a temporal segmentation technique in presence of gradual transitions, as a term of comparison we implemented a multi-threshold technique, called T2, inspired by the algorithm presented in [7]. The technique T2, similarly to other techniques presented in literature, is based on the observation that the video properties relevant for the shot transition detection are intrinsically local, i.e., depending on the behavior of the interframe metric within a temporal window of a few decades of frames. Within the window, a shot boundary may be detected and classified analyzing statistical factors like the mean and the standard deviation of the sequence of interframe metric values. In particular, the local standard deviation calculated when an abrupt transition is present within the window is very different from those obtained in case of a gradual transition, due to the more distributed changes of the interframe metric values in the latter case.

Table 7.6 shows the definition and the meaning of thresholds and other factors intervening in the transition detection process.

Table 7.6: Factors intervening in transition detection (technique T2)

Parameter

Description

w

Temporal window of analysis

Thc = μw + αcσw

High threshold for abrupt transitions

Thg = μw + αgσw

High threshold for gradual transitions

Tlc = βcμT

Low threshold for abrupt transitions

Tlg = βgμT

Low threshold for gradual transitions

In the above definitions, μw and σw, respectively, are the mean and the standard deviation of the interframe metric values within the temporal window w, while μT is the global mean of the interframe metric. The coefficients αc, αg, βc and βg are determined experimentally, as well as the optimal size of the window w.

The technique T2 operates in the following way. Shot boundary occurrences are detected in correspondence to the central pair of frames within the current temporal window. To detect a cut, the corresponding interframe metric value must be the maximum of values belonging to the current window, and also greater than the local thresholds Thc and Thg. Moreover, it must also be greater than the global threshold Tlc, whose value is depending, through the coefficient βc, on the global mean of the interframe metric. Similarly, a gradual transition is detected in correspondence to the central pair of frames within the current temporal window if the corresponding interframe metric value is a local maximum, and if it is greater than Thg and Tlg, but not greater than Thc.

It should be noted that verifying the presence of a local maximum before declaring a transition is a condition necessary to avoid the detection of more than one transition within the same temporal window. Moreover, two different local thresholds, Tlc and Tlg, have been introduced to overcome the problem of local perturbation effects, which could cause exceeding the threshold Thg, as it has also been experimentally verified. Tlc and Tlg must be different, because the average metric values related to abrupt transitions are very different from the average metric values related to gradual transitions.

Figure 7.9 shows the behavior of the interframe metric bin-to-bin for the soccer sequence, outlining the variation of local thresholds Thc and Thg (the values w = 21, αc = 4 and αg = 1.5 have been assumed). It should be noted that both the thresholds are exceeded in correspondence to abrupt transitions, while Thg (but not Thc) is sometimes exceeded when searching for gradual transitions.

click to expand
Figure 7.9: Interframe bin-to-bin metric and local thresholds for a soccer sequence.

These aspects are even more evident in Figure 7.10, where the metric bin-to-bin is shown for a shorter sequence of 250 frames extracted from the same video sequence. In this figure, two cuts of different value are reported, along with a gradual transition (dissolve) characterized by prominent fluctuations.

click to expand
Figure 7.10: Interframe bin-to-bin metric and local thresholds for a soccer sequence.

As the technique T2 is largely dependent on various parameters and thresholds, we carefully evaluated its performance by letting the parameters vary in quite large intervals, with small variation steps (intervals and steps were determined on the basis of preliminary experiments), as shown in Table 7.7.

Table 7.7: Range of parameters used for experiments

Parameter

Range

Step

w

15–49

2

αc

2.0–5.0

0.5

αg

1.0–3.0

0.5

βc

2.0–5.0

0.5

βg

1.0–4.0

0.5

In Tables 7.8–7.10 we report the best result obtained for each metric and each sequence and the set of parameters that led to the result:

Table 7.8: Optimal result using bin-to-bin metric

Sequence

Qr

Isw

W

αc

αg

βc

βg

Soccer

0.84

0.94

21

4.0

1.5

3.0

2.0

F1

0.82

0.92

25

3.5

1.5

3.0

2.0

News

0.83

0.97

21

3.5

2.0

3.0

2.0

Nature

0.83

0.96

21

3.5

2.0

3.0

2.0

Table 7.9: Optimal result using chi-squares metric

Sequence

Qr

Isw

W

αc

αg

βc

βg

Soccer

0.72

0.80

29

3.5

2.0

3.5

3.0

F1

0.75

0.76

27

3.0

2.0

3.0

2.5

News

0.78

0.86

27

3.5

2.5

3.5

2.5

Nature

0.71

0.88

25

3.0

2.5

3.5

3.0

Table 7.10: Optimal result using correlation metric

Sequence

Qr

Isw

W

αc

αg

βc

βg

Soccer

0.76

0.90

25

4.0

2.5

2.0

1.5

F1

0.77

0.92

29

4.0

2.5

2.5

1.5

News

0.78

0.86

27

4.0

2.5

2.0

1.5

Nature

0.75

0.79

25

4.5

2.5

2.5

2.0

4.3 Performance Evaluation of the MLP-Based Technique

To evaluate the performance of the proposed MLP-based technique we used the soccer and news video as training set and the nature and F1 video as test set. After each training epoch, the MLP performance is evaluated both on the training and test set.

The MLP architecture has been determined experimentally. The number of input units is obviously related to the probable number of frames belonging to a gradual transition. From video data available in our dataset we observed that the duration of gradual transitions is usually between 10 and 20 frames, i.e., below one second at 25 fps, a result in accord with modern electronic editing tools. We then tried several networks with 21 or 31 input units, a choice allowing to capture an entire transition without the risk of enclosing more than one transition within an unique temporal window. A negative consequence of this choice is the impossibility of dealing with very short shot transitions, like those ones typical of some commercial or musical sequences. However, we assumed our dataset more representative of kinds of video sequences of interest for video segmentation.

We tried network architectures with different numbers of hidden units. If the number of input units is intuitively related to the duration of gradual transitions, the choice of the number of hidden units is critical, in the sense that a limited number of hidden units can limit the ability of the network to generalize, while an excessive number of hidden units can induce problems of overtraining. Even this aspect has been investigated experimentally.

In Table 7.11 the performance of different MLP architectures after 50 training epochs is reported. The network architecture is indicated with three numbers indicating, respectively, the number of input, hidden and output units. The performance is expressed in terms of the quality factors previously defined, both on training set and testing set:

Table 7.11: Performance of MLPs of different architecture

(21,10,3)

(21,40,3)

(21,100,3)

(31,40,3)

Qr

Isw

Qr

Isw

Qr

Isw

Qr

Isw

Soccer

0.87

0.92

0.95

0.99

0.86

0.90

0.99

0.99

News

0.84

0.94

0.94

1.00

0.84

0.94

0.94

1.00

F1

0.84

0.90

0.89

0.97

0.82

0.91

0.88

0.97

Nature

0.80

0.91

0.91

0.99

0.77

0.88

0.89

0.96

From inspection of Table 7.11 it is evident the best performance is obtained using a network with 21 input units and 40 hidden units. As expected, the network with 10 or 100 hidden units does not perform well. Since the use of 31 input units does not give significant performance improvement on our dataset, the (21-40-3) network has been selected for its lower computational load and for its better ability to detect temporally close transitions.

4.4 Comparison of Different Techniques

In this section we compare the three techniques we described in the previous sections. In Table 7.12 a summary of most important results is reported. Note that the results of techniques T1 and T2 were obtained using a choice of optimal parameters (for T1 and T2). Similarly the results of the MLP-based technique for the sequences soccer and news are computed on the training set. These results are then to be considered optimal and, in practice, we should expect worse performance.

Table 7.12: Performance comparison among the described techniques

T1

T2

MLP

Qr

Isw

Qr

Isw

Qr

Isw

Soccer

0.68

---

0.84

0.94

0.95

0.99

News

0.76

---

0.82

0.92

0.94

1.00

F1

0.72

---

0.83

0.97

0.89

0.97

Nature

0.79

---

0.83

0,96

0.91

0.99

On the other side, it is important to note that the performance of the MLP-based classifier on the sequences F1 and Nature, which are outside of the training set, is not based on the knowledge of the ground truth and then may be considered representative of the algorithm behavior. If we used the optimal parameters computed for techniques T1 and T2 for the sequences soccer and news to analyze the sequences F1 and Nature, we would obtain a worse performance. Finally, although there is no theoretical justification, we note that, independently of the decision technique, best results are obtained using bin-to-bin metric.




Handbook of Video Databases. Design and Applications
Handbook of Video Databases: Design and Applications (Internet and Communications)
ISBN: 084937006X
EAN: 2147483647
Year: 2003
Pages: 393

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net