It has been pointed out that the aim of temporal segmentation is the decomposition of a video in camera-shots. Therefore, the temporal segmentation must primarily allow to exactly locate transitions between consecutive shots. Secondarily, the classification of the type of transition is of interest. Basically, temporal segmentation algorithms are based on the evaluation of the quantitative differences between successive frames and on some kind of thresholding. In general an effective segmentation technique must combine an interframe metric computationally simple and able to detect video content changes with robust decision criteria.
In subsections 2.1–2.4 we will review some of the most significant approaches proposed in recent years and discuss their strength and limitations. These techniques also constitute the building blocks of more sophisticated segmentation methodologies and tools, such as the ones illustrated in subsection 2.5, and may be used to evaluate new segmentation techniques.
Metrics classified as PDM (pixels difference metrics) are based on the intensity variations of pixels in equal position in consecutive frames. Temporal segmentation techniques using interframe differences based on color are conceptually similar, but they are not very popular as they have a greater computational burden and almost the same accuracy of their intensity based counterpart.
A basic PDM is the sum of the absolute differences of intensity of the pixels of two consecutive frames [8]. In particular, indicating with Y(x, y, j) and Y(x, y, k) the intensity of the pixels at position (x, y) and frames j and k, the metric can be expressed in the following way:
(7.1) |
where the summation is taken all over the frame.
The same authors propose other metrics based on first and second order statistical moments of the distributions of the levels of intensity of the pixels. Indicating with μ_{k} and σ_{k} respectively the values of mean and standard deviation of the intensity of the pixels of the frame k, it is possible to define the following interframe metric between the frames j and k:
(7.2) |
This metric has been also used in [7], [30], and is usually named likelihood ratio, assuming a uniform second order statistic. Other metrics based on statistical properties of pixels are:
(7.3) |
(7.4) |
(7.5) |
Associating a threshold to a metric it is possible to detect a transition whenever the metric value exceeds the threshold value. As it was pointed out in [8], PDMs offer the best performance for the detection of abrupt transitions. In general all the PDMs are particularly exposed to the effects of noise, camera/object movements or sudden lighting changes, leading to temporally or spatially localized luminance perturbations, and then to a potentially large number of false transitions. From this point of view, slightly better performances are obtained considering block-based comparisons between successive frames, with matching criteria based on values of mean and variance in corresponding blocks. Several block-based PDMs are reviewed in [18].
In order to limit the effects of local perturbations, some authors have developed top-down methods based on mathematical models of video. For example, in [1] a differential model of motion picture is presented, where three factors concur to the intensity variations of pixels of consecutive frames: a small amplitude additive zero-centered Gaussian noise, essentially modelling the noise effects of camera, film and digitizer; the intrashot intensity variations due to camera/object motion and focal length or lightness changes; and finally the intershot variations due to the presence of abrupt or gradual transitions. In [34], another approach for gradual transitions detection based on a model of intensity changes during fade out, fade in and dissolve effects is presented.
Another interesting approach based on PDMs has been proposed in [6]. In this paper the authors propose the use of moment invariants of the image. Properties such as the scale and rotation invariance make them particularly suited to represent the frame. Denoting by Y(x, y) the intensity at the position (x, y), the generic moment of order pq of the image is defined as:
(7.6) |
Moment invariants are derived from normalized central moments defined as:
(7.7) |
where:
(7.8) |
Limiting our attention to the first three moment invariants, these are defined as:
(7.9) |
These three numbers may be interpreted as the components of a vector, say σ, that can be used to represent the image:
(7.10) |
The interframe metric adopted in [6] is the Euclidean distance between the vector σ associated to frames j and k:
(7.11) |
Although the use of moment invariants can lead to robust segmentation algorithms, with respect to noise and other local perturbations, techniques based on statistical properties generally exhibit a computational load not adequate for specific applications, e.g., real time systems.
Metrics classified as HDM (histograms difference metrics) are based on the evaluation of the histograms of one or more channels of the adopted color space. As it is well known, the histogram of a digital image is a measure that supplies information on the general appearance of the image. With reference to an image represented by three color components, each quantized with 8 bit/pixel, a three-dimensional histogram or three monodimensional histograms can be defined. Although the histogram does not contain any information on the spatial distribution of intensity, the use of interframe metrics based on image histograms is very diffused because it represents a good compromise between the computational complexity and the ability to represent the image content.
In recent years several histogram-based techniques have been proposed [9], [6], [23], [37]; some of them are based only on the luminance channel, others on the conversion of 3-D or 2-D histograms to linear histograms [2]. An RGB 24 bit/pixel image would generate an histogram with 16.7 millions of bins and this is not usable in practice. To make manageable histogram-based techniques, a coarse color quantization is needed. For example in [2] the authors use an histogram in the HSV color space using 18 levels for hue and 3 for saturation and value leading to a easily tractable 162 bins histogram. Other approaches are based both on PDM and HDM. For example, in [24] two metrics are combined together: a PDM metric is used to account for spatial variation in the images and a HDM metric to account for color variations. Ideally, a high value in both metrics corresponds to a transition; however the choice of thresholds and weights may be critical.
In what follows some of the most popular HDM metrics are reviewed. In all the equations M and N are respectively the width and height (in pixels) of the image, j and k are the frame indices, L is the number of intensity levels and H[j, i] the value of the histogram for the i-th intensity level at frame j. A commonly used metric [9] is the bin-to-bin difference, defined as the sum of the absolute differences between histogram values computed for the two frames:
(7.12) |
The metric can easily be extended to the case of color images, computing the difference separately for every color component and weighting the results. For example, for a RGB representation we have:
(7.13) |
where r, g and b are the average values of the three channels and s is:
(7.14) |
Another metric, used for example in [9], [6], is called intersection difference and is defined in the following way:
(7.15) |
In other approaches [28], the chi-square test has been used, which is generally accepted as a test useful to detect if two binned distributions are generated from the some source:
(7.16) |
Also the correlation between histograms is used:
(7.17) |
where cov(i, j) is the covariance between frame histograms:
(7.18) |
and μ_{j} and σ_{j} represent the mean and the standard deviation, respectively, of the histogram of the frame j:
(7.19) |
(7.20) |
All the metrics discussed so far are global, i.e., based on the histogram computed over the entire frame. Some authors propose metrics based on histograms computed on subframes. For example, in [5] a rectangular 6x4 frame partitioning is used. Each frame is subdivided into 6 horizontal blocks and 4 vertical blocks. The reason for this asymmetry is that horizontal movements are statistically more frequent than vertical ones. Then an HDM is used to compute difference between corresponding blocks in consecutive frames, after an histogram equalization. The shape of histograms is also taken into account for each block, using histogram moments up to the third order. Region differences are also weighted, using experimentally determined coefficients, to account for different contributions, in correspondence of a shot transition, of low and high intensity levels. The global metric is finally obtained adding up the contribution of all the sub-regions. In despite of its greater complexity, the performance of this technique remains, in many cases, quite sensitive to the choice of thresholds and weights.
Both PDM and HDM techniques are based on the computation of a similarity measure of two subsequent frames and on comparison of this measure with a threshold. The choice of the threshold is critical, as too low threshold values may lead to false detections and too high threshold values may cause the opposite effect of missed transitions.
To limit the problem of threshold selection, several techniques have been proposed. For example, in [5] a short-term analysis of the sequence of interframe differences is proposed to obtain temporally localized thresholds. The analysis is performed within a temporal window of appropriate size, for example 7 frames. This approach is more robust to local variations of brightness and camera/object motion. In detail, naming D_{k} the difference metric between the frames k and k-1 and W_{k} the short-term temporal window centered on the frame k, we have:
(7.21) |
If ρ(k) exceeds a threshold, computed experimentally, then an abrupt transition at frame k is detected. In a similar approach based on temporal windows and local thresholds [7], the statistical properties of a bin-to-bin histogram metric are evaluated over a 21-frame temporal window and used to detect transitions.
The techniques reported in the previous sections, based on PDMs or HDMs, are mainly suited for the detection of abrupt transitions. Other techniques have been developed to detect gradual transition like fades or wipes. For example, in [5] the authors propose a technique to detect fading effects based on a linear model of the luminance L in the CIE L*u*v color space. Assuming that chrominance components are approximately constant during the fading, the model for a fade-out is:
(7.22) |
where L(x, y, t) is the luminance of the pixel at position (x, y) and time t, t_{0} is the time of beginning of the fading effect and d its duration. Similarly the model for a fade-in is:
(7.23) |
Even if the behavior of real luminance is not strictly linear during the fading, a technique based on the recognition of pseudo-linearity of L may be used to detect transitions. Even in this case, however, local thresholds have to be considered. Moreover, for some kind of video with very fast dynamic characteristics (for example TV commercials) this technique cannot be used. Other approaches have been proposed to overcome this limitation. For example, in [12] a production-based model of the most common editing effects is used to detect gradual transitions; in [37] a simple and effective two-thresholds technique based on HDM is reported. The two thresholds respectively detect the beginning and the end of the transition. The method, called twin-comparison, takes into account the cumulative differences between frames of the gradual transition. In the first pass a high threshold T_{h} is used to detect cuts; in the second pass a lower threshold T_{l} is employed to detect the potential starting frame F_{s} of a gradual transition. F_{s} is then compared to subsequent frames. An accumulated comparison is performed as during a gradual transition this difference value increases. The end frame F_{e} of the transition is detected when the difference between consecutive frames decreases to less than T_{l}, while the accumulated comparison has increased to a value higher than T_{h}. If the consecutive difference falls below T_{l} before the accumulated difference exceeds T_{h}, then the potential start frame F_{s} is dropped and the search continues for other gradual transitions.
Specific techniques have been aimed at the detection of other gradual transitions. For example, in [32] a two-step technique for the detection of wipe effects is proposed. It is based on statistical and structural properties of the video sequence and operates on partially decompressed MPEG streams. In [26] the authors propose a technique for transition detection and camera motion analysis based on spatiotemporal textures (spatiotemporal images will be further treated in the next subsection 2.5). The analysis of the texture changes can lead to the estimation of shooting conditions and to the detection of some types of wipes.
Finally, techniques based on higher level image features have also been tried. For example, in [35] the analysis of intensity edges between consecutive frames is used. During a cut or a dissolve, new intensity edges appear far from the locations of the old edges. Similarly, old edges disappear far from the location of new edges. Thus, by counting the entering and exiting edge pixels, cuts, fades and dissolves may be detected and classified.
Due to the increasingly availability of MPEG [19] compressed digital video, many authors have focused their attention on temporal segmentation techniques operating directly on the compressed domain or on a partially decompressed video sequence. Before discussing some of presented methods, we shortly review the fundamentals of MPEG compression standard.
MPEG uses two basic compression techniques: 16 x 16 macroblock-based motion compensation to reduce temporal redundancy and 8 x 8 Discrete Cosine Transform (DCT) block-based compression to capture spatial redundancy. An MPEG stream consists of three types of pictures, I, P and B, which are combined in a repetitive pattern called group of picture (GOP).
I (Intra) frames provide random access points into the compressed data and are coded using only information present in the picture itself. DCT coefficients of each block are quantized and coded using Run Length Encoding (RLE) and entropy coding. The first DCT coefficient is called DC term and is proportional to the average intensity of the respective block.
P (Predicted) frames are coded with forward motion compensation using the nearest previous reference (I or P) pictures.
B (Bi-directional) pictures are also motion compensated, this time with respect to both past and future reference frames.
Motion compensation is performed finding for each 16 x 16 macroblock of the current frame the best matching block in the respective reference frame(s). The residual error is DCT-encoded and one or two motion vectors are also transmitted.
A well known approach to temporal segmentation in the MPEG compressed domain, useful for detecting both abrupt and gradual transitions, has been proposed in [33] using the DC sequences. A DC sequence is a low resolution version of the video, since it is made up of frames where each pixel is the DC term of a block (see Figure 7.3). Since this technique uses I, P and B frames, a partial decompression of the video is necessary. The DC terms of I frames are directly available in the MPEG stream, while those of B and P frames must be estimated using the motion vectors and the DCT coefficients of previous I frames. This reconstruction process is computationally very expensive.
Figure 7.3: Sample frame from a sequence and corresponding DC image.
Differences of DC images are compared and a sliding window is used to set the thresholds for abrupt transitions. Both PDM and HDM metrics are suited as similarity measures, but pixel differences-based metrics give satisfactory results as DC images are already smoothed versions of the corresponding full images. Gradual transitions are detected through a accurate temporal analysis of the metric.
The technique reported in [28] uses only I pictures. It is based on the chi-square test applied to the luminance histogram and to row and column histograms of DC frames. The use of horizontal and vertical projections of the histogram introduces local information that is not available in the global histogram. In particular, for frames of M x N blocks, row and column histograms are defined as:
(7.24) |
where b_{0,0}(i, j) is the DC coefficient of the block (i, j). The three interframe differences computed on the three histograms are then used in a binned decision scheme to detect abrupt or gradual transitions. As only I frames are used, the DC recovering is eliminated. Note that as in a video there are typically two I frames per second, the analysis based on I frames only is adequate to approximately detect abrupt transitions. For gradual transitions this coarse temporal sub-sampling of the video may introduce more serious problems.
Another technique operating in the compressed domain is presented in [3], which is based on the correlation of DCT coefficients for M-JPEG compressed sequences. Other authors [38] extended this approach to MPEG sequences.
Many authors tried to organize one or more of the previously described basic algorithms within more general frameworks, with the aim of defining and implementing more robust techniques. In particular, most of the approaches discussed so far rely on suitable thresholding of similarity measures between successive frames. However, the thresholds are typically highly sensitive to the type of input video.
In [11], the authors try to overcome this drawback by applying an unsupervised clustering algorithm. In particular, the temporal video segmentation is viewed as a 2-class clustering problem (scene change and no scene change) and the well-known K-means algorithm [27] is used to cluster frame dissimilarities. Then the frames from the cluster scene change which are temporary adjacent are labeled as belonging to a gradual transition and the other frames from this cluster are considered as cuts. Both chi-square statistics and HDMs were used to measure frame similarity, both in RGB and YUV color spaces. This approach is not able to recognize the type of the gradual transitions, but it exhibits the advantage that it is a generic technique that not only eliminates the need for threshold setting but also allows multiple features to be used simultaneously to improve the performance. The same limitations and advantages characterize the technique presented at end of this chapter.
Among others, Hanjalick recently proposed a robust statistical shot-boundary detector [14]. The problem of shot-boundary detection is analyzed in detail and a conceptual solution to the shot-boundary detection problem is presented in the form of a statistical detector based on minimization of the average detection-error probability. The idea is that to draw reliable conclusions about the presence or absence of a shot boundary, a clear separation should exist between discontinuity-value ranges for measurements performed within shots and at shot boundaries. Visual-content differences between consecutive frames within the same shot are mainly caused by two factors: object/camera motion and lighting changes. Unfortunately, depending on the magnitude of these factors, the computed discontinuity values within shots vary and sometimes may cause detection mistakes.
An effective way to reduce the influence of motion and lighting changes on the detection performance is to embed additional information in the shot boundary detector. The main characteristic of this information is that it is not based on the range of discontinuity values but on some other measurements performed on a video. For example, this information may result from the comparison between the measured pattern formed by discontinuity values surrounding the interval of frames taken into consideration and a known template pattern of a shot boundary. Various template patterns, specific for different types of shot boundaries (cuts or fades, wipes and dissolves), may be considered.
Another type of useful additional information may result from observation of the characteristic behavior of some visual features for frames surrounding a shot boundary for the case of gradual transitions. For example, since a dissolve is the result of mixing the visual material from two neighboring shots, it can be expected that variance values measured per frame along a dissolve ideally reveal a downwards-parabolic pattern [22]. Hence, the decision about the presence of a dissolve can be supported by investigating the behavior of the intensity variance in the suspected series of frames.
Figure 7.4: Representation of the shot boundary detector proposed in [14].
Further improvement of the detection performance can be obtained by taking into account a priori information about the presence or absence of a shot boundary at a certain time stamp along a video. The difference between additional information and a priori information is that the latter is not based on any measurement performed on the video. An example of a priori information is the dependence of the probability for shot boundary occurrence on the number of elapsed frames since the last detected shot boundary. While it can be assumed zero at the beginning of a shot, this probability grows and converges to the value 0.5 as the number of frames in the shot increases. The main purpose of this probability is to make the detection of one shot boundary immediately after another one practically impossible and so to contribute to a reduction of false detection rate.
Combining measurements of discontinuity values with additional and a priori information may result in a robust shot-boundary detector, e.g., by continuously adapting the detection threshold T(k) for each frame k. Statistical decision theory is employed to obtain T(k) based on the criterion that average probability for detection mistakes is minimized. To this aim, all detection errors (e.g., both missed and false detections) are treated equally; thus the detector performance is determined by the average probability that any of the errors occurs. This average probability is expressed in terms of the probability of missed detection and of the probability of false detection. The necessary likelihood functions are pre-computed using training data. Both additional information (for cuts and dissolves) and a priori information are taken into account in the computation of the average probability, while discontinuity values are computed by using a block-matching, motion compensated procedure [31]. The input sequence is assumed to be a MPEG partially decoded sequence, i.e., a DC sequence. This makes possible to limit the block dimensions to 4 x 4 pixels. The system performance is evaluated by using five test sequences (different from those employed for training) belonging to four video categories: movies, soccer game, news and commercials. The authors report a 100% precision and recall for cuts, 79% precision and 83% recall for dissolve detection. ^{[1]}
Another probabilistic approach was presented in [21], where Li et al. propose a temporal segmentation method based on spatial-temporal joint probability image (ST-JPI) analysis. Here, joint probabilities are viewed as a similarity estimate between two images (only luminance images are considered in the paper). Given two images A(x, y) and B(x, y), a joint probability image (JPI) is a matrix whose element value JPI_{A,B}(i_{1}, i_{2}) is the probability that luminance i_{1} and i_{2} appear at the same position in image A and image B, respectively. Each element of a JPI corresponds to an intensity pair in two images. The distribution of the values in a JPI maps the correlation between two images: for two identical images, the JPI shows a diagonal line, while for two independent images the JPI consists of a uniform distribution. Because of the high correlation between video frames within a single shot, a JPI derived from two frames belonging to the same shot usually has a narrow distribution along the diagonal line. On the contrary, since narrative and visual content changes between two consecutive shots, then a uniform distribution is expected in the JPI. Thus the JPI behavior may be used to develop transition detection methods.
In particular, a spatial-temporal joint probability image (ST-JPI) is defined as a series of JPIs in chronological order, with all JPIs sharing the same initial image. The ST-JPI reflects the temporal evolution of video contents. For example, if a ST-JPI is derived between frames 0 and T, and a cut happens within this frame interval, the JPIs before the cut have very limited dispersion from the diagonal line, while after the cut uniform JPIs are usually obtained. The shift from narrow dispersion JPIs to uniform JPIs happens instantaneously at the cut position. By estimating the uniformity of JPIs, cuts can be detected and reported.
Detection of gradual transitions is also obtained with this approach, even if in a more complicated way. In particular, two kinds of gradual transition are considered, cross transitions and dither transitions. During a cross transition, every pixel value gradually changes from one shot to another, while during a dither transition a small portion of pixels abruptly change from pixels values from the first shot to those of the second shot every moment. With time, more and more pixels change until all of the pixels change into the second video shot. Wipes, fades and various types of dissolve may be described in terms of this scheme.
Template ST-JPIs are derived both for cross transitions and dither transitions. The detection of gradual transitions is performed by analyzing the pattern match between model ST-JPIs and the ST-JPI derived for the frame interval under consideration. Experiments performed on several digital videos of various kind gave the following results: 97% (recall) and 96% (precision) for cuts, 82% and 93% for cross transitions, 75% and 81% for dither transitions.
The algorithm proposed in [25] is based on the coherence analysis of temporal slices extracted from the digital video. Temporal slices are extracted from the video by slicing through the sequence of video frames and collecting temporal signatures. Each of these slices contains both spatial and temporal information from which coherent regions are indicative of uninterrupted video partitions separated by camera breaks (cuts, wipes and dissolves). Each spatiotemporal slice is a collection of scans, namely horizontal, vertical or diagonal image stripes, as a function of time. The detection of a shot boundary therefore becomes a problem of spatiotemporal slice segmentation into regions each of a coherent rhythm. Properties could further be extracted from the slice for both the detection and classification of camera breaks. For example, cut and wipes are detected by color-texture properties, while dissolves are detected by mean intensity and variance. The analysis is performed on the DC sequence extracted from a MPEG video. The approach has been tested by experiments on news sequences, documentary films, movies, and TV streams, with the following results: 100% (recall) and 99% (precision) for cuts, 75% and 80% for wipes, 76% and 77% for dissolves.
Finally, a fuzzy theoretic approach for temporal segmentation is presented in [16], with a fusion of various syntactic features to obtain more reliable transition detection. Cuts are detected using histogram intersection; gradual changes are detected using a combination of pixel difference and histogram intersection, while fades are detected using a combination of pixel difference, histogram intersection and edge-pixel-count. Frame-to-frame differences of these properties are considered as the input variable of the problem expressed in fuzzy terms. In particular, the linguistic variable inter-frame-difference is fuzzified so that it can be labeled as "negligible," "small," "significant," "large" or "huge." The values of metric differences are represented as these linguistic terms. To this aim, appropriate class boundaries and membership functions must be selected for each category. This is made by modeling the interframe property difference through the Rayleigh distribution. The appropriateness of this model has been tested by fitting Rayleigh distribution (chi-square test) to interframe difference data for about 300 video sequences of various kind, having 500–5000 frames each. Fuzzy rules for each property are derived by taking into account the current interframe difference, the previous interframe difference and the next interframe difference.
^{[1]}Recall and precision are popular quality factors whose formal definition is given in the following Sect. 4.