The development of video shot detection techniques has a fairly long history already and has become one of the most important research areas in content-based video analysis and retrieval. The detection of boundaries between video shots provides a basis for almost all of the existing video segmentation and abstraction methods [24]. However, it is quite difficult to give a precise definition of a video shot transition since many factors such as camera motions may change the video content significantly. Usually, a shot is defined to be a sequence of frames that was (or appears to be) continuously captured by the same camera [16]. Ideally, a shot can encompass camera motions such as pans, tilts, or zooms, and video editing effects such as fades, dissolves, wipes, and mattes [8, 24]. Basically, video shot transitions can be categorized into two types: abrupt/sharp shot transitions, which is also called cuts, where a frame from one shot is followed by a frame from a different shot, and gradual shot transitions, such as cross dissolves, fade-ins, and fade-outs, and various other editing effects. Methods to cope with these two types of shot transitions have been proposed by many researchers. These methods fall into one of two domains, either uncompressed or compressed, depending on whether it is applied to raw video stream or compressed video data. According to [8], methods working on uncompressed video are, in general, more reliable but require higher storage and computational resources, compared with techniques in the compressed domain. In this chapter, our discussion will focus only on methods in the uncompressed domain. For more details of compressed domain shot detection algorithms and evaluation of their performance, please refer to [8, 16, 17, 24, 26].
Usually, a similarity measure between successive video frames is defined based on various visual features. When one frame and the following frame are sufficiently dissimilar, an abrupt transition (cut) may be determined. Gradual transitions are found by using measures of cumulative differences and more sophisticated thresholding mechanisms.
In [37], pair-wise pixel comparison, which is also called template matching, was introduced to evaluate the differences in intensity or color values of corresponding pixels in two successive frames. The simplest way is to calculate the absolute sum of pixel differences and compare it against a threshold. The main drawback of this method is that both the feature representation and the similarity comparison are closely related to the pixel position. Therefore, methods based on simple pixel comparison are very sensitive to object and camera movements and noises.
In contrast to pair-wise comparison which is based on global visual features, block-based approaches use local characteristics to increase the robustness to object and camera movements. Each frame is divided into a number of blocks that are compared against their counterparts in the successive frame. Typically, the similarity or dissimilarity between two frames can be measured by using a likelihood ratio, as proposed in [25, 37]. A shot transition is identified if the number of changed blocks is above the given threshold. Obviously, this approach provides a better tolerance to slow and small motions between frames.
To further reduce the sensitivity to object and camera movement and thus provide a more robust shot detection technique, histogram comparison was introduced to measure the similarity between successive frames. In fact, histogram-based approaches have been widely used in content-based image analysis and retrieval. Beyond the basic histogram comparison algorithm, several researchers have proposed various approaches to improve its performance, such as histogram equalization [1], histogram intersection [32], histogram on group of frames [14], and normalized x2 test [27]. However, experimental results show that approaches which enhance the difference between two frames across a cut may also magnify the difference due to object and camera movements [37]. Due to such a trade-off, for instance, the overall performance of applying x2 test is not necessarily better than that of the linear histogram comparison, even though it is more time consuming.
Another interesting issue is which color space to use when we consider color-based techniques such as color histogram comparison. As we know, the HSV color space reflects human perception of color patterns. In [18] the performance of several color histogram based methods using different color spaces, including RGB, HSV, YIQ, etc., were evaluated. Experimental results showed that HSV performs quite well with regard to classification accuracy and it is one of those that are the least expensive in terms of computational cost of conversion from the RGB color space. Therefore, the HSV color space will be used in the experimental study in the following sections of this chapter.
The reasoning behind any of these approaches is that two images (or frames) with unchanging background and unchanging (although moving) objects will have minor difference in their histogram [26]. In addition, histograms are invariant to rotation and can minimize the sensitivity to camera movements such as panning and zooming. Besides, they are not sensibly affected by histogram dimensionality [16]. Performance of precision is quite impressive, as is shown in [4, 6]. Finally, histogram comparison doesn't require intensive computation. Although it has these attractive characteristics, theoretically, histogram-based similarity measures may lead to incorrect classifications, since the whole process depends on the distribution and does not take spatial properties into account. Therefore, the overall distribution of features, and thus its histogram, may remain mostly unchanged, even if pixel positions have been changed significantly.
There are many other shot detection approaches, such as clustering-based [13, 21, 31], feature-based [36], and model-driven [1, 5, 23, 35] techniques. For details of these methods please refer to surveys such as [8, 16, 26, 30].