Sen-ching Samson Cheung and Avideh Zakhor
Department of Electrical Engineering and Computer Sciences
University of California
Berkeley, California USA
<sccheung@ieee.org>, <avz@eecs.berkeley.edu>
The amount of information on the World Wide Web has grown enormously since its creation in 1990. By February 2000, the web had over one billion uniquely indexed pages and 30 million audio, video and image links [1]. Since there is no central management on the web, duplication of content is inevitable. A study done in 1998 estimated that about 46% of all the text documents on the web have at least one "near-duplicate" - document which is identical except for low level details such as formatting [2]. The problem is likely to be more severe for web video clips as they are often stored in multiple locations, compressed with different algorithms and bitrates to facilitate downloading and streaming. Similar versions, in part or as a whole, of the same video can also be found on the web when some web users modify and combine original content with their own productions. Identifying these similar contents is beneficial to many web video applications:
As users typically do not view beyond the first result screen from a search engine, it is detrimental to have all "near-duplicate" entries cluttering the top retrievals. Rather, it is advantageous to group together similar entries before presenting the retrieval results to users.
When a particular web video becomes unavailable or suffers from slow network transmission, users can opt for a more accessible version among similar video content identified by the video search engine.
Similarity detection algorithms can also be used for content identification when conventional techniques such as watermarking are not applicable. For example, multimedia content brokers may use similarity detection to check for copyright violation as they have no right to insert watermarks into original material.
One definition of a video similarity measure suitable for these applications is in terms of the percentage of visually similar frames between two video sequences. This is the video similarity measure used in this chapter. A similar measure, called the Tanimoto measure, is commonly used in comparing text documents where the similarity is defined as the percentage of words common to the two documents [2][3]. There are other similarity measures proposed in the literature and some of them are reviewed in this chapter. No matter which measure is used, the major challenge is how to efficiently perform the measurement. As video is a complex data type, many of the proposed similarity measures are computationally intensive. On the other hand, for every new video added to the database or a query video presented by a user, similarity measurements need to be performed with possibly millions of entries in the database. Thus, it is imperative to develop fast methods in computing similarity measurements for database applications.
Finding visually similar content is the central theme in the area of Content-Based Information Retrieval (CBIR). In the past decade, numerous algorithms have been proposed to identify visual content similar in color, texture, shape, motion and many other attributes. These algorithms typically identify a particular attribute of an image or a video shot as a high-dimensional feature vector. Visual similarity between two feature vectors can then be measured by a metric defined on the feature vector space. The basic premise of this process is that the visual content being analyzed is homogeneous in the attribute of interest. Since a full-length video is typically a collection of video shots with vastly different contents, it must be modeled as a set or a time-series of feature vectors, usually with one vector per video shot. When extending the similarity measurement to video, the first challenge is to define a single measurement to gauge the similarity between two video sequences. Multiple proposals can be found in the literature: in [4], [5] and [6], warping distance is used to measure the temporal edit differences between video sequences. Hausdorff distance is proposed in [7] to measure the maximal dissimilarity between shots. Template matching of shot change duration is used by Indyk et al. [8] to identify the overlap between video sequences. A common step shared by all the above schemes is to match similar feature vectors between two video sequences. This usually requires searching through part of or the entire video. The full computations of these measurements thus require the storage of the entire video, and time complexity that is at least linear in the length of the video. Applying such computations in finding similar content within a database of millions of video sequences may be too complex in practice.
On the other hand, the precise value of a similarity measurement is typically not required. As feature vectors are idealistic models and do not entirely capture the process of how similarity is judged in the human visual system [9], many CBIR applications only require an approximation of the underlying similarity value. As such, it is unnecessary to maintain full fidelity of the feature vector representations, and approximation schemes can be used to alleviate high computational complexity. For example, in a video similarity search system, each video in the database can be first summarized into a compact fixed-size representation. Then, the similarity between two video sequences can be approximated by comparing their corresponding representations.
There are two types of summarization techniques for similarity approximation: the higher-order and the first-order techniques. The higher-order techniques summarize all feature vectors in a video as a statistical distribution. These techniques are useful in classification and semantic retrieval as they are highly adaptive and robust against small perturbation. Nonetheless, they typically assume a restricted form of density models such as Gaussian, or mixtures of Gaussian, and require a computationally intensive method like Expectation-Maximization for parameter estimation[10][11][12]. As a result, they may not be applicable for matching the enormous amount of and extremely diverse video content on the web. The first-order techniques summarize a video by a small set of representative feature vectors. One approach is to compute the "optimal" representative vectors by minimizing the distance between the original video and its representation. If the metric is finite-dimensional Euclidean and the distance is the sum of squared metric, the well-known k-means method can be used [13]. For general metric spaces, we can use the k-medoid method which identifies k feature vectors within the video to minimize the distance [14][15]. Both of these algorithms are iterative with each iteration running at O(l) time for k-means, and O(l^{2}) for k-medoids, where l represents the length of the video. To summarize long video sequences such as feature-length movies or documentaries, such methods are clearly too complex to be used in large databases.
In this chapter, we propose a randomized first-order video summarization technique called the Video Signature (ViSig) method. The ViSig method can be applied to any metric space. Unlike the k-means or k-medoid methods, it is a single-pass O(l) algorithm in which each video is represented by a set of "randomly" selected frames called the ViSig. In this chapter, we show analytically and experimentally that we can obtain a reliable estimate of the underlying video similarity by using a very small ViSig to represent each video in the database. Based on a ground-truth set extracted from a large database of web video clips, we show that the ViSig method is able to achieve almost the same retrieval performance as the k-medoid of the same size, without the O(l^{2}) complexity of k-medoid.
This chapter is organized as follows: we describe the ViSig method and show a number of analytical results in Section 2. Experimental results on a large dataset of web video and a set of MPEG-7 test sequences with simulated similar versions are used in Section 3 to demonstrate the retrieval performance of our proposed algorithms. We conclude this chapter in Section 4 by discussing related research. The proofs for all the propositions can be found in the appendices. The following is a list of acronyms and notations used in this chapter:
Acronyms: | |
---|---|
NVS | Naive Video Similarity |
IVS | Ideal Video Similarity |
VVS | Voronoi Video Similarity |
ViSig | Video Signature |
SV | Seed Vector |
VSS_{b} | Basic ViSig Similarity |
| Probability Density Function |
VG | Voronoi Gap |
VSS_{r} | Ranked ViSig Similarity |
Notations: | |
---|---|
(F,d( , )) | Feature vector space F with metric d( , ) |
ε | Frame Similarity Threshold |
1_{x} | Indicator function |
|X| | Cardinality of set X |
nvs(X,Y;ε) | NVS between X and Y |
[X]_{ε} | Collection of clusters in X |
ivs(X,Y;ε) | IVS between X and Y |
V(X) | Voronoi Diagram of video X |
V_{X}(x) | Voronoi Cell of x ∈ X |
V_{X}(C) | Voronoi Cell of a cluster C ∈ [X]_{ε} |
g_{X}(S) | The frame in X closest to s |
R(X,Y;ε) | Similar Voronoi Region |
Vol(A) | Volume of a region A |
Prob(A) | Probability of event A |
vvs(X,Y;ε) | VVS between X and Y |
| ViSig of X with respect to the SV set S |
vss_{b}(, ;ε, m) | VSS_{b} between and |
f(u; X⋃Y) | PDF that assigns equal probability to the Voronoi Cell of each cluster in [X⋃Y]_{ε} |
G(X,Y;ε) | VG between X and Y |
Q(g_{X}(S)) | Ranking function for the ViSig frame g_{X}(s) |
vss_{r}(, ;ε,m) | VSS_{r} between and |
d_{q}(x, y), (x, y) | l_{1} and modified l_{1} color histogram distances |
d(x, y), d'(x, y) | Quadrant d_{q}(x, y), (x, y) |
ρ | Dominant Color Threshold |
rel(X) | The set of video sequences that are subjectively similar (relevant) to video X |
ret(X,ε) | The set of video sequences that are declared to be similar to X by the ViSig method at ε level |
Recall(ε) | The recall in retrieving the ground-truth by the ViSig method at ε level |
Precision(ε) | The precision in retrieving the ground-truth by the ViSig method at ε level |
^{[1]} 2003 IEEE. Reprinted, with permission, from S.-C. Cheung, A. Zakhor, "Efficient Video Similarity Measurement With Video Signature," IEEE Trans. on Circuits and Systems for Video Technology, Vol. 13, No. 13, January 2003, pp. 59–74.