3. State of the Art

Among different media types, image retrieval is the most active in recent years [32]. We give a brief review of the state of the art in relevance feedback techniques in the context of image retrieval. Again, many of these techniques are directly applicable for the retrieval of video.

In its short history, relevance feedback developed along the path from heuristic-based techniques to optimal learning algorithms, with early work inspired by term weighting and relevance feedback techniques in document retrieval [23]. These methods proposed heuristic formulation with empirical parameter adjustment, mainly along the line of independent axis weighting in the feature space [24], [33]. The intuition is to emphasize more the feature(s) that best cluster the positive examples and separate the positive and the negative.

Early work [24], [33] bears the mark of its origins in document retrieval. For example, in [24], learning based on "term frequency" and "inverse document frequency" in the text domain is transformed into learning based on the ranks of the positive and negative images along each feature axis in the continuous feature space. The scheme in [33] quantizes the features and then groups the images or regions into hierarchical trees whose nodes are constructed through single-link clustering. Then weighting on groupings is based on "set operations."

Aside from the lack of optimality claim, the assumption of feature independence imposed in these methods is also artificial, unless independent components can be extracted beforehand (by using, for example, independent component analysis (ICA). However, ICA is not guaranteed to give satisfactory results [34]).

Later on, researchers began to look at this problem from a more systematic point of view by formulating it into an optimization, learning, or classification problem. In [11] and [12], based on the minimization of total distances of positive examples from the new query, the optimal solutions turn out to be the weighted average of the positive examples as the optimal query and a whitening transform (or Mahalanobis distance) as the optimal distance metric. Additionally, [12] adopts a two-level weighting scheme to better cope with singularity due to the small number of training samples. To take into account the negative examples, [35] updates the feature weights along each feature axis by comparing the variance of positive examples to the variance of the union of positive and negative examples. (It will become clear later on that this scheme is actually the reduced diagonal solution of minimizing the ratio of positive scatter over the overall scatter among both positive and negative examples, with respect to a transformation in the original space. This is obviously not the best intuition for this problem, cf. Section 5.)

Assuming that the user is searching for a particular target, and the feedback is in the form of "relative judgment," [36] proposes the stochastic comparison search as its relevance feedback algorithm.

MacArthur et al. [37] cast relevance feedback as a two-class learning problem and use a decision tree algorithm to sequentially "cut" the feature space until all points within a partition are of the same class. The database is classified by the resulting decision tree: images that fall into a relevant leaf are collected and the nearest neighbors of the query are returned.

While most CBIR systems use well-established image features such as color histogram/moments, texture, shape, and structure features, there are alternatives. Tieu and Viola [25] use more than 45000 "highly selective features" and a boosting technique to learn a classification function in this feature space. The features were demonstrated to be sparse with high kurtosis and were argued to be expressive for high-level semantic concepts. Weak two-class classifiers were formulated based on Gaussian assumption for both the positive and negative (randomly chosen) examples along each feature component, independently. The strong classifier is a weighted sum of the weak classifiers as in AdaBoost [38].

In [26], the Gaussian mixture model on DCT coefficients is used as image representation. Then Bayesian inference is applied for image regional matching and learning over time. Note that in this work each image is represented by a distribution (mixture model) instead of a single feature vector. Richer information captured by the mixture model makes Bayesian inference and image regional matching possible.

Recently, there also have been attempts to incorporate the support vector machine (SVM) into the relevance feedback process [14], [39], [40]. However, SVM as a two-class classifier is not directly suitable for relevance feedback because the training examples are far too few to be representative of the true distributions [39]. However, a kernel-based one-class SVM as density estimator for positive examples has been shown to outperform the whitening transform method in [39].

Without assuming one Gaussion mode for positive examples, Parzen window density estimation can be applied to capture nonlinearity in distribution [41], or an "aggregate dissimilarity" function is used to combine for a candidate image the pair-wise distances to every positive example. The major weakness of these schemes is in their equal treatment of different feature components for different tasks.

Formulated in the transductive learning framework, the D-EM algorithm [42] uses examples from the user feedback (labeled data) as well as other data points (unlabeled data).