3. The Similarity Measure

Although new features are being discovered everyday, it is hard to imagine that we can find one set of features that can kill all the applications. Moreover, finding new features requires a lot of trials and errors, which in many cases has few clues to follow. If the feature set has been fixed for a certain retrieval system, another place that researchers can improve is the similarity measure.

It is said that a feature is good if and only if similar objects are close to each other in the feature space, and dissimilar objects are far apart. Obviously, to decide "close" or "far," the similarity measure plays an equally important role as the original feature space. For example, Euclidian distance or other Minkowski-type distances are widely used as similarity measures. A feature space is considered as "bad," probably because it does not satisfy the above criterion under Euclidian distance. However, if the feature space can be "good" by first (nonlinearly) "warping" it and then applying the Euclidian distance or by employing some new distance metrics, it is still fine for our purpose. Here the "warping" can also be considered as a preprocessing step for the feature space. However, in this chapter, such kind of "warping" is included in similarity measure, thus giving the latter a fairly general definition.

The difficult problem is how to get the "warping" function, or, in general, find the right similarity measure. Santini and Jain suggested to do it based on psychological experiments [29]. They analysed some similarity measure proposed in the psychological literature to model human similarity perception, and showed that all of them actually challenge the Euclidean distance assumption in non-trivial ways. Their research implies that a new distance metric is not only a preprocessing method, but also a necessity given the property of human perception. They developed a similarity measure, named Fuzzy Feature Contrast, based on fuzzy logic and Tverky's famous Feature Contrast [30].

One problem with the psychological view is that it does not have sound mathematical or computational models. A much more popular approach to find the "warping" function is through learning. By learning we mean given a small amount of training data, the system can automatically find the right "warping" function to make the feature space better.

Before learning, one must decide what to use as the ground truth data, in other words, what one wants the "warping" function to be optimised for. It turns out that there are two major forms that are most often used: similarity/dissimilarity (SD) and keyword annotations (KA). If we know that some objects are similar to each other while others are not, the feature space should be "warped" so that similar objects get closer, and dissimilar objects get farther. If the training data has some keyword annotated for each object, we want objects that share the same keywords to get closer while otherwise get farther. Both SD and KA have their advantages and good applications. SD is convenient for an end-user, and it does not require any explanation why two objects are similar or not (sometimes the reason is hard to be presented to the system as an end-user). Therefore, SD is suitable for end-user optimised learning, e.g., to learn what the end-user really means by giving some examples. SD is almost exclusively used by relevance feedback - a very hot research topic today. KA is good for system maintainers to improve the general performance of the retrieval system. Recent work in video retrieval has shown an interesting shift from query by example (QBE) [31][1] to query by keywords (QBK). The reason is that it allows the end users to specify queries with keywords, as they have been used to in text retrieval. Moreover, KA allows the knowledge learned to be accumulated by simply adding more annotations, which is often not obvious when using SD. Therefore, by adding more and more annotations, the system maintainer can let the end-user feel that the system works better and better. Both SD and KA have their constraints, too. For example, SD is often too user-dependent and the knowledge obtained is hard to accumulate, while KA is often limited by a predefined small Lexicon.

The learning process is determined not only by the form of the training data, but also by their availability. Sometimes all the training data are available before the learning, and the process can be done in one batch. We often call this off-line learning. If the training data are obtained gradually and the learning is progressively refined, we call it on-line learning. Both cases were widely studied in the literature. In the following sections, we will focus on applying these learning algorithms on retrieval systems to improve the similarity measure. Offline learning is discussed in Section 4 and on-line learning is in Section 5.