2. Background

To enable multimedia retrieval, retrieval systems extract a set of properties from the multimedia objects that capture some aspects of its content. These properties, also called features, are what a retrieval system understands of the multimedia objects and thus limit the systems capabilities. In the broadest sense, features are of a textual or visual nature. Textual features include manually or automatically assigned annotations or keywords. Visual features capture properties such as color, texture, shape, faces, etc. and are typically extracted automatically using image processing techniques, although manual extraction (e.g., for segmentation) can also be used. Textual features and associated retrieval have been extensively studied in the field of Information Retrieval [26][2] and we will not discuss them any further.

Visual features are an active research area with many exciting results and a rich literature. Visual features can be broadly classified into general and specific features. General features deal with aspects common to most objects, such as color, texture, shape, etc. Specific features on the other hand focus on properties such as human fingerprints, faces, or gestures.

Many techniques have been developed for both general and specific features. For example, there are many different ways to represent the color content of a multimedia object including color histogram, color moments, color sets [17], etc. This variety corresponds to the subjectivity with which humans perceive the content of multimedia objects, and each of these feature representations capture the feature from a different perspective. Extensive descriptions of feature representations appear in other chapters of this book. In this chapter, we will assume a multimedia retrieval system contains for each multimedia object, a set of features F. Each feature F_i may itself contain a set of feature representations f_i,j. For example, F₁ can be the color feature with color histogram (f_1,1) and color moments (f_1,2) representations, while F₂ can be the texture feature with wavelet (f_2,1) and Tamura (f_2,2) representations. Associated with each feature representation f_i,j is a set of comparison functions d_i,j,k that determine how good two feature representations match each other. Retrieval systems have adopted two interpretations for these functions. Under the distance interpretation, a value of 0 means a perfect match with higher values indicating progressively worse matches. For example, the Euclidean distance metric can be used as the distance function for a feature representation. Under the similarity interpretation, values are in the range [0,1] where a value of 1 means a perfect match and 0 means no match. These two interpretations are generally interchangeable and can easily be converted into each other. We will focus on a distance interpretation in the remainder of the chapter.

Regardless of the details of feature representations, most of them are represented as an array of real values. We can then easily view a feature value as a vector in a multidimensional space. The distance functions for each feature representation can be viewed as determining the distance between two objects from this space.

To explore relevance feedback, we will assume the presence of two features each with two representations as described above. To simplify our discussion, we will assume each of these feature representations to be two-dimensional, with the understanding that the same principles we discuss also apply in higher dimensions. For example, we can interpret the color histogram representation to be the average hue and saturation of an image.

Once the sets of features, representations, and distance functions are established, we must turn to the problem of determining the overall distance between a multimedia object and the query. Because the query model heavily influences how relevance feedback works, we present the different query models in conjunction with the feedback models they support.