Since different types of multimedia information can be represented in the same form of feature vectors, media type becomes transparent to the machine. Therefore, the algorithms discussed here are applicable for retrieval of different media types. We assume that each meaningful "unit" of multimedia information is represented by a feature vector. For images, the "unit" can be the whole image, image blocks, or segmented regions; and for videos, the "unit" can be shots or keyframes, depending upon the application scenarios.
To combine the use of low-level features with keywords, we convert keyword annotations for each image into a vector, with components v_{ij} indicating the appearance or "probability" of keyword j in image i. With v_{ij}∊ [0,1], we call it a soft vector representation [28].
Another way of modeling relationships among keywords is to apply linear multidimensional scaling (MDS) [29] on the word similarity matrix to arrive at a low-dimensional space; or to use nonlinear techniques such as locally linear embedding to construct such a space [30] in which each word is represented by a point, and their mutual distances are preserved as much as possible. These schemes have the advantage of compact feature representation in a low dimensional feature space, but with the drawback of losing the semantic meaning of the axes. Also, they have poor scalability; i.e., with new keywords the MDS procedure has to be repeated, and the new lower-dimensional embedding can be very different from the previous one, even with a relatively small amount of new data. While with the simple soft vector representation, the insertion of new words is a linear incremental process.
In the abstraction of the feature space, each "unit" of multimedia data becomes a point. Relevance feedback becomes a supervised classification problem, or an on-line learning problem in a batch mode, but with unique characteristics. The uniqueness lies in at least three aspects:
First, the machine needs to learn and respond in real time. The feature subset and the similarity measure should be dynamically determined by the current user for the current task. Therefore real-time response is important.
Second, the number of training samples is very small relative to the dimension of the feature space, and relative to the requirements by popular learning machines such as support vector machines (SVM) [31].
Third, the desired output is not necessarily a binary decision on each point, but rather a rank-ordered top-k return, as a typical search engine will do. This is a less demanding task since we actually do not care about the rank or configuration of the negative points as long as they are far beyond the top-k returns. In fact, algorithms aiming at binary classification are ill fitted to this problem and may perform poorly.
We use the phrase "relevance feedback" to denote the on-line learning process during multimedia information retrieval, based on the relevance judgments fed back by the user. The scenario is like this:
Machine provides initial retrieval results through query-by-keyword, sketch, or example, etc.
Then, iteratively:
User provides judgment on the current results as to whether, and to what degree, they are relevant to her/his request;
Machine learns and tries again.
The task is to design an algorithm that learns a discriminating subspace from the limited number of examples provided by the user in an interactive fashion. Two cases need to be addressed, ideally in a unified framework: first, when only positive examples are given, a transformation, linear or nonlinear, shall bring them together in the new metric space. (Feature weighting [24] and whitening transform [11] are the linear solutions, with and without the independence assumption on feature components, respectively.) Second, when negative examples are also given, the transformation shall separate the positive and negative examples in the new space.