4. Off-Line Learning

If all the training data is available at the very beginning, learning can be done in one step. This kind of off-line learning is often applied before the system is open to end-users. Although it might take quite some time, the speed is not a concern, as the end-user would not feel that.

Most off-line learning systems handle keyword annotations (KA). The keywords are often given as a predetermined set, organized in different ways. For example, Basu et al. [32] defined a Lexicon as relatively independent keywords describing events, scenes and objects. Many authors prefer the tree structure [34][35][33], as it is clean and easy to understand. Naphade et al. [36] and Lee et al. [37] used graph structure, which is appropriate if the relationship between keywords is very complex.

Once the training data is given, a couple of learning algorithms, parametric or non-parametric, can be used to learn the concepts behind the keywords. As far as the authors know, at least Gaussian Mixture Model (GMM) [32][35], Support Vector Machine (SVM) [38], Hybrid Neural Network [39], Multi-nets [36], Distance Learning Network [40] and Kernel Regression [33] have been studied in the literature. A common characteristic of these algorithms is that all of them can model potentially any distribution of the data. This is expected because we do not know how the objects that share the same concept are distributed in the low-level feature space. One assumption we can probably make is that in the low-level feature space, if two objects are very close to each other, they should be semantically similar, or be able to infer some knowledge to each other. On the other hand, if two objects are far from each other, the semantic link between them should be weak. Notice that because of the locality of the semantic inference, this assumption allows objects with the same semantic meaning to lie in different places in the feature space, which cannot be handled by simple methods such as linear feature reweighing. If the above assumption does not hold, probably none of the above learning algorithms will help improve the retrieval performance too much. The only solution to this circumstance might be to find better low-level features for the objects.

Different learning algorithms have different properties and are good for different circumstances. Take the Gaussian Mixture Model as an example. It assumes that the objects having the same semantic meaning are clustered into groups. The groups can lie at different places in the feature space, but each of them follows a Gaussian distribution. If the above assumptions are true, GMM is the best way to model the data: it is simple, elegant, easy to solve with algorithms such as EM [41][42] and sound in theoretical point of view. However, the above assumptions are very fragile: we do not know how many clusters the GMM will have, and no real case will happen that each cluster is a Gaussian Distribution. Despite the constraints, GMM is still very popular for its many advantages. Kernel regression (KR) is another popular machine learning technique. Instead of using a global model like GMM, KR assumes some local inference (kernel function) around each training sample. From the unannotated object's point of view, to predict its semantic meaning, an annotated object that is closer will have a higher influence, and a farther one will have less. Therefore, it will have similar semantic meanings to its close-by neighbours. KR can model any distribution naturally, and also has sound theory behind it [43]. The limitation of KR is that the kernel function is hard to select, and the number of samples needed to achieve a reasonable prediction is often high. Support Vector Machine (SVM) [44][45] is a recent addition to the toolbox of machine learning algorithms that has shown improved performance over standard techniques in many domains. It has been one of the most favourite methods among researchers today. The basic idea is to find the hyperplane that has the maximum margin towards the sample objects. Margin here means the distance the hyperplane can move along its normal before hitting any sample object. Intuitively, the greater the margin, the less the possibility that any sample points will be misclassified. For the same reason, if a sample object is far from the hyperplane, it is less likely to be misclassified. If the reader agrees with the reasoning above, he/she will easily understand the SVM Active Learning approaches introduced in Section 5. For detailed information on SVM, please refer to [44][45].

Although after applying the learning algorithm, the semantic model can be used to tell the similarity between any two objects already, most systems require a fusion step [33][32][35]. The reason is that the performance of the statistically learned models is largely determined by the size of the training data set. Since often the training data is manually made, very expensive and thus small, it is risky to believe that the semantic model is good enough. In [33], semantic distance is combined with low-level feature distance through a weighting mechanism to give the final output, and the weight is determined by the confidence of the semantic distance. In [32], several GMM models are trained for each feature type, and the final result is generated by fusing the outputs of all the GMM models. In [35] where audio retrieval was studied, the semantic space and the feature space are designed symmetrically, i.e., each node in the semantic model is linked to equivalent sound documents in the acoustic space with a GMM, and each audio file/document is linked with a probability model in the semantic space. The spaces themselves are organized with hierarchical models. Given a new query in any space, the system can first search in that space to find the best node, and then apply the link model to get the retrieval results in the other space.

Keyword annotation is very expensive because it requires a lot of manual work. Chang and Li [46] proposed to employ another way of getting the ground truth data. They used 60,000 images as the original set and synthesized another set by 24 transforms such as rotation, scaling, cropping, etc. Obviously, images after the transforms should be similar to the one before the transform. They discovered a perceptual function called dynamic partial distance function (DPF). Synthesizing new images by transforms and using them as training data is not new. For example, people play this trick in face recognition systems when the training image set has very few images (e.g., only one). Despite the fact that transforms may not be complete as a model of similarity, this is a very convenient way of getting a lot of training data, and DPF seems to have reasonable performance as reported in [46].