2. Main Concepts

The main concepts involved in similarity search on multimedia data regarding information representation are:

the feature spaces where the data under analysis is described,
the dissimilarity metric or distance function used to compare two instances of data,
the searching spaces and
the access methods specially tailored to quickly process the similarity searching.

These concepts and their relationship are detailed.

2.1 Feature Spaces

The first and crucial step in content-based indexing and retrieval of multimedia data is the definition of the proper features to be extracted from the data, aiming to represent these complex data as closely as possible, providing adequate support for the intended use of the data. Based on this representation, access methods must be able to exploit the feature vectors, allowing efficient retrieval when answering similarity queries. It is important to notice that, depending on the intended application, different features would be computed from the same data, and different ways of comparisons would be used to improve the retrieval capability.

As the volume of information in multimedia databases is getting larger every day, it is important to look for feature extraction techniques that do not demand human interaction. There are basically two main approaches for that: extraction of features directly from the raw data, and extraction of features from transformed data, where the information usually comes from a compressed transform domain, such as wavelet coefficients [14]. For example, in image databases it is usual to extract features from raw data, that is, directly from the image pixels. The color histograms represent this kind of feature. Other feature spaces are constructed in a more elaborated way over raw data, thus being in a higher abstraction level. These include the detection of contours, edges and surfaces. Statistical techniques such as determining centroids, moment invariants, principal axis, among others, extract global information from the data or from part of the data. Such statistical results have been used to define feature spaces as well [15].

Among the more abstract feature spaces are the techniques that deal with the relationship between the information already obtained by other feature extractors. Structural features built over pattern graphs [16] or grammars composed from patterns [17] are examples of high-level feature spaces.

The extraction of features based on the transform-domain-based techniques takes advantage of the assumption that transformations are also implemented to compress the data [15]. Thus, extracting features from compressed data requires less memory space and usually less computational effort, with increasing the overall performance of the multimedia system. In the literature there are many transformations used to achieve this intent, such as Short Window Fourier Transform [18], Discrete Cosine Transform (DCT) [18] and Wavelet Transform [19] among others.

The previous discussion about feature spaces leads to an inherent hierarchy of the properties extracted from the multimedia data. Each multimedia data type has specific properties which are used to represent the data. To discuss such properties it is necessary to focus on a specific multimedia data type. In this chapter we frequently based the discussion on images as an exemplifying instrument, because the presented concepts can be applied on multimedia or complex data types with straightforward adaptations. However, video data is composed by other multimedia data, such as sound, images and temporal sequences. Therefore, a special consideration is made with regard to this data type.

Regarding images, we can create four abstraction levels to classify the feature spaces, as shown in Figure 29.1. The level 1 of this figure holds the features that are directly extracted from the raw data or from the pixels. On level 2 are the features based on contours, edges, lines, regions, surfaces and salient regions (sharp borders, corners, crossings). The features represented in level 2 are combined, producing the more structured features that pertain to level 3. Level 4 is composed by the interpretation of the relationship of the features from level 3 and corresponds to the features with the highest semantic properties embedded. Features from levels 1 and 2 can be processed automatically; however in level 3 a semi-automatic approach would be expected. Feature spaces handled in level 4 are highly dependent on human interaction, and they basically comprise annotations and descriptions of the images made by humans in a structured language, which can be interpreted by the system.

click to expand
Figure 29.1: Abstraction levels describing the feature spaces for image databases.

2.1.1 Video Databases

Dealing with video databases brings more challenges than is usual with other multimedia data types. The key characteristics of video data are their intrinsic temporal nature as well as the combination of spatial and temporal information. Therefore, to answer similarity queries on video data it is necessary to work on these two aspects. A digital video is frequently represented as a sequence of scenes. Each scene consists of a collection of objects (e.g., a kid playing with a dog in a garden) with their spatial and temporal relationships (the position of the kid and the dog in the scene and the time of movements in the scene). Besides the relationships between the objects in the scene, it is necessary to manage and to process the features of these objects such as color, shape and texture. Therefore, to extract the relevant features from video data, it is necessary to address the spatial and the temporal aspects of the data.

Regarding the spatio-temporal nature of video data, the first step to index video data should be the temporal segmentation of the video elements, which are episodes, scenes, shots and frames [4]. The elementary video unit is the frame, which actually is a still image. To represent a video by its frames leads to a lot of redundant data, because sequential frames usually have very similar properties. Thus, it is more common to use other techniques based on video segments and shots. A shot is a set of contiguous frames representing a continuous action regarding time and space. A scene is composed by a sequence of shots focusing on the same point of location of interest. A series of related scenes forms an episode.

Nonetheless, a shot encompasses a lot of data. Thus, it leads to further summarization through its extracted properties. A condensed representation of a shot can be given by the key frame of the shot. The choice of the key frame can be done based on the evaluation of the object's dynamics for every frame inside the shot or even based on simple heuristics. For example, in the Informedia Project [10] the key frame is set as the one in the middle of the shot. A digital video is reduced to a sequence of its key frames.

After establishing a key frame, which is basically a still image, it will be processed by image processing techniques and will have the main features extracted. Such features can be organized hierarchically, and will be used to generate indexes over the video and to allow similarity searching. This approach tackles only the spatial aspect of the video data. Thus, continuing on this aspect, the hierarchy of levels presented in Figure 29.1 should be extended to include video data. Two more levels would appear before the level 1, that is:

Level 0.0 : partition of each video segment into shots;
Level 0.1 : choice of a key frame (representative frame) for each shot.
The temporal reference of the key frames should be kept and used to help in further refinements needed.

2.1.2 Searching Complex Data

The general idea behind any content-based multimedia retrieval system is this: For every multimedia data to be indexed, such as image, video or time series, a feature vector representing the data is stored in the database along with the original information. To answer a query, the feature vector of the query object is computed and the database is scanned looking for the corresponding feature vectors that match the searching criteria. When the matching is obtained, the corresponding multimedia data is returned as a query result.

The choice of the meaningful characteristics or properties of multimedia information is made by the specialist on the data domain. Based on the proposal given by such experts, the features corresponding to those characteristics are extracted from every multimedia object stored in the database. Thus, complex data will be "seen" through their features. Let us say that there are n features extracted from a given data. Usually the features are placed in a n-dimensional vector and can be manipulated as an n-dimensional object. After getting the feature vectors of the data it is necessary to compare them, which is done by the use of dissimilarity or distance functions.

Considering images, a usual descriptor, or feature, is the histogram computed from the image. Histograms are used to give a global color distribution of images, as each bin of the histogram keeps the counting of the pixels in the image presented in the corresponding color. That is, a histogram indicates the number of image pixels per each color (or grey level) used in the image quantization. The number of elements of the feature vector which holds the histogram is equal to the number of colors of the image. For example, images acquired in 8 bits or 256 grey levels will produce histograms with 256 elements or dimensions. So the corresponding feature vector is 256-dimensional. Besides given a drastic reduction of the data volume, the extraction of a feature aims to spot some intrinsic and meaningful property of the data, which is usually considered to discriminate the data under analysis. The color distribution is one of the main properties of images, and it is simple to calculate histograms, so it is natural that they are one of the first features to be extracted from images. Figure 29.2 presents three images from a video sequence and their histogram. Instead of comparing the original images, a content-based image retrieval system will deal with the image histograms.

click to expand
Figure 29.2: An example of feature extraction— grey scale histograms with 256 bins.

Section 2.4 discusses the applicability of distance functions and metric spaces to speed up searching operations through the use of metric access methods.

2.2 Distance Function and Metric Domains

In this section we present the basic definitions necessary to delimit the scope of this work. The first definition aims to establish a way to compare multimedia and complex data through their features.

Definition 1 (metric distance function): Given a set of objects S= {s₁, s₂, ..., s_n} a distance function d() that has the following properties:
1. Symmetry: d(s₁,s₂) = d(s₂, s₁)
2. Non-negativity: 0<d(s₁, s₂) < 4, s₁ s₂ and d(s₁, s₁)=0
3. Triangle inequality: d(s₁,s₃) # d(s₁,s₂) + d(s₂,s₃) for any s₁, s₂, s₃ 0 S is called a metric distance function, or a metric for short.

These properties are fundamental to build indexing access methods that allow fast and effective searching and retrieval of multimedia data. The triangle inequality property helps diminish the searching space offering a lower bound comparison.

The objects s_i stated in Definition 1 are the feature vectors extracted from the multimedia data. This comparison intends to work on the main characteristics of the multimedia data, aiming to preserve the semantic of the data. Note that the distance function actually measures the dissimilarity between the feature vectors. However, dissimilarity and similarity are easily interchangeable, as one is the opposite of the other. Thus, from now on we will refer to a distance function as a manner to measure the similarity between multimedia data. The domain of the data is stated in the following definition.

(metric domain): Definition 2 Given a set of objects S, and a metric distance function as stated in definition 1, the pair M={ S, d()} is a metric domain.

For example, a set of two-dimensional points representing latitude and longitude of cities in a Geographical Information System (GIS) and the Euclidean distance function, which is a metric, constitutes a metric domain. In this example the data are in a spatial domain as well as in a metric domain. In fact, spatial datasets following a metric ^[1] distance function (such as the L₂ or the Euclidean distance) are special cases of metric domains. ^[2] Thus, multidimensional feature vectors and a metric measuring the dissimilarity between data are a special case of a metric domain as well.

An example of a pure metric space is the set of words of a language dictionary, and the L_Edit ^[3] distance function. Clearly the words are not in a spatial domain, but as L_Edit gives a way to measure how similar the words are, we have a metric space.

2.3 Searching Spaces and Similarity Queries

As already highlighted in the introductory section of this chapter, multimedia data are usually compared by similarity, because there is no sense in searching for the specific data that the user already has under analysis. What the user customarily desires is to find other data in the database which resembles the given data. It is widely accepted that there are mainly two classes of similarity queries: range queries and nearest neighbors queries. Both of them are defined as follows:

Definition 3 (Range query): Given a query object q 0 S, a subset S 0S and a maximum search distance r_q, the answer is the subset of S such that the distance between the objects to the query object is less or equal r_q, that is: Rquery(q, r_q)={ s_i | s_i 0S d(s_i, q)#r_q}.

An example of a range query on a word dataset with the L_Eedit distance function is: "Find the words that are within distance 3 from the word 'book'."

Definition 4 (Nearest Neighbor query): Given a query object q 0 S, and a subset S0S, the nearest neighbor of q is the unitary subset of S such that Nnquery(q) = {s_n0S | s_i0S d(s_n, q)#d(s_i, q)}.

A usual variation is the k-nearest neighbor query. It finds the k>0 closest objects to q in the dataset S. An example of a k-nearest neighbor query with k=5 is: "Select the 5 words nearest to 'book'. " It is possible to have ties when asking for k-nearest neighbor queries. Thus, the system must be able to handle a list of objects that has more than k objects. The number of ties usually increases if the distance function cannot separate the objects well. For example, the L_Edit metric always gives integer results from a small range of values. Therefore, many words have the same distance among them, bringing the occurrence of a tie list.

2.4 Metric Access Methods

Many approaches to accelerate the processing of similarity queries have been proposed. The majority of them led to the development of metric access methods (MAM). The intent of a MAM is to minimize the number of comparisons (distance calculations) and the number of disk accesses during the query processing.

The study of metric datasets, which are the great majority of multimedia data, including DNA strings, fingerprints, video and images, has attracted the attention of researchers in the database area. Recall that dealing with this kind of data, only the objects (the extracted features) and the distances between them are available. Usually there is no additional information, as it is given in spatial domains. Therefore, just the objects and the distances between them are all the information that metric access methods can use to organize the data.

The work of Burkhard and Keller [20] is considered the landmark of the MAM area. They presented interesting techniques for partitioning a metric dataset recursively, where the recursive process is materialized as a tree. The first technique partitions a dataset by choosing a representative from the set and grouping the elements with respect to their distance from the representative. The second technique partitions the original set into a fixed number of subsets and chooses a representative from each of the subsets. The representative and the maximum distance from the representative to a point of the corresponding subset are also maintained. The metric tree of Uhlmann [21] and the Vantage-point tree (vp-tree) of Yanilos [22] are somewhat similar to the first technique of [20], as they partition the elements into two groups according to a representative, called a "vantage point." In [22] the vp-tree has also been generalized to a multi-way tree, and in [23] it was further improved through a better algorithm to choose the vantage points and to answer nearest neighbor queries. In order to reduce the number of distance calculations, Baeza-Yates et al. [24] suggested using the same vantage point in all nodes that belong to the same level. Then, a binary tree degenerates into a simple list of vantage points. Another method of Uhlmann [21] is the generalized hyper-plane tree (gh-tree). The gh-tree partitions the dataset into two subsets, by picking two objects as representatives and assigning the remaining to the closest representative. Bozkaya and Ozsoyoglu [25] proposed an extension of the vp-tree called the "multi-vantage-point tree" (mvp-tree), which carefully chooses m vantage points for a node which has a fanout of m². The Geometric Near Access Tree (GNAT) of Brin [26] can be viewed as a refinement of the second technique presented in [20]. In addition to the representative and the maximum distance, it is suggested that the distances between pairs of representatives be stored. These distances can be used to prune the searching space using the triangle inequality. Also, an approach to estimate distances between objects using pre-computed distances on selected objects of the dataset is proposed by Shasha and Wang in [27].

All methods aforementioned are static in the sense that they do not support insertions and deletions after the creation of the tree. The M-tree of Ciaccia, Patella and Zezulla [28] was the first method to overcome this deficiency, allowing further insertions. The M-tree is a height-balanced tree where the data elements are stored in the leaves. Note that, the M-tree supports insertions similar to R-trees [29].

For all the metric access methods, the searching space is divided in "regions" or nodes, each region having one or more representatives or centers, which are strategically chosen by the MAM. At the same time, the nodes are hierarchically organized in a tree way. So, each node of a tree has basically the representative, the covering radii and the objects of the dataset which are in the node region. Figure 29.3(a) shows the organization of a node which has maximum capacity equal to 8. The representative (or center) object and the covering radius are depicted along with the remaining objects. For the sake of the clearness, we are drawing the node in a two-dimensional space using the Euclidean distance (L₂) to depict the shape of the node, which results in a circle. Each distance defines a specific shape for the node. Thus, for the Manhattan (L₁) distance the shape of the node is a diamond and for the L_o or L-infinity distance the shape becomes a square. As the shape of the nodes changes depending on the distance used, so does the object composition of the nodes.

click to expand
Figure 29.3: Nodes in a two-dimensional space defined by the Euclidean distance. The nodes have the maximum capacity of 8 objects. (a) The representative of the node is the object A. At inserting the object I, the node surpasses its capacity and needs to be split. (b) After the node splits, there are two nodes, the first one with four objects and its representative is B, and the second node with five objects having as representative the object C.

As the objects are being inserted in the tree, and the nodes achieve their capacity of holding objects, new nodes must be created. This is done by the splitting algorithms presented in the MAM. Figure 29.3(b) illustrates this procedure. At the end of the insertion of the objects the tree is built. Figure 29.4 presents a metric tree built over 17 objects. The capacity, or maximum number of objects in the nodes, of this tree is three. This tree also has three levels, where the root node is on level 1, and it is the only node without a representative. On the second level are the index-nodes, which keep the routing information of the tree. On the third level are the leaf-nodes, which store the indexed objects. The leaf-nodes always will be only on the highest level of the tree. Observe that there is overlap between the regions stated by the nodes of the tree.

click to expand
Figure 29.4: A metric tree indexing 17 objects. The objects are organized in three levels. The node capacity is 3.

As there are many interesting MAMs in the literature, it is important to highlight their particularities. We can say that the difference among the metric access methods presented before is either on the way that the representatives of the nodes are chosen, or in the strategy for splitting the nodes and building the tree, or in the techniques for insertion and data organization. In [30] is presented a survey on metric access methods, comparing these aspects between the MAMs.

A common problem with all metric access methods presented in the literature is the overlap between the nodes imposed by their structures. As the overlap increases, the efficiency of the structure decreases, because all the nodes covering a query region need to be processed during the search operation. The Slim-tree proposed in [31] [32] was developed aiming to minimize the overlap between nodes. This metric access method allows to decrease the amount of overlap between nodes of a tree using the Slim-down algorithm, which achieved a gain in performance for answering similarity queries of up to 50%.

Two aspects involved in measuring the efficiency of a MAM for answering similarity queries are the number of disk access and the number of distance calculations. It was proposed in [33] the Omni-family, which is a set of metric access methods that, taking advantage of global elements called foci, can extraordinarily decrease the number of distance calculations when processing similarity queries. Another advantage of such methods is that they can be built on commercial data managers, and improve the performance of these managers without disturbing the execution of previously developed applications.

The improved response in query time already achieved in MAMs had led the performance of similarity queries asking for objects in metric spaces to acceptable levels, that is, compared to the times obtained for their counterparts in spatial domains. Therefore, albeit more research is yet required in techniques to support queries of objects in metric spaces, other demands need to be fulfilled now. Among them, there is the need to represent the structure of the metric objects through its feature vectors, the distance functions and the similarity queries. The next section addresses a generic architecture to store multimedia data following the basic concepts of the relational data model, as well as an extension of the standard relational query language, the SQL.

^[1]A L_p-norm between two arrays X(x₁, x₂, …x_n) and Y(y₁, y₂, …y_n) is expressed as L_p(X, Y) = , where w_i is the weight of each attribute or dimension. The weight is usually equal to one. When p = 2, it is the Euclidean norm.

^[2]Some authors call metric domains as metric spaces.

^[3]The distance L_Edit (or Levenshtein) between two strings X and Y is a metric which counts the minimal number of symbols that have to be inserted, deleted or substituted, to transform X into Y. For example, L_Edit ("head", "hobby") = 4, i.e., three substitutions and one insertion.