Content Description, Search and Delivery (MPEG-7 and MPEG-21)


As more and more audio-visual information becomes available in digital form, there is an increasing pressure to make use of it. Before one can use any information, however, it has to be located. Unfortunately, widespread availability of interesting material makes this search extremely difficult.

For textual information, currently many text-based search engines, such as Google, Yahoo, Altavista etc. are available on the worldwide web (www), and they are among the most visited sites. This is an indication of real demand for searching information in the public domain. However, identifying information for audio-visual content is not so trivial, and no generally recognised description of these materials exists. In the mean time, there is no efficient way of searching the www for, say, a piece of video concert by Pavarotti, or improving the user friendliness of interconnected computers via the Internet by rich-spoken queries, hand-drawn sketches and image-based queries.

The question of finding content is not restricted to database retrieval applications. For example, TV programme producers may want to search and retrieve famous events stored among the thousands of hours of audio-visual records, in order to collect material for a programme. This will reduce programme time and increase the quality of its content. Another example is the selection of a favourite TV programme from a vast number of available satellite television channels. Currently 6-8 MPEG-2 coded TV programmes can be accommodated in a satellite transponder. Considering that each satellite can have up to 12 transponders, each in horizontal and vertical polarisation mode and satellites can be stationed within two degrees guard band, it is not unrealistic that users may have access to thousands of TV channels. Certainly the current method of printing weekly TV programmes will not be practical (tens of thousands of pages per week!), and a more intelligent computerised way of choosing a TV programme is needed. MPEG-7, under the name of 'Multimedia content-based description standard', aims to address these issues and define how humans expect to interact with computers [1].

The increasing demand for searching multimedia content on the web has opened up new opportunities for creation and delivery of these items on the Internet. Today, many elements exist to build an infrastructure for the delivery and consumption of multimedia content. There is, however, no standard way of describing these elements or relating them to each other. It is hoped that such a standard will be devised by the ISO/IEC MPEG committee under the name of MPEG-21 [2].

The main aim of this standard is to specify how various elements for content creation fit together and, when a gap exists, MPEG-21 will recommend which new standards are required. The MPEG standard will then develop new standards as appropriate while other bodies may develop other relevant standards. These specifications will be integrated into the multimedia framework, through collaboration between MPEG and these bodies. The result is an open framework for multimedia delivery and consumption, with both the content creators and content consumers as the main beneficiaries. The open framework aims to provide content creators and service providers with equal opportunities in the MPEG-21 enabled market. It will also be to the benefit of the content users, providing them access to a large variety of data in an interoperable manner.

In summary, MPEG-7 is about describing and finding contents and MPEG-21 deals with the delivery and consumption of these contents. As we see, none of these standards are about video compression, which is the main subject of this book. However, for the completeness of a book on the standard codecs we briefly describe these two new standards that incidentally are developed by the ISO/IEC MPEG standard bodies.

MPEG 7 multimedia content description interface

The main goal of MPEG-7 is to specify a standard set of descriptors that can be used to describe various types of multimedia information coded with the standard codecs, as well as other databases and even analogue audio-visual information. This will be in the form of defining descriptor schemes or structures and the relationship between various descriptors. The combination of the descriptors and description schemes will be associated with the content itself to allow a fast and efficient searching method for material of user interest. The audio-visual material that has MPEG-7 data associated with it can be indexed and searched for. This material may include still pictures, graphics, three-dimensional models, audio, speech, video and information about how these elements are combined in a multimedia presentation.

Figure 11.1 shows a highly abstract block diagram of the MPEG-7 mission. In this Figure object features are extracted and they are described in a manner which is meaningful to the search engine. As usual, MPEG-7 does not specify how features should be extracted, nor how they should be searched for, but only specifies the order in which features should be described.

click to expand
Figure 11.1: Scope of MPEG-7

Description levels

Since the description features must be meaningful in the context of the application, they may be defined in different ways for different applications. Hence a specific audio-visual event might be described with different sets of features if their applications are different. To describe visual events, they are first described by their lower abstraction level, such as shape, size, texture, colour, movement and their positions inside the picture frame. At this level, the audio material may be defined as key, mood, tempo, tempo changes and position in the sound space.

The high level of the abstraction is then a description of the semantic relation between the above lower level abstractions. For the earlier example of a piece of Pavarotti's concert, the lower level of abstraction for the picture would be: his portrait, a picture of an orchestra, shapes of musical instruments etc. For the audio, this level of abstraction could be, of course, his song, as well as other background music. All these descriptions are of course coded in a way in which they can be searched as efficiently as possible.

The level of abstraction is related to the way the required features are extracted. Many low level features can be extracted in a fully automatic manner. High level features, however, need more human interaction to define the semantic relations between the lower level features.

In addition to the description of contents, it may also be required to include other types of information about the multimedia data. For example:

  • The form: an example of the form is the coding scheme used (e.g. JPEG, MPEG-2), or the overall data size.
  • Conditions for accessing material: this would include copyright information, price etc.
  • Classification: this could include parental rating, and content classification into a number of predefined categories.
  • Links to other relevant material: this information will help the users to speed up the search operation.
  • The context: for some recorded events, it is very important to know the occasion of recording (e.g. World Cup 2002, final between Brazil and Germany).

In many cases addition of textual information to the descriptors may be useful. Care must be taken such that the usefulness of the descriptors is as independent as possible from the language. An example of this is giving names of authors, films and places. However, providing text-only documents will not be among the goals of MPEG-7.

The MPEG group has also defined a laboratory reference model for MPEG-7. This time it is called the experimental model (XM), and has the same role as RM, TM and VM in H.261, MPEG-2 and MPEG-4, respectively.

Application area

The elements that MPEG-7 standardises will support a broad range of applications. MPEG-7 will also make the web as searchable for multimedia content as it is searchable for text today. This would apply especially to large content archives as well as to multimedia catalogues enabling people to identify content for purchase. The information used for content retrieval may also be used by agents, for the selection and filtering of broadcasted material or for personalised advertising.

All application domains making use of multimedia will benefit from MPEG-7. In the mean time some of these domains that might find MPEG-7 useful are:

  • architecture, real estate and interior design (e.g. searching for ideas)
  • broadcast multimedia selection (e.g. radio and TV channels)
  • cultural service (e.g. history museums, art galleries etc.)
  • digital libraries (e.g. image catalogue, musical dictionary, biomedical imaging catalogues, film, video and radio archives)
  • e-commerce (e.g. personalised advertising, online catalogues, directories of electronic shops)
  • education (e.g. repositories of multimedia courses, multimedia search for support material)
  • home entertainment (e.g. systems for the management of personal multimedia collections, including manipulation of content, e.g. home video editing, searching a game)
  • investigation services (e.g. human characteristics recognition, forensics)
  • journalism (e.g. searching speeches of a certain politician using his name, his voice or his face)
  • multimedia directory services (e.g. yellow pages, tourist information, geographical information systems)
  • multimedia editing (e.g. personalised electronic news services, media authoring)
  • remote sensing (e.g. cartography, ecology, natural resources management)
  • shopping (e.g. searching for clothes that you like)
  • surveillance (e.g. traffic control, surface transportation, nondestructive testing in a hostile environment)

and many more.

The way MPEG-7 data will be used to answer user queries is outside the scope of the standard. In principle, any type of audio-visual material may be retrieved by means of any type of query material. For example, video material may be queried using video, music, speech etc. It is up to the search engine to match the query data and the MPEG-7 audio-visual description. A few query examples are:

  • play a few notes on a keyboard and retrieve a list of musical pieces similar to the required tune
  • draw a few lines on a screen and find a set of images containing similar graphics, logos, and ideograms
  • sketch objects, including colour patches or textures and retrieve examples among which you select the interesting objects to compose your design
  • on a given set of multimedia objects, describe movements and relations between the objects and so search for animations fulfilling the described temporal and spatial relations
  • describe actions and get a list of scenarios containing such actions
  • using an excerpt of Pavarotti's voice, obtaining a list of Pavarotti's records, video clips where Pavarotti is singing and photographic material portraying Pavarotti.

Indexing and query

The current status of research at this stage of MPEG-7 development is concentrated into two interrelated areas, indexing and query. In the former, significant events of video shots are indexed, and in the latter, given a description of an event, the video shot for that event is sought. Figure 11.2 shows how the indices for a video clip can be generated.

click to expand
Figure 11.2: Index generation for a video clip

In the Figure a video programme (normally 30-90 minutes) is temporally segmented into video shots. A shot is a piece of video clip, where the picture content from one frame to the other does not change significantly, and in general there is no scene cut within a shot. Therefore, a single frame in a shot has a high correlation to all the pictures within the shot. One of these frames is chosen as the key frame. Selection of the key frame is an interesting research issue. An ideal key frame is the one that has maximum similarity with all the pictures within its own shot, but minimum similarity with those of the other shots. The key frame is then spatially segmented into objects with meaningful features. These may include colour, shape, texture, where a semantic relation between these individual features defines an object of interest. As mentioned, depending on the type of application, the same features might be described in a different order. Also, in extracting the features, other information like motion of the objects, background sound or sometimes text might be useful. Here, features are then indexed, and the indexed data along with the key frames is stored in the database, sometimes called metadata.

The query process is the opposite of indexing. In this process, the database is searched for a specific visual content. Depending on how the query is defined to the search engine, the process can be very complex. For instance in our earlier example of Pavarotti's singing, the simplest form of the query is that a single frame (picture) of him, or a piece of his song is available. This picture (or song) is then matched against all the key frames in the database. If such a picture is found then, due to its index relation with the actual shot and video clip, that piece of video is located. Matching of the query picture with the key frames is under active research, since this is very different from the conventional pixel-to-pixel matching of the pictures. For example, due to motion, obstruction of the objects, shading, shearing etc. the physical dimensions of the objects of interest might change such that pixel-to-pixel matching does not necessarily find the right object. For instance, with the pixel-to-pixel matching, a circle can be more similar to a hexagon of almost the same number of pixels and intensity than to a smaller or larger circle, which is not a desired match.

The extreme complexity in the query is when the event is defined verbally or in text, like the text of Pavarotti's song. Here, this data has to be converted into audiovisual objects, to be matched with the key frames. There is no doubt that most of the future MPEG-7 activity will be focused on this extremely complex audio and image processing task.

In the following some of the description tools used for indexing and retrieval are described. To be consistent we have ignored speech and audio and consider only the visual description tools. Currently there are five visual description tools that can be used for indexing. During the search, any of them or a combination, as well as other data, say from audio description tools, might be used for retrieval.

Colour descriptors

Colour is the most important descriptor of an object. MPEG-7 defines seven colour descriptors, to be used in combination for describing an object. These are defined in the following sections. Colour space

Colour space is the feature that defines how the colour components are used in the other colour descriptors. For example, R, G, B (red, green, blue), Y, Cr, Cb (luminance and chrominance), HSV (hue, saturation, value) components or monochrome, are the types describing colour space. Colour quantisation

Once the colour space is defined, the colour components of each pixel are quantised to represent them with a small (manageable) number of levels or bins. These bins can then be used to represent the colour histogram of the object. That is, the distribution of the colour components at various levels. Dominant colour(s)

This colour descriptor is most suitable for representing local (object or image region) features where a small number of colours is enough to characterise the colour information in the region of interest. Dominant colour can also be defined for the whole image. To define the dominant colour, colour quantisation is used to extract a small number of representing colours in each region or image. The percentage of each quantised colour in the region then shows the degree of the dominance of that colour. A spatial coherency on the entire descriptor is also defined and is used in similarity retrieval (objects having similar dominant colours). Scalable colour

The scalable colour descriptor is a colour histogram in HSV colour space, which is encoded by a Haar transform. Its binary representation is scalable in terms of bin numbers and bit representation accuracy over a wide range of data rates. The scalable colour descriptor is useful for image-to-image matching and retrieval based on colour feature. Retrieval accuracy increases with the number of bits used in the representation. Colour structure

The colour structure descriptor is a colour feature that captures both colour content (similar to colour histogram) and information about the structure of this content (e.g. colour of the neighbouring regions). The extraction method embeds the colour structure information into the descriptor by taking into account the colours in a local neighbourhood of pixels instead of considering each pixel separately. Its main usage is image-to-image matching and is intended for still image retrieval. The colour structure descriptor provides additional functionality and improved similarity-based image retrieval performance for natural images compared with the ordinary histogram. Colour layout

This descriptor specifies the spatial distribution of colours for high speed retrieval and browsing. It can be used not only for image-to-image matching and video-to-video clip matching, but also in layout-based retrieval for colour, such as sketch-to-image matching which is not supported by other colour descriptors. For example, to find an object, one may sketch the object and paint it with the colour of interest. This descriptor can be applied either to a whole image or to any part of it. This descriptor can be applied to arbitrary shaped regions. GOP colour

The group of pictures (GOP) colour descriptor extends the scalable colour descriptor that is defined for still images to a video segment or a collection of still images. Before applying the Haar transform, the way the colour histogram is derived should be defined. MPEG-7 considers three ways of defining the colour histogram for GOP colour descriptor, namely average, median and intersection histogram methods.

The average histogram refers to averaging the counter value of each bin across all pictures, which is equivalent to computing the aggregate colour histogram of all pictures with proper normalisation. The median histogram refers to computing the median of the counter value of each bin across all pictures. It is more robust to roundoff errors and the presence of outliers in image intensity values compared with the average histogram. The intersection histogram refers to computing the minimum of the counter value of each bin across all pictures to capture the least common colour traits of a group of images. The same similarity/distance measures that are used to compare scalable colour descriptions can be employed to compare GOP colour descriptors.

Texture descriptors

Texture is an important structural descriptor of objects. MPEG-7 defines three texture-based descriptors, which are explained in the following sections. Homogeneous texture

Homogeneous texture has emerged as an important primitive for searching and browsing through large collections of similar looking patterns. In this descriptor, the texture features associated with the regions of an image can be used to index the image. Extraction of texture features is done by filtering the image at various scales and orientations. For example, using the wavelet transform with Gabor filters in say six orientations and four levels of decompositions, one can create 24 subimages. Each subimage reflects a particular image pattern at certain frequency and resolution. The mean and the variance of each subimage are then calculated. Finally, the image is indexed with a 48-dimensional vector (24 mean and 24 standard deviations). In image retrieval, the minimum distance between this 48-dimensional vector of the query image and those in the database are calculated. The one that gives the minimum distance is retrieved. The homogeneous texture descriptor provides a precise and quantitative description of a texture that can be used for accurate search and retrieval in this respect. Texture browsing

Texture browsing is useful for representation of homogeneous texture for browsing type applications, and is defined by at most 12 bits. It provides a perceptual characterisation of texture, similar to a human characterisation, in terms of regularity, coarseness and directionality. Derivation of this descriptor is done in a similar way to the homogeneous texture descriptor of section That is, the image is filtered with a bank of orientation and scale-tuned filters, using Gabor functions. From the filtered image, the two dominant texture orientations are selected. Three bits are needed to represent each of the dominant orientations (out of, say, six). This is followed by analysing the filtered image projections along the dominant orientations to determine the regularity (quantified by two bits) and coarseness (two bits ×2). The second dominant orientation and second scale feature are optional. This descriptor, combined with the homogeneous texture descriptor, provides a scalable solution to representing homogeneous texture regions in images. Edge histogram

The edge histogram represents the spatial distribution of five types of edge, namely four directional edges and one nondirectional edge. It consists in the distribution of pixel values in each of these directions. Since edges are important in image perception, they can be used to retrieve images with similar semantic meaning. The primary use of this descriptor is image-to-image matching, especially for natural edges with nonuniform edge distribution. The retrieval reliability of this descriptor is increased when it is combined with other descriptors, such as the colour histogram descriptor.

Shape descriptors

Humans normally describe objects by their shapes, and hence shape descriptors are very instrumental in finding similar shapes. One important property of shape is its invariance to rotation, scaling and displacement. MPEG-7 identifies three shape descriptors, which are defined as follows. Region-based shapes

The shape of an object may consist of either a single region or a set of regions. Since a region-based shape descriptor makes use of all pixels constituting the shape within a picture frame, it can describe shapes of any complexity.

The shape is described as a binary plane, with the black pixel within the object corresponding to 1 and the white background corresponding to 0. To reduce the data required to represent the shape, its size is reduced. MPEG-7 recommends shapes to be described at a fixed size of 17.5 bytes. The feature extraction and matching processes are straightforward enough to have a low order of computational complexity, so as to be suitable for tracking shapes in video data processing. Contour-based shapes

The contour-based shape descriptor captures the characteristics of the shapes and is more similar to the human notion of understanding shapes. It is the most popular method for shape-based image retrieval. In section 11.2 we will demonstrate some of its practical applications in image retrieval.

The contour-based shape descriptor is based on the so-called curvature scale space (CSS) representation of the contour. That is, by filtering the shape at various scales (various degrees of the smoothness of the filter), a contour is smoothed at various levels. The smoothed contours are then used for matching. This method has several important properties:

  • it captures important characteristics of the shapes, enabling similarity-based retrieval
  • it reflects properties of the perception of the human visual system
  • it is robust to nonrigid motion or partial occlusion of the shape
  • it is robust to various transformations on shapes, such as rotation, scaling, zooming etc. Three-dimensional shapes

Advances in multimedia technology have brought three-dimensional contents into today's information systems in the forms of virtual worlds and augmented reality. The three-dimensional objects are normally represented as polygonal meshes, such as those used in MPEG-4 for synthetic images and image rendering. Within the MPEG-7 framework, tools for intelligent content-based access to three-dimensional information are needed. The main applications for three-dimensional shape description are search, retrieval and browsing of three-dimensional model databases.

Motion descriptors

Motion is a feature which discriminates video from still images. It can be used as a descriptor for video segments and the MPEG-7 standard recognises four motion-based descriptors. Camera motion

Camera motion is a descriptor that characterises the three-dimensional camera motion parameters. These parameters can be automatically extracted or generated by the capturing devices.

The camera motion descriptor supports the following well known basic camera motion:

  • fixed: camera is static
  • panning: horizontal rotation
  • tracking: horizontal traverse movement, also known as travelling in the film industry
  • tilting: vertical rotation
  • booming: vertical traverse movement
  • zooming: change of the focal length
  • dollying: translation along the optical axis
  • rolling: rotation around the optical axis.

The subshots for which all frames are characterised by a particular camera motion, which can be single or mixed, determine the building blocks for the camera motion descriptor. Each building block is described by its start time, the duration, the speed of the induced image motion, by the fraction of time of its duration compared with a given temporal window size and the focus-of-expansion or the focus-of-contraction. Motion trajectory

The motion trajectory of an object is a simple high level feature defined as the localisation in time and space of one representative point of this object. This descriptor can be useful for content-based retrieval in object-oriented visual databases. If a priori knowledge is available, the trajectory motion can be very useful. For example, in surveillance, alarms can be triggered if an object has a trajectory that looks unusual (e.g. passing through a forbidden area).

The descriptor is essentially a list of key points along with a set of optional interpolating functions that describe the path of the object between the key points in terms of acceleration. The key points are specified by their time instant and either two-or three-dimensional Cartesian coordinates, depending on the intended application. Parametric motion

Parametric motion models of affine, perspective etc. have been extensively used in image processing, including motion-based segmentation and estimation, global motion estimation. Some of these we have seen in Chapter 9 for motion estimation and in Chapter 10 for global motion estimation used in the sprite of MPEG-4. Within the MPEG-7 framework, motion is a highly relevant feature, related to the spatio-temporal structure of a video and concerning several MPEG-7 specific applications, such as storage and retrieval of video databases and hyperlinking purposes.

The basic underlying principle consists of describing the motion of an object in video sequences in terms of its model parameters. Specifically, the affine model includes translations, rotations, scaling and a combination of them. Planar perspective models can take into account global deformations associated with perspective projections. More complex movements can be described with the quadratic motion model. Such an approach leads to a very efficient description of several types of motion, including simple translations, rotations and zooming or more complex motions such as combinations of the above mentioned motions. Motion activity

Video scenes are usually classified in terms of their motion activity. For example, sports programmes are highly active and newsreader video shots represent low activity. Therefore motion can be used as a descriptor to express the activity of a given video segment.

An activity descriptor can be useful for applications such as surveillance, fast browsing, dynamic video summarisation and movement-based query. For example, if the activity descriptor shows a high activity, then during the playback the frame rate can be slowed down to make highly active scenes viewable. Another example of an application is finding all the high action shots in a news video programme.


This descriptor specifies the position of a query object within the image, and it is defined with two types of description. Region locator

The region locator is a descriptor that enables localisation of regions within images by specifying them with a brief and scalable representation of a box or a polygon. Spatio-temporal locator

This descriptor defines the spatio-temporal locations of the regions in a video sequence, such as moving object regions, and provides localisation functionality. The main application of it is hypermedia, which displays the related information when the designated point is inside the object. Another major application is object retrieval by checking whether the object has passed through particular points, which can be used in, say, surveillance. The spatio-temporal locator can describe both spatially connected and nonconnected regions.


MPEG-7 is an ongoing process and in the future other descriptors will be added to the above list. In this category, the most notable descriptor is face recognition. Face recognition

The face recognition descriptor can be used to retrieve face images. The descriptor represents the projection of a face vector onto a set of basis vectors, which are representative of possible face vectors. The face recognition feature set is extracted from a normalised face image. This normalised face image contains 56 lines, each line with 46 intensity values. At the 24th row, the centres of the two eyes in each face image are located at the 16th and 31st column for the right and left eye, respectively. This normalised image is then used to extract the one-dimensional face vector that consists of the luminance pixel values from the normalised face image arranged into a one-dimensional vector in the scanning direction. The face recognition feature set is then calculated by projecting the one-dimensional face vector onto the space defined by a set of basis vectors.

Practical examples of image retrieval

In this section some of the methods described in the previous sections are used to demonstrate practical visual information retrieval from image databases. I have chosen texture and shape-based retrievals for demonstration purposes, since those of colour and motion-based methods are not easy to demonstrate in black and white pictures and within the limited space of this book.

Texture based image retrieval

Spatial frequency analysis of textures provides an excellent way of classifying them. The complex Gabor wavelet which is a modulated Guassian function to a complex exponential is ideal for this purpose [3]. A two-dimensional complex Gabor wavelet is defined as:


where σx and σy are the horizontal and vertical standard deviations of the Gaussian and f0 is the filter bandwidth. Thus its frequency domain representation (Fourier transform) is given by:


where u and v are the horizontal and vertical spatial frequencies and σu and σv are their respective standard deviations.

The g(x, y) of eqn. 11.1 can be used as the mother wavelet to decompose a signal into various levels and orientations. In Chapter 4 we showed how mother wavelets could be used in the design of discrete wavelet transform filters. The same procedure can be applied to the mother Gabor wavelet.

Figure 11.3 shows the spectrum of the Gabor filter at four levels and six orientations. It is derived by setting the lowest and highest horizontal spatial frequencies to ul = 0.05 and uh = 0.4, respectively. The intermediate frequencies are derived by constraining the bands to touch each other.

click to expand
Figure 11.3: Gabor filter spectrum; the contours indicate the half-peak magnitude of the filter responses in the Gabor filter dictionary. The filter parameters used are uh = 0.4, ul = 0.05, M = 4 and L = 6

The above set of filters can decompose an image into 4 × 6 = 24 subimages. As we had seen in Chapter 4, each subimage reflects characteristics of the image at a specific direction and spatial resolution. Hence it can analyse textures of images and describe them at these orientations and resolutions. For the example given above, one can calculate the mean and standard deviation of each subimage, and use it as a 48-dimensional vector to describe it. This is called the feature vector that can be

used for indexing the texture of an image and is given by:

(11.3)  click to expand

To retrieve a query texture, its feature vector is compared against the feature vectors of the textures in the database. The similarity measure is the Euclidean distance between the feature vectors, and the one that gives the least distance is the most similar texture.

Figure 11.4 demonstrates retrieving a texture from a texture database of 112 images. In this database, images are identified as D1 to D112. The query image, D5, is shown at the top left, along with the 12 most similar retrieved images, in the order of their similarity measure. As we see, the query texture itself, D5, is found as the closest match, followed by a visually similar texture, D54, and so on. In the Figure the similarity distance, sm, of each retrieved candidate texture is also given.

click to expand
Figure 11.4: An example of texture-based image retrieval; query texture— D5

Shape based retrieval

Shapes are best described by the strength of their curvatures along their contours. A useful way of describing this strength is the curvature function, k(s, θ), defined as the instantaneous rate of change of the angle of the curve (tangent) θ over its arc length s:


At sharp edges, where the rate of change of the angle is fast, the curvature function, k(s, θ), has large values. Hence contours can be described by some values of their curvature functions as feature points. For example, feature points can be defined as the positions of large curvature points or their zero crossings. However, since contours are normally noisy, direct derivation of the curvature function from the contour can lead to false feature points.

To eliminate these unwanted feature points, contours should be denoised through smoothing filters. Care should be taken on the degree of filter smoothness, since heavily filtered contours lose the feature points and lightly filtered ones cannot get rid of the false feature points. Large numbers of feature points also demand more storage and heavy processing for retrieval.

Filtering a contour with a set of Gaussian filters of varying degrees of smoothness is the answer to this question, the so-called scale-space representation of curves [4]. Figure 11.5 shows four smoothed contours of a shape at scaling (smoothing) factors of 16, 64, 256 and 1024. The positions of the curvature extremes on the contour at each scale are also shown. These points exhibit very well the most important structure of each contour.

click to expand
Figure 11.5: A contour and the positions of its curvature extremes at four different scales

For indexing and retrieval applications, these feature points can be joined together to approximate each smoothed contour with a polygon [5]. Moving round the contour, the angle of every polygon line with the horizontal is recorded, and this set of angles is called the turning function. The turning function is the index of the shape that can be used for retrieval.

The similarity measure is based on the minimisation of the Euclidean distance between the query turning function and the turning functions of the shapes in the database. That is:


where is the ith angle of the query turning function and is the ith angle of the jth shape turning function in the database and N is the number of angles in the turning function (e.g. number of vertices of the polygons or feature points on the contours). Calculation of the Euclidean distance necessitates that all the indices should have an equal number of turning angles in their turning functions. This is done by inserting additional feature points on the contours such that polygons have N vertices.

Insertion of new feature points on the contour should be such that they should also represent important curvature extremes. This is done by inserting points on the contour, one at a time, where the contour has its largest distance from the polygon. Figure 11.6 shows the polygons of Figure 11.5 with the added new vertices. The number of total vertices is chosen such that approximation distortion of the original contour with a polygon is less than an acceptable value.

click to expand
Figure 11.6: New polygons of Figure 11.5 with some added vertices

To show the retrieval efficiency of this method of shape description, Figure 11.7a shows a query shape, to be searched in a database of almost 1100 marine creatures. The query marine is found as the closest shape, followed by three next closet marine shapes in the order of their closeness as, shown in Figure 11.7.

click to expand
Figure 11.7: a the query shape, b, c and d the three closest shapes in order

Sketch based retrieval

In the above two examples of image retrieval, it was assumed that the query image (texture or shape) is available to the user. However, there are occasions when these query images may not be available. Users might have a verbal description of objects of interest, or have in their minds an idea of a visual object.

One way of searching for visual objects without the query images is to sketch the object of interest and submit it to the search engine. Then, based on a set of retrieved objects, the user may iteratively modify his sketch, until he finds the desired object. Assume that the fish in Figure 11.8 C1 is the desired shape in a database. The user first draws a rough sketch of this fish, as he imagines it, like the one shown in Figure 11.8 A0. Based on shape similarity, the three best similar shapes in the order of their similarity to the drawn shape are shown in A1, A2 and A3.

click to expand
Figure 11.8: Sketch-based shape retrieval

By inspecting these outputs, the user then realises that none of the matched fish has any fins. Adding a dorsal and a ventral fin to the sketched fish, the new query fish of BO is created. With this new query shape, the new set of best matched shapes in the order of their similarity to the refined sketch become B1, B2 and B3.

Finally, adding an anal fin and a small adipose fin to the refined sketch, a further refined query shape of CO is created. The new set of retrieved shapes, in the order of their similarity to the refined sketch now become C1, C2 and C3. This is the last iteration step, since the desired shape, C1, comes as the best matched shape.

This technique can also be extended to the other retrieval methods. For example, one might paint or add some texture to the above drawings. This would certainly improve the reliability of the retrieval system

MPEG 21 multimedia framework

Today multimedia technology is so advanced that access to the vast amount of information and services from almost anywhere at any time, through ubiquitous terminals and networks, is possible. However, no complete picture exists of how different communities can best interact with each other in a complex infrastructure. Examples of these communities are the content providers, financial, communication, computer and consumer electronics sectors and their customers. Developing a common multimedia framework will facilitate cooperation between these sectors and support a more efficient implementation and integration of different models, rules, interests and content formats. This is the task given to the multimedia framework project under the name of MPEG-21. The name is chosen to signify the coincidence of the start of the project with the 21st century.

The chain of multimedia content delivery encompasses content creation, production, delivery and consumption. To support this, the content has to be identified, described, managed and protected. The transport and delivery of content will undoubtedly be over a heterogeneous set of terminals and networks. Reliable delivery of contents, management of personal data, financial transactions and user privacy are some of the issues that the multimedia framework should take into account. In the following sections the seven architectural key elements that the multimedia framework considers instrumental in the realisation of the task are explained.

Digital item declaration

In multimedia communication to facilitate a wide range of actions involving digital items, there is a strong need for a concrete description for defining exactly what constitutes such an item. Clearly, there are many kinds of content, and nearly as many possible ways of describing it. This presents a strong challenge to lay out a powerful and flexible model for a digital item, from which the content can be described more accurately. Such a model is only useful if it yields a format that can be used to represent any digital item defined within the model unambiguously, and communicate it successfully.

Consider a simple web page as a digital item. This web page typically consists of an HTML (hypertext markup language) document with embedded links or dependencies to various image files (e.g. JPEG images) and possibly some layout information (e.g. style sheet). In this simple case, it is a straightforward exercise to inspect the HTML document and deduce that this digital item consists of the HTML document itself plus all the other resources upon which it depends.

Now let us constrain the above example, such that the web page should be viewed with the JavaScript language. The presence of the language logic now raises the question of what constitutes this digital item and how it can be unambiguously determined. The first problem is that addition of the scripting code changes the declaration of the links, since the links can be determined only by running the embedded script on the specific platform. This could still work as a method of deducing the structure of a digital item, assuming that the author intended each translated version of the web page to be a separate and distinct digital item. This assumption creates a second problem, as it is ambiguous whether the author actually intends for each translation of the page to be a standalone digital item, or whether the intention is for the digital item to consist of the page with the language choice left unresolved. If the latter is the case, it makes it impossible to deduce the exact set of resources that this digital item consists of, which leads back to the first problem. In the course of standardisation MPEG-21 aims to come up with a standard way of defining and declaring digital items.

Digital item identification and description

Currently, the majority of content lacks identification and description. Moreover, there is no mechanism to ensure that this identity and description is persistently associated with the content, which hinders any kind of efficient content storage.

However, some identifiers have been successfully implemented and commonly used for several years, but they are defined in a single media type. ISBN, the International Standard Book Number, or URN, Universal Resource Name, are a few examples of how a digital item can be identified. This is just the beginning and in the future we will see more of these.

There are many examples of businesses, which have requirements for the deployment of a unique identification system on a global scale. Proprietary solutions such as labelling and watermarking for insertion, modification and extractions of IDs have emerged in the past. However, no international standard is available today for the deployment of such technologies, and it is the second task of MPEG-21 to identify and describe them.

Content handling and usage

The availability and access of content within networks is exponentially increasing over time. With the goal of MPEG-21 to enable transparent use of this content over a variety of networks and devices, it becomes extremely important that standards should exist to facilitate searching, locating, caching, archiving, routing, distributing and using content. In addition, the content has to be relevant to the customer needs and provide a better return of money for the business.

Thus the goal of the MPEG-21 multimedia framework is to provide interfaces and protocols that enable creation, manipulation, search, access, storage, delivery and use of content, across the content creation and consumption chain. The emphasis should be given to improve an interaction model for users with personalisation and content handling.

Intellectual property and management

MPEG-21 should provide a uniform framework that enables users to express their rights and interests in, and agreements related to, digital items. They should be assured that those rights, interests and agreements will be persistently and reliably managed and protected across a wide range of networks and devices.

Terminal and networks

Accessibility of heterogeneous content is becoming widespread to many network devices. Today we receive a variety of information through set-top boxes for terrestrial/cable/satellite networks, personal digital assistants, mobile phones etc. Additionally, these access devices are used in different locations and environments. This makes it difficult for service providers to ensure that content is available anywhere, anytime and can be used and rendered in a meaningful way.

The goal of MPEG-21 is to enable transparent use of multimedia resources across a wide range of networked devices. This inevitably has an impact on the way network and terminal resources themselves are being dealt with.

Users accessing content should be offered services with a known subjective quality, perhaps at a known or agreed price. They should be shielded from network and terminal installation, management and implementation issues.

From the network point of view, it is desirable that the application serving the user can translate the user requirements into a network quality of service (QOS) contract. This contract, containing a summary of negotiated network parameters, is handled between the user or his agents and the network. It guarantees the delivery of service over the network for a given QOS. However, the actual implementation of network QOS does not fall within the scope of MPEG-21. The intent is to make use of these mechanisms and propose requirements to network QOS functionality extensions to fulfil the overall MPEG-21 QOS demands.

Content representation

Content is the most important element of a multimedia framework. Within the framework, content is coded, identified, described, stored, delivered, protected, transacted, consumed etc.

Although MPEG-21 assumes that content is available in digital form, it should be represented in a form to fulfil some requirements. For example, digital video as a digital item needs to be compressed and converted into a format to be stored more economically. Although there are several standards for efficient compression (representation) of image and video, they have been devised for specific purposes. Throughout the book we have seen that JPEG and JPEG2000 are some of the standards for coding of still images. For video H.261, H.263, H.26L, MPEG-1, MPEG-2 are used for frame-based video and MPEG-4 is for coding of arbitrary shaped objects. But this is not enough for unique and unambiguous representation of digital video items. The same is true for audio.

In fact users are becoming more mobile and have a need to access information on multiple devices in different contexts at different locations. Currently, content providers and authors have to create multiple formats of content and deploy them in a multitude of networks. Also, no satisfactory automated configurable way of delivering and consuming content exists that scales automatically to different network characteristics and device profiles. This is despite the introduction of various scalable coding of video to alleviate some of these problems.

The content representation element of the framework is intended to address the technology needed in order that content is represented in a way adequate for pursuing the general objectives of MPEG-21. In this regard MPEG-21 assumes that content consists of one or a combination of:

  1. content represented by MPEG standards (e.g. MPEG-4 video)
  2. content used by MPEG but not covered by MPEG standards (e.g. plain text, HTML etc.)
  3. content that can be represented by (i) and (ii) but is represented by different standards or proprietary specifications
  4. future standards for other sensory media.

Event reporting

Every interaction is an event and there is a necessity that events should be reported. However, there are a number of difficulties in providing an accurate report about the event. Different observers of the event may have different perspectives, needs and focuses. Currently there exists no standardised means of event reporting.

In a multimedia environment there are many events that need reporting. For example, accurate product cost, consumer cost, channel costs or profitability information. This allows users to understand operational processes and simulate dynamics in order to optimise efficiencies and outputs.

However, every industry reports information about its performance to other users. A number of issues make this difficult for the receiving users to process this information. For example, different reporting formats, different standards from country to country, different currencies, languages etc. make event processing difficult. As the last key architectural element, MPEG-21 intends to standardise event reporting, to eliminate these shortfalls.


1 ISO/IEC JTC1/SC29/WG11/N4031: 'Overview of the MPEG-7 standard'. Singapore, March 2001

2 ISO/IEC JTC1/SC29/WG11/N4040: 'Study on MPEG-21 (digital audiovisual framework) part 1'. Singapore, March 2001

3 LEE, T.S.: 'Image representation using 2D Gabor wavelets', IEEE Trans. Pattern Anal. Mach. Intell., 1996, 18:10, pp.959–971

4 IZQUIERDO, E., and GHANBARI, M.: 'Nonlinear Gaussian filtering approach for object segmentation', IEE Proc. Vis. Image Signal Process., 1999, 146:3, pp. 137–143

5 ARKIN, E.,CHEW, L.,HUTTENLOCHER, D., and MITCHELL, J.: 'An efficiently computable metric for comparing polygonal shapes'. IEEE Trans. Pattern Anal. Mach. Intell., 1991, PAMI-13, pp.209–216

Standard Codecs(c) Image Compression to Advanced Video Coding
Standard Codecs: Image Compression to Advanced Video Coding (IET Telecommunications Series)
ISBN: 0852967101
EAN: 2147483647
Year: 2005
Pages: 148
Authors: M. Ghanbari

Similar book on Amazon © 2008-2017.
If you may any questions please contact us: