11.1 MPEG-7: multimedia content description interface

11.1 MPEG-7: multimedia content description interface

The main goal of MPEG-7 is to specify a standard set of descriptors that can be used to describe various types of multimedia information coded with the standard codecs, as well as other databases and even analogue audio-visual information. This will be in the form of defining descriptor schemes or structures and the relationship between various descriptors. The combination of the descriptors and description schemes will be associated with the content itself to allow a fast and efficient searching method for material of user interest. The audio-visual material that has MPEG-7 data associated with it can be indexed and searched for. This material may include still pictures, graphics, three-dimensional models, audio, speech, video and information about how these elements are combined in a multimedia presentation.

Figure 11.1 shows a highly abstract block diagram of the MPEG-7 mission. In this Figure object features are extracted and they are described in a manner which is meaningful to the search engine. As usual, MPEG-7 does not specify how features should be extracted, nor how they should be searched for, but only specifies the order in which features should be described.

click to expand
Figure 11.1: Scope of MPEG-7

11.1.1 Description levels

Since the description features must be meaningful in the context of the application, they may be defined in different ways for different applications. Hence a specific audio-visual event might be described with different sets of features if their applications are different. To describe visual events, they are first described by their lower abstraction level, such as shape, size, texture, colour, movement and their positions inside the picture frame. At this level, the audio material may be defined as key, mood, tempo, tempo changes and position in the sound space.

The high level of the abstraction is then a description of the semantic relation between the above lower level abstractions. For the earlier example of a piece of Pavarotti's concert, the lower level of abstraction for the picture would be: his portrait, a picture of an orchestra, shapes of musical instruments etc. For the audio, this level of abstraction could be, of course, his song, as well as other background music. All these descriptions are of course coded in a way in which they can be searched as efficiently as possible.

The level of abstraction is related to the way the required features are extracted. Many low level features can be extracted in a fully automatic manner. High level features, however, need more human interaction to define the semantic relations between the lower level features.

In addition to the description of contents, it may also be required to include other types of information about the multimedia data. For example:

  • The form: an example of the form is the coding scheme used (e.g. JPEG, MPEG-2), or the overall data size.

  • Conditions for accessing material: this would include copyright information, price etc.

  • Classification: this could include parental rating, and content classification into a number of predefined categories.

  • Links to other relevant material: this information will help the users to speed up the search operation.

  • The context: for some recorded events, it is very important to know the occasion of recording (e.g. World Cup 2002, final between Brazil and Germany).

In many cases addition of textual information to the descriptors may be useful. Care must be taken such that the usefulness of the descriptors is as independent as possible from the language. An example of this is giving names of authors, films and places. However, providing text-only documents will not be among the goals of MPEG-7.

The MPEG group has also defined a laboratory reference model for MPEG-7. This time it is called the experimental model (XM), and has the same role as RM, TM and VM in H.261, MPEG-2 and MPEG-4, respectively.

11.1.2 Application area

The elements that MPEG-7 standardises will support a broad range of applications. MPEG-7 will also make the web as searchable for multimedia content as it is searchable for text today. This would apply especially to large content archives as well as to multimedia catalogues enabling people to identify content for purchase. The information used for content retrieval may also be used by agents, for the selection and filtering of broadcasted material or for personalised advertising.

All application domains making use of multimedia will benefit from MPEG-7. In the mean time some of these domains that might find MPEG-7 useful are:

  • architecture, real estate and interior design (e.g. searching for ideas)

  • broadcast multimedia selection (e.g. radio and TV channels)

  • cultural service (e.g. history museums, art galleries etc.)

  • digital libraries (e.g. image catalogue, musical dictionary, biomedical imaging catalogues, film, video and radio archives)

  • e-commerce (e.g. personalised advertising, online catalogues, directories of electronic shops)

  • education (e.g. repositories of multimedia courses, multimedia search for support material)

  • home entertainment (e.g. systems for the management of personal multimedia collections, including manipulation of content, e.g. home video editing, searching a game)

  • investigation services (e.g. human characteristics recognition, forensics)

  • journalism (e.g. searching speeches of a certain politician using his name, his voice or his face)

  • multimedia directory services (e.g. yellow pages, tourist information, geographical information systems)

  • multimedia editing (e.g. personalised electronic news services, media authoring)

  • remote sensing (e.g. cartography, ecology, natural resources management)

  • shopping (e.g. searching for clothes that you like)

  • surveillance (e.g. traffic control, surface transportation, nondestructive testing in a hostile environment)

and many more.

The way MPEG-7 data will be used to answer user queries is outside the scope of the standard. In principle, any type of audio-visual material may be retrieved by means of any type of query material. For example, video material may be queried using video, music, speech etc. It is up to the search engine to match the query data and the MPEG-7 audio-visual description. A few query examples are:

  • play a few notes on a keyboard and retrieve a list of musical pieces similar to the required tune

  • draw a few lines on a screen and find a set of images containing similar graphics, logos, and ideograms

  • sketch objects, including colour patches or textures and retrieve examples among which you select the interesting objects to compose your design

  • on a given set of multimedia objects, describe movements and relations between the objects and so search for animations fulfilling the described temporal and spatial relations

  • describe actions and get a list of scenarios containing such actions

  • using an excerpt of Pavarotti's voice, obtaining a list of Pavarotti's records, video clips where Pavarotti is singing and photographic material portraying Pavarotti.

11.1.3 Indexing and query

The current status of research at this stage of MPEG-7 development is concentrated into two interrelated areas, indexing and query. In the former, significant events of video shots are indexed, and in the latter, given a description of an event, the video shot for that event is sought. Figure 11.2 shows how the indices for a video clip can be generated.

click to expand
Figure 11.2: Index generation for a video clip

In the Figure a video programme (normally 30-90 minutes) is temporally segmented into video shots. A shot is a piece of video clip, where the picture content from one frame to the other does not change significantly, and in general there is no scene cut within a shot. Therefore, a single frame in a shot has a high correlation to all the pictures within the shot. One of these frames is chosen as the key frame. Selection of the key frame is an interesting research issue. An ideal key frame is the one that has maximum similarity with all the pictures within its own shot, but minimum similarity with those of the other shots. The key frame is then spatially segmented into objects with meaningful features. These may include colour, shape, texture, where a semantic relation between these individual features defines an object of interest. As mentioned, depending on the type of application, the same features might be described in a different order. Also, in extracting the features, other information like motion of the objects, background sound or sometimes text might be useful. Here, features are then indexed, and the indexed data along with the key frames is stored in the database, sometimes called metadata.

The query process is the opposite of indexing. In this process, the database is searched for a specific visual content. Depending on how the query is defined to the search engine, the process can be very complex. For instance in our earlier example of Pavarotti's singing, the simplest form of the query is that a single frame (picture) of him, or a piece of his song is available. This picture (or song) is then matched against all the key frames in the database. If such a picture is found then, due to its index relation with the actual shot and video clip, that piece of video is located. Matching of the query picture with the key frames is under active research, since this is very different from the conventional pixel-to-pixel matching of the pictures. For example, due to motion, obstruction of the objects, shading, shearing etc. the physical dimensions of the objects of interest might change such that pixel-to-pixel matching does not necessarily find the right object. For instance, with the pixel-to-pixel matching, a circle can be more similar to a hexagon of almost the same number of pixels and intensity than to a smaller or larger circle, which is not a desired match.

The extreme complexity in the query is when the event is defined verbally or in text, like the text of Pavarotti's song. Here, this data has to be converted into audiovisual objects, to be matched with the key frames. There is no doubt that most of the future MPEG-7 activity will be focused on this extremely complex audio and image processing task.

In the following some of the description tools used for indexing and retrieval are described. To be consistent we have ignored speech and audio and consider only the visual description tools. Currently there are five visual description tools that can be used for indexing. During the search, any of them or a combination, as well as other data, say from audio description tools, might be used for retrieval.

11.1.4 Colour descriptors

Colour is the most important descriptor of an object. MPEG-7 defines seven colour descriptors, to be used in combination for describing an object. These are defined in the following sections.

11.1.4.1 Colour space

Colour space is the feature that defines how the colour components are used in the other colour descriptors. For example, R, G, B (red, green, blue), Y, Cr, Cb (luminance and chrominance), HSV (hue, saturation, value) components or monochrome, are the types describing colour space.

11.1.4.2 Colour quantisation

Once the colour space is defined, the colour components of each pixel are quantised to represent them with a small (manageable) number of levels or bins. These bins can then be used to represent the colour histogram of the object. That is, the distribution of the colour components at various levels.

11.1.4.3 Dominant colour(s)

This colour descriptor is most suitable for representing local (object or image region) features where a small number of colours is enough to characterise the colour information in the region of interest. Dominant colour can also be defined for the whole image. To define the dominant colour, colour quantisation is used to extract a small number of representing colours in each region or image. The percentage of each quantised colour in the region then shows the degree of the dominance of that colour. A spatial coherency on the entire descriptor is also defined and is used in similarity retrieval (objects having similar dominant colours).

11.1.4.4 Scalable colour

The scalable colour descriptor is a colour histogram in HSV colour space, which is encoded by a Haar transform. Its binary representation is scalable in terms of bin numbers and bit representation accuracy over a wide range of data rates. The scalable colour descriptor is useful for image-to-image matching and retrieval based on colour feature. Retrieval accuracy increases with the number of bits used in the representation.

11.1.4.5 Colour structure

The colour structure descriptor is a colour feature that captures both colour content (similar to colour histogram) and information about the structure of this content (e.g. colour of the neighbouring regions). The extraction method embeds the colour structure information into the descriptor by taking into account the colours in a local neighbourhood of pixels instead of considering each pixel separately. Its main usage is image-to-image matching and is intended for still image retrieval. The colour structure descriptor provides additional functionality and improved similarity-based image retrieval performance for natural images compared with the ordinary histogram.

11.1.4.6 Colour layout

This descriptor specifies the spatial distribution of colours for high speed retrieval and browsing. It can be used not only for image-to-image matching and video-to-video clip matching, but also in layout-based retrieval for colour, such as sketch-to-image matching which is not supported by other colour descriptors. For example, to find an object, one may sketch the object and paint it with the colour of interest. This descriptor can be applied either to a whole image or to any part of it. This descriptor can be applied to arbitrary shaped regions.

11.1.4.7 GOP colour

The group of pictures (GOP) colour descriptor extends the scalable colour descriptor that is defined for still images to a video segment or a collection of still images. Before applying the Haar transform, the way the colour histogram is derived should be defined. MPEG-7 considers three ways of defining the colour histogram for GOP colour descriptor, namely average, median and intersection histogram methods.

The average histogram refers to averaging the counter value of each bin across all pictures, which is equivalent to computing the aggregate colour histogram of all pictures with proper normalisation. The median histogram refers to computing the median of the counter value of each bin across all pictures. It is more robust to roundoff errors and the presence of outliers in image intensity values compared with the average histogram. The intersection histogram refers to computing the minimum of the counter value of each bin across all pictures to capture the least common colour traits of a group of images. The same similarity/distance measures that are used to compare scalable colour descriptions can be employed to compare GOP colour descriptors.

11.1.5 Texture descriptors

Texture is an important structural descriptor of objects. MPEG-7 defines three texture-based descriptors, which are explained in the following sections.

11.1.5.1 Homogeneous texture

Homogeneous texture has emerged as an important primitive for searching and browsing through large collections of similar looking patterns. In this descriptor, the texture features associated with the regions of an image can be used to index the image. Extraction of texture features is done by filtering the image at various scales and orientations. For example, using the wavelet transform with Gabor filters in say six orientations and four levels of decompositions, one can create 24 subimages. Each subimage reflects a particular image pattern at certain frequency and resolution. The mean and the variance of each subimage are then calculated. Finally, the image is indexed with a 48-dimensional vector (24 mean and 24 standard deviations). In image retrieval, the minimum distance between this 48-dimensional vector of the query image and those in the database are calculated. The one that gives the minimum distance is retrieved. The homogeneous texture descriptor provides a precise and quantitative description of a texture that can be used for accurate search and retrieval in this respect.

11.1.5.2 Texture browsing

Texture browsing is useful for representation of homogeneous texture for browsing type applications, and is defined by at most 12 bits. It provides a perceptual characterisation of texture, similar to a human characterisation, in terms of regularity, coarseness and directionality. Derivation of this descriptor is done in a similar way to the homogeneous texture descriptor of section 11.1.5.1. That is, the image is filtered with a bank of orientation and scale-tuned filters, using Gabor functions. From the filtered image, the two dominant texture orientations are selected. Three bits are needed to represent each of the dominant orientations (out of, say, six). This is followed by analysing the filtered image projections along the dominant orientations to determine the regularity (quantified by two bits) and coarseness (two bits ×2). The second dominant orientation and second scale feature are optional. This descriptor, combined with the homogeneous texture descriptor, provides a scalable solution to representing homogeneous texture regions in images.

11.1.5.3 Edge histogram

The edge histogram represents the spatial distribution of five types of edge, namely four directional edges and one nondirectional edge. It consists in the distribution of pixel values in each of these directions. Since edges are important in image perception, they can be used to retrieve images with similar semantic meaning. The primary use of this descriptor is image-to-image matching, especially for natural edges with nonuniform edge distribution. The retrieval reliability of this descriptor is increased when it is combined with other descriptors, such as the colour histogram descriptor.

11.1.6 Shape descriptors

Humans normally describe objects by their shapes, and hence shape descriptors are very instrumental in finding similar shapes. One important property of shape is its invariance to rotation, scaling and displacement. MPEG-7 identifies three shape descriptors, which are defined as follows.

11.1.6.1 Region-based shapes

The shape of an object may consist of either a single region or a set of regions. Since a region-based shape descriptor makes use of all pixels constituting the shape within a picture frame, it can describe shapes of any complexity.

The shape is described as a binary plane, with the black pixel within the object corresponding to 1 and the white background corresponding to 0. To reduce the data required to represent the shape, its size is reduced. MPEG-7 recommends shapes to be described at a fixed size of 17.5 bytes. The feature extraction and matching processes are straightforward enough to have a low order of computational complexity, so as to be suitable for tracking shapes in video data processing.

11.1.6.2 Contour-based shapes

The contour-based shape descriptor captures the characteristics of the shapes and is more similar to the human notion of understanding shapes. It is the most popular method for shape-based image retrieval. In section 11.2 we will demonstrate some of its practical applications in image retrieval.

The contour-based shape descriptor is based on the so-called curvature scale space (CSS) representation of the contour. That is, by filtering the shape at various scales (various degrees of the smoothness of the filter), a contour is smoothed at various levels. The smoothed contours are then used for matching. This method has several important properties:

  • it captures important characteristics of the shapes, enabling similarity-based retrieval

  • it reflects properties of the perception of the human visual system

  • it is robust to nonrigid motion or partial occlusion of the shape

  • it is robust to various transformations on shapes, such as rotation, scaling, zooming etc.

11.1.6.3 Three-dimensional shapes

Advances in multimedia technology have brought three-dimensional contents into today's information systems in the forms of virtual worlds and augmented reality. The three-dimensional objects are normally represented as polygonal meshes, such as those used in MPEG-4 for synthetic images and image rendering. Within the MPEG-7 framework, tools for intelligent content-based access to three-dimensional information are needed. The main applications for three-dimensional shape description are search, retrieval and browsing of three-dimensional model databases.

11.1.7 Motion descriptors

Motion is a feature which discriminates video from still images. It can be used as a descriptor for video segments and the MPEG-7 standard recognises four motion-based descriptors.

11.1.7.1 Camera motion

Camera motion is a descriptor that characterises the three-dimensional camera motion parameters. These parameters can be automatically extracted or generated by the capturing devices.

The camera motion descriptor supports the following well known basic camera motion:

  • fixed: camera is static

  • panning: horizontal rotation

  • tracking: horizontal traverse movement, also known as travelling in the film industry

  • tilting: vertical rotation

  • booming: vertical traverse movement

  • zooming: change of the focal length

  • dollying: translation along the optical axis

  • rolling: rotation around the optical axis.

The subshots for which all frames are characterised by a particular camera motion, which can be single or mixed, determine the building blocks for the camera motion descriptor. Each building block is described by its start time, the duration, the speed of the induced image motion, by the fraction of time of its duration compared with a given temporal window size and the focus-of-expansion or the focus-of-contraction.

11.1.7.2 Motion trajectory

The motion trajectory of an object is a simple high level feature defined as the localisation in time and space of one representative point of this object. This descriptor can be useful for content-based retrieval in object-oriented visual databases. If a priori knowledge is available, the trajectory motion can be very useful. For example, in surveillance, alarms can be triggered if an object has a trajectory that looks unusual (e.g. passing through a forbidden area).

The descriptor is essentially a list of key points along with a set of optional interpolating functions that describe the path of the object between the key points in terms of acceleration. The key points are specified by their time instant and either two-or three-dimensional Cartesian coordinates, depending on the intended application.

11.1.7.3 Parametric motion

Parametric motion models of affine, perspective etc. have been extensively used in image processing, including motion-based segmentation and estimation, global motion estimation. Some of these we have seen in Chapter 9 for motion estimation and in Chapter 10 for global motion estimation used in the sprite of MPEG-4. Within the MPEG-7 framework, motion is a highly relevant feature, related to the spatio-temporal structure of a video and concerning several MPEG-7 specific applications, such as storage and retrieval of video databases and hyperlinking purposes.

The basic underlying principle consists of describing the motion of an object in video sequences in terms of its model parameters. Specifically, the affine model includes translations, rotations, scaling and a combination of them. Planar perspective models can take into account global deformations associated with perspective projections. More complex movements can be described with the quadratic motion model. Such an approach leads to a very efficient description of several types of motion, including simple translations, rotations and zooming or more complex motions such as combinations of the above mentioned motions.

11.1.7.4 Motion activity

Video scenes are usually classified in terms of their motion activity. For example, sports programmes are highly active and newsreader video shots represent low activity. Therefore motion can be used as a descriptor to express the activity of a given video segment.

An activity descriptor can be useful for applications such as surveillance, fast browsing, dynamic video summarisation and movement-based query. For example, if the activity descriptor shows a high activity, then during the playback the frame rate can be slowed down to make highly active scenes viewable. Another example of an application is finding all the high action shots in a news video programme.

11.1.8 Localisation

This descriptor specifies the position of a query object within the image, and it is defined with two types of description.

11.1.8.1 Region locator

The region locator is a descriptor that enables localisation of regions within images by specifying them with a brief and scalable representation of a box or a polygon.

11.1.8.2 Spatio-temporal locator

This descriptor defines the spatio-temporal locations of the regions in a video sequence, such as moving object regions, and provides localisation functionality. The main application of it is hypermedia, which displays the related information when the designated point is inside the object. Another major application is object retrieval by checking whether the object has passed through particular points, which can be used in, say, surveillance. The spatio-temporal locator can describe both spatially connected and nonconnected regions.

11.1.9 Others

MPEG-7 is an ongoing process and in the future other descriptors will be added to the above list. In this category, the most notable descriptor is face recognition.

11.1.9.1 Face recognition

The face recognition descriptor can be used to retrieve face images. The descriptor represents the projection of a face vector onto a set of basis vectors, which are representative of possible face vectors. The face recognition feature set is extracted from a normalised face image. This normalised face image contains 56 lines, each line with 46 intensity values. At the 24th row, the centres of the two eyes in each face image are located at the 16th and 31st column for the right and left eye, respectively. This normalised image is then used to extract the one-dimensional face vector that consists of the luminance pixel values from the normalised face image arranged into a one-dimensional vector in the scanning direction. The face recognition feature set is then calculated by projecting the one-dimensional face vector onto the space defined by a set of basis vectors.



Standard Codecs(c) Image Compression to Advanced Video Coding
Standard Codecs: Image Compression to Advanced Video Coding (IET Telecommunications Series)
ISBN: 0852967101
EAN: 2147483647
Year: 2005
Pages: 148
Authors: M. Ghanbari

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net