2. Video Semantics

2.1. Annotation-Based Semantics

An annotation represents any symbolic description of a video, or an excerpt of a video. However, when considering annotated symbolic description of videos, it should be noted that the representation may not be always extracted automatically. In contexts where enough knowledge can be used like in sports videos [6] or in news videos [7][8], automatic processes are able to extract abstract semantics, but in general cases systems are only able to provide help for users to describe the content of videos [9] or low level representations of the video content.

Many approaches have been proposed for modeling video semantics. According to the linear and continuous aspect of a video, many models are based on a specification of strata. Strata can be supported by different knowledge or data representation: VSTORM [10] or Carlos et al. [11] adopt an object or prototype approach, and recently, the MPEG committee has chosen the family of XML languages for the definition and the extension of the MPEG-7 standard audiovisual (AV) descriptions. We give below the principles of the stratum-based models, and the different representations of the strata.

2.1.1. Strata-based fundamentals

A stratum is a list of still images or frames that share a common semantics. So, a strata can be associated with some descriptions of the video excerpts it encompasses. In some models, it is allowed to specify overlapping between strata.

In [12][13], the authors propose a stratification approach inspired from [14] to represent high level description of videos. A stratum is a list of non-intersected temporal intervals. Each temporal interval is expressed using frame numbers. A video is thus described by a set of strata. Two types of strata are proposed: dialog stratum and entity stratum. An entity stratum reflects the occurrence of an object, of a concept, or of a text, etc., and has a Boolean representation. For instance, an entity stratum can be used to store the fact that a person occurs, to express the mood of a sequence in the video (e.g., 'sadness'), and to represent the structure of the video in term of shots or scenes. A dialog stratum describes the dialog content for each interval considered, and speech recognition may also be used in this case to extract such strata. Retrieval is here based on Boolean expressions for entity strata and on vector space retrieval [15] for dialog strata. The automatic extraction of semantic features, like Action/Close-Up/Crowd/Setting [16] allows also the definition of strata which are characterized by these features.

2.1.2. Strata structures and representations

In the stratification work of [13], the strata are not structured and no explicit link between strata exists. In [17], the authors define a video algebra that defines nested strata. Here, the structure is defined in a top-down way to refine the content description of video parts. Such nesting ensures consistency in the description of strata, because the nested strata correspond necessarily to nested time intervals. From a user perspective, browsing in a tree structure is certainly easier than viewing a flat structure. In that case, however, the question related to the retrieval of the video parts according to a query is far more complex than with flat strata.

Such tree-based content representation is also proposed in [18]. The approach there consists in finding ways to evaluate queries using database approaches. A tree structure describes when objects occur, and an SQL-like query language (using specific predicates or functions dedicated to the management of the object representation) allows the retrieval of the video excerpts that correspond to the query. Other database approaches have been proposed, like [19], in order to provide database modeling and retrieval on strata-based video representation. An interesting proposal, AI-Strata [20], has been dedicated to formalize strata and relations among strata. The AI-strata formalism is a graph-based AV documentation model. Root elements are AV units or strata to which annotation elements are attached. These annotations elements derive from a knowledge base into which abstract annotation elements (classes) are described and organized in a specialisation hierarchy. The exploitation of this graph structure is based on a sub-graph matching algorithm, a query being formulated in the shape of a so-called potential graph. In this approach, one annotation graph describes how the video document decomposes into scenes, shots and frames, the objects and events involved in the AV units, their relationships.

In V-STORM [10], a video database management system written using 02, we proposed to annotate a video at each level of its decomposition into sequences, scenes, shots and frames. A 02 class represents an annotation and is linked to a sub-network of other 02 classes describing objects or events organized in specialization and composition hierarchies. Using an object-based model to describe the content of a stratum has some advantages: complex objects can be represented; objects are identified; attributes and methods can be inherited. Following a similar approach in the representation formalism, authors in [11] have proposed a video description model based on a prototype-instance model. Prototypes can be seen as objects which can play the roles of both instances and classes. Here, the user describes video stories by creating or adapting prototypes. Queries are formulated in the shape of new prototypes which are classified in the hierarchies of prototypes describing the video in order to determine the existing prototype(s) that match best the query and are delivered as results.

Among the family of MPEG standards dedicated to videos, the MPEG-7 standard addresses specifically the annotation problem [3][21]. The general objective of MEPG-7 is to provide standard descriptions for the indexing, searching and retrieval of AV content. MPEG-7 Descriptors can either be in a XML (and then human-readable, searchable, filterable) form or in a binary form when consuming storage, transmission and streaming are required. MPEG-7 Descriptors (which are representations of features) can describe low-level features (such as color, texture, sound, motion, or such as location, duration, format) which can be automatically extracted or determined, but also higher level features (such as regions, segments, their spatial and temporal structure, or such as objects, events, and their interaction, or such as author, copyright, date of creation, or such as users preferences, summaries). MPEG-7 predefines Description Schemes which are structures made of descriptors and the relationships between descriptors or other description schemes. Descriptors and Description Schemes are specified and can be defined using the MPEG-7 Description Definition Language which is based on the XML Schema Language with some extensions concerning vectors, matrices and references. The Semantic Description Scheme is dedicated to provide data about objects, concepts, places, time in the narrative world and abstraction. The descriptions can be very complex: it is possible to express, using trees or graphs, actions or relations between objects, states of objects, abstraction relationships, abstract concepts (like 'happiness') in the MPEG-7 Semantic Description Scheme. Such Description Schemes can be related to temporal intervals like in the strata-based approach.

To conclude, annotation-based semantics generally relies on a data structure that allows expression of a relationship between a continuous list of frames and its abstract representation. The difference among approaches is the capabilities offered by the underlying model to formalize strata and links between strata and the structure of the annotations. Our opinion is that this description is a key-point for generating video summarization. Thus, proposing a consistent and sound formalism will help in avoiding inconsistencies and fuzzy interpretation of annotations.

2.2. Low Level Content-Based Semantics

We define the low level content-based semantics of videos as the elements that can be extracted automatically from the video flow without considering specific knowledge related to a specific context. It means that the video is processed by an algorithm that captures various signal information about frames and that proposes an interpretation in term of color, shape, structure and object motion of these features.

We can roughly separate the different content-based semantics extracted from videos into two categories: single frame-based and multiple frame-based. Single frame-based extractions consider only one frame at a time, while multiple frame-based extractions use several frames, mostly sequences of frames.

Generally, single frame extraction is performed using segmentation and region feature extractions using colors, textures and shapes (like the one used in still image retrieval, like QBIC [22] and Netra [23] for instance). MPEG-7 proposes, in its visual part, description schemes that supports color descriptions of images, groups of images and/or image regions (different color spaces, using dominant colors or histograms), texture descriptions of image regions (low level based on Gabor filters, high level based on 3 labels, namely regularity, main direction and coarseness), shape of image regions based on curvature scale spaces and histograms of shapes. Retrieval is then based on similarity measures between the query and the features extracted from the images. However, the use of usual still image retrieval systems on each frame of video documents is not adequate in term of processing time and of accuracy: consecutive frames in videos are usually quite similar, and this feature has to be taken into account..

Features related to sequences of frames can be extracted from averaging over the sequence, in order to define, like in [24], the ratio of saturated colors in commercials.

Others approaches propose to define and use motion of visible objects and motion of camera. MPEG-7 defines motion trajectory based either on key points and interpolation techniques, and parametric motion based on parametric estimations using optical flow techniques or on usual MPEG-1, MPEG-2 motion vectors. In VideoQ [25], the authors describe ways to extract object motion from videos and to process queries involving motion of objects. Works in the field of databases also consider the modeling of objects motion [26]. In this case, the concern is not how to extract the objects and their motion, but how to represent the objects and their motion in a database for allowing fast retrieval (on an object oriented database system) of a video's part according to SQL-like queries based on object motion.

Indexing formalisms used at this level are too low-level to be straightforwardly used in a query process by consumers for instance but on the other hand such approaches are not tightly linked to specific contexts (for instance motion vectors can be extracted from any video). Thus, a current trend in this domain is the merger of content-based information with other semantic information in order to provide usable information.

2.3. Structure-Based Semantics

It is widely accepted that video documents are hierarchically structured into clips, scenes, shots and frames. Such structure usually reflects the creation process of the videos. Shots are usually defined as continuous sequences of frames taken without stopping the camera. Usually, scenes are defined as sequences of contiguous shots that are semantically related, but in [27] [28] the shots of a sequence may not be contiguous. A clip is a list of scenes.

After the seminal work of Nagasaka and Tanaka in 1990 [29], work has been done for detecting shot boundaries in a video flow. Many researchers [30][31][32][33] have focused on trying to detect the different kinds of shot transitions that occur in video. The TREC video track 2001 [34] compared different temporal segmentation approaches, and if the detection of cuts between shots is usually successful, the detection of fades does not achieve very high success rates.

Other approaches focus on scene detection, like [27] using shot information and multiple cues like audio consistency between shots and the close caption of the speeches of the video. Bolle et al. [35] use types of shots and predefined rules to define scenes. Authors in [36] extend the previous work of Bolle et al. by including vocal emotion changes.

Once the shots and sequences are extracted, it is then possible to describe the content of each structural element [37] in a way to retrieve video excerpts using queries or by navigating in a synthetic video graph.

The video structure provides a sound view of a video document. From our point of view, a video summary must keep this view of the video. Like the table of contents of a book, the video structure provides a direct access to video segments.