1. Introduction

One of the specific characteristics of the video medium is to be a temporal medium: it has an inherent duration and the time spent to find information present in a video depends somehow on its duration. Without any knowledge about the video, it is necessary to use some Video Cassette Recorder facilities in order to decrease the search time. Fortunately, the digitalization of videos overcome former material constraints that forced a sequential reading of the video. Nowadays, the technological progress makes it possible to achieve some treatments on the images and segments which compose the video. For instance, it is very easy now to edit certain images or sequences of images of the video, or to create non-destructive excerpts of videos. It is also possible to modify the order of the images in order to make semantic groupings of images according to a particular center of interest. These treatments are extensively used by film-makers who use numeric benches of assembly such as Final Cut Pro [1] and Adobe Premiere [2].

In such software, users have to intervene in most of the stages of the process in order to point out the sequences of images to be treated and to specify the treatment to be made by choosing appropriate operators. Such treatments mainly correspond to operations of manipulation such as, for instance, cutting a segment and copying it out in another place. Such software offer simple interfaces that fit the requirements of applications like personal video computing and, in a more general way, this technology is fully adapted to the treatment of one video. However, in the case of large video databases used and manipulated at the same time by various kinds of users, this technology is less well suited. In this case, a video software has to offer a set of tools in able to i) handle and manage multiple video documents at the same time, ii) find segments in a collection of videos corresponding to given research criteria, (iii) create dynamically and automatically (i.e., with no intervention of the user) a video comprising the montage of the result of this search.

In order to fulfill these three requirements, video software must eventually model the semantics conveyed by the video, as well as its structure which can be determined from the capturing or editing processes. According to the nature of the semantics, the indexing process in charge of extracting the video semantics can be manual, semi-automated or automated. For instance, some attempts have been made to automatically extract some physical features such as color, shape and texture in a given frame and to extend this description to a video segment. Obviously, it is quite impossible to extract some high-level information such as the name of an actor or the location of a sequence. Thus, automatic indexing is generally associated with manual indexing.

The expected result of this indexing process is a description of the content of video in a formalism that allows accessing, querying, filtering, classifying and reusing the whole or some parts of the video. Offering metadata structures for describing and annotation audiovisual (AV) content is the goal of the MPEG-7 standard [3] which supports a range of descriptions ranging from the low-level signal features (shape, size, texture, color, movement, position.).to the highest semantic level (author, date of creation, format, objects and characters involved, their relationships, spatial and temporal constraints.). MPEG-7 standard descriptions are defined using the XML Schema Language and Description Schemes can then be instantiated as XML documents.

Using such descriptions, it is possible to dynamically create summaries of videos. Briefly, a video summary is an excerpt of a video that is supposed to keep the relevant parts of the video while dropping the less interesting parts of the video. This notion of relevance is subjective, interest-driven, related to the context and therefore it is hard to specify and necessitates some preference criteria to be given. It is our opinion that in order to create such a summary, the semantics previously captured by the semantic data model can be exploited while taking into account the preference criteria.

In this paper, we present then the VISU model which is the result of our ongoing research in this domain. This model capitalizes amount of work made in the field of information retrieval, notably the use of Conceptual Graphs [4]. We show how the VISU model adapts and extends these results in order to satisfy the constraints inherent to the video medium. Our objective is to annotate videos using Conceptual Graphs in order to represent complex descriptions associated with frames or segments of frames of the video. Then, we take advantage of the material implication of Conceptual Graphs on which is based an efficient graph matching algorithm [5] which allows formulation of queries. We propose a query formalism that provides a way to specify retrieval criteria. These criteria are useful to users for adapting the summary to their specific requirements. The principles of the query processing are also presented. Finally, we discuss about time constraints that must be solved to create a summary with a given duration. The paper is organized as follows. In the next section, the different approaches used to capture video semantics are presented. The section 3 proposes an overview of the main works in the field of video summarization. The section 4 presents the video data model, and the query models of the VISU system we are currently developing. Section 4 gives some concluding remarks and perspectives to this work.