4. General Framework

The general framework is composed of three tasks: temporal structure extraction, meta-data generation and Probabilistic prediction.

click to expand
Figure 18.1: General framework

4.1 Temporal Structure Extraction

The first task "temporal structure extraction" collects temporal data and structures from video bases. The temporal structure of a video document induces its partitioning into basic parts that are defined at four different levels.

Frame level: A frame is the low-level unit of composition. There is insignificant temporal structure at this level.

Shot-level: A shot is a sequence of frames acquired through a continuous camera recording. The partitioning of the video into shots generally does not refer to any semantic analysis. It is the first event of temporal structure.

Scene-level: A scene is a sequence of shots having a common semantic significance. Our approach considers this level as a unit of exploration.

Video-level: The video level represents the whole document.

The first key-level that includes temporal structure is the shot-level, and specific operations (cut, dissolve and wipe) characterize shot boundaries. Cut is a formal boundary between contiguous shots. This generally implies a peak in the difference between color or motion histograms corresponding to the two frames surrounding the cut. Cut detection may therefore simply consist in detecting such peaks. Adding any form of temporal smoothing will also improve the robustness of the detection process. Dissolve is a fuzzy boundary between contiguous shots. The content of last images of the first shot continuously overlaps the first images of the second shot. The major issue here is to distinguish between dissolve effects and changes induced by global motion. Fade-in and fade-out effects are special cases of dissolve transitions where the first or the second scene, respectively, is a dark frame [17]. Wipe is another fuzzy boundary between contiguous shots. The images of the second shot continuously cover or push out of the display of the first shot.

While shot extraction, through cut detection, is relatively easy due to the abrupt nature of transitions, dissolves and wipes are more difficult to detect. Some efficient solutions ([3], [12], [22]) exploit the compressed structure of MPEG files, based on global motion estimation and segmentation. Dissolve effects at high scale consider elaborated operations such as mosaics and whirls. However, depending on the particular type of frame mixing technique, dissolve detectors may be misled by the apparent motion induced by such effects. We are not aware of any technique specialized in detecting elaborated gradual effects.

A deep understanding of the contents of the shots defines a scene. Automated scene annotation relies on a high-level clustering of shots where the indexing data derived from shots composes feature vectors. Depending on the video, the segmentation of shots may lead to a small and manageable set of objects (shot representations). In this case, a human operator can reliably solve the definition of scenes. It is important to note that a shot segmentation performed by a human operator may not be fully reliable, because this task is tedious and calls for a constant concentration. The development of semi-automated segmentation and annotation tools is therefore important in this context [11]. In our framework, the system extracts semi-automatically the scenes that represent Markov states.

4.2 Meta-Data Generation

The second task is "meta-data generation." The end of video temporal segmentation triggers this task. The objective is to support quick reference [7].

The content of the meta-data varies, depending on the application towards which the video base is oriented. For generic video documents, this data generally includes video scenes and shots boundaries along with some characteristic and visual representation.

One common representation is the choice of one or more key frames within the shot or the scene. The assumptions of the temporal segmentation means normally consistency of all frames within a basic shot. Therefore, the system may derive heuristics for choosing one or more key-frames. The simplest relies on the global position of the frames within the shot. The system may also use some other characteristics such as the corresponding audio stream, for efficient key frame detection [1] [6]. Video micro-segmentation [10] refers to the process of re-segmenting the shots with respect to some heuristics reflecting a comprehension of the video content.

4.3 Probabilistic Prediction

The third task "Probabilistic prediction" is composed of four major actions. The first one is Markov model, which consists of a sparse matrix, with a suitable low-level representation, of state transition probabilities, and the initial state probability vector. The second one is user access memory. All user's requests are temporarily saved into the user memory, and flushed once a minimum sample threshold is exceeded, or the session times out. The system assigns to each user a separate memory, and the sequence of user requests temporarily saved in the memory. The third one is an up-date action. This action updates the Markov model with available user path trace. The system typically updates the Markov model, by smoothing the current count matrix with the counts derived from the additional path sequences available. The fourth one is exploration path generator. Given a start scene identifier, the system outputs a sequence of states. The output predicts a chain of scenes the user may follow.