2. Background

This section presents the prior art in semantic video models, description, and analysis.

2.1 Semantic Video Modeling

Semantic video models are those that capture the structure and semantics of video programs. We can classify them as textual models, which employ only keywords and structured annotations to represent semantics, and integrated models, which employ both textual and content-based low-level features in order to perform mixed-level queries. A mixed-level query is one that includes both high-level semantic concepts and low-level content-based features, e.g.,, "Object 1 is on the left of Object 2 and participates in Event A," or "Object 1 participates in Event A and follows trajectory T."

Among the textual models, Oomoto et al. [4] have designed an object-based model for their video database system, OVID. A video object in OVID refers to a meaningful scene in terms of an object identifier, an interval, and a collection of attribute-value pairs. Hjelsvold and Midstraum [5] have developed a generic video model that captures video structure at various levels to enable specification of the video structure, the annotations, and the sharing and reuse in one model. The thematic indexing is achieved by annotations defined for video segments, and by specific annotation entities corresponding to persons, locations, and events. Adali et al. [6] introduce AVIS with a formal video model in terms of interesting video objects. A video object in AVIS refers to a semantic entity that attracts attention in the scene. The model includes events, as the instantiation of activity types, and the roles of objects in the events. Al Safadi and Getta [7] have designed a semantic video model to express various human interpretations. Their conceptual model constitutes semantic units, description of semantic units, association between semantic units, and abstraction mechanisms over semantic units. These textual models, in general, do not include low-level features; for example, they lack object motion modeling.

In contrast, Hacid et al. [8] have extended a textual database model to include low-level features. Their model consists of two layers: 1) Feature and content (audiovisual) layer that contains low-level features, 2) Semantic layer that provides conceptual information. The first layer is based on the information obtained from QBIC [9]. This work is a good starting point for a system to handle mixed-level queries, but their framework does not support object-based motion description and mid-level semantics for motion.

As a sign of a certain level of maturity reached in the field of content-based retrieval, the ISO MPEG-7 standard (formally Multimedia Content Description Interface) provides normative tools to describe multimedia content by defining a normative set of descriptors (D) and description schemes (DS). One of these DSs, the SemanticDS, introduces a generic semantic model to enable semantic retrieval, in terms of objects, events, places, and semantic relations [1]. MPEG-7 also provides low-level descriptors, such as color, texture, shape, and motion, under a separate SegmentDS. Thus, in order to perform mixed-level queries with spatio-temporal relations between objects, one needs to instantiate both a SemanticDS and a SegmentDS with two separate graph structures. This unduly increases representation complexity, resulting in inefficiency in query resolution and problems due to the independent manipulations of a single DS. Furthermore, there is no context-dependent classification of object attributes in the MPEG-7 SemanticDS. For example, if an object appears in multiple events, all attributes (relations) of the object related to all events are listed within the same object entity. This aggravates the inefficiency problem.

We propose a new generic integrated semantic-syntactic model to address the above deficiencies in Section 3.

2.2 Automatic Sports Video Analysis

Semantic analysis of sports video involves use of cinematic features, to detect shot boundaries, shot types, and replays, as well as object-based features to detect and track players and referee, and detect certain game specific events such as goals and plays around the penalty box in soccer videos.

The earliest works on sports video processing have used object color and texture features to generate highlights [10] and to parse TV soccer programs [11]. Object motion trajectories and interactions are used for football play classification [12] and for soccer event detection [13]. Both [12] and [13], however, rely on pre-extracted accurate object trajectories, which are obtained manually in [12]; hence, they are not practical for real-time applications. LucentVision [14] and ESPN K-Zone [15] track only specific objects for tennis and baseball, respectively. The former analyzes trajectory statistics of two tennis players and the ball. The latter tracks the ball during pitches to show, as replays, if the strike and ball decisions are correct. The real-time tracking in both systems is achieved by extensive use of a priori knowledge about the system setup, such as camera locations and their coverage, which limits their application. Cinematic descriptors are also commonly employed. The plays and breaks in soccer games are detected by frame view types in [16]. Li and Sezan summarize football video by play/break and slow-motion replay detection using both cinematic and object descriptors [17]. Scene cuts and camera motion parameters are used for soccer event detection in [18] where usage of only few cinematic features prevents reliable detection of multiple events. A mixture of cinematic and object descriptors is employed in [19]. Motion activity features are proposed for golf event detection [20]. Text information from closed captions and visual features are integrated in [21] for event-based football video indexing. Audio features, alone, are proposed to detect hits and generate baseball highlights [22].

In this chapter, we present a framework to process soccer video using both cinematic and object-based features. The output of this processing can be used to automatically generate summaries that capture the essential semantics of the game, or to instantiate the proposed new model for mixed-level (semantic and low-level) search applications.