J. Assfalg, M. Bertini, C. Colombo, and A. Del Bimbo
Universit di Firenze
Dipartimento di Sistemi e Informatica
Via S.Marta 3, 50139 Firenze, Italy
The dramatic quantity of videos generated by digital technologies has originated the need for automatic annotation of these videos, and the consequent need for techniques supporting their retrieval. Content-based video annotation and retrieval is therefore an active research topic. While many of the results in content-based image retrieval can be successfully applied to videos, additional techniques have to be developed to address their peculiarity. In fact, videos add the temporal dimension, thus requiring to represent object dynamics. Furthermore, while we often think of a video just as of a sequence of images, it is actually a compound medium, integrating such elementary media as realistic images, graphics, text and audio, each showing characteristic presentation affordances . Finally, application contexts for videos are different than those for images, and therefore call for different approaches in the way in which users may annotate, query for, and exploit archived video data.
This results in video streams to go through a complex processing chain, which comprises a variable number of processing steps. These steps include, but are not limited to, temporal segmentation of the stream into shots , detection and recognition of text appearing in captions [11, 9], extraction and interpretation of the audio track (including speech recognition) [1, 19, 36], visual summarization of shot content  and semantic annotation [2, 13]. In general, a bottom-up approach, moving from low level perceptual features to high level semantic descriptions, is followed. While this general framework roughly meets the requirements of a variety of application domains, the specificity of each domain has to be also addressed when developing systems that are expected to effectively support users in the accomplishment of their tasks. This specificity affects different stages in the development of content-based video retrieval systems, including selection of relevant low level features, models of specific domain knowledge supporting representation of semantic information , querying and visualization interfaces .
The huge amount of data delivered by a video stream requires development of techniques supporting an effective description of the content of a video.
This necessarily results in higher levels of abstraction in the annotation of the content, and therefore requires investigation and modeling of video semantics. This further points out that general purpose approaches are likely to fail, as semantics inherently depends on the specific application context. Semantic modeling of content of multimedia databases has been addressed by many researchers. From a theoretical viewpoint, the semantic treatment of a video requires the construction of a hierarchical data model including, at increasing levels of abstraction, four main layers: raw data, feature, object and knowledge . For each layer, the model must specify both the elements of representation (what) and the algorithms used to compute them (how). Upper layers are typically constructed by combining the elements of the lower layers according to a set of rules (however they are implemented). Concrete video retrieval applications by high-level semantics have been reported on in specific contexts such as movies, news and commercials [5, 7].
Due to its enormous commercial appeal, sports video represent another important application domain, where most of the research efforts have been devoted so far on the characterization of single, specific sports. Miyamori et al.  proposed a method to annotate the videos with human behavior. Ariki et al.  proposed a method for classification of TV sports news videos using DCT features. Among sports that have been analyzed so far, we can cite soccer [24, 17, 25, 26], tennis , basketball [28, 29, 15], baseball , American football .
This chapter illustrates an approach to semantic video annotation in the specific context of sports videos. Within scope of the EU ASSAVID  (Automatic Segmentation and Semantic Annotation of Sports Videos) project a number of tools supporting automatic annotation of sports videos were developed.
In the following sections methodological aspects of top-down and bottom-up approaches followed within the project will be addressed, considering generic sports videos. Videos are automatically annotated according to elements of visual content at different layers of semantic significance. Unlike previous approaches, videos can include several different sports and can also be interleaved with non-sport shots. In fact, studio/interview shots can be recognized and distinguished from sports action shots; the latter are then further decomposed into their main visual and graphic content elements, including sport type, foreground vs. background, text captions, and so on. Relevant semantic elements are extracted from videos by suitably combining together several low level visual primitives such as image edges, corners, segments, curves, color histograms, etc., according to context-specific aggregation rules. From section 4 to the end, detection of highlights will be discussed, analyzing in particular soccer videos. Soccer specific semantic elements are recognized using highlights models, based on low and medium level cues such as playfield zone and motion parameters; thus application to different kind of sports can be easily envisioned.
This work was partially supported by the ASSAVID EU Project (The ASSAVID consortium comprises ACS SpA (I), BBC R&D (UK), Institut Dalle Molle D'Intelligence Artificielle Perceptive (CH), Sony BPE (UK), University of Florence (I), University of Surrey (UK).