1. Introduction

A current important trend in multimedia information management is towards web-based/enabled multimedia search and management systems. Video is a rich and colorful media widely used in many of our daily life applications such as education, entertainment, news spreading, etc. Digital videos have diverse sources of origin such as cassette recorder, tape recorder, home video camera, DVD and Internet. Expressiveness of video documents decides their dominative position in the next-generation multimedia information systems. Unlike traditional/static types of data, digital video can provide more effective dissemination of information for its rich content. Collectively, a (digital) video can have several information descriptors: (1) meta data - the actual video frame stream, including its encoding scheme and frame rate; (2) media data - the information about the characteristics of video content, such as visual feature, scene structure and spatio-temporal features; (3) semantic data - the mapping between media data and their high level meanings such as text annotation or concepts relevant to the content of the video, obtained by manual or automatic understanding.

Video meta data is created independently from how its contents are described and how its database structure is organized later. It is thus natural to define "video" and other meaningful constructs such as "scene," "frame" as objects corresponding to their respective inherent semantic and visual contents. Meaningful video scenes are identified and associated with their description data incrementally. But the semantic gap between the user understanding and video content remains a big problem. Depending on the user's viewpoint, the same video/scene may be given different descriptions or interpretations. It is therefore extremely difficult (if not impossible) to describe the whole contents of a video, due to the difficulties in automatic detecting salient features from and interpreting the visual content.

In order to develop an effective web-based video retrieval system, one should go beyond the traditional query-based or purely content-based retrieval (CBR) paradigm. Our standpoint is that videos are multi-faceted data objects, and an effective retrieval system should be able to accommodate all of the above complementary information descriptions and bridge the gap between different elements for retrieving video. As such, in this chapter we advocate a web-based hybrid approach to video retrieval by integrating the query-based (database) approach with the CBR paradigm.

1.1 Related Work

Text-based information retrieval and content-based image retrieval have been two separate topics historically. The former one uses annotation/specification of high-level semantics of images manually. The latter one needs accurate extraction and recognition of low-level image features. Seldom research focuses on integrating these two classes of information; however actual application demonstrates its importance. However, this aspect of work is fast becoming an area of intensive research. Recently, Tang, Hanka and Ip [17] have provided an overall architectural framework for combining iconic and semantic content for image retrieval and developed techniques for extracting semantic information from images for the medical domain. Simone and Ramesh [13] have attempted to integrate both text and visual features for image retrieval. They used labels attached in the HTML files as the textual information, and developed several search engines which analyze low-level image features separately. As the labels extracted from the HTML files may not be relevant to the images, information derived from these two sources may not necessarily be consistent. To reduce the potential semantic inconsistency and the time consuming task in manual textual annotations of images, and to support semantic retrieval of media content, recent works have focused on automatic extraction of semantic information directly from images [15, 19].

In the context of videos, previous research focus was on video structuring, such as shot detection and key frame extraction based on visual features in the video. Its application areas have been extended from commercial video databases to home videos (e.g., [9]). But most of the practical systems annotate videos by textual information manually. For example, in OVID [12], each video object has a set of manually annotated attributes and attribute values describing its content. In [1], both the video objects and their spatial relationships are manually annotated in order to support complex spatial queries.

In order to index video automatically, not only the visual contents but also the audio content should be used (see [10,18,21]). Recent work also proposed the use of closed caption as well as other collateral textual information. Ip and Chan have combined lecture scripts and video text analysis to achieve automatic shot segmentation and indexing of lecture videos [7,8]. The approach assumed there is a correspondence between text-script segment and video segment and can be applied to other types of structured video with collateral text such as news videos. An obvious trend in video indexing and retrieval is to use media features derived from a variety of sources such as audio, closed captions, textual scripts and so on instead of simply visual features. Another trend is towards web-enabled/based video management system.

Over the last few years we have been working on developing a generic Video Management and Application Processing (VideoMAP) framework [2,3,20]. A central component of VideoMAP is a query-based video retrieval mechanism called CAROL/ST, which supports spatio-temporal queries [2,3]. While the original CAROL/ST has been designed to work with video semantic data based on an extended object oriented approach, little support has been provided to support video retrieval using visual features. To support a more effective video retrieval system through a hybrid approach, we have been making extensions to the VideoMAP framework, and particularly the CAROL/ST mechanism [5]. At the same time, we have also developed a framework for extracting and combining iconic and semantic features for image retrieval [15,17]. The underlying approach for semantic extraction and automatic image annotation can be extended to video data.

In this chapter, we present our web-based version of VideoMAP (termed as VideoMAP*), and discuss the main issues involved in developing such a web-based video database management system supporting hybrid retrieval. Furthermore, an approach to extracting and representing semantic information from video data in support of hybrid retrieval will be presented.

1.2 Chapter Organization

The rest of our chapter is organized as follows. We first discuss the philosophical and technical issues on defining and extracting semantics from video key frames and collateral texts such as video scripts and present a representation scheme for video semantics that can be applied to support semantic query processing. This is followed by a review of the main extensions we have made to VideoMAP, the result of which is a comprehensive video database management system which supports CBR and query-based retrieval methods. In section 4 we detail the VideoMAP* framework and its web-based hybrid approach to video retrieval. The specific user interface facilities are described in section 5; example queries are also given to illustrate the expressive power of this mechanism. Finally, we conclude the chapter and offer further research directions in section 6.