2. Semantics in Text and Images Extracted From Video Data

To locate and represent the semantic meanings in video data is the key to enable intelligent query. This section presents the methodology for knowledge elicitation and, in particular, introduces a preliminary semantic representation scheme that bridges the information gap not only between different media but also between different levels of contents in the media.

Semantics is always desirable for intelligent information systems. However, it is not just difficult to extract semantics from video data, only to define an appropriate set of semantics already presents a significant challenge. Semantics is closely related to contextual information. There are several levels of semantics bearing units in a video system. Roughly, these units can be divided into two levels, static semantics and dynamic semantics. Static semantics are defined in a possible data source that may appear in the application while dynamic semantics is generated according to the application context. In fact, the generation of dynamic semantics is based on static semantics and normally contains high-level information about a wider range of data.

In our work, we focus on two specific elements in the video content: textual script that accompany a video and image content in key frames. Text itself is not the semantics of video content, rather it is a possible way pointing to the video semantics. Text analysis opens windows to approach the video semantics and the information contained in text should be integrated with the visual contents. Generally, the static semantics in textual form are represented in the following contents:

Meanings in words and phrases.
Collocations between words and phrases.
The relationships between words and phrases.
The concept space.

Traditionally, most of the above information is represented in dictionaries and thesaurus, or contained within a statistical corpus. Dynamic semantics is generated when the information analysis system is making use of the static semantics within the current context in analyzing text. Disambiguation processing paves the path to the rest of the analysis. Examples of dynamic semantics are information about sentences, paragraphs, indexing contents for full text documents, abstracts/summaries of full texts, etc. At the representation level, for general purpose, a semantics network can be deployed.

Compared with the textual semantics extraction, visual semantics poses even more challenges due to the complexity of the data and yet the techniques of computational visual perception are far from mature. In the past years, many researchers mainly focused on the techniques at the perceptual level, in other words, the primitive iconic features that can be extracted from key frames or frame sequence, and developed techniques for detecting texture, colour, shapes, spatial relationships, motion field, etc. However, high-level semantic analysis and interpretation is more desirable for complicated queries. In fact, traditional content-based method and semantic analysis are indispensable and complementary to each other; they should and can be combined to represent the meaning that the underlying visual data depicts. This requires efforts to bridge the gap between low-level syntactic/iconic features that can be automatically detected by conventional image processing tools and high-level semantics that captures the meanings of visual content in different conceptual levels.

2.1 Processes of Semantics Extraction

2.1.1 Textual Semantic Feature Extraction

Text in video can be found in the form of the video script or video segment annotations or textual description of the video. The basis of the text semantics extraction is done by the creation of a large-scale comprehensive dictionary, a complete control rule system and a concept space. The dictionary stores static syntax and semantics about the words and phrases, which is the basic information for text segmentation and the rest of the text analysis. Each lexical entry contains the information about its category, its classification point in the concept space, the inflection, the frequency and its possible visual attributes. For example, a word "roundish" implies a shape of an object, so "shape" is taken as the visual information in the entry. The rule system provides the possible collocation situations of different words and phrases in a text and which information should be obtained under such situations; in the meantime, it suggests possible solutions for the processing. The collocation situations are described based on collocations statistics that capture the relationships between concepts, words and phrases or even sentences and pieces of texts.

A semantic parser takes control of all this information and generates dynamic semantics for input unknown text. It serves to execute the whole analyzing strategy, to control the pace of analysis, to decide when to execute what kind of analysis, to call which rule base, to invoke different algorithms that are suitable for different analysis. In short, it decides how to approach the analysis for a piece of text. The analysis procedure includes automatic segmentation, ambiguity processing, morphological analysis and syntactic analysis, semantic analysis and complex text context processing. Within our framework, the result is a semantic network Papillon, which will be discussed later in section 2.3.

2.1.2 Semantics Extraction and Semantic Definition for Key Frame Images

The semantics of a video segment can be defined via the key frame images by associating/mapping semantic meanings to low-level image properties of the key frames. The two levels of image contents should be used to complement each other and be integrated to support intelligent query as well as the tasks in an indexing cycle. The first step is to locate the semantics in the images by defining the visual appearance using semantic labels. Then different feature extraction techniques and classifiers were trained to extract and classify low-level image features that have been already associated with semantic labels. This is followed by a process of semantic analysis which aims to ensure the consistency among the semantic labels on different parts of the image by exploiting contextual and high-level domain knowledge. As a summary, the video semantics are analyzed through:

Defining the semantic label set, which is associated with a set of typical visual appearance that can be found in the video segment
Building up domain knowledge for interpreting the label set
Designing the visual detectors to recognize and classify the visual appearances that have been associated with the semantic label set
Designing a semantic analyzer that can exploit the domain knowledge and dynamic information that is produced by the visual detectors to
- rectify the results from the visual detectors which are normally not able to provide 100% accuracy for mapping visual content to the set of semantic labels, and
- generate high level semantics for the whole image and represent them in a semantic representation called Papillon that is introduced later.

To this end, a knowledge elicitation subsystem was developed to support the acquisition of domain knowledge.

Semantic label definition

In order to associate semantic meaning to different visual appearances found in a video, we partition a key frame image into a number of sub-images. The size and the shape of sub-images can be varied according to their suitability to capture the image features. The sets of sub-images serve two purposes: (a) they form the basic units for image analysis such as texture and colour analyses, as well as for semantic analysis; and (b) they form the basic unit for static semantic definition.

Semantic Interpretation of Visual Content and the Semantic Label Set

Generally speaking, the key frame images as well as their object contents under different scales are likely to have different semantics if we classify the semantics in great details. Image content under different scales will show different prominent visual features. Even within the same scale, there are still different semantic (descriptive or interpretive) levels. The choice of semantic level depends on what kind of information the research aims to obtain from the images and how the information is to be categorized. Several levels of interpretations can be defined correspondingly for the features that may appear in the key frame image. The features in each level have their respective knowledge held in different knowledge bases, where such knowledge serves the purposes of reasoning, semantic description and generating annotation.

We have implemented a subsystem for knowledge elicitation to allow a user to interactively assign semantic labels to a large subset of the objects in key frame images obtained from video database. These associations of semantic labels to sub-images depicting the various visual characteristics of the labels were taken as the ground truth and formed the initial set of training samples for designing the feature extraction and recognition processes. The approach was tested, by first giving the visual feature detector a set of training samples containing semantic labels in the form of a mapping between the labels and the corresponding sub-image. The visual feature detector then executes related feature extraction routines and assigns semantic labels to other unknown key frame images.

For a large-scale database that contains a wide range of videos, there is no single feature detector that is capable of extracting all the salient image features. In our research, a set of visual feature detectors were developed for texture, colour and shape measurements that have been associated with the semantic labels.

Incorporated with contextual information in the knowledge base, the semantic analyzer analyzes the labelling results from the visual detectors for the images, and confirms or refutes, as well as explains any unknown regions according to the domain knowledge.

2.2 Content of the Knowledge Base for Visual Contents

Depending on the application domain, the defined set of the semantic labels should be described further in the knowledge base, forming the static image semantics. This includes:

Domain attributes, such as its logical or expected location, logical or expected neighbours, etc.
Visual attributes, such as colour, shape, size, quantity, as well as the similarity with any other features, and the various relationships including spatial relationship among them.
Measurement attributes, indicating which detector is the best one that suits this feature.
Contextual attributes, e.g., the special attributes a video may have when the semantic labels are combined with some situations or other semantic labels.

2.3 Papillon - An Intermediate Semantic Representation

We have devised a semantic representation scheme to represent the semantics for both full text and key frame images. We call this scheme Papillon. The content of Papillon comes from either text analysis or image analysis including its semantic analysis. It contains the static and generated dynamic semantics about the text or images. More importantly, Papillon focuses on fusing the high level information inherent in the different media (image and textual information), and serves as semantic or conceptual representation for them. In a sense, it can be considered as an intermediate semantic representation for the two types of video elements. Furthermore, in our application, textual annotation for unknown key frame images can be automatically generated through Papillon.

Papillon consists of a semantic network, which is a graph, where the nodes in the graph represent concepts, and the arcs represent binary relationships between concepts. The nodes or concepts are objects or features and their descriptions in the image or text, which contains all the information and attributes about the entities including semantic code and concept number in the concept space. The semantic relationships express the inherent semantic relationships between concepts that have been derived from the analysis for text and key frame images. Examples of the semantic relationships are agent, analogue, compose of, degree, direction, duration, focus, frequency, goal, instrument, location, manner, modification, object, origin, parallel, possession, quantity, colour, size, shape, produce, reason, equal, reference, result, scope, attribute, time, value, capacity, condition, comparison, consequence. Detail of this representation scheme can be found in [16].

Semantic information extracted from the key frame and represented within the Papillon can then be incorporated into the object framework for video retrieval to be described in the following section.