2. Video Analysis for Metadata Extraction

2.1 Content-Based Annotation for Video Metadata

Since information in videos is quite "raw" and dispersed, it is almost impossible to achieve content-based access to videos unless some additional information is available. In order to enable flexible and intelligent access to videos, we somehow need to extract "keywords" which describe contents of videos. Typical "keywords" useful for content-based access to videos include information on:

what/who appears in the video,
when the video is broadcasted/recorded,
where the video is recorded,
what the video is about, etc.

Video metadata provide these kinds of additional information. MPEG-7 has been standardized (in part as of 2002) to be used as the video metadata standard, which can describe a wide variety of content information for videos. Sufficiently in-depth metadata for large volume of videos are indispensable for content-based access to video archives.

The problem here is how we produce necessary video metadata for large volume of videos. A conservative and unfailing solution to this is manual annotation, since a person can easily perceive contents of videos. On the other hand, some types of information for video metadata are available from production stage of videos, such as shot boundaries, object contours in shots using chroma key, video captions, etc. Even so, automatic extraction of video metadata by using video analysis is still important, especially in the following cases:

providing metadata which are not available in production stage,
providing metadata which may be extremely expensive with manual annotation,
providing metadata of live events (almost) instantaneously,
whenever possible to save manual labor especially in annotating large volume of video archives.

We focused on the importance of face information in news videos. News' primary role is to provide information on news topics in terms of 5W1H (who, what, when, where, why, and how). Among them "who" information provides the most part of information in news topics. Our goal is to generate annotation to faces appearing in news videos by their corresponding names. This task incurs intensive labor when manual annotation is imposed, i.e., locating every occurrence of faces in video segments, identifying the faces, and somehow naming them. Instead, we developed Name-It, a system that associates faces and names in news videos in an automated way by integration of image understanding, natural language processing, and artificial intelligence technologies. As an example of automated extraction of video metadata, we briefly describe the Name-It system in this section.

2.2 Face and Name Association in News Videos

The purpose of Name-It is to associate names and faces in news videos [1]. Several potential applications of Name-It might include: (1) News video viewer which can interactively provide text description of the displayed face, (2) News text browser which can provide facial information of names, (3) Automated video annotation generation by naming faces.

To achieve Name-It system, we employ the architecture shown in Figure 27.1. Since we use closed-captioned CNN Headline News for our target, given news are composed of a video portion along with a transcript portion as closed-caption text. From video images, the system extracts faces of persons who might be mentioned in transcripts. Meanwhile, from transcripts, the system extracts words corresponding to persons who might appear in videos. Then, the system evaluates the association of the extracted names and faces. Since names and faces are both extracted from videos, they furnish additional timing information, i.e., at what time in videos they appear. The association of names and faces is evaluated with a "co-occurrence" factor using their timing information. Cooccurrence of a name and a face expresses how often and how well the name coincides with the face in given news video archives. In addition, the system also extracts video captions from video images. Extracted video captions are recognized to obtain text information, and then used to enhance the quality of face-name association.

click to expand
Figure 27.1: Architecture of Name-It.

Component technologies to obtain face, name, and video caption information employ state-of-the-art technologies, which will be briefed here, and related approaches are described elsewhere including other chapters in this handbook. They do not necessarily achieve perfect analysis results, though, properly integrating these results may obtain useful and meaningful contents information. We laid emphasis on the integration technique of imperfect analysis results to reveal its effectiveness in video analysis. We then depict relation between Name-It's integration method and mining the multimedia feature space.

2.3 Image Processing

The image processing portion of Name-It is necessary for extracting faces of persons who might be mentioned in transcripts. Those faces are typically shown under the following conditions: (a) frontal, (b) close-up, (c) centered, (d) long duration, (e) frequently. Given a video as input, the system outputs a two-tuple list for each occurrence of faces: timing information (start ~ end frame), and face identification information. Some of the conditions above will be used to generate the list; others will be evaluated later using information provided by that list. The image processing portion also contributes for video caption recognition, which provides rich information for face-name association.

Face Tracking

To extract face sequences from image sequences, Name-It applies face tracking to videos. Face tracking consists of 3 components: face detection, skin color model extraction, and skin color region tracking.

First, Name-It applies face detection to every frame within a certain interval of frames, e.g., 10 frames. The system uses the neural network-based face detector [2] which detects mostly frontal faces at various sizes and locations. The face detector can also detect eyes; we use only faces in which eyes are successfully detected to ensure that the faces are frontal and close-up.

Once a face is detected, the system extracts a skin color model of the face. Once a face region is detected in a frame, the skin color model of the face region is captured as the Gaussian model in (R,G,B) space. The model is applied to the subsequent frames to detect skin candidate regions. Face region tracking is continued until a scene change is encountered or until no succeeding face region is found.

Face Identification

To infer the "frequent" occurrence of a face, face identification is necessary. Namely, we need to determine whether one face sequence is identical to another.

To make face identification work effectively, we need to use frontal faces. The best frontal view of a face will be chosen from each face sequence. We first apply the face skin region clustering method to all detected faces. Then, the center of gravity of the face skin region is calculated and compared with the eye locations to evaluate a frontal factor. The system then chooses the face having the largest frontal factor as the most frontal face in the face sequence.

We choose the eigenface-based method to evaluate face identification [3]. Each of the most frontal faces is converted into a point in the 16-dimensional eigenface space. Face identification can be evaluated as the face distance, i.e., the Euclidean distance between two corresponding points in the eigenface space.

Video Caption Recognition

Video captions are directly attached to image sequences, and give text information. In many cases, they are attached to faces, and usually represent persons' names. Thus video caption recognition provides rich information for face-name association, though they do not necessarily appear for all faces of persons of interest.

To achieve video caption recognition [4], the system first detects text regions in video frames. Several filters including differential filters and smoothing filters are employed to achieve this task. Clusters with bounding regions that satisfy several size constraints are selected as text regions. The detected text regions are preprocessed to enhance image quality. First, the filter that minimizes intensities among consecutive frames is applied. This filter suppresses complicated and moving background, yet enhances characters because they are placed at the exact position for a sequence of frames. Next, the linear interpolation filter is applied to quadruple the resolution. Then template-based character recognition is applied. The current system can recognize only uppercase letters, but it achieved 76% character recognition rate.

Since character recognition results are not perfect, inexact matching between the results and character strings is essential to utilize imperfect results for face-name association. We extended the edit distance method [5] to cope with this problem. Assume that C is the character recognition result, and N is a word. The similarity S_c(C, N) is defined using the edit distance to represent that C approximately equals to N .

2.4 Natural Language Processing

The system extracts name candidates from transcripts using natural language processing technologies. The system is expected not only to extract name candidates, but also to associate them with scores. The score represents the likelihood that the associated name candidate might appear in the video. To achieve this task, combination of lexical and grammatical analysis and the knowledge of the news video structure is employed.

First, the dictionary and parser are used to extract proper nouns as name candidates. The agent of an act such as speech or attending meeting obtains a higher score. In doing this, the parser and thesaurus are essential. In a typical news video, an anchor person appears first, talks about an overview of the news, and mentions the name of the person of interest. The system also uses news structure knowledge like this. Several such conditions are employed for score evaluation. The system evaluates these conditions for each word in transcripts by using a dictionary (the Oxford Advanced Learner's Dictionary [4]), thesaurus (WordNet [7]), and parser (Link Parser [8]). Then, the system outputs a three-tuple list: a word, timing information (frame), and a normalized score.

2.5 Integration of Processing Results

In this section, the algorithm for retrieving face candidates by a given name is described. We use the co-occurrence factor to integrate image and natural language analysis. Let N and F be a name and face, respectively. The cooccurrence factor C(N, F) is expected to have a degree that represents the fact that the face F is likely to have the name N. Think of the faces F_a, F_b , … and the names N_p, N_q, …, where F_a corresponds to N_p. Then C(N_p,F_a) should have the largest value among co-occurrence factors of any combinations of F_a and the other names (e.g., C(N_q,F_a), etc.), or of the other faces and N_p (e.g., C(N_p,F_b), etc.). Retrieval of face candidates by a given name is realized as follows using the co-occurrence factor:

Calculate co-occurrences of combinations of all face candidates and the given name.
Sort co-occurrences.
Output faces that correspond to the N largest co-occurrences.

Retrieval of name candidates by a face is realized as well.

Integration by Co-occurrence Calculation

In this section, the co-occurrence factor C(N, F) of a face F and a name N is defined. Figure 27.2 depicts this process. Assume that we have the two-tuple list of face sequences (timing, face identification): , the three-tuple list of name candidates (word, timing, score): , and the two-tuple list of video captions (timing, recognition result): . Note that and have duration, e.g., (); so we can then define the duration function as . Also note that a name N_j may occur several times in video archives, so each occurrence is indexed by k. We define the face similarity between faces F_i and F_j as S_f(F_i, Fj) using the Euclidean distance in the eigenface space. The caption similarity between a video caption recognition result C and a word N, S_c (C, N), and the timing similarity between times t_i and t_j, S_t(t_i, t_j), are also defined. The caption similarity is defined using the edit distance, and the timing similarity represents coincidence of events. Then the co-occurrence factor C(N, F) of the face F and the name candidate N is defined as follows:

(27.1)

click to expand
Figure 27.2: Co-occurrence factor calculation.

Intuitively, the numerator of C(N, F) becomes larger if F is identical to F_i while at the same time F_i coincides with N with the large score. To prevent "anchor person problem", (an anchor person coincides with almost any name; a face or name coincides with any name, or face should correspond to no name or face), C(N, F) is normalized with the denominator. w_c is the weight factor for caption recognition results. Roughly speaking, when a name and a caption match and the caption and a face match at the same time, the face equivalently coincides with w_c occurrences of that name. We use 1 for the value of w_c.

2.6 Experiments and Discussions

The Name-It system was first developed on an SGI workstation, and now it's working on Windows PC as well. We processed 10 CNN Headline News videos (30 minutes each), i.e., a total of 5 hours of video. The system extracted 556 face sequences from videos. Name-It performs name candidate retrieval by a given face, and face candidate retrieval by a given name from the 5-hours news video archives. In face-to-name retrieval, the system is given a face, then outputs name candidates with co-occurrence factors in descending order. Likewise, in name-to-face retrieval, the system outputs face candidates of a given name with co-occurrence factors in descending order.

Figure 27.3 (a) through (d) show the results of face-to-name retrieval. In each result, an image of a given face and ranked name candidates associated with cooccurrence factors are shown. A correct answer is shown with a circled ranking number. Figure 27.3 (e) through (h) show the results of name-to-face retrieval. The top-4 face candidates are shown in the order from left to right with corresponding co-occurrence factors. These results demonstrate that Name-It achieves effective face-to-name and name-to-face retrieval with actual news videos.

click to expand
Figure 27.3: Face and name association results.

We have to note that there are some faces not being mentioned in the transcripts, but described only in video captions. These faces can be named only by incorporating video caption recognition (e.g., Figure 27.3 (d) and (h)). Although these faces are not always the most important in terms of news topics, namely, "the next to the most important," video caption recognition surely enhances performance of Name-It. The overall accuracy that the correct answer is involved in top-5 candidates is 33% in face-to-name retrieval, and 46% in name-to-face retrieval.

2.7 Name-It as Mining in Feature Space

In Name-It process, we assume that a frequently coincident face-name pair may represent a corresponding face-name pair. A face-name pair is frequently coincident if the face (and faces identical to this face) frequently appears, while at the same time the name frequently occurs, and the face and the name coincide in many segments of news video archives. Frequency of occurrence, e.g., of faces, can be regarded as density in the feature space. For example, each face is converted to a point in the eigenface space. By converting all faces in news video archives to points, the faces can be represented as scatter in the eigenface space, and frequently appearing faces correspond to high-density regions in the eigenface space. The same observation is possible for names and video captions. For faces, the feature space is the eigenface space, that is the Euclidean space, and similarity between faces is evaluated by S_f . For names, the feature space is discrete, and thus similarity is evaluated by identity. For video captions, the feature space is the metric space where similarity is evaluated by S_c, which is based on the edit distance. Even though these feature spaces have the different metric systems, density of points can be evaluated based on similarity (i.e., distance) defined in each feature space. The cooccurrence factor of a face-name pair corresponding to coincident high-density regions in the face space, name space, and video caption space will result in larger value (See Equation (27.1).). This observation is depicted in Figure 27.4. From this viewpoint, we can say that Name-It process is closely related to mining in the multimedia feature space.

click to expand
Figure 27.4: Relating Feature Spaces by co-occurrence.