5. Multimodal Integration Bayesian Engine

The Multimodal Integration Framework (MIF) is the most vital part of the archiving module. The MIF (Figure 45.4) consists of the Unimodal Analysis Engine and the Multimodal Bayesian Engine from the archiving module. This three-layered framework separates the visual, audio, and transcript streams; and it describes feature extraction at a low, mid, and high-level. In the low-level layer, features are extracted from each of the three domains. At the mid-level layer, identifiable objects and segmentation features are extracted. At the high-level layer, video information is obtained through the integration of mid-level features across the different domains. The Unimodal Analysis Engine extracts features from the low-level layer and produces probabilities for the mid-level features. The Multimodal Bayesian Engine takes the mid-level features and their probabilities and combines them with context information to generate features at the high-level layer.

click to expand
Figure 45.4: Three-layered Multimodal Integration Framework.

The low-level layer describes signal-processing parameters as shown in Figure 45.4. In the current implementation, the visual features include color, edge, and shape; the audio features consist of twenty audio parameters [12] derived from average energy, mel-frequency cepstral coefficients, and delta spectrum magnitude; and the transcript features are given by the close-captioning (CC) text. All the features are typically extracted via signal processing operations. They result in high granularity and low abstraction information [13]. As one example, a high granularity information is pixel and/or frame based; the low abstraction character of these features is due to the fact that they correspond to low-level signal processing parameters. Within each domain, information can be combined. For example, in the visual domain, color, edge, and shape information can be combined to define whole image regions associated to parts or whole 2-D/3-D objects.

The mid-level features are associated with whole frames or collections of frames and whole image regions. In order to achieve this, information is combined within and/or across each of the three domains. The mid-level features in the current implementation are: (i) visual: keyframes (first frame of a "new" shot), faces, and videotext, (ii) audio: silence, noise, speech, music, speech plus noise, speech plus speech, and speech plus music, and (iii) transcript: keywords and twenty predefined transcript categories. We will describe in more detail mid-level feature extraction in section 5.1.

The high-level features describe semantic video information obtained through the integration of mid-level features across the different domains. In the current implementation, Scout classifies segments as either part of a talk show, financial news, or a commercial.

The information obtained through the first two layers represents video content information. By this we mean "objects", such as, pixels, frames, time intervals, image regions, video shots, 3-D surfaces, faces, audio sound, melodies, words, or sentences, etc. In opposition to this, we define video context information. Video context information is multimodal; it is defined for the audio, visual, and/or transcript parts of the video. Video context information denotes the circumstance, situation, underlying structure of the information being processed. The role of context is to circumscribe or disambiguate video information in order to achieve a precise segmentation, indexing, and classification. This is analogous to the role of linguistic context information [14] [15] in the interpretation of sentences according to their meanings.

Contextual information in multimedia processing circumscribes the application domain and therefore reduces the number of possible interpretations. For example: in order to classify a TV program between news, sports, or talk show we can use audio, visual, and or transcript context information. The majority of news programs are characterized by indoor scenes of head/shoulder images with superimposed videotext and speech (or speech with background music). Sports programs are characterized by lots of camera panning/zooming, outdoor scenes, and speech (announcer) with background noise (public cheering, car engine noise, etc.). This circumstantial information about the underlying structure of TV programs is not evident when using strict video content information in terms of "objects". However, taken together with the video content information, accuracy of classification tasks can be increased. In Figure 45.4, we indicate the combination of video content and context information with the ⊗ symbol.

Using the MIF we can perform two crucial tasks needed for the retrieval of segments: classification and segmentation. The high-level inferences introduced at the high-level layer of the MIF correspond to the first task of topic classification. In our benchmarking test we classified financial, celebrity, and TV commercial segments. The MIF produces probability information associated with each of the segments. The second task of segmentation process determines boundaries according to mid-level features. The initial segment boundaries are determined based on the transcript information. These segments are often too short to make sense for users since many times they may mean a change of speaker, which does not necessarily mean change of topic. MIF uses the output of the visual, audio, and transcript engines to accurately determine the boundary of the segments. These segments are further merged if consecutive segments have the same category. The output of the classification, segmentation, and associated indexing information is used in the retrieval module. We will describe multimodal integration in section 5.2.

5.1 Unimodal Analysis Engine

In the visual domain, the Unimodal Analysis Engine (UAE) searches for boundaries between sequential I-frames using the DCT information from MPEG-2 encoded video stream [16]. The UAE then outputs both the uncompressed keyframes and a list detailing each keyframe's probability and its location in the TV program. The keyframe's probability reflects the relative confidence that it represents the beginning of a new shot. The UAE examines these uncompressed keyframes for videotext using an edge-based method [17] and examines the keyframes for faces [18]. It then updates the keyframe list indicating which keyframes appear to have faces and/or videotext. We will describe in more detail the visual analysis in section 5.1.1.

In the audio domain, the UAE extracts low-level acoustic features such as short-time energy, mel-frequency cepstral coefficients, and delta spectrum magnitude coefficients. These were derived after testing a comprehensive set of audio features including bandwidth, pitch, linear prediction coding coefficients, and zero-crossings. These features are extracted from the audio stream (PCM .wav files sampled at 44.1kHz) frame by frame along the time axis using a sliding window of 20ms with no overlapping. The audio stream is then divided into small, homogeneous signal segments based on the detection of onsets and offsets. Finally, a multidimensional Gaussian maximum a posteriori estimator is adopted to classify each audio segment into one of the seven audio categories: speech, noise, music, silence, speech plus noise, speech plus music, and speech plus speech. We will describe in more detail audio analysis in section 5.1.2.

The transcript stream is extracted from the closed caption (CC) data found in the MPEG-2 user data field. The UAE generates a timestamp for each line of text from the frame information. These timestamps allow alignment of the extracted CC data with the stored video. In addition, the UAE looks for the single, double, and triple arrows in the closed captions text to identify events such as a change in topic or speaker. We will describe in more detail transcript extraction and analysis in section 5.1.3.

5.1.1 Visual Analysis Engine

Visual analysis is one of the basic and most frequently used ways to extract useful information from video. We analyze the visual signal in order to find important temporal information such as shot cuts and visual objects such as faces and videotext. The extracted features are used later on for multimodal video classification.

Keyframe Extraction

Consecutive video frames are compared to find the abrupt scene changes (hard cuts) or soft transitions (dissolve, fade-in and out). We have experimented with several methods for cut detection. The method that performed the best uses the number of changed macroblocks to measure differences between consecutive frames. We have experimented with 2 hours of U.S. TV broadcast video and video segments from the MPEG-7 evaluation content set. The experiments on the subset of MPEG-7 test set show that we can obtain 66% precision and 80% recall.

Text Detection

The goal of videotext detection is to find regions in video frames, which correspond to text overlay (superimposed text), i.e., the anchor name and scene text (e.g., "hotel" street signs, etc.). We developed a method for text detection that uses edge information in the input images to detect character regions and text regions [17]. The method has 85% recall and precision for CIF resolution.

The first step of our videotext detection method is to separate the color information that will be processed for detecting text: R frame of an RGB image or Y component for an image in the YUV format. The frame is enhanced and edge detection is performed. An edge filtering is performed next in order to eliminate frames with too many edges. These edges are then merged to form connected components. If the connected components satisfy size and area restrictions, they are accepted for the next level of processing. If connected components lie within row and column thresholds of other connected components, they are merged to form text boxes. The text boxes are extracted from the original image and local threshold is performed to obtain text as black on white. This is then passed onto an OCR to generate text transcripts.

Figure 45.5 shows examples of edge images and extracted text areas. The text detection can be further used to string together text detected across multiple frames to generate the text pattern: scrolling, fading, flying, static, etc. Scrolling text could possibly mean ending credits and flying text could mean that the text is a part of a commercial.

click to expand
Figure 45.5: Edge image and extracted image areas.

Text analysis is related to detecting and characterizing the superimposed text on video frames. We can apply OCR on the detected regions, which results in a transcript of the superimposed text on the screen. This can be used for video annotation, indexing, semantic video analysis, and search. For example, the origin of a broadcast is indicated by a graphic station logo in the right-hand top or bottom of the screen. Such station logo, if present, can be automatically recognized and used as annotation. From the transcript, a look-up database of important TV names, public figures, can be created. These names are associated with topics and categories, e.g., Bill Clinton is associated with "president of the U.S.A." and "politics". We will link names to faces if there is a single face detected in an image. Naming can be solved by using a "name" text region under the face that has certain characteristics of text length and height. Names can also be inferred from discourse analysis (e.g., in news, the anchors are passing the token to each other: "and Jim now back to you in New York".) Anchor/correspondent names and locations in a news program are often displayed on the screen and can be recognized by extracting the text showing in the bottom one-third of the video frame. Further, in music videos, musician names and music group names, and in talk shows, talk show hosts and guests, and other TV personalities are also introduced and identified in a similar fashion. So, by detecting the text box and recognizing the text, the video can be indexed based on a TV personality or a location. This information can then be used for retrieving news clips based on proper names or locations. For example, in the case of the personal profile indicating a preference to news read by a particular newscaster, information obtained using superimposed text analysis can help in detecting and tagging the news for later retrieval. Sports programs can be indexed by extracting the scores and team or player names.

Face Detection

We used the face detection method described in [18]. The system employs a feature-based top-down scheme. It consists of the following four stages:

Skin-tone region extraction. Through manually labeling the skin-tone pixels of a large set of color images, a distribution graph of skin-tone pixels in YIQ color coordinate is generated. A half ellipse model is used to simulate the distribution and filter skin-tone pixels of a given color image.
Pre-processing. Morphological operations are applied to skin-tone regions to smooth each region, break narrow isthmuses, and remove thin protrusions and small isolated regions.
Face candidate selection. Shape analysis is applied to each region and those with elliptical shapes are accepted as candidates. Iterative partition process based on k-means clustering is applied to rejected regions to decompose them into smaller convex regions and see if more candidates can be found.
Decision making. Possible facial features are extracted and their spatial configuration is checked to decide if the candidate is truly a face.

5.1.2 Audio Analysis Engine

We have observed that the audio domain analysis is crucial for multimodal analysis and classification of video segments. The background audio information defines the context of the program. For example a segment in news program is distinguished from a comedy program depicting a newscaster (e.g., Saturday Night Live) with the fact that there is background laughter in the comedy program. We perform audio feature extraction, pause detection segmentation, and classification.

Audio Feature Extraction

Audio Segmentation and Classification

Audio segmentation and classification stage follows the feature extraction stage. The first step is that of pause detection. This eliminates the silence segments and further processing is performed only on non-silent segments. The audio stream is divided into small, homogeneous signal segments based on the detection of onsets and offsets. Finally, a multidimensional Gaussian maximum a posteriori estimator is adopted to classify each audio segment into one of remaining six audio categories: speech, noise, music, speech plus noise, speech plus music, and speech plus speech.

click to expand
Figure 45.6: a) Original image, b) Skin tone detection, c) Pre-processing result, d) Candidate selection, e) Decision making, f) Final detection.

5.1.3 Transcript Processing and Analysis

Transcript processing begins with the extraction of closed captions from the video stream. Once the transcript is available, the transcript is analyzed to obtain coarse segmentation and category information for each segment.

Transcript Extraction

Transcripts can be generated using multiple methods including speech to text conversion, closed captions (CC), or third party program transcription. In the United States, FCC regulations mandate insertion of closed captions in the broadcast. In the analog domain, CCs are inserted in the Vertical Blanking Interval. In the digital domain there are different standards for encoding closed captions. For this paper we used CC from digital broadcast. The transcripts are generated using the CC from either High Definition Television (HDTV) broadcasts or Standard Definition (SD) program streams. For the HDTV broadcasts we extract the CC from MPEG-2 transport streams using the ATSC Standard EIA-608 format on a Philips TriMedia TM1100. The transport stream is demultiplexed and the program stream is selected for decoding. During the decoding process the decoding chip delivers the CC packets to the TriMedia processor. The CC packets are intercepted before reaching the processor. Each packet contains 32 bits of data. This 32-bit word must be parsed into four bytes of information: data type, valid/invalid bit, CC byte 1, and CC byte 2. The EIA-608 packet structure is depicted in Figure 45.7 with the ASCII values shown. If the valid/invalid bit is zero then the third and fourth bytes contain CC characters in EIA-608 format; if the valid/invalid bit is one then these bytes contain the EIA-708 wrapper information. The third and fourth bytes contain the CC data in 7-parity so that to extract the characters, the most significant bit must be cleared before extracting the data. Once the characters are extracted, all the control characters must be identified to find the carriage returns, line feeds, or spaces in order to build words from the characters. The control information is also used to insert the same information into the transcript. A time stamp generated from the frame information is inserted into the transcript for each line of text.

Figure 45.7: EIA-608 Packet Structure with the ASCII values shown.

In our experiments, standard analog US broadcasts are encoded to create the SD MPEG-2 program streams. While encoding, the CC data found on line 21 of these broadcasts are inserted into the user-data field of the MPEG-2 program stream. The current encoding process uses an OptiVison VSTOR-150. This encoder inserts the CC data into the user-data field in a proprietary format rather than the ATSC EIA-608 standard. Consequently, the TriMedia does not recognize the packets and the CC extraction must be done completely in software. First the MPEG-2 header information is read to locate the user-data field. Then the next 32 bits are extracted. To get the first byte of CC data the first 24 bits are "ANDed" with the ASCII value 127 while the full 32 bits are "ANDed" with this value for the second byte of data. This extraction process is shown in Figure 45.8a and 8b for the first (CC1) and second (CC2) characters, respectively. Next the words must be assembled and the timestamps generated to create the transcript. This process is similar to that used in the EIA-608 case; however, a different set of control characters is used and these are identified and handled accordingly.

click to expand
Figure 45.8: CC1 and CC2 extraction from the proprietary format.

The result of either of the above processes is a complete time-stamped program transcript. The time-stamps are used to align the CC data with the related portion of the program. The transcripts also contain all the standard ASCII printable characters (ASCII value 32–127) that are not used as control characters. This means that the transcript then contains the standard single (">"), double (">>"), and triple arrows (">>>") that can be used to identify specific portions of the transcript, for example the change in topic or speaker. Figure 45.9 shows an example of an extracted closed caption text. The first column is the time stamp in milliseconds relative to the beginning of the program and the second is the closed captions extracted from the input video.

29863 an earlier decision and voted

31264 to require background checks

32567 to buy guns at gun shows.

34970 But did the senate really make

36141 the gun laws tougher?

38205 No one seems to agree on that.

40549 linda douglass is

41549 on capitol hill.

43613 >>Reporter:The senate debate

44815 dissolved into a frenzy

Figure 45.9: Sample extracted closed caption text.

Transcript Analysis

Figure 45.10 displays an overview of the transcript analysis. The UAE begins by extracting a high-level table of contents (summary) using known cues for the program's genre. The knowledge database and the temporal database embody the domain knowledge and include the temporal sequence of the cues [19]. For example: when analyzing a ski race, the system attempts to create segments for each racer. The phrases "now at the gate" or "contestant number" can be searched for in the transcript to index when a contestant is starting. Scout also employs a database of categories with the associated keywords/key-phrases for different topics. This database helps find the main topic for a given segment. Finally, the non-stop words (generally nouns and verbs) in each segment are indexed.

Figure 45.10: Story segmentation and summarization.

5.2 Multimodal Bayesian Engine

The Bayesian Engine (BE), a probabilistic framework, performs multimodal integration [13]. We chose to use a probabilistic framework for the following reasons: (i) allows for the precise handling of certainty/uncertainty, (ii) describes a general method for the integration of information across modalities, and (iii) has the power to perform recursive updating of information. The probabilistic framework we use is a combination of Bayesian networks [20] with hierarchical priors [21].

Bayesian networks are directed acyclical graphs (DAG) in which the nodes correspond to (stochastic) variables. The arcs describe a direct causal relationship between the linked variables. The strength of these links is given by conditional probability distributions (cpds). More formally, let the set Ω = {x₁, ,x_N} of N variables define a directed acyclic graph (DAG). For each variable x_i,i = 1, ,N, there exists a sub-set of variables of Ω, , the parent's set of x_i, describing the predecessors of x_i in the DAG, that is, the predecessors of x_i in the DAG, such that

(45.1)

where P( | ) is a strictly positive cpd. Now, the chain rule [23] tells us that the joint probability density function (jpdf) P(x₁, ,x_N) is decomposed as:

(45.2)

According to this, the parent set has the property that x_i and {x₁, ,x_N} \ are conditionally independent given .

In Figure 45.4 we showed a conceptual realization of the MIF. As explained at the beginning of section 5, the MIF is divided into three layers: low, mid, and high-level. These layers are made up of nodes (shown in filled circles) and arrows connecting the nodes. Each node and arrow is associated with a probability distribution. Each node's probability indicates how likely a given property representing that node might occur. For example, for the keyframe node, the probability associated with it represents the likelihood a given frame is a cut. The probability for each arrow indicates how likely two nodes (one from each layer) are related. Taken together, the nodes and arrows constitute a directed acyclic graph (DAG). When combined with the probability distributions, the DAG describes a Bayesian network.

The low-level layer encompasses low-level attributes for the visual (color, edges, shape), audio (20 different signal processing parameters), and transcript (CC text) domains, one for each node. In the process of extracting these attributes, probabilities are generated for each node. The mid-level layer encompasses nodes representing mid-level information in the visual (faces, keyframes, video text), audio (7 mid-level audio categories), and transcript (20 mid-level CC categories) domains. These nodes are causally related to nodes in the low-level layer by the arrows. Finally, the high-level layer represents the outcome of high-level (semantic) inferences about TV program topics.

In the actual implementation of MIF, we used the mid- and high-level layers to perform high-level inferences. This means we compute and use probabilities only for these two layers. The nodes in the mid-level layer represent video content information, while the nodes of the high-level layer represent the results of inferences.

Figure 45.4 shows a ⊗ symbol between the mid- and high-level layers. This indicates where we have the input of multimodal information. For each multimodal context information we associate a probability. The multimodal context information generates circumstantial structural information about the video. This is modeled via hierarchical priors [21] which is a probabilistic method of integrating this circumstantial information with that generated by Bayesian networks. For example, a TV talk show contains a lot of speech, noise (clap, laughter, etc.) and background music (with or without speech), faces, and a very structured story line - typically, each talk show has a host and 2 or 3 guests.

In Video Scout, we discover Context Patterns by processing multiple TV programs. For example, by examining the mid-level audio features of several Financial News programs, Scout learned that Financial News has a high probability of Speech and Speech plus Music. In the visual domain, Scout looks for context patterns associated with keyframe rate (cut rate) and the cooccurrence of faces and videotext. In the audio domain Scout looks for context patterns based on the relative probability of six of the seven mid-level audio features. The silence category was not used. In the textual domain the context information is given by the different categories which are related to talk show and financial news programs.

Initial segmentation and indexing is done using closed caption data to divide the video into program and commercial segments. Next the closed captions of the program segments are analyzed for single, double, and triple arrows. Double arrows indicate a speaker change. The system marks text between successive double arrows with a start and end time in order to use it as an atomic closed captions unit. Scout uses these units as the indexing building blocks. In order to determine a segment's context (whether it is financial news or a talk show) Scout computes two joint probabilities. These are defined as:

(45.3)

(45.4)

The audio probabilities p-AUDIO-FIN for financial news and p-AUDIO-TALK for talk shows are created by the combination of different individual audio category probabilities. The closed captions probabilities p-CC-FIN for financial news and p-CC-TALK for talk shows are chosen as the largest probability out of the list of twenty probabilities. The face and videotext probabilities p-FACETEXT-FIN and p-FACETEXT-TALK are obtained by comparing the face and videotext probabilities p-FACE and p-TEXT which determine, for each individual closed caption unit, the probability of face and text occurrence. One heuristic used builds on the fact that talk shows are dominated by faces while financial news has both faces and text. The high-level indexing is done on each closed captions unit by computing in a new pair of probabilities: p-FIN-TOPIC and p-TALK-TOPIC. The highest value dictates the classification of the segment as either financial news or talk show.