2. Multi-Modal Processing for Clip Segmentation


2. Multi-Modal Processing for Clip Segmentation

2.1 Problem Definition

The goal of the clip segmentation algorithm is to break long, multi-topic video programs into smaller segments where each segment corresponds to a single topic. A problem arises, however, because the notion of topic is ill-defined. Consider for example retrieving a news program that aired on the day that John Glenn returned to space aboard the space shuttle as the oldest astronaut in history. Suppose that the news program covered the launch and several background stories including one about John Glenn's first trip to space, one about his career as a U.S. Senator, and so on. If a user requests all video related to John Glenn, then it would be correct to include all three segments, whereas if they specified that they were interested in John Glenn's politics, then the correct system response would be to return the much shorter segment confined to that topic. Further, it is possible that the desired topic is "John Glenn's entrance into politics" in which case the correct segment is shorter still.

2.2 A Two Phased Approach

Given that it cannot be determined a priori for which topics users will query the system, we cannot define a single, static, topic segmentation for each video stored in the database. Instead, we use a two phase scheme which combines both pre-processing (per content) and dynamic (per query) processing. The preprocessing segmentation is done once as a particular video program is inserted into the database. In contrast, the dynamic phase occurs at query time and uses the query terms as inputs. The two phases are independent, so that in different applications, or for different content types within an application, different optimized algorithms for the first phase may be employed. By storing the results of the phase one processing in a persistent data store in a standard extensible markup language (XML) format, this approach supports applications where metadata indicating the topic segmentation accompanies the content. For example, in a fully digital production process the edit decision list as well as higher level metadata may be recorded and maintained with the content data. Alternatively, the topic segmentation may be entered manually in a post-production operation. The second phase of the processing may be viewed as dynamically (at query time) splitting or merging segments from the first phase. As explained further below, the second phase also has the property that reasonable system performance will be achieved even in the absence of story segmentation from phase one. This may be necessary because typically multimodal clip segmentation algorithms are domain specific and don't generalize well to genres other than those for which they are designed.

2.3 Content Preprocessing

We will first describe general purpose (largely genre-independent) media preprocessing necessary for subsequent processing stages, or for content representation for multiple applications.

2.3.1 Video Segmentation

In comparison to unimodal information sources that are limited to textual, or auditory information, multimodal information sources, such as video programs, provide a much richer environment for the identification of topic/story boundaries. The information contained in the video frames can be combined with information from the text and audio to detect story transition points more accurately. Such transition points in the video can be detected using video segmentation techniques.

Video segmentation is the process of partitioning the video program into segments (scenes) with similar visual content. Once individual scenes have been identified, one or more images are selected from the large number of frames comprising each scene to represent the visual information that is contained in the scene. The set of representative images is often used to provide a visual index for navigating the video program [30], or is combined with other information sources, such as text, to generate compact representations of the video content [24]. In the context of multimodal story segmentation, the transition points are used to detect and adjust the story boundaries, while the representative images are used to generate visual representations, or storyboards for individual stories.

Ideally the "similarity" measures that are used in the segmentation process should be based on high-level semantic information extracted from the video frames. Such high-level information about the visual contents of the scene would go a long way towards extracting video clips that are relevant to a particular topic. However, most of the algorithms that possess sufficient computational efficiency to be applicable to large volumes of video information are only concerned with the extraction of information pertaining to the structure of the video program. At the lowest level, such algorithms detect well-pronounced, sudden changes in the visual contents of the video frames that are indicative of editing cuts in the video. For many informational video programs (e.g., television news broadcasts) the beginning of a new story or topic is often coincident with a transition in the video content. Therefore, video segmentation can serve as a powerful mechanism for providing candidate points for topic/story boundaries. Video programs also include other transitions that involve gradual changes in the visual contents of the scene. These can be put into two categories. The first category consists of the gradual transitions that are inserted during the video editing process and include fade, dissolve, and a large number of digital transitions. The second category consists of the changes in the image contents due to camera operations such as pan, tilt, and zoom.

Gradual transitions, such as fade, are often systematically used to transition between main programming and other segments (e.g., commercials) that are inserted within the program. Therefore, detection and classification of these transitions would help isolate such segments. The detection of visual changes induced by camera operations further divides the program into segments from which representative images are taken. The transition points that are generated by this process are usually not coincident with story boundaries, and therefore do not contribute to the story segmentation process. However, the representative images retained by this process generate a more complete representation of the visual contents of the program from which a better visual presentation can be generated.

Many algorithms for shot boundary detection have been reported in the literature [1][8][21][23]. We use a method similar to the one reported in [23] that combines the detection of abrupt and gradual transitions between video shots with the detection of intra-shot changes in the visual contents resulting from camera motions to detect and classify the shot transitions. The method also performs a content-based sampling of the video frames. The subset of frames selected by this process is used to generate the pictorial representation of the stories. The sampling process also serves as a data-reduction process to provide a small subset of video frames on which more computationally expensive processing algorithms (e.g., face detection) operate.

2.3.2 Video OCR

If the system can identify video frames containing text regions, it could provide additional metadata for the query mechanism to take into account. Using this type of text as part of the query search mechanism is not well studied but it can be used to improve the clip selection process. The general process would be as follows [10]:

  • Identify video frames containing probable text regions, in part through horizontal differential filters with binary thresholding,

  • Filter the probable text region across multiple video frames where that region is identified as containing approximately the same data. This should improve the quality of the image used as input for OCR processing,

  • Use OCR software to process the filtered image into text.

2.3.3 Closed Caption Alignment

Many television programs are closed captioned in "real-time." As they are being broadcast, a highly skilled stenographer transcribes the dialog and the output of the stenographic keypad is connected to a video caption insertion system. As a result of this arrangement, there is a variable delay of from roughly three to ten seconds from the time that a word is uttered, and the time that the corresponding closed caption text appears on the video. Further, it is often the case that the dialog is not transcribed verbatim. The timing delay presents a problem for clip extraction, since the clip start and end times are derived from the timing information contained in the closed caption text stream.

We correct the inherent closed caption timing errors using speech processing techniques. The method operates on small segments of the closed caption text stream corresponding to sentences as defined by punctuation and the goal is to determine sentence (not word) start and end times. An audio segment is extracted which corresponds to the closed caption text sentence start and end time. The audio clip start time is extended by 10 seconds back in time to allow for the closed caption delay. The phonetic transcription of the text is obtained using text-to-speech (TTS) techniques and the speech processing engine attempts to find the best match of this phoneme string to the given audio segment. The method works well when the closed caption transcription is accurate (it is even tolerant to acoustic distortions such as background music). When the transcription is poor, the method fails to find the proper alignment. We detect failures with a second pass consistency check, which insures that all sentence start and end times are sequential. If overlaps are detected, we use the mean closed caption delay as an estimate.

If a manual transcription is available, it can be used to yield a higher quality presentation than the closed caption text by using parallel text alignment with the best transcription of a very large vocabulary speech recognizer [7].

2.3.4 Closed Caption Case Restoration

Closed captioning is typically all upper-case to make the characters more readable when displayed on low-resolution television displays and viewed from a distance. When the text is to be used in other applications, however, we must generate the proper capitalization to improve readability and appearance. We employ three linguistic sources for this as follows: 1) a set of deterministic syntactic rules, 2) a dictionary of words that are always capitalized, and 3) a source which gives likelihood of capitalization for ambiguous cases [24]. The generation of the dictionary and the statistical analysis required to generate the likelihood scores are performed using a training corpus similar to the content to be processed. Processed newswire feeds are suitable for this purpose.

2.3.5 Speaker Segmentation and Clustering

In several genres including broadcast news, speaker boundaries provide landmarks for detecting the content boundaries so it is important to identify speaker segments during automatic content-based indexing [18]. Speaker segmentation is finding the speaker boundaries within an audio stream, and the speaker clustering is grouping speaker segments into clusters that correspond to different speakers.

Mel-Frequency Cepstral Coefficients (MFCC) are widely used audio features in speech domain and provide a smoothed version of spectrum that considers the non-linear properties of human hearing. The degree of smoothness depends on the order of MFCC being employed. In our study, we employed 13th order MFCC features and their temporal derivatives.

The speaker segmentation algorithm consists of two steps: splitting and merging. During splitting, we identify possible speaker boundaries. During merging, neighboring segments are merged if their acoustic characteristics are similar. In the first step, low energy frames, which are local minimum points on the volume contour, are located as boundary candidates. For each boundary candidate, the difference between it and its neighbors is computed. The neighbors of a frame are the two adjacent windows that are before and after the frame respectively, each with duration L seconds (typically L=3 seconds). If the difference is higher than certain threshold and it is the maximum in the surrounding range, we declare that the corresponding frame is a possible speaker boundary.

We adopt divergence to measure the difference [25]. Divergence is computed by numerical integration, making it computationally expensive. However, in some special cases, it can be simplified. For example, when F and G are one dimensional Gaussians, the divergence between F and G can be simplified as a computation directly from the Gaussian parameters. If the means and standard deviations of F and G are (μF, σF) and (μG, σG), respectively, the divergence is computed as,

click to expand

To simplify the computation, we assume different audio features are independent, and they follow Gaussian distribution. Then, the overall difference between two audio windows is simply the summation of the divergence of each feature.

The above described splitting process yields, in general, over-segmentation. A merging step is necessary to group similar neighboring segments together to reduce the amount of false segmentation. This is achieved by comparing the statistical properties of adjacent segments. The divergence is computed based on the mean and standard deviation vectors of adjacent segments. If it is lower than a preset threshold, the two segments are merged.

Speaker clustering is realized by agglomerative hierarchical clustering [13]. Each segment is initially treated as a cluster on its own. In each iteration, two clusters with the minimum dissimilarity value are merged. This procedure continues until the minimum cluster dissimilarity exceeds a preset threshold. In general, the clustering can also be performed across different programs. This is especially useful in some scenarios in which the speech segments of a certain speaker (e.g., President Bush) on different broadcasts can be clustered together so that users can easily retrieve such content-based clusters.

2.3.6 Anchorperson Detection

The appearance of anchorpersons in broadcast news and other informational programs often indicates a semantically meaningful boundary for reported news stories. Therefore, detecting anchorperson speech segments in news is desirable for indexing news content. Previous efforts are mostly focused on audio information (e.g., acoustic speaker models) or visual information (e.g., visual anchor models such as face) alone for anchor detection using either model-based methods via off-line trained models or unsupervised clustering methods. The inflexibility of the off-line model-based approach (allows only fixed target) and the increasing difficulty in achieving detection reliability using clustering approach lead to an adaptive approach adopted in our system [17].

To adaptively detect an unspecified anchor, we use a scheme depicted in Figure 43.1. There are two main parts in this scheme. One is visual-based detection (top part) and the other is integrated audio/visual-based detection. The former serves as a mechanism for initial on-line training data collection where possible anchor video frames are identified by assuming that the personal appearance (excluding the background) of the anchor remains constant within the same program.

click to expand
Figure 43.1: Diagram of integrated algorithm for adaptive anchorperson detection.

Two different methods of visual-based detection are described in this diagram. One is along the right column where audio cues are first exploited that identify the theme music segment of the given news program. From that, an anchor frame can be reliably located, from which a feature block is extracted to build an on-line visual model for the anchor. A feature block is a rectangular block, covering the neck-down clothing part of a person, which captures both the style and the color of the clothes. By properly scaling the features extracted from such blocks, the on-line anchor visual model built from such features is invariant to location, scale, and background. Once the model has been generated, all other anchor frames can be identified by matching against it.

The other method for visual-based anchor detection is used when there is no acoustic cue such as theme music present so that no first anchor frame can be reliably identified to build an on-line visual model. In this scenario, we utilize the common property of human facial color across different anchors. Face detection [2] is applied and then feature blocks are identified in a similar fashion for every detected human face. Once invariant features are extracted from all the feature blocks, similarity measures are computed among all possible pairs of detected persons. Agglomerative hierarchical clustering is applied to group faces into clusters that possess similar features (same cloth with similar colors). Given the nature of the anchor's function, it is clear that the largest cluster with the most scattered appearance time corresponds to the anchor class. Either of the above described methods can be used to detect the anchorperson. When the theme music is present, the combination of both methods can be used to detect the anchorperson with higher confidence.

Visual-based anchor detection by itself is not adequate because there are situations where the anchor speech is present but where the anchor does not appear. To precisely identify all anchor segments, we need to recover these segments as well. This is achieved by adding audio-based anchor detection. The visually detected anchor keyframes from the video stream identify the locations of the anchor speech in audio stream. Acoustic data at these locations is gathered to serve as the training data to build an on-line speaker model for the anchor, which can then be applied, together with the visual detection results, to extract all the segments from the given video where the anchor is present.

In our system, we use a Gaussian Mixture Model (GMM) with 13 MFCC features and their first and second order derivatives (for a total of 39 features as opposed to 26 used for speaker segmentation) to model the acoustic property of anchorperson and non-anchorperson audio. Maximum likelihood method is applied to detect the anchorperson segments.

2.3.7 Commercial Detection

Detection of commercials is an important task in news broadcast analysis, and many techniques have been developed for this purpose [9]. For example, to obtain transcription from automatic speech recognition (ASR) or to recognize a speaker from speech signals, it is more efficient to filter out the non-speech or noisy speech signals prior to the recognition. Since most commercials are accompanied with prominent music background, it is necessary to detect commercials and remove them before ASR and speaker identification. Conventionally, this task is performed mainly by relying on only auditory properties. In our system, we explore the solution that utilizes information from different media in the classification [16].

Commercials in national broadcast news can usually be characterized as being louder, faster, and more condensed in time compared with the rest of the broadcast. In our current solution for distinguishing between commercials and news reporting, eighteen clip-based audio and visual features are extracted and utilized in the classification. Each clip is about 3 seconds long, and fourteen clip features are from the audio domain and the rest are from the visual domain. Audio features capture the acoustic characteristics including volume, zero crossing rate, pitch, and spectrum. Visual features capture color and motion related visual properties. Such a combination of audio/visual features is designed to capture the discriminating characteristics between commercials and news reporting.

A Gaussian Mixture Model classifier is applied to detect commercial segments. Based on a collected training dataset, we train GMM models for both news reporting and commercials by the expectation maximization (EM) method. In the detection state, we compute the likelihood values of each video clip with regard to these two GMM models, and then we assign the video clip to the category with larger likelihood value. It is possible that news reporting may also contain music or noisy backgrounds and fast motion, especially in live reporting, which make the decision wrong since it is aurally or visually similar with commercials. This type of error can be easily corrected by smoothing the clip-wise classification decision using contextual information from neighboring clips.

2.4 Multi-Modal Story Segmentation

We now present a phase one algorithm suitable for broadcast news applications. In a typical national news program, news is composed of several headline stories, each of which is usually introduced and summarized by the anchorperson prior to and following the detailed reporting conducted by correspondents and others. Since news stories normally start with an anchorperson introduction, anchorperson segments provide a set of hypothesized story boundaries [11]. Then the text information, closed caption or speech transcription by automatic speech recognition, is used to verify whether adjacent hypothesized stories cover the same topic or not.

Formally, our input data for text analysis is T = {T1, T2, ..., TM}, where each Ti, 1 i M, is a text block that begins with the anchor person's speech and ends before the next anchor's speech, and M is the number of hypothesized stories. To verify story boundaries, we evaluate a similarity measure sim() between every pair (Ta, Tb) of adjacent blocks:

Here, w enumerates all the token words in each text block; fw,* is the weighted frequency of word w in the corresponding block. Stop words, for example, "a, the", etc, are excluded from token word list. We also applied stemming to reduce the number of token words [22]. Each word frequency is weighted by the standard frequency of the same word computed from a database of broadcast news data collected from NBC Nightly News from 1997 to 2001. The higher the frequencies of the common words in the two involved blocks are, the more similar the content of the blocks is. We experimentally determined a threshold to verify the story boundaries.

For each news story, we extract a list of keywords and key images as textual and visual representation of the story. Keywords are the ten token words with the highest weighted frequency within the story. Choosing key images is more sophisticated. We need to choose the most representative keyframes extracted from video shot detection. Anchorpersons' keyframes should be excluded, since they don't give story dependent information. The method we used is to select top five non-anchorperson keyframes that cover maximum number of chosen keywords in corresponding text blocks.

After story segmentation, a higher level news abstract can be extracted, for example, a news summary of the program. We can group all the anchorperson introduction segments, which are the leading anchorperson segments within corresponding stories, as the news summary since they are normally the most concise snapshot of the entire program.

2.5 Query-Based Segmentation

As mentioned previously, the detected story boundaries may be inappropriate for some user queries. We may need to split long stories into sub-segments, or on the other hand, join stories together to form suitable video presentations.

The second phase of processing to determine the clip start and end times for a given program takes place at query time. This processing is independent of the multimodal story segmentation described above. That method will have limited success for genres that are very different from broadcast news, in which case this second phase may be used by itself.

The sentences from the text for a given program are compared against a query string that is formed from a user's profile (see Section 3). The query string contains single word terms as well as phrase terms and typically contains from three to five such items. If phrase terms are used, the entire phrase must match in order for it to be counted as a hit. A hit is defined as a location in the text that matches the query string. The match may not be limited to a lexical match, but may include synonyms or alternate forms obtained via stemming. Typically the query string contains logical 'or' operators, so hits correspond to occurrences of single terms from the query string.

The algorithm uses a three-pass method to determine the largest cluster of hits to determine clip start and end times for a given program. In the first pass, a vector containing the times for each hit is formed. Next, a one-dimensional temporal domain morphological closing operation is performed on the hit vector. Assuming that time is sampled in units of seconds, the size of the structuring element for the closing operation corresponds to the maximum length of time in seconds between hits that can be assigned to a particular candidate clip. We have found that a value of 120 seconds provides reasonable performance for a wide range of content genres. In the final pass, a run-length encoding of the closed hit vector is performed, and the longest run is selected. The tentative clip start time is set to the start time of the sentence containing the first hit of the longest run. The clip end time is set to the end time of the sentence containing the last hit in the longest run. Note that there is no maximum clip length imposed.

It is possible for a program to contain only one instance of the search term, in which case the clip start and end times may be set to correspond to those of the sentence containing the hit. A minimum clip length (of typically 20 seconds) is imposed; it is usually only used in cases where the term occurs only once in a program, and it occurs in a short sentence. If the clip length is less than the minimum, the clip end time is extended to the end time of the next sentence. In other applications it may be desirable to impose a minimum number of sentences constraint instead of a minimum clip duration.

2.5.1 Example

Figure 43.2 represents a timeline of a 30-minute news program (the NBC Nightly News on 6/7/01). The large ticks on the scale on the bottom are in units of 100 seconds. The top row shows occurrences of an example key word, "heart". The second row indicates start times and durations of the sentences. Note that commercial segments have been removed as described above, and sentences or hits are not plotted during commercial segments.

click to expand
Figure 43.2: Timeline plot for a news program.

The bar in the second row indicates the extracted clip, in this case a story about heart disease. Notice that the keyword "heart" occurred five times before the story. Most of these are due to "teasers" advertising the upcoming story. The third occurrence of the term is from the mention of the term "heart" in a story about an execution (the first occurrence is from the end of the preceding program which happened to be a local news program, the second is from the beginning of the Nightly News program, and the last two before the clip are just before a commercial break).

2.6 Text-Based Information Retrieval

Text-based information retrieval (IR) is an important part of the overall process of creating personalized multimedia presentations. An in-depth review of IR techniques is beyond the scope of this chapter and further information can be found in the literature [14]. This section will describe how some IR methods could be used to improve the overall accuracy and correctness of the video selections.

Many IR systems attempt to find the best document matches given a single query and a large set of documents. This method will work with reasonable success in this system. We assume that a text transcript of the broadcast is available and that the query will return not just a set of documents, but also locations of the hits within the documents. Each document will relate to a single program. Thus the text processing will return a set of video programs that are candidates for the video segmentation and the additional processing required to create a set of personalized video clips.

In the simplest view of text processing the individual query hits on a document would be analyzed within each program to determine the best candidate areas for a video clip that would satisfy the query. Additional processing on the video and audio would then create the exact bounds of the video clip before it was presented to the user of the system.

A more advanced view would look at the entire profile of the user and use that as input to algorithms for query expansion or topic detection. In addition, suitable user interfaces (e.g., see Section 3.2) could provide the system with excellent information for relevance feedback in making future choices for this topic. The system could log the actual choices that the user makes in viewing the videos and incorporate information from those choices into the algorithms for making future choices of video clips. The view of integrated solutions is the correct one for the future of multimedia information management and there is much work to do in integrating the standard text-based tools with tools for handling multimedia.

2.6.1 Query Expansion

Query expansion should lead to less variation in the responses to similar but different queries [15]. For example, if a query was "home run", then the set of video clips found as a response should be similar to those found for the query "homer". This is only true if the query expansion is clever enough to group these terms in the sports category. Otherwise the query "homer" would be likely to return video clips from a documentary on the artist Winslow Homer.

This context is not available in a system that makes requests with no history or logging but is readily available in a system that not only includes a profile but also maintains and uses logs of previous queries. The system should be set up to be adaptive over time and all the relevant information should be stored with the user's individual profile.

2.6.2 Text Summarization and Keyword Extraction

These techniques, described elsewhere [6,12,20] help the system in determining appropriate clip selection by narrowing the text of a multimedia event to its essential and important components. These techniques can be used in isolation or in combination with relevance feedback to improve overall clip selection and generation.

2.6.3 Relevance Feedback

This goes beyond the realm of text processing but is worth mentioning in the standard IR context. This application lends itself to relevance feedback. Since we maintain profile information on each user we can maintain logs of which videos were viewed for how long and which videos were considered important enough to archive or email to others. Using this log information we can provide relevance feedback to the overall system on the effectiveness of certain searches and we can adapt the system over time to provide more answers similar to the clips that were viewed by the user and reject those that are similar to the clips that were not viewed [28,29].




Handbook of Video Databases. Design and Applications
Handbook of Video Databases: Design and Applications (Internet and Communications)
ISBN: 084937006X
EAN: 2147483647
Year: 2003
Pages: 393

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net