6. Audio-Visual Summarization

In developing the audio-visual summarization system, we strive to produce a motion video summary for the original video that:

provides a natural, effective audio and visual content overview;
maximizes the coverage for both audio and visual contents of the original video without having to sacrifice either of them.

To accomplish these goals, we compose an audio-visual summary by decoupling the audio and the image track of the input video, summarizing the two tracks separately, and then integrating the two summaries with a loose alignment.

6.1 System Overview

Figure 10.6 is the block diagram of the audio-visual summarization system. It is a systematic combination of the audio-centric and the image-centric summarization systems. The summarization process starts by receiving the user's input of the two summarization parameters: the summary length T_len, and the minimum time length of each image segment T_min in the summary. The meaning and the intended use of these two parameters are the same as described in Section 5.1.

click to expand
Figure 10.6: The block diagram of the audio-visual summarization system.

The audio content summarization is accomplished using the same process as described in Section 4. It consists of conducting speech recognition on the audio track of the original video to obtain a speech transcript, applying the text summarization method described in Section 4.2 to obtain an importance rank for each sentence in the transcript, and selecting audio segments of the sentences in descending order of their importance ranks until the user specified summary length is reached.

The first two steps of the visual content summarization, shot boundary detection and scene shot clustering, are the same as described in Section 5, while the visual summary composition is conducted based on both the audio summarization and the shot clustering results. The whole visual content summarization process consists of the following major steps. First, shot boundary detection is conducted to segment the image track into individual scene shots. Next, shot clustering is performed to group scene shots into the required number N = T_len / T_min of clusters based on their visual similarities. The summary length T_len is then divided into N time slots each of which lasts for T_min seconds, and each time slot is assigned to a suitable shot from an appropriate shot cluster. The assignment of a time shot to an appropriate shot cluster is made by the alignment process to fulfill the predefined alignment constraints (see Section 6.2 for detailed descriptions). Once a shot is assigned a time slot, its beginning segment (T_min seconds long) is collected, and a visual summary is created by concatenating these collected segments in their original time order. Moreover, face detection is conducted for each scene shot to detect the most salient frontal face that appears steadily in the shot. Such a face is considered as a speaker's face, and will play an important role in the alignment operation.

For the alignment task, to archive the summarization goals listed at the beginning of this section, we partially align the spoken sentences in the audio summary with the associated image segments in the original video. With video programs such as news and documentaries, a sentence spoken by an anchor person or a reporter lasts for ten to fifteen seconds on average. If a full alignment is made between each spoken sentence in the audio summary and its corresponding image segment, what we may get in the worst case is a video summary whose image part consists mostly of anchor persons and reporters. The summary created this way may look natural and smooth, but it is at the great sacrifice of the visual content. To create a content rich audio-visual summary, we developed the following alignment operations: for each spoken sentence in the audio summary, if the corresponding image segment in the original video displays scenes rather than the speaker's face, perform no alignment operations. The visual summary associated with this spoken sentence can be created by selecting image segments from any appropriate shot clusters. If the corresponding image segment does display the speaker's face, align the spoken sentence with its corresponding image segment for the first T_min seconds, and then fill the remaining portion of the associated visual summary with image segments from appropriate shot clusters. The decision of selecting which shot from which cluster is made by the alignment process to fulfill the predefined alignment constraints.

6.2 Alignment Operations

Let A(t_i,τ_i), I(t_i,τ_i) denote the audio, image segments that start at time instant t_i, and last for τ_i seconds, respectively. The alignment operation consists of the following two main steps.

For a spoken sentence A(t_i, τ_i) in the audio summary, check the content of its corresponding image segment I(t_i,τ_i) in the original video. If I(t_i, τ_i) shows a close-up face, and this face has not been aligned with any other component in the audio summary, align A(t_i,τ_i) with I(t_i,τ_i) for T_min seconds. Otherwise, do not perform the alignment operation for A(t_i,τ_i). This T_min seconds alignment between A(t_i,τ_i) and I(t_i,τ_i) is called an alignment point.
Once all the alignment points are identified, evenly assign the remaining time period of the summary among the shot clusters which have not received any playback time slot. This assignment must ensure the following two constraints:
- Single assignment constraint: Each shot cluster can receive only one time slot assignment.
- Time order constraint: All the image segments forming the visual summary must be in original time order.

The following subsections explain our approach to realizing the above alignment requirements.

6.2.1 Alignment Based on Bipartite Graph

Assume that the whole time span T_len of the video summary is divided by the alignment points into P partitions, and the time length of partition i is T_i (see Figure 10.7).

click to expand
Figure 10.7: An example of the audio-visual alignment and the corresponding bipartite graph.

Because each image segment forming the visual summary must be at least T_min seconds long (a time duration of T_min seconds long is called a time slot), partition i will be able to provide S_i = ⌈T_i /T_min⌉ time slots, and hence the total number of available time slots becomes S_total = . Here the problem becomes as follows: Given a total of N shot clusters and S_total time slots, determine a best matching between the shot clusters and the time slots which satisfies the above two constraints. By some reformulation, this problem can be converted into the following maximum-bipartite-matching (MBM) problem [15]. Let G = (V, E) represent an undirected graph where V is a finite set of vertices and E is an edge set on V. A bipartite graph is an undirected graph G = (V, E) in which V can be partitioned into two sets L and R such that (u, v)∈ E implies either u ∈ L and v ∈ R or u ∈ R and v ∈ L. That is, all edges go between the two sets L and R. A matching is a subset of edges M ⊆ E such that for any vertex pair (u, v) where u ∈ L and v ∈ R, at most one edge of M connects between u and v. A maximum matching is a matching M such that for any matching M', we have | M |≥| M'|.

To apply the MBM algorithm to our alignment problem, we use each vertex u ∈ L to represent a shot cluster, and each vertex v ∈ R to represent a time slot. An edge (u, v) exists if a shot cluster u is able to take time slot v without violating the time order constraint. If a shot cluster consists of multiple scene shots, this cluster may have multiple edges that leave from it and enter different vertices in R. A maximum-bipartite-matching solution is a best assignment between all the shot clusters and the time slots. Note that a best assignment is not necessarily unique.

6.2.2 Alignment Process Illustration

Figure 10.7(a) illustrates the alignment process using a simple example. In this figure, the original video program is 70 seconds long, which consists of 7 scene shots and 7 spoken sentences each of which lasts for 10 seconds. The user has set T_len = 20 seconds, and T_min = 3 seconds. Assume that the audio summarization has selected two spoken sentences A(0,10) and A(30,10), and that the shot clustering process has generated five shot clusters as shown in the figure. As the audio summary is formed by A(0,10) and A(30,10), we must first examine the contents of the corresponding image segments I(0,10) and I(30,10) to determine whether the alignment operations are required. Suppose that I(0,10) and I(30,10) display the faces of the spoken sentences A(0,10), A(30,10), respectively, and that I(0,10), I(30,10) have not been aligned with other audio segments yet. Then, according to the alignment rules, I(0,10) will be aligned with A(0,10), and I(30,10) with A(30,10) for T_min(= 3) seconds. Because I(0,10) and I(30,10) have been used once, they will not be used in other parts of the visual summary. By these two alignment points, the remaining time period of the visual summary is divided into two partitions, with each lasting for 7 seconds that can provide at most 2 time slots. Because there are three shot clusters and four time slots left for the alignment, we have a bipartite graph for the alignment task shown in Figure 10.7(b). Since shot cluster 2 consists of two shots: I(10,10) and I(50,10), it could take a time slot in either partition 1 or partition 2. If I(10,10) is selected from cluster 2, it can take either time slot 2 or 3 in partition 1. On the other hand, if I(50,10) is selected, it can take either time slot 5 or 6 in partition 2. Therefore, we have four edges leaving from cluster 2, each entering time slots 2, 3, 5, and 6, respectively. Similarly, there are four edges leaving from cluster 4, and two edges leaving from shot cluster 5, respectively.

There are several possible maximum matching solutions for the bipartite graph in Figure 10.7(b). Figure 10.8(a) shows one solution where the coarse lines represent the assignment of the shots to the time slots. Note that in this solution time slot 3 remains unassigned. This example illustrates a fact that, although the MBM algorithm will find a best matching between the available shot clusters and time slots, it may leave some time slots unassigned, especially when the number of available shot clusters is less than that of available time slots. To fill these unassigned time slots, we loosen the single assignment constraint, examine those clusters with multiple scene shots, and select an appropriate shot that has not been used yet, and that satisfies the time order constraint. In the above example, the blank time slot 3 is filled using the shot I(20,10) in cluster 4 (coarse dashed line in Figure 10.8(b)).

click to expand
Figure 10.8: Alignment solutions— the coarse lines represent the assignment of the shot clusters to the time shots; the notation I(j,k) on each coarse line tells which shot from the cluster has been selected, and assigned to the time slot.

In the case when the number of available shot clusters is more than that of available time shots, some shot clusters will not be assigned time slots within the visual summary. The MBM algorithm determines which cluster to discard and which cluster to take during its process of finding the best matching solution.

It is noticed that the MBM algorithm may generate some false solutions, and Figure 10.8(c) shows such an example. Here, because shot I(60,10) has been placed before shot I(50,10), it has violated the time order constraint. However, this kind of false solution can be easily detected, and can be corrected by sorting the image segments assigned to each partition into their original time order. In the above example, the time order violation can be corrected by exchanging the two image segments assigned to time slots 5 and 6 in Partition 2.

In a summary, Step 2 of the alignment operation (Section 6.2) can be described as follows:

After the alignment points have been identified, determine the number of shot clusters and time slots that are left for the assignment, and construct a bipartite graph accordingly.
Apply the MBM algorithm to find a solution.
Examine the solution with the time order constraint; if necessary, sort the image segments assigned to each partition into their original time order.
If there exist unassigned time slots, examine those shot clusters with multiple scene shots, and select an appropriate shot that has not been used yet, and that satisfies the time order constraint.

6.3 Summarization Performances

Conducting an objective and meaningful evaluation for an audio-visual content summarization method is difficult and challenging, and is a open issue deserving more research. The challenge is mainly from the fact that research for audiovisual content summarization is still at its early stage, and there are no agreed-upon metrics for performance evaluations. This challenge is further compounded by the fact that different people carry different opinions and requirements towards summarizing the audio-visual contents of a given video, making the creation of any agreed-upon performance metrics even more difficult.

Our audio-visual content summarization system has the following characteristics:

The audio content summarization is achieved by using the latent semantic analysis technique to select representative spoken sentences from the audio track.
The visual content summarization is performed by eliminating duplicates/redundancies and preserving visually rich contents from the image track.
The alignment operation ensures that the generated audio-visual summary maximizes the coverage for both audio and visual contents of the original video without having to sacrifice either of them.

In Section 4.3, systematic performance evaluations have been conducted on the text summarization method. The evaluations were carried out by comparing the machine generated summaries with the manual summaries created by three independent human evaluators, and the F-value was used to measure the overlap degrees between the two types of summaries. It has been shown that the text summarization method achieved the F-value in a range of 0.57 and 0.61 for multiple test runs, the performance compatible with the top-ranking state-of-the-art text summarization techniques [13], [15].

With respect to the visual content summarization, as a visual summary is composed by first grouping visually similar shots into the same clusters, and then selecting at most one shot segment from each cluster, this visual summarization method ensures that duplicates and redundancies are diminished and visually distinct contents are preserved within the visual summary.

The alignment operation partially aligns each spoken sentence in the audio summary to the image segment displaying the speaker's face, and fills the remaining period of the visual summary with other image segments. In fact this alignment method is a mimic of news video production technique commonly used by major TV stations. A common pattern for news programs is that an anchor person appears on the screen and reports the news for several seconds. After that, the anchor person continues his/her reports, but the image part of the news video switches to either field scenes or some related interesting scenes. By doing so, visual contents of news broadcast are remarkably enriched, and viewers will not get bored. On the other hand, by mimicking this news video production technique in our summarization process, we get an audio-visual content summary which provides a richer visual content and a more natural audio-visual content overview. Such audio-visual summaries dramatically increase the information intensity and depth, and lead to a more effective video content overview.

Figure 10.9 illustrates the process of summarizing a 12-minute CNN news program reporting the Anthrax threat after the September 11 terrorist attack. The news program consists of 45 scene shots in its image track, and 117 spoken sentences in its audio track. Keyframes of all the shots are displayed at the left hand side of the Figure 10.9 in their original time order. Obviously, this news program contains many duplicated shots which arise from video cameras switching forth and back among anchor persons and field reporters.

click to expand
Figure 10.9: Audio-visual summarization of a 12-minute news program.

The user has set T_len = 60 seconds, and T_min = 2.5 seconds. The shot clustering process has generated 25 distinct shot clusters, and the audio summarization process has selected sentences 1, 3, 38, 69, 103 for composing the audio summary. The clustering result is shown by the table in the middle of the figure. It is clear from the clustering result that all the duplicated shots have been properly grouped into the appropriate clusters, and there is no apparent misplacement among the resultant clusters. Because the total time length of these five sentences equals 70 seconds, the actual length of the produced audiovisual summary exceeds the user specified summary length by 10 seconds.

Among the five sentences comprising the audio summary, four sentences have their corresponding image segments containing the speakers' faces. Each of these four audio sentences has been aligned with its corresponding image segment for T_min = 2.5 seconds. The dashed lines between the audio and the visual summaries denote these alignments.

With the actual T_len = 70 seconds, and T_min = 2.5 seconds, the audio-visual summary can accommodate 28 time slots. As there are a total of only 25 distinct shot clusters, some shot clusters were assigned more than one time slot (e.g., clusters 0, 15, 17, 20, 21). To find an alignment solution that fulfills the two alignment constraints listed in Section 6.2, several shot clusters were not assigned any time slots, and were consequently discarded by the alignment algorithm (e.g., clusters 6, 7, 8, 11). The configuration of the audio and the visual summaries are displayed at the right hand side of the figure.

Video summarization examples can be viewed at http://www.ccrl.com/~ygong/VSUM/VSUM.html.