5. Image-Centric Summarization

We set the following goals for our image-centric summarization system:

The user is able to affect the summarization outcome;
Within the user specified time length, the summarization minimizes the redundancy while preserving visually rich content of the original video program.

The first goal aims to meet different users' content overview requirements. The second goal strives to turn the difficult, subjective visual content summarization problem into a feasible and objective one. By common sense, an ideal visual content summary should be the one that retains only visually or semantically important segments of a given video. However, finding such video segments requires an overall understanding of the video content, which is beyond our reach given the state of the art of current computer vision and image understanding techniques. On the other hand, it is relatively easy to measure the level of visual content change, and to identify duplicates and redundancies in a video sequence. For the purpose of visual content overviews, the video watching time will be largely shortened, and the visual content of the original video will not be dramatically lost if we eliminate duplicate shots and curtail lengthy and static shots. Therefore, instead of relying on heuristically picking visually "important" segments for generating summaries, we choose to eliminate duplicates and redundancies while preserving visually rich contents in the given video. The following subsections present the system overview, and describe the image-centric summarization process in more detail.

5.1 System Overview

We enable the user to affect the summarization process by specifying two parameters: the summary length T_len, and the minimum time length of each shot T_min in the summary. Given T_len and T_min, the maximum number of scene shots a video summary can incorporate equals N = T_len / T_min. Here the parameter T_min can be considered as a control knob for the user to select between depth-oriented and breadth-oriented summaries. A smaller value for T_min will produce a breadth-oriented summary that consists of more shots that each is shorter in length, while a larger value for T_min will produce a depth-oriented summary that consists of less shots that each is longer in length.

Figure 10.4 shows the block diagram of the image-centric summarization system. For the image track of the input video, shot boundary detection is first conducted to segment the video into individual scene shots (Module 1). Shot clustering is then conducted to group all the shots into N = T_len /T_min clusters based on their visual similarities (Module 2). The clustering process generates shot clusters that meet the following conditions:

All the shots within the same cluster are visually similar.
Any pair of shots from two different clusters are remarkably different with respect to their visual contents.

Figure 10.4: Block diagram of the image-centric summarization system.

Using this shot cluster set, an image-centric summary is created by selecting the longest shot from each cluster as the representative shot, taking a chunk from each of the representative shots, and then concatenating them in their original time order (Module 3).

Image-centric summaries are particularly useful for summarizing entertainment video programs whose visual contents play a more important role than the corresponding audio contents. Examples of such videos include action and combat movies, cartoons, TV broadcasted sports games, etc.

5.2 Scene Shot Clustering

Scene shot clustering is the most challenging and important part of the image-centric summarization system. Here, the challenges are:

We have to generate shot clusters of the requested number N;
The generated N shot clusters have to all satisfy the two conditions listed in Section 5.1.

The task is challenging because in many cases, satisfying one requirement will require breaking the other.

In the literature, hierarchical clustering is a commonly used technique for dynamically grouping scene shots into a specified number of clusters. However, this technique does not meet the above requirements for the following reasons:

The computation time is O(n³) for building up the entire hierarchy, where n is the number of scene shots.
When the clustering hierarchy reaches certain heights, some major, visually different scene shots will be merged together.
At the lower layers of the clustering hierarchy, many visually similar shots have yet to be merged properly, and there are still some shot clusters that are visually similar to one another.

If the summarization process has to take clusters from layers described in either (b) or (c), this summary will either drop some major shots, or contain many duplicates.

To overcome the problems associated with the hierarchical clustering methods, we developed a novel method that performs scene shot clustering using the minimum spanning tree (MST) algorithm together with the upper-bound threshold T_high and the lower-bound threshold T_low. This method produces shot clusters that meet the two conditions listed in Section 5.1. Under the extreme circumstance where the required number of clusters N is either very high or very low, and can not be generated without breaking the two conditions, we use either T_high or T_low to generate clusters of number N' that best approaches N. If N' > N, we sort all the shot clusters in descending order of their combined time lengths, and discard those clusters at the bottom of the list. If N' < N, we define a visual content richness metric, and assign playback time to each shot cluster according to its content richness measure.

5.2.1 Visual Content Richness Metric

For many video programs, the degree of visual change is a good indicator of the amount of visual content conveyed by a video. From the viewpoint of visual content richness, a dynamic video with many changes contains richer visual content than a static video with almost no changes. Based on this observation, a simple visual content richness metric R(S) for a video segment S can be defined as follows:

(10.4)

where f_i is the i^th frame of the video segment S, K is the total number of frames in S, and d(f_i₊₁ - f_i) is the distance between frames i+1 and i in the image feature space.

5.2.2 Clustering Scene Shots Using Minimum Spanning Tree

Minimum spanning tree (MST) is a special kind of graph that connects all its vertices using the shortest path [15]. Let G = (V, E) be a connected, undirected graph where V is the set of vertices, and E is the set of possible interconnections between pairs of vertices. For each edge (u, v) ∊ E , we define a weight w(u, v) specifying the cost to connect u and v. The MST of the graph G = (V, E) is defined as T = (V, P) where P is an acyclic subset P ⊆ E that connects all the vertices, and minimizes the total weight

(10.5)

Construction of MST T for a given graph G is not unique, and could have multiple solutions.

For our scene shot clustering problem, an MST T is constructed for such a graph G = (V, E) where each vertex v ∊ V represents a scene shot of the original video, and each edge (u, v)∊ E represents the distance between shots u and v in the image feature space. By definition, T = (V, P) connects the shot set V with the shortest edge set P. Given a graph G = (V, E), its MST T can be built in time O(|E| lg|V|). The computation time is a dramatic improvement compared to the time O(|V|³) required for the hierarchical clustering.

Once an MST T = (V, P) is constructed for the shot set V, we sort all the edges in P in descending order of their lengths. As T is a tree structure, which means that any two vertices in T are connected by a unique simple path, a cut at any edge will break the tree into two subtrees. Therefore, if N shot clusters are required to compose the video summary, we can cut T at its top N - 1 longest edges to obtain N subtrees, and to use each subtree to form a shot cluster. Let L_N_-1 denote the length of the N - 1'th longest edge of T. All of the N subtrees obtained above have the property that the distance between an arbitrary vertex and its nearest neighbour is less than, or equal to L_N_-1. Because the N - 1'th longest edge of T defines the upper-bound edge of the subsequent N subtrees, to highlight its special role in the shot clustering process, we call it the thresholding edge for cutting the MST.

To achieve a shot clustering meeting the two conditions listed in Section 5.1 using MST, we must put certain restrictions on cutting the tree. Our experiments have shown that, using image features with a reasonable discrimination power, one can easily find an upper-bound threshold T_high and a lower-bound threshold T_low that divide pairs of scene shots into three categories:

The two shots are completely different when their distance in the feature space exceeds T_high.
The two shots are similar when their distance in the feature space is smaller than T_low.
When the distance is between T_high and T_low, the two shots could be judged as either similar or different depending on how strict the similarity criterion is.

The upper-bound and the lower-bound thresholds create an ambiguous zone of judgment. This ambiguous zone provides us with a range of thresholds for cutting the MST. To ensure that the clustering process does not separate visually similar shots into different clusters, nor merge completely different shots into the same clusters, the length of the thresholding edge for cutting the MST must be between T_high and T_low.

5.2.3 Creating Video Summaries

When the required number N of shot clusters is either very high or very low, and can not be generated without breaking the two conditions listed in Section 5.1, we use either T_high or T_low to generate clusters of number N' that best approaches N, and perform the following procedures for the three different cases:

N' = N: From each cluster, take the starting portion of the representative shot together with its corresponding audio segment for T_min seconds, and concatenate these segments in time order to create the video summary.
N' > N : Sort the N' clusters in descending order of their combined time lengths. A cluster's combined time length is the sum of time lengths of all the scene shots contained in the cluster. Select the top N clusters in the sorted list, and perform the operations same to the case N' = N to create the video summary.
N' < N: Compute the average visual content richness (Φ_i) for each cluster Φ_i, and assign playback time t_i defined below to Φ_i:

(10.6)

where τ_i is the combined time length of Φ_i. If t_i is shorter than T_min , set t_i = T_min. If t_i is longer than the longest shot in Φ_i, and Φ_i consists of multiple scene shots, sort the shots in descending order of their lengths, take the longest shot, the second longest shot, etc., until the total length reaches t_i. Once the video segments have been collected from all the shot clusters, concatenate them in time order to create the video summary.

The above operations accomplish the following two things: when the video summary can incorporate less shots (N' > N), those trivial shot clusters with short combined time lengths will be discarded, while redundancies and duplicates will be kept to a minimum; when the video summary can accommodate more shots (N' < N), those shot clusters with a richer visual content metric will be given more exposures within the summary.

In summary, the whole video summarization process consists of the following major operations:

Prompt the user to specify the video summary length T_len, and the minimum time length of visual segments T_min.
Segment the video sequence into individual scene shots.
Perform scene shot clustering using the MST algorithm together with the upper-bound threshold T_high and the lower-bound threshold T_low. The clustering process strives to generate clusters of the required number N; if impossible, then generate clusters of a number N' which best approaches N within the given threshold range.
Create the video summary using the procedures in accordance to the cases N' = N, N' > N, and N' < N .

5.3 Experimental Evaluation

We have tested the image-centric summarization system using five different video programs. The test video programs consist of news reports, documentary, political debate, and live coverage of the tally dispute for the 2000 presidential election, each of which last from 5 to 22 minutes.

Figure 10.5 shows the summarization result on a six-minute news report covering recent Israeli-Palestinian conflict (News report 1 in Table 10.4). The sequence consists of 28 shots, and Figure 10.5 displays the 15 major shots. Each row in the left hand rectangle represents a shot in the original video, and the number of frames in each row is proportional to the time length of the corresponding shot. The same row in the right hand rectangle depicts the video segment(s) taken from the corresponding shot cluster. The user has specified the two parameters T_len = 60 seconds, and T_min = 2 seconds (N = 30 clusters). In this sequence, the field reporter appeared four times, which are at row 2, 4, 6, and 8, respectively. As these four shots are quite static and visually similar, they were clustered together, and were assigned one unit of playback time T_min (row 2 in the right hand rectangle). The similar situation occurs for shot 1 and 15 as well. Shot 9 is a long, dynamic shot that contains many visual changes. It was grouped into one cluster, and was assigned approximately three units of playback time ( 3 T_min ). Similarly, as shots 10 and 14 contain high degrees of visual change, they were also assigned longer than one unit of playback time.

Table 10.4: Experimental Evaluation.
Video Contents	Time Length	Summary Length T_len	Total Shots	Shots Merged	Properly Merged	Shots Assigned More time
Documentary	5 min.	1 min.	17	0	0	12
Political debase	10 min.	2 min.	62	16	16	14
News report 1	6 min.	1 min.	28	6	6	7
News report 2	13 min.	1.5 min.	48	13	13	10
Live event coverage	22 min.	2 min.	42	13	13	28

click to expand
Figure 10.5: An image-centric summarization result.

Table 10.4 shows the detailed evaluation results on five different video sequences. The political debate video contains many shots that display either the same speakers, or the same global view of the studio. As shown in the table, sixteen visually similar shots have been properly merged by the summarization system. However, the merge of the similar shots has yielded less than the requested number of clusters for creating a two-minute summary (case N' < N). Therefore, fourteen shots with higher degrees of visual content change have received longer than one unit of playback time. On the other hand, the documentary program consists of no duplicate shots but many long, dynamic shots that gradually unveil the ongoing event in the field. Twelve dynamic shots have been assigned longer than one unit of playback time, while no shots have been merged in generating the summary.