In this section, we present a sports video analysis framework for video summarization and model instantiation. We discuss the application of the algorithms for soccer video, but the introduced framework is not limited to soccer, and can be extended to other sports. In Figure 6.6, the flowchart of the proposed summarization and analysis framework is shown for soccer video. In the following, we first introduce algorithms using cinematic features, such as shot boundary detection, shot classification, and slow-motion replay detection, for low-level processing of soccer video. The output of these algorithms serves for two purposes: 1) Generation of video summaries defined solely by those features, e.g., summaries of all slow-motion replays. 2) Detection of interesting segments for higher level video processing, such as for event and object detection (shown by the segment selection box in Figure 6.6). In Section 4.2, we present automatic algorithms for the detection of soccer events and in Section 4.3, we explain object detection and tracking algorithms for object motion descriptor extraction. Then, in Section 4.4, the generation of summaries is discussed. Finally, we elaborate on the complete instantiation of the model for querying by web and professional users.
Figure 6.6:
The flowchart of the proposed summarization and model instantiation framework for soccer video
As mentioned in Section 2.2, semantic analysis of sports video
In the proposed framework, the first low-level operation is the detection of the dominant color region, i.e., grass region, in each frame. Based on the difference in grass colored pixel ratio between two
Long Shot: A long shot displays the global view of the field as shown in Figure 6.7 (a); hence, a long shot serves for accurate localization of the events on the field.
Figure 6.7:
The shot classes in soccer— (a) Long shot, (b) in-field medium shot, (c) close-up shot, and (d) out-of-field shot
In-Field Medium Shot: A medium shot, where a whole human body is usually visible as in Figure 6.7 (b), is a zoomed-in view of a specific part of the field.
Close-Up or Out-of-field Shot: A close-up shot usually shows above-waist view of one person as in Figure 6.7 (c). The audience (Figure 6.7 (d)), coach, and other shots are denoted as out-of-field shots. We analyze both out-of-field and close-up shots in the same category due to their similar semantic meaning.
Shot classes are useful in several aspects. They can be used for segmentation of a soccer video into
plays
and
breaks
[16]. In general, long shots
Classification of a shot into one of the above three classes is based on spatial features. Therefore, the class of a shot can be determined from a single keyframe or from a set of keyframes selected according to certain criteria. In order to find the frame view, frame grass colored pixel ratio,
G
, is computed. Intuitively, a low
G
value in a frame corresponds to close-up or out-of-field view, while high
G
value indicates that the frame is of long view type, and in between, medium view is selected. By using only grass colored pixel ratio, medium shots with high
G
value will be mislabeled as long shots. The error rate due to this approach depends on the specific broadcasting style and it usually
G R2 : The grass colored pixel ratio in the second region
R diff : The average of the sum of the absolute grass color pixel differences between R 1 and R 2 , and between R 2 and R 3 :
| (6.1) |
|
Figure 6.8:
Grass/non-grass segmented long and medium views and the regions determined by Golden Section spatial composition rule
Then, we
Figure 6.9:
The flowchart of the shot classification algorithm
In the proposed framework, all goal events and the events in and around penalty boxes are detected. Goal events are detected in real-time by using only cinematic features. We can further classify each segment as those consisting of events in and around the penalty box. These events may be free kicks, saves, penalties, and so on.
A goal is scored when the whole of the ball
Duration of the break: A break due to a goal lasts no less than 30 and no more than 120 seconds.
The occurrence of at least one close-up/out-of-field shot: This shot may either be a close-up of a player or out-of-field view of the audience.
The existence of at least one slow-motion replay shot: The goal play is always replayed one or more times.
The relative position of the replay shot(s): The replay shot(s) follow the close-up/out of field shot(s).
In Figure 6.10, the instantiation of the cinematic goal template is given for the first goal in Spain sequence of MPEG-7 test set. The break due to this goal lasts 53 seconds, and three slow-motion replay shots are broadcast during this break. The segment selection for goal event templates starts by detection of the slow-motion replay shots. For every slow-motion replay shot, we find the long shots that define the start and the end of the corresponding break. These long shots must indicate a play that is determined by a simple duration constraint, i.e., long shots of short duration are discarded as breaks. Finally, the conditions of the template are
Figure 6.10:
The occurrence of a goal and its break— (left to right) goal play as a long shot, close-up of the scorer, out-of-field view of the fans (middle), 3
rd
slow-motion replay shot, the restart of the game as a long shot
The events occurring in and around penalty boxes, such as saves, shots wide, shots on goals, penalties, free kicks, and so on, are important in soccer. To classify a summary segment as consisting of such events, penalty boxes are detected. As explained in Section 4.1, field lines in a long view can be used to localize the view and/or register the current frame on the standard field model. In this section, we reduce the penalty box detection problem to the search for three parallel lines. In Figure 6.11, a model of the whole soccer field is shown, and three parallel field lines, shown in bold on the right, become visible when the action occurs around one of the penalty boxes.
Figure 6.11:
Soccer field model (left) and the highlighted three parallel lines of a penalty box
In order to detect three lines, we use the grass detection result in Section 4.1. The edge response of non-grass pixels are used to separate line pixels from other non-grass pixels, where edge response of a pixel is computed by 3x3 Laplacian mask [34]. The pixels with the highest edge response, the threshold of which is
automatically
determined from the histogram of the gradient
Figure 6.12:
Penalty Box detection by three parallel lines
In this section, we first describe a referee detection and tracking algorithm, since the referee is a significant object in sports video, and the existence of referee in a shot may indicate the presence of an event, such as red/yellow cards and penalties. Then, we present a player tracking algorithm that uses feature point correspondences between frame pairs.
Referees in soccer
Figure 6.13:
Referee Detection by horizontal and vertical projections
The decision about the existence of the referee in the current frame is based on the following size-invariant shape descriptors:
The ratio of the area of the MBR ref to the frame area: A low value indicates that the current frame does not contain a referee.
MBR ref aspect ratio (width/height): It determines if the MBR ref corresponds to a human region.
Feature pixel ratio in the MBR ref : This feature approximates the compactness of the MBR ref ; higher compactness values are favored.
The ratio of the number of feature pixels in the MBR ref to that of the outside: It measures the correctness of the single referee assumption. When this ratio is low, the single referee assumption does not hold, and the frame is discarded.
Tracking of referee is achieved by region correspondence; that is, the referee template, which is found by the referee detection algorithm, is tracked in the other frames. In Figure 6.14, the output of the tracker is shown for a medium shot of MPEG-7 Spain sequence.
Figure 6.14:
Referee detection and tracking
The players are tracked in the long shot segments that precede slow-motion replay shots of interesting events. These long shots consist of the normal-motion action of the corresponding replay shots. For example, for the replay shots of the goal event in Figure 6.10, we find the long shot whose keyframe is shown as the leftmost frame in Figure 6.10. The aim of tracking objects and registering each frame onto a standard field is to extract EMU and ERU descriptors by the algorithms in [25] and [35].
The tracking algorithm takes the object position and
object ID
(or
Figure 6.15:
The flowchart of the tracking algorithm
The bounding box location is corrected by integrating the information about the region bounding box, the initial estimate of the bounding box, and the motion history of the object. If the
Figure 6.16:
Example tracking of a player in Spain sequence
The registration of each frame of a long shot involves field line detection. Low-level image processing operations, such as color segmentation, edge detection, and thinning, are applied to each frame before Hough transform. The integration of prior knowledge about field line locations as a set of constraints, such as the number of lines and parallelism,
Figure 6.17:
Line detection examples in Spain sequence.
The proposed framework includes three types of summaries: 1) All slow-motion replay shots in a game, 2) all goals in a game, and 3) extensions of both with detected events. The first two types of summaries are based solely on cinematic features, and are generated in real-time; hence they are particularly
Slow-motion summaries
are generated by shot boundary, shot class, and slow-motion replay features, and consist of slow-motion shots. Depending on the requirements, they may also include all shots in a predefined time window around each replay, or, instead, they can include only the closest long shot before each replay in the summary, since the
Model instantiation is necessary for the resolution of model-based queries
The interactive part of the instantiation process is