Following the above description of a typical sports video, the annotation task is organized into four distinct subtasks: i) shot pre-classification (aimed at extracting the actual sports actions from the video stream); ii) classification of graphic features (which, in sports videos, are mainly text captions that are not synchronized with shot changes); iii) camera and object motion analysis and mosaicing; and iv) classification of visual shot features. The first three subtasks are sport independent, while the fourth requires embedding some domain knowledge, e.g., information on a playfield. This subtask can be further specialized for different kinds of sports, such as detecting specific parts of a playfield, e.g., the soccer goal box area. Video shot segmentation will not be described, readers can find a thorough description of several video segmentation algorithms in [6] and [18]. While this general framework meets the requirements of a variety of sport domains, the specificity of each domain has to be addressed when developing systems that are expected to detect highlights specific of different sports. In Section 4 it will be shown how to combine results deriving from these subtasks, to recognize specific highlights.
This section expounds on the contextual analysis of the application domain and on the implementations of modules supporting each of the subtasks: contextual analysis examines specificity of data and provides an overview on the rationale underlying selection of relevant features; the implementation describes how to compute the features, and how the feature combination rules are implemented.
Contextual analysis. The anchorman/interview shot classification module provides a simple preliminary classification of shot content, which can be further exploited and enhanced by subsequent modules. The need for this type of classification stems from the fact that some video feeds contain interviews and studio scenes featuring anchorman and athletes. An example is that of the Olympic Games, where the material that must be logged is often pre-edited by the hosting broadcaster, and may contain such kinds of shots. The purpose of this module is to roughly separate shots that contain possible sport scenes from shots that do not contain sport scenes. To this end, a statistical approach can be followed to analyze visual content similarity and motion features of the anchorman shots, without requiring any predefined shot content model to be used as a reference. In fact, this latter constraint is required in order to be able to correctly manage interviews, which do not feature a standard studio set-up, as athletes are usually interviewed near the playfield, and each interview has a different background and location. Also the detection of studio scenes requires such independence of a shot content model, since the "style" changes often, and each program has its unique style that would require the creation and maintenance of database of shot content models.
Studio scenes show a well defined syntax: shot location is consistent within the video, the number of cameras and their view field is limited, the sequence of shot content can be represented by a repeating pattern. An example of such a structure is shown in Figure 5.2, where the first frames of the five successive shots comprising a studio scene are shown.
Figure 5.2: Studio scene with alternating anchorman shots.
Implementation. Shots of the studio/interview are repeated at intervals of variable length throughout the sequence. The first step for the classification of these shots stems from this assumption and is based on the computation, for each video shot Sk, of its shot lifetime L(Sk). The shot lifetime measures the shortest temporal interval that includes all the occurrences of shots with similar visual content, within the video. Given a generic shot Sk, its lifetime is computed by considering the set TK = {Ti |σ(Sk,Si) < τs}, where σ(Sk,Si) is a similarity measure applied to keyframes of shots Sk and Si, τs a similarity threshold and ti is the value of the time variable corresponding to the occurrence of the keyframe of shot Si. The lifetime of shot Sk is defined as L(SK) = max(TK) - min(TK). Shot classification is based on fitting values of L(Sk) for all the video shots in a bimodal distribution. This allows for the determination of a threshold value tl that is used to classify shots into the sport and studio/interview categories. Particularly, all the shots Sk such that L(SK)>tl are classified as studio/interview shots, where tl was determined according to the statistics of the test database, and set to 5 sec. Remaining shots are classified as sport shots. Typical videos in the target domain do not contain complete studio shows, and, in feeds produced on location, interviews have a limited time and shot length. This allows for the reduction of false detections caused by the repetition of similar sport scenes (e.g., as in the case of edited magazine programs or summaries) by limiting the search of similar shots to a window of shots. The adopted similarity metric is a histogram intersection of the mean color histogram of shots. Usage of the mean histogram takes into account the dynamics of sport scenes. In fact even if some scenes take place in the same location, and thus the color histogram of their first frame may be similar, the following actions yield a different color histogram. When applied to studio/interview shots, where the dynamics of changes of lighting of the scene are much more compressed, and the reduced camera and objects movement do not introduce new objects, we get a stable histogram.
Although the mean color histogram accounts for minor variations due to camera and objects movement, it does not take into account spatial information. Results of the first classification step are therefore refined by considering motion features of the studio/interview shots. This develops on the assumption that in an anchorman shot, both the camera and the anchorman are almost steady. In contrast, for sport shots, background objects and camera movements—persons, free-hand shots, camera panning and zooming, changes in scene lighting—cause relevant motion components throughout the shot. Classification refinement is performed by computing an index of the quantity of motion QM, for each possible anchorman shot. Only those shots whose QM doesn't exceed a threshold τM are definitely classified as studio/interview shots.
Contextual analysis. In sports videos, graphic objects (GO) may appear everywhere within the frame, even if most of the time they are placed in the lower third or quarter of the image. Also the vertical and horizontal ratio of the GO zones varies, e.g., the roster of a team occupies a vertical box, while the name of a single athlete usually occupies a horizontal box (see Figure 5.3). For text graphics, character fonts may vary in size and typeface, and may be superimposed either on an opaque background or directly on the image captured by the camera. GOs often appear and disappear gradually, through dissolve or fade effects. These properties call for automatic GO localization algorithms with the least amount of heuristics and possibly no training.
Figure 5.3: Examples of superimposed graphic objects.
Several features such as edges and textures have been used in past research as cues of super-imposed GOs [14] [31]. Such features represent global properties of images, and require the analysis of large frame patches. Moreover, also natural objects such as woods and leaves, or man-made objects such as buildings and cars may present a local combination of such features that can be wrongly classified as a GO [9].
In order both to reduce the visual information to a minimum and to preserve local saliency, we have elected to work with image corners, extracted from luminance information of images. Corners are computed from luminance information only; this is very appealing for the purpose of GO detection and localization in that it prevents many misclassification problems from arising with color-based approaches. This fact is particularly important when considering the characteristics of television standards, which require a spatial sub-sampling of the chromatic information; thus the borders of captions are affected by color aliasing. Therefore, to enhance readability of characters the producers typically exploit luminance contrast, since luminance is not spatially subsampled and human vision is more sensitive to it than to color contrast. Another aspect that must be considered when analyzing GO detection algorithms is they do not require any knowledge or training on super-imposed captions features.
Implementation. The salient points of the frames, which are to be analyzed in the following steps, are extracted using the Harris algorithm, from the luminance map, extracted from each frame. Corner extraction greatly reduces the number of spatial data to be processed by the GO detection and localization system. The most basic property of GOs is the fact that they must remain stable for a certain amount of time, in order to let people read and understand them. This property is used in the first step of GO detection. Each corner is checked to determine if it is still present in the same position in at least 2 more frames within a sliding window of 4 frames.
Each corner that complies with this property is marked as persistent, and is kept for further analysis, while all the others are discarded. Every 8th frame is processed to extract its corners, thus further reducing the computational resources needed to process a whole video. This choice develops on the assumption that, in order to be perceived and understood by the viewer, a GO must be stable on the screen for 1 second. The patch surrounding each corner is inspected, and if there are not enough neighboring corners (i.e., corners whose patches do not intersect), the corner is not considered in further processing.
This process, which is repeated a second time in order to eliminate corners that get isolated after the first processing, avoids having isolated high contrast background objects contained within static scenes are recognized as possible GO zones.
An unsupervised clustering is performed on the corners that comply with the temporal and spatial features described above. This is aimed at determining bounding boxes for GOs (Figures 5.4 and 5.5). For each bounding box the percentage of pixels that belong to the corner patches is calculated, and if it is below a predefined threshold the corners are discarded. This strategy reduces the noise due to high contrast background during static scenes, which typically produce small scattered zones of corners that cannot be eliminated by the spatial feature analysis. An example of GO detection is provided in Figure 5.5.
Figure 5.4: a) Source frame; b) Detected captions with noise removal.
Figure 5.5: Detection results for the frames in Figure 5.3.
Evaluation of results takes into account GO detection (whether the appearance of a GO is correctly detected) and correct detection of the GO's bounding box. Typical results of GO detection have a precision of 80.6% and recall of 92%. Missed detections are experienced mostly in VHS videos, and only few in DV videos. The GO bounding box miss rate is 5%. Results also included false detections, due to scene text.
Contextual analysis. Camera and object motion parameters have been used to detect and recognize highlights in sports videos. Use of lack of motion and camera operations has been described in the specific context of soccer by [4], [17]. Usage of camera motion (based on 3 parameter model) for the purpose of video classification has been discussed in [22]. Mosaicing is an intra-shot video summarization technique which is applied to a video shot. Mosaicing is based on the visual tracking of all motions in the shot, either due to camera action or to scene objects moving independently from the camera. Image pixels whose motion is due to camera action are labeled as background, and the others as foreground. A single mosaic image summarizing the whole visual content of the shot is then incrementally obtained from all new background pixels as they enter in the field of view of the camera; camera action is also recorded for every shot frame. Foreground information is stored separately, by tracing frame by frame the image shape and trajectory of each of the objects in individual motion. Figure 5.6 and 5.7 show a mosaic image and pixels associated to the diver in an intermediate frame of the diving shot. Use of mosaic images has been described by [24] and [26] for highlight presentation. Also in [17] mosaic images are used to show soccer highlights.
Figure 5.6: Mosaic image of a dive.
Figure 5.7: Intermediate diver frame.
Two of the main paradigms for motion estimation are: feature-based and correlation-based, each complementing the other [10]. Feature-based techniques compute image motion by first tracing features of interest (edges, corners, etc.) from frame to frame, and then inferring a set of motions compatible with the set of feature correspondences. Feature-based motion computation is known to be robust (since it works with a selected number of salient image locations), to work well even with large displacements, and to be quite fast, thanks to the limited number of image locations being analyzed. Yet, in applications requiring that each image pixel be labeled as belonging to a motion class, the feature-based approach loses much of its appeal, since passing from sparse computations to dense estimates is both an ambiguous and slow task. Correlation-based techniques compute the motions of whole image patches through block matching. Different from feature-based methods, these approaches are intrinsically slow due to the high number of elementary operations to be performed at each pixel. Moreover, block matching works only for small image displacements, thus requiring the use of a multiresolution scheme in the presence of large displacements. Nonetheless, correspondence-based motion computation techniques automatically produce dense estimates, and can be optimized in order to work reasonably fast.
Implementation. In this paragraph a basic feature-based algorithm for mosaicing using corner features is first outlined [32]. The image motion between successive shot frames is evaluated through a three-step analysis: (1) corner detection; (2) corner tracking; (3) motion clustering and segmentation. A fourth step, namely mosaic updating, concludes the processing of a generic shot frame.
Corner Detection. An image location is defined as a corner if the intensity gradient in a patch around it is distributed along two preferred directions. Corner detection is based on the Harris algorithm (see Figure 5.8).
Figure 5.8: Image corners.
Corner Tracking. To perform intra-shot motion parameters estimation, corners are tracked from frame to frame, according to an algorithm originally proposed in [35] and modified by the authors to enhance tracking robustness. The algorithm optimizes performance according to three distinct criteria, namely:
Frame similarity: The image content in the neighborhood of a corner is virtually unchanged in two successive frames; hence, the matching score between image points can be measured via a local correlation operator.
Proximity of Correspondence: As frames go by, corner points follow smooth trajectories in the image plane, thus allowing to reduce the search space for each corner in a small neighborhood of its expected location, as inferred based on previous tracking results.
Corner Uniqueness: Corner trajectories cannot overlap, i.e., it is not possible that at the same time two corners share the same image location. Should this happen, only the corner point with higher correlation would be maintained, while the other would be discarded.
Since the corner extraction process is heavily affected by image noise (the number and individual location of corner varies significantly in successive frames; also, a corner extracted in one frame, albeit still visible, could be ignored in the next one), the modified algorithm implements three different corner matching strategies, ensuring that the above tracking criteria are fulfilled:
strong match, taking place between pairs of locations classified as corners in two consecutive frames;
forced match, image correlation within the current frame, in the neighborhood of a previously extracted corner;
backward match, image correlation within the previous frame, in the neighborhood of a currently extracted corner.
These matching strategies ensure that a corner trajectory continues to be traced even if, in some instants, the corresponding corner fails to be detected.
Motion clustering and segmentation. After corner correspondences have been established, a motion clustering technique is used to obtain the most relevant motions present in the current frame. Each individual 2D motion of the scene is detected and described by means of the 6-parameter affine motion model. Motion clustering takes place starting from the set of corner correspondences found for each frame. A robust estimation method is adopted, guaranteeing on the one hand an effective motion clustering, and on the other a good rejection of false matches (clustering outliers). The actual motion-based segmentation is performed by introducing spatial constraints to the classes obtained via the previous motion clustering phase. Compact image regions featuring homogenous motion parameters -thus corresponding to single, independently moving objects- are extracted by region growing. The motion segmentation algorithm is based on the computation of an a posteriori error obtained by plain pixel differences between pairs of frames realigned according to the extracted affine transformations. Figure 5.9 shows the results of the segmentation of the frame of Figure 5.8 into its independent motions.
Figure 5.9: Segmented image (all black pixels are considered part of the background).
Mosaic updating. All the black-labeled pixels of Figure 5.9 are considered as part of the background, and are thus used to create the mosaic image of Figure 5.10. In this algorithm, to obtain the corresponding mosaic pixel, all background pixels present in multiple shot frames were simply averaged together.
Figure 5.10: Mosaic image.
Contextual analysis. Generic sports videos feature a number of different scene types, intertwined with each other in a live video feed reporting on a single event, or edited into a magazine summarizing highlights of different events.
A preliminary analysis of videos reveals that 3 types of scenes prevail: playfield, player and audience (see Figure 5.11). Most of the action of a sports game takes place on the playfield. Hence, the relevance of playfield scenes, showing mutual interactions among subjects (e.g.,: players, referees, etc.) and objects (e.g.,: ball, goal, hurdles, etc.). However, along with playfield scenes, a number of scenes appear in videos, featuring close-ups of players, or framing the audience. The former typically show a player that had a relevant role in the most recent action (e.g., the athlete who just failed throwing the javelin, or the player who shot the penalty). The latter occur at the beginning and at the end of an event, when nothing is happening on the playfield, or just after a highlight (e.g.,: in soccer, when a player shoots a goal, audience shots are often shown immediately after). It is thus possible to use these scenes as cues for detection of highlights.
Figure 5.11: Although sports events reported on in a video may vary significantly, distinguishing features are shared across this variety. For example, playfield lines is a concept that is explicitly present in some outdoor and indoor sports (e.g.,— athletics or swimming ), but can also be extended to other sports (e.g.,— cycling on public roads). Similarly, player and audience scenes appear in most sports videos.
In a sample of 1267 keyframes, extracted randomly from our data set (obtained from the BBC Sports Library), approximately 9% were audience scenes, whereas player scenes represented up to 28% of the video material. To address the specificity of such a variety of content types, we devised a hierarchical classification scheme. The first stage performs a classification in terms of the categories of playfield, player and audience, with a twofold aim: on the one hand, this provides an annotation of video material that is meaningful for users' tasks; on the other hand it is instrumental for further processing, such as identification of sports type and highlight detection.
Inspection of video material reveals that: i) playfield shots typically feature large homogeneous color regions and distinct long lines; ii) in player shots, the shape of the player appears distinctly in the foreground, and the background of the image tends to be homogeneous, or blurred (either because of camera motion or lens effects); iii) in audience shots, individuals in the audience do not always appear clearly, but the audience as a whole appears as a texture. These observations suggest that basic edge and shape features could significantly help in differentiating among playfield, player and audience scenes. It is also worth pointing out that models for these classes do not vary significantly across different sports, events and sources.
We propose here that identification of the type of sport represented in a shot relies on playfield detection. In fact, we can observe that: i) most sports events take place in a play field, with each sport having its own playfield; ii) each playfield has a number of distinguishing features, the most relevant of which is color; iii) the playfield appears in a large number of frames of a video shot, and often covers a large part of the camera frame (i.e., a large area of single images comprising the video). Hence, playfield, and objects that populate it, may effectively support identification of sports types. Therefore, in our approach, sports type identification is applied to playfield shots output by the previous classification stage.
Playfield shape and playfield lines can be used also to perform highlight detection, besides sport classification. For example detection of a tennis player near the net may be considered a cue of a volley; tracking swimmers near the turning wall helps identify the backstroke turn, etc. In Section 4 playfield shape and playfield lines features will be used to detect soccer highlights.
Implementation. A feature vector comprising edge, segment and color features was devised. Some of the selected features are represented in Figure 5.12 for representatives of the three classes (playfield, player and audience, respectively).
Figure 5.12: Edge, segment length and orientation, and hue distribution for the three representative sample images in the first row of Figure 5.11. Synthetic indices derived from these distributions allow to differentiate among the three classes of playfield, player, and audience. (Please, note that hue histograms are scaled to the maximum value).
Edge detection is first performed, and a successive growing algorithm is applied to edges to identify segments in the image [20]. The distribution of edge intensities is analyzed to evaluate the degree of uniformity. Distributions of lengths and orientations of segments are also analyzed, to extract the maximum length of segments in an image, as well as to detect whether peaks exist or not in the distribution of orientations. This choice was driven by the following observations: playfield lines are characteristic segments in playfield scenes [17], and determine peaks in the orientation histogram, and also feature longer segments than other types of scenes; audience scenes are typically characterized by more or less uniform distributions for edge intensities, segments orientation and hue; player scenes typically feature fewer edges, a uniform segment orientation distribution, and short segments.
Color features were also considered, both to increase robustness to the first classification stage (e.g., audience scenes display more uniform color distributions than playfield or player scenes), and to support sports type identification.
In fact, the playfield each sport usually takes place on typically features a few dominant colors (one or two, in most cases). This is particularly the case in long and mid-range camera takes, where the frame area occupied by players is only a fraction of the whole area. Further, for each sport type, the color of the playfield is fixed, or varies in a very small set of possibilities. For example, for soccer the playfield is always green, while for swimming it is blue. Color content is described through color histograms. We selected the HSI color space, and quantized it into 64 levels for hue, 3 levels for saturation and 3 levels for intensity. Indices describing the distribution (i.e., degree of uniformity, number of peaks) were also derived from these distributions.
Two neural network classifiers are used to perform the classification tasks. To evaluate their performance, over 600 frames were extracted from a wide range of video shots, and were manually annotated to define a ground truth.
Frames were then subdivided into three sets to perform training, testing and evaluation of the classifiers. The aforementioned edge, segment and color features were computed for all of the frames. Results for the scene type classification are summarized in Table 5.1. It is worth pointing out that extending this classification scheme to shots, rather than just limiting it to keyframes, will yield an even better performance, as integration of results for keyframes belonging to the same shot reduces error rates. For instance, some keyframes of a playfield shot may not contain playfield lines (e.g.,: because of a zoom-in), but some others will. Hence, the whole shot can be classified as a playfield shot.
Class | Correct | Missed | False |
---|---|---|---|
Playfield | 80.4% | 19.6% | 9.8% |
Player | 84.8% | 15.2% | 15.1% |
Audience | 92.5% | 7.5% | 9.8% |
Results on sports type identification are shown in Table 5.2. The first column of figures refers to an experiment carried out on a data set including also player and audience scenes, whereas the second column of figures summarizes an experiment carried out on the output of a filtering process keeping only playfield frames. As expected, in the former case we obtained lower success rates. By comparing results in the two columns, we can observe that introduction of the playfield, player and audience classes is instrumental to improve identification rates for sports types. On average, these improve by 16%, with a maximum of 26%. The highest improvement rates are observed for those sports where the playfield is shown only for small time intervals (e.g., high diving), or in sports where only one athlete takes part in the competition, videos of which frequently show close-ups of the athlete (e.g., javelin).
Sports type | All frames | Playfield only |
---|---|---|
High diving | 56.9% | 83.2% |
Floor | 78.7% | 97.4% |
Field hockey | 85.0% | 95.1% |
Long horse | 53.4% | 64.3% |
Javelin | 37.8% | 58.8% |
Judo | 80.6% | 96.9% |
Soccer | 80.3% | 93.2% |
Swimming | 77.4% | 96.1% |
Tennis | 69.1% | 94.5% |
Track | 88.2% | 92.7% |