The techniques for single modal image, audio and language analysis have been studied in a number of applications. Video may consist of all these modalities and its characterization must include them individually or in combination. This requires integrating these techniques for extraction of significant information, such as specific objects, audio keywords and relevant video structure. In order to characterize video, features may be visual, audible or some text-based interpretation.
A feature is defined as a descriptive parameter that is extracted from a video stream . Features may be used to interpret visual content, or as a measure for similarity in image and video databases. In this section, features are described as 1. Statistical - Features are extracted from an image or video sequence without regard to content, 2. Compressed Domain - Features extracted from compressed data and 3. Content-Based - Features that attempt to describe content.
Certain features may be extracted from image or video without regard to content. These features include such analytical features as shot changes, motion flow and video structure in the image domain, and sound discrimination in the audio domain. In this section we describe techniques for image difference and motion analysis as statistical features. For more information on statistical features, see Section V.
A difference measure between images serves as a feature to measure similarity. In the sections below, we describe two fundamental methods for image difference: Absolute difference and Histogram difference. The absolute difference requires less computation, but is generally more susceptible to noise and other imaging artifacts, as described below.
This difference is the sum of the absolute difference at each pixel. The first image It is analyzed with a second image, It-T, at a temporal distance T. The difference value is defined as,
where M is the resolution, or number of pixels in the image. This method for image difference is noisy and extremely sensitive to camera motion and image degradation. When applied to sub-regions of the image, D(t) is less noisy and may be used as a more reliable parameter for image difference.
Ds(t) is the sum of the absolute difference in a sub-region of the image, where S represents the starting position for a particular region and n represents the number of sub-regions. H and W are image height and width respectively.
We may also apply some form of filtering to eliminate excess noise in the image and subsequent difference. For example, the image on the right in Figure 9.1 represents the output of a Gaussian filter on the original image on the left.
Figure 9.1: Left— original; Right— filtered.
A histogram difference is less sensitive to subtle motion, and is an effective measure for detecting similarity in images. By detecting significant changes in the weighted color histogram of two images, we form a more robust measure for image correspondence. The histogram difference may also be used in sub-regions to limit distortion due to noise and motion.
The difference value, DH(t), will rise during shot changes, image noise and camera or object motion. In the equation above, N represents the number of bins in the histogram, typically 256. Two adjacent images may be processed, although this algorithm is less sensitive to error when images are separated by a spacing interval, Di. Di is typically on the order of 5 to 10 frames for video encoded at standard 30 fps. An empirical threshold may be set to detect values of DH(t) that correspond to shot changes. For inputs from multiple categories of video, an adaptive threshold for DH(t) should be used.
If the histogram is actually three separate sets for RGB, the difference may simply be summed. An alternative to summing the separate histograms is to convert the RGB histograms to a single color band, such as Munsell or LUV color .
An important application of image difference in video is the separation of visual shots . Zhang et al. define a shot as a set of contiguous frames representing a continuous action in time or space [23, 60]. A simple image difference represents one of the more common methods for detecting shot changes. The difference measures, D(t) and DH(t), may be used to determine the occurrence of a shot change. By monitoring the difference of two images over some time interval, a threshold may be set to detect significant differences or changes in scenery. This method provides a useful tool for detecting shot cuts, but is susceptible to errors during transitions. A block-based approach may be used to reduce errors in difference calculations. This method is still subject to errors when subtle object or camera motion occurs.
The most fundamental shot change is the video cut. For most cuts, the difference between image frames is so distinct that accurate detection is not difficult. Cuts between similar shots, however, may be missed when using only static properties such as image difference. Several research groups have developed working techniques for detecting shot changes through variations in image and histogram difference [5, 10, 11, 12, 23, 35].
A histogram difference is less sensitive to subtle motion, and is an effective measure for detecting shot cuts and gradual transitions. By detecting significant changes in the weighted color histogram of each successive frame, video sequences can be separated into shots. This technique is simple, and yet robust enough to maintain high levels of accuracy.
There are a variety of complex shot changes used in video production, but the basic premise is a change in visual content. Certain shot changes are used to imply different themes and their detection is useful in characterization. Various video cuts, as well as other shot change procedures, are listed below .
Fast Cut - A sequence of video cuts, each very short in duration.
Distance Cut - A camera cut from one perspective of a shot to another.
Inter-cutting - Shots that change back and forth from one subject to another.
Fades and Dissolves - A shot that fades over time to a black background and shots that change over time by dissolving into another shot.
Wipe - The actual format may change from one genre to the next.
An analysis of the global motion of a video sequence may also be used to detect changes in scenery . For example, when the error in optical flow is high, this is usually attributed to its inability to track a majority of the motion vectors from one frame to the next. Such errors can be used to identify shot changes. A motion-controlled temporal filter may also be used to detect dissolves and fades, as well as separate video sequences that contain long pans. The use of motion as a statistical feature is discussed in the following section. The methods for shot detection described in this section may be used individually or combined for more robust segmentation.
Motion characteristics represent an important feature in video indexing. One aspect is based on interpreting camera motion [2, 53]. Many video shots have dynamic camera effects, but offer little in the description of a particular segment. Static shots, such as interviews and still poses, contain essentially identical video frames. Knowing the precise location of camera motion can also provide a method for video parsing. Rather than simply parse a video by shots, one may also parse a video according to the type of motion.
An analysis of optical flow can be used to detect camera and object motion. Most algorithms for computing optical flow require extensive computation, and more often, researchers are exploring methods to extract optical flow from video compressed with some form of motion compensation. Section 2.2 describes the benefits of using compressed video for optical flow and other image features.
Statistics from optical flow may also be used to detect shot changes. Optical flow is computed from one frame to the next. When the motion vectors for a frame are randomly distributed without coherency, this may suggest the presence of a shot change. In this sense, the quality of the camera motion estimate is used to segment video. Video segmentation algorithms often yield false shot changes in the presence of extreme camera or object motion. An analysis of optical flow quality may also be used to avoid false detection of shot changes.
Optical flow fields may be interpreted in many ways to estimate the characteristics of motion in video. Two such interpretations are the camera motion and object motion.
An affine model was used in our experiments to approximate the flow patterns consistent with all types of camera motion.
u(xi, yi) = axi + byi + c
v(xi, yi) = dxi + eyi + f
Affine parameters a, b, c, d, e and f are calculated by minimizing the least squares error of the motion vectors.
We also compute average flow and . Where and ,
Using the affine flow parameters and average flow, we classify the flow pattern. To determine if a pattern is a zoom, we first check if there is the convergence or divergence point (x0,y0), where: u(xi, yi) = 0 and v(xi, yi) = 0.
If the above relation is true, and (x0,y0) is located inside the image, then it must represent the focus of expansion. If and are large, then this is the focus of the flow and camera is zooming. If (x0,y0) is outside the image, and or are large, then the camera is panning in the direction of the dominant vector.
If the above determinant is approximately 0, then (x0,y0) does not exist and the camera is panning or static. If or are large, the motion is panning in the direction of the dominant vector. Otherwise, there is no significant motion and the flow is static. We may eliminate fragmented motion by averaging the results in a W frame window over time. An example of the camera motion analysis results is shown in Figure 9.2.
Figure 9.2: Camera and object motion detection.
Object motion typically exhibits flow fields in specific regions of an image, while camera motion is characterized by flow throughout the entire image. The global distribution of motion vectors distinguishes between object and camera motion. The flow field is partitioned into a grid as shown in Figure 9.2. If the average velocity for the vectors in a particular grid is high, then that grid is designated as containing motion. When the number of connected motion grids, Gm,
is high (typically Gm > 7), the flow is some form of camera motion. Gm(i) represents the status of motion grid at position i and M represents the number of neighbors. A motion grid should consist of at least a 4x4 array of motion vectors. If Gm is not high, but greater than some small value (typically 2 grids), the motion is isolated in a small region of the image and the flow is probably caused by object motion. This result is averaged over a frame window of width WA, just as with camera motion, but the number of object motion regions needed is typically on the order of 60%. Examples of the object motion analysis results are shown in Figure 9.2.
Analysis of image texture is useful in the discrimination of low interest video from video containing complex features . A low interest image may also contain uniform texture, as well as uniform color or low contrast. Perceptual features for individual video frames were computed using common textual features such as coarseness, contrast, directionality and regularity.
The shape and appearance of objects may also be used as a feature for image correspondence. Color and texture properties will often change from one image to the next, making image difference and texture features less useful.
In addition to image features, certain audio features may be extracted from video to assist in the retrieval task . Loud sound, silence and single frequency sound markers may be detected analytically without actual knowledge of the audio content.
Loud sounds imply a heightened state of emotion in video, and are easily detected by measuring a number of audio attributes, such as signal amplitude or power. Silent video may signify an area of less importance, and can also be detected with straightforward analytical estimates. A video producer will often use single frequency sound markers, typically a 1000 Hz. tone, to mark a particular point in the beginning of a video. This tone may be detected to determine the exact point in which a video will start.
Most video is produced with a particular format and structure. This structure may be taken into consideration when analyzing particular video content. News segments are typically 30 minutes in duration and follow a rigid pattern from day to day. Commercials are also of fixed duration, making detection less difficult.
Another key element in video is the use of the black frame. In most broadcast video, a black frame is shown between a transition of two segments. In news broadcast this usually occurs between a story and a commercial. By detecting the location of black frames in video, a hierarchical structure may be created to determine transitions between segments. A black frame or any single intensity image may be detected by summing the total number of pixels in a particular color space.
In the detection of the black frame, Ihigh, the maximum allowable pixel intensity is on the order of 20% of the maximum color resolution (51 for a 256 bit image), and Ilow, the minimum allowable pixel intensity, is 0. The separation of segments in video is crucial in retrieval systems, where a user will most likely request a small segment of interest and not an entire full-length video. There are a number of ways to detect this feature in video, the simplest being to detect a high number of pixels in an image that are within a given tolerance of being a black pixel.
In typical applications of multimedia databases, the materials, especially the images and video, are often in a compressed format [3, 59]. To deal with these materials, a straightforward approach is to decompress all the data, and utilize the same features as mentioned in the previous section. Doing so, however, has some disadvantages. First, the decompression implies extra computation. Secondly, the process of decompression and re-compression, often referred to as recoding, results in further loss of image quality. Finally, since the size of decompressed data is much larger than the compressed form, most hardware and CPU cycles are needed to process and store the data. The solution to these problems is to extract features directly from the compressed data. Below are a number of commonly used compressed-domain features .
The motion vectors that are available in all video data compressed using standards such H.261, H.263, MPEG-1 and MPEG-2 are very useful. Analysis of motion vectors can be used to detect shot changes and other special effects such as dissolve, fade in and fade out. For example, if the motion vectors for a frame are randomly distributed without coherency that may suggest the presence of a shot change. Motion vectors represent low-resolution optical flow in the video, and can be used to extract all information that can be extracted using the optical flow method.
The percentage of each type of block in a picture is also a good indicator of shot changes, too. For a P frame, a large percentage of intra blocks implies a lot of new information for the current frame that cannot be predicted from the previous frame. Therefore, such a P-frame indicates the beginning of a new shot right after a shot change .
The DCT (Discrete Cosine Transform) provides a decomposition of the original image in the frequency domain. Therefore, DCT coefficients form a natural representation of texture in the original image. In addition to texture analysis, DCT coefficients can also be used to match images and to detect shot changes. The DC components are a low-resolution representation of the original image, averaged over 8x8 blocks. This implies much less data to manipulate, and for some applications, DC components already contain sufficient information. For color analysis, usually only the DC components are used to estimate the color histogram. For shot change detection, usually only the DC components are used to compare the content in two consecutive frames.
The parameters in the compression process that are not explicitly specified in the bitstream can be very useful as well. One example is the bit rate, i.e., the number of bits used for each picture. For intra coded video (i.e., no motion compensation), the number of bits per picture should remain roughly constant for a shot segment and should change when the shot changes. For example, a shot with simple color variation and texture requires fewer bits per picture compared to a shot that has detailed texture. For intercoding, the number of bits per picture is proportional to the action between the current picture and the previous picture. Therefore, if the number of bits for a certain picture is high, we can often conclude that there is a shot cut.
The compressed-domain approach does not solve all problems, though. To identify useful features from compressed data is typically difficult because each compression technique poses additional constraints, e.g., non-linear processing, rigid data structure syntax, and resolution reduction.
Another issue is that compressed-domain features depend on the underlying compression standard. For different compression standards, different feature extraction algorithms have to be developed. Ultimately, we would like to have new compression standards with maximal content accessibility. MPEG-4 and MPEG-7 already have considered this aspect. In particular, MPEG-7 is a standard that goes beyond the domain of "compression" and seeks efficient representation of image and video content. The compressed-domain approach provides significant advantages but also brings new challenges. For more details on compressed-domain video processing, please see Chapter 43. For more details on MPEG-7 standard, please see Chapter 28.
Section 2 described a number of features that can be extracted using well-known techniques in image and audio processing. Section 2.3 described how many of these features are computed or approximated using encoded parameters in image and video compression. Although in both cases there is considerable understanding of the structure of the video, the features in no way estimate the actual image or video content.
In this section we describe several methods to approximate the actual content of an image or video. For many users, the query of interest is text based, and therefore, the content is essential. The desired result has less to do with analytical features such as color, or texture, and more with the actual objects within the image or video.
Identifying significant objects that appear in the video frames is one of the key components for video characterization. Several working systems have generated reasonable results for the detection of a particular object, such as human faces, text or automobiles . These limited domain systems have much greater accuracy than do broad domain systems that attempt to identify all objects in the image. Recent work in perceptual grouping  and low-level object recognition have yielded systems capable of recognizing small groups such as certain four-legged mammals, flowers, specific terrain, clothing and buildings [18, 20, 29].
Face detection and recognition are necessary elements for video characterization. The "talking head" image is common in interviews and news clips, and illustrates a clear example of video production focusing on an individual of interest. The detection of a human subject is particularly important in the analysis of news footage. An anchorperson will often appear at the start and end of a news broadcast, which is useful for detecting segment boundaries. In sports, anchorpersons will often appear between plays or commercials. A close-up is used in documentaries to introduce a person of significance.
The detection of humans in video is possible using a number of algorithms . The Eigen Faces work from MIT is one of the earliest and widely used face detection algorithms . Figure 9.3 shows examples of faces detected using the Neural Network Arbitration method . Sneiderman and Kanade extended this method to detect faces at 90 degree rotation . Most techniques are dependent on scale, and rely heavily on lighting conditions, limited occlusion and limited facial rotation. Recent work for biometric and security based face recognition has also contributed to better characterization systems for video summarization .
Figure 9.3: Recognition of video captions and faces.
Text and graphics are used in a variety of ways to convey content to the viewer. They are most commonly used in broadcast news, where information must be absorbed in a short time. Examples of text and graphics in video are discussed below.
Text in video provides significant information as to the content of a shot. For example, statistical numbers and titles are not usually spoken but are included in captions for viewer inspection. Moreover, this information does not always appear in closed captions so detection in the image is crucial for identifying potentially important regions.
In news video, captions of the broadcasting company are often shown at low opacity as a watermark in a corner without obstructing the actual video. A ticker tape is widely used in broadcast news to display information such as the weather, sports scores or the stock market. In some broadcast news, graphics such as weather forecasts are displayed in a ticker-tape format with the news logo in the lower right corner at full opacity. Captions that appear in the lower third portion of a frame are almost always used to describe a location, person of interest, title or event in news video. In Figure 9.3, the anchorperson's location is listed.
Video captions are used less frequently in video domains other than broadcast news. In sports, a score or some information about an ensuing play is often shown in a corner or border at low opacity. Captions are sometimes used in documentaries to describe a location, person of interest, title or event. Almost all commercials use some form of captions to describe a product or institution, because their time is limited to less than a minute in most cases.
A producer will seldom use fortuitous text in the actual video unless the wording is noticeable and easy to read in a short time. A typical text region can be characterized as a horizontal rectangular structure of clustered sharp edges, because characters usually form regions of high contrast against the background. By detecting these properties, we can extract potentially important regions from video frames that contain textual information. Most captions are high contrast text such as the black and white chyron commonly found in news video. Consistent detection of the same text region over a period of time is probable since text regions remain at an exact position for many video frames. This may also reduce the number of false detections that occur when text regions move or fade in and out between shots.
Text regions can be characterized as a horizontal rectangular structure of clustered sharp edges. Characters usually form regions of high contrast against the background. By detecting these properties we can extract regions from video frames that contain textual information. We first apply a global horizontal differential filter, FHD, to the image. There are many variations to the differential filter, but the most common format is shown as,
An appropriate binary threshold should be set for extraction of vertical edge features. A smoothing filter, Fs, is then used to eliminate extraneous fragments, and to connect character sections that may have been detached. Individual regions must be identified through cluster detection. A bounding box, BB, should be computed for selection of text regions. We now select clusters with bounding regions that satisfy constraints in cluster size, Cs, cluster fill-factor, CFF, and horizontal-vertical aspect ratio.
A cluster's bounding region must have a small vertical-to-horizontal aspect ratio as well as satisfy various limits in height and width. The fill factor of the region should be high to insure dense clusters. The cluster size should also be relatively large to avoid small fragments. Other controlling parameters are listed below.
Finally, we examine the intensity histogram of each region to test for high contrast. This is because certain textures and shapes appear similar to text but exhibit low contrast when examined in a bounded region.
For some fonts a generic optical character recognition (OCR) package may accurately recognize video captions. For most OCR systems, the input is an individual character. This presents a problem in digital video since most of the characters experience some degradation during recording, digitization and compression. For a simple font, we can search for blank spaces between characters and assume a fixed width for each letter .
A graphic is usually a recognizable symbol, which may contain text. Graphic illustrations or symbolic logos are used to represent many institutions, locations and organizations. They are used extensively in news video, where it is important to describe the subject matter as efficiently as possible. A logo representing the subject is often placed in a corner next to an anchorperson during dialogue. Detection of graphics is a useful method for finding changes in semantic content. In this sense, its appearance may serve as a shot break. Recognition of corner regions for graphics detection may be possible through an extension of the shot detection technology. Histogram difference analysis of isolated image regions instead of the entire image can provide a simple method for detecting corner graphics. In Figure 9.4, a change is detected by the appearance of a graphics logo in the upper corner, although no shot is detected.
Figure 9.4: Graphics detection through sub-region histogram difference.
A particular object is usually the emphasis of a query in image and video retrieval. Recognition of articulated objects poses a great challenge, and represents a significant step in content-based feature extraction. Many working systems have demonstrated accurate recognition of animal objects, segmented objects and rigid objects such as planes or automobiles.
The recognition of a single object is only one potential use of image based recognition systems. Discrimination of synthetic and natural backgrounds, or an animated or mechanical motion would yield a significant improvement content-based feature extraction.
An important element in video indexing creation is the audio track. Audio is an enormous source for describing video content. Words specific to the actual content, or Keywords can be extracted using a number of language processing [37, 38]. Keywords may be used to reduce indexing and provide abstraction for video sequences. There are many possibilities for language processing in video, but the audio track must first exist as an ASCII document or speech recognition is necessary.
Audio segmentation is needed to distinguish spoken words from music, noise and silence. Further analysis through speech recognition is necessary to align and translate these words into text. Audio selection is made on a frame-by-frame basis, so it is important to achieve the highest possible accuracy. At a sampling rate of 8 KHz, one frame corresponds to 267 samples of audio. Techniques in language understanding are used for selecting the most significant words and phrases.
Audio segmentation is also used to parse or separate audio into distinct units. The units may be described as phrases or used as input for a speech recognition system. The duration of the phrase can be controlled and modified based on the duration and genre of the summary. Shorter phrases improve speech recognition efficiency when the duration of the audio input is long.
For documentaries, a digital ASCII version of the transcript is usually provided with an analog version of the video. From this we can identify keywords and phrases. Language analysis works on the audio transcript to identify keywords in it. We use the well-known technique of TF-IDF (Term Frequency Inverse Document Frequency) to measure relative importance of words for the video document .
A high TF-IDF value signifies relative high importance. Words that appear often in a particular segment, but appear relatively infrequently in the standard corpus, receive the highest weights. Punctuation in the transcript provides a means to identify sentences. With the sentence structure, we can parse smaller important regions, such as noun phrases and conjunction phrases. The link grammar parser  developed at Carnegie Mellon was used to parse noun phrases. With audio alignment from speech recognition, we can extract the phrases with the highest TF-IDF values as the audio portion of the summary. We can also look for conjunction phrases, slang words and question words to alter the frequency weighting for the TF-IDF analysis.
In order to use the audio track, we must isolate each individual word. To transcribe the content of the video material, we recognize spoken words using a speech recognition system. Speaker independent recognition systems have made great strides as of late and offer promise for application in video indexing [24, 23]. Speech recognition works best when closed-captioned data are available.
Captions usually occur in broadcast material, such as sitcoms, sports and news. Documentaries and movies may not necessarily contain captions. Closed-captions have become more common in video material throughout the United States since 1985 and most televisions provide standard caption display. Captions usually occur in United States broadcast material.
A final solution for content-based feature extraction is the use of known procedures for creating video. Video production manuals provide insight into the procedures used during video editing and creation. There are many textbooks and journals that describe the editing and production procedures for creating video segments. Pryluck published one of the most well known works in this area [45, 46].
One of the most common elements in video production is the ability to convey climax or suspense. Producers use a variety of different effects, ranging from camera positioning, lighting and special effects to convey this mood to an audience. Detection of procedures such as these is beyond the realm of present image and language understanding technology. However, many of the important features described in sections 2, 3 and 4 were derived from research in the video production industry.
Structural information as to the content of a video is a useful tool for indexing video. For example, the type of video being used (documentaries, news footage, movies and sports) and its duration may offer suggestions to assist in object recognition. In news footage, the anchorperson will generally appear in the same pose and background at different times. The exact locations of the anchorperson can then be used to delineate story breaks. In documentaries, a person of expertise will appear at various points throughout the story when topical changes take place. There are also many visual effects introduced during video editing and creation that may provide information for video content. For example, in documentaries the shots prior to the introduction of a person usually describe their accomplishments and often precede shots with large views of the person's face.
A producer will often create production notes that describe in detail action and scenery of a video, shot by shot. If a particular feature is needed for an application in image or video databases, the description may have already been documented during video production.
Another source of descriptive information may be embedded in the video stream in the form of timecode and geospatial (GPS/GIS) data. These features are useful in indexing precise segments in video or a particular location in spatial coordinates. Aeronautic and automobile surveillance video will often contain GPS data that may be used as a source for indexing.