Figure 1.1: Visual Information Retrieval blends together many research disciplines.
Figure 1.2: The three levels of the ANSI/SPARC architecture.
Figure 1.3: Block diagram of a VDBMS.
Figure 1.4: Classification of video data models.
Figure 1.5: Classification of queries to a VDBMS.
Chapter 2: Modeling Video using Input/Output Markov Models with Application to Multi-Modal Event Detection
Figure 2.1: Organizing a Video with a Table of Contents (ToC) and a Semantic Index (SI). The ToC gives a top-down break-up in terms of scenes, shots and key frames. The SI lists key-concepts occurring in the video. The links indicate the exact location of these concepts and the confidence measure.
Figure 2.2: A probabilistic multimedia object (multiject).
Figure 2.3: Consider the multimodal concept represented by node E and multimodal features represented by nodes A and V in this Bayesian network. The network implies that features A and V are independent given concept E.
Figure 2.4: Consider the multimodal concept represented by node E and multimodal features represented by nodes A and V in this Bayesian network. The network implies that features A and V are dependent given concept E.
Figure 2.5: This figure illustrates the Markovian transition of the input output Markov model. Random variable E can be present in one of the two states. The model characterizes the transition probabilities and the dependence of these on input sequence y.
Figure 2.6: This figure illustrates the Markovian transition of the input output Markov model along with duration models for stay in each state. Random variable E can be present in one of the two states. The model characterizes the transition probabilities and the dependence of these on input sequence y and duration of stay in any state. Note that missing self transition arrows.
Figure 2.7: Hierarchical multimedia fusion using HMM. The media HMMs are responsible for mapping media observations to state sequences.
Figure 2.8: Hierarchical multimedia fusion. The media HMMs are responsible for mapping media observations to state sequences. The fusion model, which lies between the dotted lines, uses these state sequences as inputs and maps them to output decision sequences.
Figure 2.9: Some typical frames from a video clip.
Figure 2.10: Classification error for the nine video clips using the leave-one-clip-out evaluation strategy. The maximum error for the DDIOMM is the least among the maximum error of the three schemes.
Figure 2.11: Comparing detection and false alarms. DDIOMM results in best detection performance.
Chapter 3: Statistical Models of Video Structure and Semantics
Figure 3.1: Shot activity vs. duration features. The genre of each movie is identified by the symbol used to represent the movie in the plot ( 2000 IEEE).
Figure 3.2: Shot duration histogram, and maximum likelihood fit obtained with the Erlang (left) and Weibull (right) distributions ( 2000 IEEE).
Figure 3.3: Left— Conditional activity histogram for regular frames, and best fit by a mixture with three Erlang and a uniform component. Right— Conditional activity histogram for shot transitions, and best fit by a mixture with a Gaussian and a uniform component ( 2000 IEEE).
Figure 3.4: Temporal evolution of the Bayesian threshold for the Erlang (left) and Weibull (right) priors ( 2000 IEEE).
Figure 3.5: Total number of errors, false positives, and missed boundaries achieved with the different shot duration priors ( 2000 IEEE).
Figure 3.6: Evolution of the thresholding operation for a challenging trailer. Top— Bayesian segmentation. The likelihood ratio and the Weibull threshold are shown. Bottom— Fixed threshold. The histogram distances and the optimal threshold are presented. In both graphs, misses are represented as circles and false positives as stars ( 2000 IEEE).
Figure 3.7: A generic Bayesian architecture for content characterization. Even though only three layers of variables are represented in the figure, the network could contain as many as desired ( 1998 IEEE).
Figure 3.8: A simple Bayesian network for the classification of sports ( 1998 IEEE).
Figure 3.9: Bayesian network implemented in BmoViES ( 1998 IEEE).
Figure 3.10: The duality between classification and retrieval ( 1998 IEEE).
Figure 3.11: Classification errors in BMoViES. People were not detected in the left three clips; crowd was not recognized on the right ( 1998 IEEE).
Figure 3.12: Example based retrieval in BMoViES. The top left image is a key frame of the clip submitted to the retrieval system. The remaining images are key frames of the best seven matches found by the system ( 1998 IEEE).
Figure 3.13: Relevance feedback in the Bayesian setting ( 1998 IEEE).
Figure 3.14: Relevance feedback in BMoViES. Each row presents the response of the system to the query on the left. The action (A), crowd (C), natural set (N), and close-up (D) attributes are instantiated with yes (y), no (n), or don't care (x). The confidence of the system of each of the retrieved clips is shown on top of the corresponding key frame ( 1998 IEEE).
Figure 3.15: Semantic time-lines for the trailers of the movies "Circle of friends" (top) and "The river wild" (bottom) ( 1998 IEEE).
Figure 3.16: Key-frames of the shots in the highlighted area of the time-line in the bottom of Figure 3.15. The shot (correctly) classified as not containing action is omitted ( 1998 IEEE).
Chapter 4: Flavor: A Language for Media Representation
Figure 4.1: HelloBits. (a) Representation using the MPEG-1/2 methodology. (b) Representation using C++ (A similar construct would also be used for Java). (c) Representation using Flavor.
Figure 4.2: Parsable variable declaration syntax.
Figure 4.3: (a) Parsable variable declaration. (b) Look-ahead parsing. (c) Parsable variable declaration with an expected value.
Figure 4.4: Array. (a) Declaration with dynamic size specification. (b) Declaration with initialization. (c) Declaration with dynamic array and parse sizes. (d) Declaration of partial arrays.
Figure 4.5: An example of a conditional expression.
Figure 4.6: Class. (a) A simple class definition. (b) A simple class variable declaration. (c) A simple class definition with parameter types. (d) A simple class variable declaration with parameter types.
Figure 4.7: Inheritance. (a) Derived class declaration. (b) Derived class declaration with object identifiers.
Figure 4.8: A class with id range.
Figure 4.9: Scoping rules example.
Figure 4.10: A simple map declaration.
Figure 4.11: A map with defined output type.
Figure 4.12: A map declaration with extension.
Figure 4.13: The %include directive. (a) The other.fl file is included by the main.fl file. (b) The other.h and other.cpp files are generated from the other.fl file whereas the main.h file is generated from the main.fl file.
Figure 4.14: The %import directive. (a) The main.fl file using the %import directive. (b) The main.h and main.cpp files generated from the main.fl file defined in (a).
Figure 4.15: Some examples of using pragma statements to set the translator options at specific locations.
Figure 4.16: A simple Flavor example— the GIF87a header. The usage of verbatim code is illustrated.
Figure 4.17: An illustration of XML features offered by XFlavor that are used for different applications.
Chapter 5: Integrating Domain Knowledge and Visual Evidence to Support Highlight Detection in Sports Videos
Figure 5.1: Typical sequence of shots in a sports video.
Figure 5.2: Studio scene with alternating anchorman shots.
Figure 5.3: Examples of superimposed graphic objects.
Figure 5.4: a) Source frame; b) Detected captions with noise removal.
Figure 5.5: Detection results for the frames in Figure 5.3.
Figure 5.6: Mosaic image of a dive.
Figure 5.7: Intermediate diver frame.
Figure 5.8: Image corners.
Figure 5.9: Segmented image (all black pixels are considered part of the background).
Figure 5.10: Mosaic image.
Figure 5.11: Although sports events reported on in a video may vary significantly, distinguishing features are shared across this variety. For example, playfield lines is a concept that is explicitly present in some outdoor and indoor sports (e.g.,— athletics or swimming ), but can also be extended to other sports (e.g.,— cycling on public roads). Similarly, player and audience scenes appear in most sports videos.
Figure 5.12: Edge, segment length and orientation, and hue distribution for the three representative sample images in the first row of Figure 5.11. Synthetic indices derived from these distributions allow to differentiate among the three classes of playfield, player, and audience. (Please, note that hue histograms are scaled to the maximum value).
Figure 5.13: Example of frames taken from the main camera.
Figure 5.14: Playfield zones; Z7 to Z12 are symmetrical.
Figure 5.16: (a) Playfield shape descriptor F. (b) Playfield line orientation descriptor O.
Figure 5.17: (a) Playfield corner position descriptor C. (b) Midfield line descriptor M.
Figure 5.15: Original image, playfield shape and lines in soccer.
Figure 5.18: Na ve Bayes networks— Z1 and Z6 zone classifiers.
Figure 5.19: Typical shot action— at time tstart the ball is kicked toward the goal post. Symbols on the right identify low, medium and high motion.
Figure 5.20: Shot model— on the arcs are reported the camera motion and playfield zones needed for the state transition. The upper branch describes a shot in the left goal post, the lower branch the shot in the right goal post. If state OK is reached then the highlight is recognized
Figure 5.21: Restart model— the final transition requires a minimum time length.
Chapter 6: A Generic Event Model and Sports Video Processing for Summarization and Model-Based Search
Figure 6.1: The graphical representation of the model.
Figure 6.2: The keyframes of an example free kick event.
Figure 6.3: The description of the free kick event in the example clip.
Figure 6.4: The low-level descriptions in the example free kick event
Figure 6.5: The description of the composite goal event
Figure 6.6: The flowchart of the proposed summarization and model instantiation framework for soccer video
Figure 6.7: The shot classes in soccer— (a) Long shot, (b) in-field medium shot, (c) close-up shot, and (d) out-of-field shot
Figure 6.8: Grass/non-grass segmented long and medium views and the regions determined by Golden Section spatial composition rule
Figure 6.9: The flowchart of the shot classification algorithm
Figure 6.10: The occurrence of a goal and its break— (left to right) goal play as a long shot, close-up of the scorer, out-of-field view of the fans (middle), 3rd slow-motion replay shot, the restart of the game as a long shot
Figure 6.11: Soccer field model (left) and the highlighted three parallel lines of a penalty box
Figure 6.12: Penalty Box detection by three parallel lines
Figure 6.13: Referee Detection by horizontal and vertical projections
Figure 6.14: Referee detection and tracking
Figure 6.15: The flowchart of the tracking algorithm
Figure 6.16: Example tracking of a player in Spain sequence
Figure 6.17: Line detection examples in Spain sequence.
Figure 6.18: The query graph pattern for "find the events where the ball object has a similar trajectory to the example and the event precedes header and score events"
Chapter 7: Temporal Segmentation of Video Data
Figure 7.1: General structure of a video sequence.
Figure 7.3: Sample frame from a sequence and corresponding DC image.
Figure 7.4: Representation of the shot boundary detector proposed in .
Figure 7.5: An example of multilayer perceptron with one hidden layer.
Figure 7.6: Interframe bin-to-bin metric for a soccer sequence.
Figure 7.7: Performance of abrupt detection technique T1 for the soccer test sequence.
Figure 7.8: Performance of abrupt detection technique T1 for the nature test sequence.
Figure 7.9: Interframe bin-to-bin metric and local thresholds for a soccer sequence.
Figure 7.10: Interframe bin-to-bin metric and local thresholds for a soccer sequence.
Chapter 8: A Temporal Multi-Resolution Approach to Video Shot Segmentation
Figure 8.1: Structure modeling of video.
Figure 8.2: Stratification Modeling of Video.
Figure 8.3: Transitions on the video trajectory in RGB color space.
Figure 8.4: Canny wavelets in RO-R3.
Figure 8.5: Transitions shown at two resolutions.
Figure 8.6: Schematic of TMRA system for shot boundary detection.
Figure 8.7: Sample frames from the sequence Echosystem.mpg.
Figure 8.8: Post transition frames of the video sequence Echosystem.mpg.
Figure 8.9: Examples of coefficients after the transformation (WDC64).
Figure 8.10: Selecting the local maxima.
Figure 8.11: Finding the start and end frame number of a transition.
Figure 8.12: Adaptive threshold for wavelet coefficients.
Figure 8.13: Decreasing strength of WMA64 with lowering resolution corresponding to a flash.
Figure 8.14: Mean absolute differences for DC64 and MA64 features.
Chapter 9: Video Summaries Through Multimodal Analysis
Figure 9.1: Left— original; Right— filtered.
Figure 9.2: Camera and object motion detection.
Figure 9.3: Recognition of video captions and faces.
Figure 9.4: Graphics detection through sub-region histogram difference.
Figure 9.5: Thumbnail selection from shot key frames.
Figure 9.6: Audio with unsynchronized imagery.
Figure 9.7a: Inter-cut shot detection for image and audio selection.
Figure 9.7b: Short successive effects.
Figure 9.8: Informedia results list, with thumbnail surrogates and title shown for third video result.
Figure 9.9: Snapshots of the 3 treatments used in thumbnail empirical study with 30 subjects.
Figure 9.10: Storyboard, with overlaid match "notches" following query on "man walking on the moon."
Figure 9.11: Reduced storyboard display for same video represented by full storyboard in Figure 9.10.
Figure 9.12: Scaled-down view of storyboard with full transcript text aligned by image row.
Figure 9.13: Storyboard plus concise text surrogate for same video clip represented by Figure 9.12.
Figure 9.14: Skim treatments used in empirical study on skim utility as informative summary.
Figure 9.15: Timeline overview of "air crash" query.
Figure 9.16: Map visualization for results of "air crash" query, with dynamic query sliders for control and feedback.
Figure 9.17: Filtered set of video documents from Figure 9.15, with added annotations.
Chapter 10: Audio and Visual Content Summarization of a Video Program
Figure 10.1: Block diagram of the audio-centric summarization system.
Figure 10.2: The projection of the above terms-by-sentences matrix A into the singular vector space.
Figure 10.3: The influence of different weighting schemes on the summarization performances. (a),(b),(c),(d)— Evaluation using the manual summarization result from evaluator 1, 2, 3, and the one determined by a majority vote, respectively. The notation of weighting schemes is the same as the one from the SMART system. Each weighting scheme is denoted by three letters. The first, second, and third letters represent the local weighting, the global weighing, and the vector normalization, respectively. The meaning of the letters are as follows— N— No weight, B— Binary, L— Logarithm, A— Augmented, T— Inverse document frequency, C— Vector normalization.
Figure 10.4: Block diagram of the image-centric summarization system.
Figure 10.5: An image-centric summarization result.
Figure 10.6: The block diagram of the audio-visual summarization system.
Figure 10.7: An example of the audio-visual alignment and the corresponding bipartite graph.
Figure 10.8: Alignment solutions— the coarse lines represent the assignment of the shot clusters to the time shots; the notation I(j,k) on each coarse line tells which shot from the cluster has been selected, and assigned to the time slot.
Figure 10.9: Audio-visual summarization of a 12-minute news program.
Chapter 11: Adaptive Video Summarization
Figure 11.1: An example of annotation-based structure for a video. A video of 300 frames is here annotated using 2 initial strata, namely Strata1 and Strata2. The video segments between frames 87 and 165 are related only to the description of stratum 2; the frame intervals [166, 177] and [193,229] are described by a generated strata Strata3, corresponding to a conjunction of the annotations of the strata 1 and 2, where the frame intervals [178,192] and [230,246] are described only by the stratum 1.
Figure 11.2: An example of concept types lattice.
Figure 11.3: An example of relations lattice
Figure 11.4: An example of conceptual graph.
Figure 11.5: Joint of two graphs (g1 and g2) on a concept [Man— #John].
Figure 11.6: Query syntax.
Chapter 12: Adaptive Video Segmentation and Summarization
Figure 12.1: Multiple Methods
Figure 12.2: Multiple Probability Methods
Figure 12.3: Multiple Inputs
Chapter 13: Augmented Imagery for Digital Video Applications
Figure 13.1: The Milgram virtuality continuum (VC)
Figure 13.2: Architecture of an augmented imagery system
Figure 13.3: A simple polygon mesh description of a pyramid with vertex, normal and face list
Figure 13.4: A ball represented as a tessellated polygon mesh
Figure 13.5: A ball rendered using Gouraud shading
Figure 13.6: A Bezier curve with 4 control points
Figure 13.7: A Bezier surface
Figure 13.8: Screen graph of a highway billboard
Figure 13.9: Aliased and anti-aliased version of the character "A"
Figure 13.10: A ray tracing model
Figure 13.11: Direct and indirect lighting in radiosity
Figure 13.12: Diagram of a virtual studio system
Figure 13.13: An augmented video frame
Figure 13.14: Sportvision virtual first-down line system
Figure 13.15: Virtual advertising
Figure 13.16: Example of computation of an alpha map
Chapter 14: Video Indexing and Summarization Service for Mobile Users
Figure 14.1: Current problems to access video libraries.
Figure 14.2: The goal— delivering the video message in an adapted form.
Figure 14.3: Incorrect Frames - Change Decision.
Figure 14.4: Correct frames— unchanged decision.
Figure 14.5: A Camera Cut Detection Scenario Using the Binary Penetration.
Figure 14.6: The Binary Penetration Algorithm.
Figure 14.7: Consecutive shots cut possibilities.
Figure 14.8: Video segmentation process.
Figure 14.9: The use of the media key framing as a batch process.
Figure 14.10: Adopted video indexing service structure.
Figure 14.11: Samples of user interface to browse tourist destinations in Egypt. (a) In the case of highest quality level— Video Streaming; (b) In the case of medium quality level— Color and High Resolution Key Frames; (c) In the case of lowest quality level— Gray and Low Resolution Key Frames.
Chapter 15: Video Shot Detection using Color Anglogram and Latent Semantic Indexing: From Contents to Semantics
Figure 15.1: A Delaunay Triangulation Example
Figure 15.2: A Color Anglogram Example
Figure 15.3: A Sample Query Result of Color Anglogram
Figure 15.4: Abrupt Shot Transitions of a Sample Video Clip
Figure 15.5: Gradual Shot Transition of a Sample Video Clip
Figure 15.6: Video Shot Detection Evaluation System
Figure 15.7: Shot Detection Result of a Sample Video Clip
Chapter 16: Tools and Technologies for Providing Interactive Video Database Applications
Figure 16.1: A Sample Interactive Video Application Interface (Content Courtesy of the Deutsche Bank)
Figure 16.2: TEMA used to Transform Single Media Call Centers
Figure 16.3: The Interactive Video Content Model
Figure 16.4: The Presentation Generation Process
Figure 16.5: The HotStreams™ Content Management Tool
Figure 16.6: Alternative Ways to Deliver Content in the Right Form
Figure 16.7: Device-dependent View of Interactive Content
Figure 16.8: Multi-format Video Content Adaptation
Figure 16.9: Adaptive Personalization Workflow
Figure 16.10: SMIL and ASX Scripts for Personalized Video Content
Figure 16.11: Carry-along Workflow
Figure 16.12: Video Node Filtering Examples
Figure 16.13: Hyperlink Filtering Example
Figure 16.14: The Experience Sharing Workflows
Chapter 17: Animation Databases
Figure 17.1: An animation reuse example applying a walking sequence of a woman to a man
Figure 17.2: Direct and inverse kinematics
Figure 17.3: Proposed XML mediator for databases of animations
Figure 17.4: An example of scene graph illustrating the model
Figure 17.5: The ER diagram of the XML Mediator DTD
Figure 17.6: Block diagram for the Pre-processing phase
Figure 17.7: The original motion sequence of a 2d articulated figure with 3 DOFs in joint space
Figure 17.8: Retargeting to generate a new motion
Figure 17.9: XML Mediator-based Animation Toolkit
Figure 17.10: Detailed object model with views
Figure 17.11: Query view
Figure 17.12: Scene Graph View
Figure 17.13: Query (motion) GUI and pseudo-code for inserting an object from the database into the scene graph
Figure 17.14: The scene graph GUI and the pseudo-code to convert the scene to file
Figure 17.15: Model metadata view
Figure 17.16: Model metadata GUI
Figure 17.17: Motion Adjustment View
Figure 17.18: Motion adjustment GUI
Figure 17.19: Retarget Motion View
Figure 17.20: Motion retargeting GUI and pseudo-code for motion mapping in VRML
Figure 17.21: Motion mapping view
Figure 17.22: Motion mapping GUI
Figure 17.23: Operation sequences for solar systems animation
Figure 17.24: Crosses scene with the rotation and revolution motions
Figure 17.25: Solar system model as viewed in toolkit
Figure 17.26: Generated VRML file for above queries
Figure 17.27: Above generated VRML file inserted in PPT
Chapter 18: Video Intelligent Exploration
Figure 18.1: General framework
Chapter 19: A Video Database Algebra
Figure 19.1: Visual and Textual Categories
Figure 19.2a: A stork
Figure 19.2b: Region of interests detected by a human eye
Figure 19.2c: Region of interests and eye movement with active vision techniques
Figure 19.3: A video block showing a lecture
Figure 19.4: Color histograms in RGB space
Figure 19.5: Example of Cartesian product
Figure 19.6: AVE! key parts
Figure 19.7: The AVE! system screendump
Chapter 20: Audio Indexing and Retrieval
Figure 20.1: Decomposition of an audio signal into clips and frames.
Figure 20.2: Illustration of speaker segmentation.
Figure 20.3: Content and scene-change-index for one audio stream.
Figure 20.4: The SCANMail architecture.
Figure 20.5: Overview of audio framework in MPEG-7 audio.
Chapter 21: Relevance Feedback in Multimedia Databases
Chapter 22: Organizational Principles of Video Data
Figure 22.1: Semantic view of a graph.
Figure 22.2: A general scheme that shows the dependence of events.
Figure 22.3: The system with two primitive events.
Figure 22.4: Examples of the verification of a relaxed sequence.
Figure 22.5: State machine.
Figure 22.6: Diagram of domain specification.
Chapter 23: Segmentation and Classification of Moving Video Objects
Figure 23.1: Architecture of the video object classification system.
Figure 23.2: Projection of background plane in world coordinates to image coordinate systems of images i and j.
Figure 23.3: Different plane transformations. While transformations (a)-(d) are affine, perspective deformations (e) can only be modeled by the perspective motion model.
Figure 23.4: Feature points on edges (a) cannot be tracked reliably because a high uncertainty about the position along the edge remains. Feature points at corners (b), crossings (c), or points where several regions overlap (d) can be tracked very reliably.
Figure 23.5: Scatter plots of gradient components for a selected set of windows.
Figure 23.6: Pixel classification based on Harris corner detector response function. The dashed lines are the isolines of r.
Figure 23.7: Sub-pel feature point localization by fitting a quadratic function through the feature point at xp and its two neighbors.
Figure 23.8: Estimation of camera parameters. (a) motion estimates, (b) estimation by least squares, (c) motion estimates (white vectors are excluded from the estimation), (d) estimation by least-trimmed squares regression.
Figure 23.9: Difference frame after Levenberg-Marquardt minimization. (a) shows residuals using squared differences as error function, (b) shows residuals with saturated squared differences. It is visible that the robust estimation achieves a better compensation. Especially note the text in the right part of the image.
Figure 23.10: Reconstruction of background based on compensation of camera motion between video frames. The original video frames are indicated with borders.
Figure 23.11: Aligned input frames are stacked and a pixel-wise median filter is applied in the temporal direction to remove foreground objects.
Figure 23.12: Computing the difference between successive frames results in unwanted artifacts. The first two pictures show two input frames with foreground objects. The right picture shows the difference. Two kinds of artifacts can be observed. First, the circle appears twice since the algorithm cannot distinguish between appearing and disappearing. Second, part of the inner area of the polygon is not filled because the pixels in this area do not change their brightness.
Figure 23.13: Difference frames using squared error and SSD-3. Note that SSD-3 shows considerably less errors at edges caused by aliasing in the sub-sampling process.
Figure 23.14: Definition of pixel neighborhood. Picture (a) shows the two classes of pixel neighbors; straight (1) and diagonal (2). These two classes are used to define the second order cliques. Straight cliques (b) and diagonal cliques (c).
Figure 23.15: Segmentation results for per-pixel decision between foreground and background object and MRF based segmentation.
Figure 23.16: Construction of the CSS image. Left— Object view (a) and iteratively smoothed contour (b)–(d). Right— Resulting CSS image.
Figure 23.17: Weights for transitions between object classes. Thicker arrows represent more probable transitions.
Figure 23.18: Extraction of object behavior.
Figure 23.19: Segmentation results of "stefan" test sequence (frames 40, 80, 120, 160).
Figure 23.20: Reconstructed background from "stefan" sequence. Note that the player is not visible even though there is no input video frame without the player.
Figure 23.21: Segmentation results of "road" test sequence (frames 20, 40, 70, 75).
Figure 23.22: CSS matching results for selected frames of the "stefan" sequence.
Chapter 24: A Web-Based Video Retrieval System: Architecture, Semantic Extraction, and Experimental Development
Figure 24.1: A "Stand-alone" VRS architecture.
Figure 24.2: Architecture of a VideoMAP*— a web-based video retrieval system.
Figure 24.3: Data distribution of Global VPDB.
Figure 24.4: Data distribution of Global VQDB.
Figure 24.5: Annotating the segments after video segmentation.
Figure 24.6: Query facilities.
Figure 24.7: Query by semantic information.
Figure 24.8: Query by visual information.
Figure 24.9: GUI of the Web-based video query system (VideoMAP*).
Figure 24.10: A client's query processed by the server (VideoMAP*).
Chapter 26: Video Indexing and Retrieval using MPEG-7
Figure 26.1: MPEG-7 applications include pull-type applications such as multimedia database searching, push-type applications such as multimedia content filtering, and universal multimedia access.
Figure 26.2: Overview of the normative scope of MPEG-7 standard. The methods for extraction and use of MPEG-7 descriptions are not standardized.
Figure 26.3: The MPEG-7 classification schemes organize terms that are used by the description tools.
Figure 26.4: Example showing temporal decomposition of a video into two shots and the spatio-temporal of each shot into moving regions.
Figure 26.5: Example image showing two people shaking hands.
Figure 26.6: The MPEG-7 video search engine allows searching based on features, models, and semantics of the video content.
Chapter 27: Indexing Video Archives: Analyzing, Organizing, and Searching Video Information
Figure 27.1: Architecture of Name-It.
Figure 27.2: Co-occurrence factor calculation.
Figure 27.3: Face and name association results.
Figure 27.4: Relating Feature Spaces by co-occurrence.
Figure 27.5: Structure of the R-tree.
Figure 27.6: Structure of the SS-tree.
Figure 27.7: Structure of the SR-tree.
Figure 27.8: Similarity retrieval of video frame sequences based on the closest pair.
Figure 27.9: Performance of the SR-tree with face sequence matching.
Figure 27.10: Distances among 100,000 points generated at random in a unit hypercube.
Figure 27.11: Relative difference between the first and the second nearest neighbors.
Figure 27.12: Definition of the indistinctive NN.
Figure 27.13: Rejection probability when Rp is 1.84471 and Nc is 48.
Figure 27.14: Search result of the distinctiveness-sensitive NN search.
Figure 27.15: Performance of the distinctiveness-sensitive NN search.
Figure 27.16: Examples of search results.
Figure 27.17: Examples of results when no distinctive NN is found except for the query.
Figure 27.18: Examples of the distribution of 99 nearest neighbors.
Chapter 28: Efficient Video Similarity Measurement using Video Signatures
Figure 28.1: (a) Two video sequences with IVS equal to 1/3. (b) The Voronoi Diagram of a 3-frame video X. (c) The shaded area, normalized by the area of the entire space, is equal to the VVS between the two sequences shown.
Figure 28.2: Three examples of VVS.
Figure 28.3: (a) The error probability for the Hamming cube at different values of ε and distances k between the frames in the video. (b) Values of ranking function Q() for a three-frame video sequence. Lighter colors correspond to larger values.
Figure 28.4: Quantization of the HSV color space.
Figure 28.5: Precision-recall plots for web video experiments— (a) Comparisons between the Basic (broken-line) and Ranked (solid) ViSig methods for four different ViSig sizes— m=2, 6, 10, 14; (b) Ranked ViSig methods for the same set of ViSig sizes; (c) Ranked ViSig methods with m=6 based on l1, metric (broken) and modified l1 distance (solid) on color histograms; (d) Comparison between the Ranked ViSig method with m=6 (solid) and k-medoid with 7 representative frames (broken).
Chapter 29: Similarity Search in Multimedia Databases
Figure 29.1: Abstraction levels describing the feature spaces for image databases.
Figure 29.2: An example of feature extraction— grey scale histograms with 256 bins.
Figure 29.3: Nodes in a two-dimensional space defined by the Euclidean distance. The nodes have the maximum capacity of 8 objects. (a) The representative of the node is the object A. At inserting the object I, the node surpasses its capacity and needs to be split. (b) After the node splits, there are two nodes, the first one with four objects and its representative is B, and the second node with five objects having as representative the object C.
Figure 29.4: A metric tree indexing 17 objects. The objects are organized in three levels. The node capacity is 3.
Figure 29.5: Operations executed over each logical database by each command in SQL. The definition (D), update (U) and query (Q) operations are performed to process the issued commands. The grey rows of the table refer to image related operations.
Figure 29.6: The five logical databases required to support the application data and content-based image retrieval in an RDBMS.
Figure 29.7: Structure of the operations needed to support content-based retrieval operations in a set of images.
Figure 29.8: Architecture of the image support core engine.
Chapter 30: Small Sample Learning Issues for Interactive Video Retrieval
Figure 30.1: Content-based multimedia retrieval— system diagram.
Figure 30.2: Video contents can be represented by a combination of audio, visual, and textual features.
Figure 30.3: Real queries are often specific and personal. A building detector  may declare these two images "similar"; however, a user often looks for more specific features such as shape, curvature, materials, and color, etc.
Figure 30.4: Comparison of FDA, MDA, and BDA for dimensionality reduction from 2-D to 1-D.
Figure 30.5: Test results on synthetic data.
Chapter 31: Cost Effective and Scalable Video Streaming Techniques
Figure 31.1: A multicast delivery.
Figure 31.2: Server architecture.
Figure 31.3: Client architecture.
Figure 31.4: Skyscraper downloading scheme.
Figure 31.5: Broadcast schedule when N=5 and K=6.
Figure 31.6: Server software— main window.
Figure 31.7: Client software— JukeBox.
Figure 31.8: Playback of a video.
Figure 31.9: A broadcast example in Pagoda Broadcast.
Figure 31.10: Patching.
Figure 31.11: A range multicast example.
Figure 31.12: Play point position in the client buffer.
Figure 31.13: A Heterogeneous Receiver-Oriented (HeRO) broadcast example.
Chapter 32: Design and Development of a Scalable End-to-End Streaming Architecture
Figure 32.1: The Yima multi-node hardware architecture. Each node is based on a standard PC and connects to one or more disk drives and the network.
Figure 32.2: Smoothing effectiveness of the MTFC algorithm with the movie "Twister."
Figure 32.3: Block movement and coefficient of variation.
Figure 32.4: Disk bandwidth when scaling from 2 to 7 disks. One disk is added every 120 seconds (10 clients each receiving 5.33 Mbps).
Figure 32.5: Yima single node server performance with 100 Mbps versus 1 Gbps network connection.
Figure 32.6: Yima server 2 node versus 4 node performance.
Chapter 33: Streaming Multimedia Presentations in Distributed Database Environments
Figure 33.1: Server design for audio and video streams.
Figure 33.2: Multiple streams synchronized at client.
Figure 33.3: Server flow control buffer design.
Figure 33.4: Client video buffer.
Figure 33.5: Definition of DSD in a stream.
Figure 33.6: PLUS probing scheme.
Figure 33.7: The relationships among receiver, controller and actor objects.
Chapter 34: Video Streaming: Concepts, Algorithms, and Systems
Figure 34.1: Example of the prediction dependencies between frames.
Figure 34.2: Effect of playout buffer on reducing the number of late packets.
Figure 34.3: Example of error propagation that can result from a single error.
Figure 34.4: Multiple description video coding and path diversity for reliable communication over lossy packet networks.
Figure 34.5: An MD-CDN uses MD coding and path diversity to provide improved reliability for packet losses, link outages, and server failures.
Chapter 35: Continuous Display of Video Objects using Heterogeneous Disk Subsystems
Figure 35.1: Taxonomy of techniques.
Figure 35.2: OLT1.
Figure 35.3: OLT2.
Figure 35.4: HetFIXB.
Figure 35.5: Disk grouping.
Figure 35.6: Disk merging.
Figure 35.7: One Seagate disk model (homogeneous).
Figure 35.8: Two Seagate disk models (heterogeneous).
Figure 35.9: Three Seagate disk models (heterogeneous).
Figure 35.10: Wasted disk bandwidth.
Figure 35.11: Maximum startup latency.
Figure 35.12: One disk model, type 0 (homogeneous).
Figure 35.13: Two disk models, type 0 and 1 (heterogeneous).
Figure 35.14: Three disk models, type 0, 1, 2 (heterogeneous).
Figure 35.15: Wasted disk space with three disk models.
Chapter 36: Technologies and Standards for Universal Multimedia Access
Figure 36.1: Capability mismatch and adaptation for universal access.
Figure 36.2: Example of an object-based scene.
Figure 36.3: Composition parameters.
Figure 36.4: Stream Association in MPEG-4 Systems.
Figure 36.5: Generic illustration of the transcoding hints application.
Figure 36.6: Reference transcoding architecture for bit rate reduction.
Figure 36.7: Simplified transcoding architectures for bitrate reduction. (a) open-loop, partial decoding to DCT coefficients, then re-quantize, (b) closed-loop, drift compensation for re-quantized data.
Figure 36.8: Frame-based comparison of PSNR quality for reference, open-loop and closed-loop architectures for bitrate reduction.
Figure 36.9: Illustration of intra-refresh architecture for reduced spatial resolution transcoding.
Figure 36.10: Concept of Digital Item Adaptation.
Figure 36.11: Graphical representation of CC/PP profile components.
Chapter 37: Server-Based Service Aggregation Schemes for Interactive Video-on-Demand
Figure 37.1: Aggregation of streams.
Figure 37.2: Minimum bandwidth clustering.
Figure 37.3: Merging with more than two rates.
Figure 37.4: Clustering with predicted arrivals.
Figure 37.5: Clustering under perfect prediction.
Figure 37.6: Event model and controls.
Figure 37.7: Static snapshot case.
Figure 37.8: Dynamic clustering in steady state.
Figure 37.9: Dependence of channel gains on W.
Figure 37.10: Comparison— With best W for each policy.
Figure 37.11: Two variants of RSMA.
Figure 37.12: Effect of varying degrees of interactivity.
Chapter 38: Challenges in Distributed Video Management and Delivery
Figure 38.1: A distributed video management and delivery system. Every user can be a video producer/provider, intermediate service provider (e.g., such as of transcoding), and video consumer.
Figure 38.2: System architecture.
Figure 38.3: Block diagram of multi-rate encoder.
Figure 38.4: High-level representation of the differential video encoder (top) and decoder (bottom).
Figure 38.4: VISDOM system block diagram.
Figure 38.5: MDFEC— The arrangement of FECs and video layers enables successful decoding of layer 1 with reception of any one packet and layers 1 and 2 with the reception of any two packets and so on. The amount of data and protection for each layer can be matched to the transmission channel at hand.
Figure 38.6: Plots (a),(c),(e) show the delivered quality for the Football video sequence (measured in PSNR (dB)) as a function of time for one, two and three servers respectively. Plots (b),(d),(f) show the total transmission rate as a function of time for one, two and three servers respectively. Even though the transmission rate is nearly the same for the three cases, the delivered quality becomes more and more smooth as the number of content servers is increased.
Figure 38.7: Media data ordering in the adaptive/dynamic storage and caching management system based on recall values. The total cache size is NN+1.
Figure 38.8: Example of text extraction in videos.
Figure 38.9: Sample results of color-based similarity search. Similar videos are showed in small icons surrounding the displaying video in the center.
Figure 38.10: Example of transcoding/path selection graph construction for delivery of a video stream "example" to a client Peer A possessing only an MPEG-4 decoder. The content is available in MPEG-2 format from Peer C, or directly as an MPEG-4 sequence from Peer D. The requesting node compares the cost of obtaining the clip along several possible paths— directly from D, which involves a cost ct(D, A) of transmission and from C to B to A, which involves the cost of transmission, ct(C, B) + ct(B, A), plus the cost of transcoding at B, cx("example", MPEG-2 to MPEG-4). Because Peer A can only decode MPEG-4, it is useless to get an MPEG-2 version of the object to Peer A. In this case, which cannot satisfy the required constraints, some edges are marked with infinite cost.
Figure 38.11: Mapping progressive stream to packets using MDFEC codes. Progressive stream (top) is partitioned by Ri and encoded with (N,i) codes. Ri corresponds to layer i in Figure 38.5.
Figure 38.12: Hybrid ARQ algorithm data flow.
Figure 38.13: PSNR vs. packet erasure characteristics.
Figure 38.14: The system overview (PW— perceptual weighting; F— filtering)
Chapter 39: Video Compression: State of the Art and New Trends
Figure 39.1: MPEG-7 standard.
Figure 39.2: Five training views of the image Ana.
Figure 39.3: Five training views of the image Jos Mari.
Figure 41.7: Illustration of Minkowski error pooling.
Figure 41.8: Evaluation of "Lena" images with different types of noise. Top-left— Original "Lena" image, 512 512, 8bits/pixel; Top-right— Impulsive salt-pepper noise contaminated image, MSE=225, Q=0.6494; Bottom-left— Additive Gaussian noise contaminated image, MSE=225, Q=0.3891; Bottom-right— Multiplicative speckle noise contaminated image, MSE=225, Q=0.4408.
Figure 41.9: Evaluation of "Lena" images with different types of distortions. Top-left— Mean shifted image, MSE=225, Q=0.9894; Top-right— Contrast stretched image, MSE=225, Q=0.9372; Bottom-left— Blurred image, MSE=225, Q=0.3461; Bottom-right— JPEG compressed image, MSE=215, Q=0.2876.
Figure 41.10: Evaluation of blurred image quality. Top-left— Original "Woman" image; Top-right— Blurred "Woman" image, MSE=200, Q=0.3483; Middle-left— Original "Man" image; Middle-right— Blurred "Man" image, MSE=200, Q=0.4123; Bottom-left— Original "Barbara" image; Bottom-right— Blurred "Barbara" image, MSE=200, Q=0.6594.
Figure 41.11: Evaluation of JPEG compressed image quality. Top-left— Original "Tiffany" image; Top-right— compressed "Tiffany" image, MSE=165, Q=0.3709; Middle-left— Original "Lake" image; Middle-right— compressed "Lake" image, MSE=167, Q=0.4606; Bottom-left— Original "Mandrill" image; Bottom-right— compressed "Mandrill" image, MSE=163, Q=0.7959.
Figure 41.12: Deployment of a reduced-reference video quality assessment metric. Features extracted from the reference video are sent to the receiver to aid in quality measurements. The video transmission network may be lossy but the RR channel is assumed to be lossless.