2. Preliminaries: Video Data Model

2. Preliminaries: Video Data Model

Throughout this paper, we will assume that a video v is divided up into a sequence b1,,blen(v) of blocks. The video database administrator can choose what a block is - he could, for instance, choose a block to be a single frame, or to be the set of frames between two consecutive I-frames (in the case of MPEG video) or something else. The number len(v) is called the length of video v. If 1 l u len(v), then we use the expression block sequence to refer to the closed interval [l,u] which denotes the set of all blocks b such that l b u. Associated with any block sequence [l,u] is a set of objects. These objects fall into four categories as shown in Figure 19.1.




Color Histograms

Texture Maps

Shape Descriptors

Spatial Position Descriptors

Objects Annotations






Vehicle Tracking

Activity Annotation

Figure 19.1: Visual and Textual Categories

  • Visual Entities of Interest (Visual EOIs for short): An entity of interest is a region of interest in a block sequence (usually when identifying entities of interest, a single block, i.e., a block sequence of length one, is considered). Visual EOIs can be identified using appropriate image processing algorithms. For example, Figure 19.2 shows a photograph of a stork and identifies three regions of interest in the picture using active vision techniques [1], [2]. This image may have various attributes associated with the above rectangles such as an id, a color histogram, a texture map, etc.

  • Visual Activities of Interest (Visual AOIs): An activity of interest is a motion of interest in a video segment. For example, a dance motion is an example of an activity of interest. Likewise, a flying bird might be an activity of interest ("flight"). There are numerous techniques available in the image processing literature to extract visual AOIs. These include dancing [3], gestures [4] and many other motions [5].

  • Textual Entities of Interest: Many videos are annotated with information about the video. For example, news videos often have textual streams associated with the video stream. There are also numerous projects [25,26,27] that allow textual annotations of video - such annotations may explicitly annotate a video with objects in a block sequence or may use a text stream from which such objects can be derived. A recent commercial product to do this is IBM's AlphaWorks system where a user can annotate a video while watching it.

  • Textual Activities of Interest: The main difference between Textual EOIs and Textual AOIs is that the latter pertains to activities, while the former pertains to entities. Both are textually marked up.

click to expand
Figure 19.2a: A stork

click to expand
Figure 19.2b: Region of interests detected by a human eye

The algebra and implementation we propose to query and otherwise manipulate video includes elements of all the above. It is the first extension of the relational algebra to handle video databases that includes generic capabilities to handle both annotated video and image processing algorithms.

Throughout the rest of this paper, we assume the existence of some set A whose elements are called attribute names.

Definition (Textual object type) A textual object type (TOT for short) is inductively defined as follows:

  • real,int,string are TOTs.

  • If A1,,An are attribute names and τ1,,τn are TOTs, then [A1 : τ1,,An : τn] is a TOT.

  • If τ is a TOT, then {τ} is a TOT.

As usual, every type τ has a domain, dom(τ). The domains of the basic types (real, int, string) are defined in the usual way. The domain of the type [ A1 : τ1 ,, An : τn ] is dom(τ1) dom(τn). The domain of {τ} is the power-set 2dom(τ) of τ's domain. We return to our Stork example to illustrate various TOTs and their domains.

Example. Consider the stork shown in Figure 19.2. Some textual object type of Figure 19.2(c) might be:

  • Feature: {string} which may specify a set of features.

click to expand
Figure 19.2c: Region of interests and eye movement with active vision techniques

Definition (Textual object) A textual object of type τ is any member of dom(τ).

Note that we will often abuse notation and write a textual object of type [ A1 : τ1,,An : τn ] as [ A1 : v1,, An : vn] instead of [ v1,, vn] so that the association of values vi with attributes Ai is explicit (here, the vi 's are all in dom(τi)).

Example. Continuing the preceding example, consider the image shown in Figure 19.2(c). Here, the textual objects involved might be:

  • Feature:{stork,water}: This may be a human-created annotation reflecting the fact that the image shows a stork and water.

While the concept of a textual object is well suited to human-created annotations, some additional information may be required for image processing algorithms. We now define the concept of a visual type.

Definition (Visual type) A visual type is any of the following

  • real,int are Visual Types (VTs for short).

  • If τ is a VT, then {τ} is a VT.

  • If A1,,An are strings (attribute names) and τ1,,τn are VT, then [A1 : τ1,,An : τn] is a VT.

Remark: The reader will note that visual types are a special case of textual types. The reason is that after image processing algorithms are applied to an image or video, the result may be stored in one or more image files which are then referenced by their file name (which is a string). Similarly, other outputs of image matching algorithms are textual (e.g., the name of a person seen in an image, etc.). The key reason we distinguish between visual and textual types is that the former are obtained using image processing algorithms while the latter are obtained using human annotations.

Definition (Visual object) A visual object is a special kind of textual object over the type

[LLx : real, LLy : real, URx : real, URy : real, A1 : τ1,,An : τn]

where τ1,,τn are visual types.

Example. Intuitively, a visual type could be:

  • Red:int specifying a number from 0–255 denoting the level of redness. We could have types Green:int and Blue:int in a similar way.

  • [red:int, green:int, blue:int] is also a type. This may specify average RGB values for a given region of Figure 19.2(c).

An example visual object could be: [LLx:50,Lly:0,Urx:60,URY:10, red:0,green:6,blue:10]. If we look at the lowest rectangle in Figure 19.2(c), this might describe the average color values associated with that rectangle.

The preceding definitions describe objects of interest, but not activities. In the following, we describe activities of interest in a video.

Definition (textual activity object) A textual activity object is a special kind of textual object over the type [ActName: string, Roles:{[RoleName:string, Player: string]}]

Definition (visual activity object) A visual activity object w.r.t video v is a special kind of textual object over the type [ActName: string, Start: int, End: int]

Example. Let us return to the stork of Figure 19.2. Here, the textual activity of the stork is just [ActName:standing,Roles:{ }] describing the fact that the stork is standing. It has no roles nor any players associated with those roles because the stork by itself is not playing a role.

The following example shows a lecture where there are roles.

Example. Figure 19.3 below shows selected frames from a lecture video.

click to expand
Figure 19.3: A video block showing a lecture

In this case, the visual activity object would be:

[ActName: lecturing,Start:5,End:200]

describing the fact that the activity in question is a lecturing activity and that it starts in video frame 5 and ends in video frame 200.

On the other hand, if a human were to do the annotation, rather than a program, then we would have a textual activity object which may look like this:

[ActName:lecturing, Roles:{ [RoleName:speaker, Player: vs]}].

Again, we emphasize that visual activities and objects are identified using image processing algorithms, while human annotations are used to identify non-visual textual objects and activities.

Throughout the rest of this paper, we assume the existence of some arbitrary but fixed set T of types including all the types needed for activities and visual objects. We assume that T is closed under subtypes. We do not give a formal definition of subtype (as this is fairly obvious) and instead appeal to the reader's intuition.

Definition (video association map) Suppose T is some set of types (closed under subtypes). An association map ρ is a mapping from block sequences to objects in τT d om(τ) such that if bs bs are block sequences, then ρ(bs) ρ(bs').

Intuitively, ρ(bs) specifies objects that occur somewhere in the block sequence bs. Note that the video association map ρ can be specified through the use of image processing algorithms. In our AVE! system, ρ can be specified using a variety of image processing algorithms including adaptive filtering, entropy based detection, Gabor filtering and active vision techniques [2][21], as well as through human-created video annotations.

Example. Suppose we have a 900 frame video clip of our favorite stork. The video clip may show the stork walking and perhaps catching a fish. In this case, the set T of types used may include all the types listed in the preceding examples. The association map ρ may say that:

  • ρ ([0,900]) = { Feature:{{stork,water},[red:10,green:15,blue:20]}

  • ρ([0,101] = { [ActName:standing,Roles:{ }]}

  • ρ([101,250])={[ActName:fishing,Roles:{[RoleName:fisher,Player:stork]} denoting the fact that the stork is playing the role of the fisher (entity catching the fish).

  • ρ ([0,900]) may consist of all the above as well as { Feature:{{stork,water},[red:10,green:15,blue:20]}.

This does not mean that the stork is continuously fishing during frames 101–250. It merely means that at some time in this interval, it is fishing. If the user wants to specify that the stork is fishing at each time instance in the [101,250] interval, it can be explicitly stated as ρ([101,101]) = { [ActName:fishing,Roles:{ [RoleName:fisher,Player:stork] }, ρ([102,102]) = { [ActName:fishing,Roles:{ [RoleName:fisher,Player:stork] }, and so on.

As the AND-interpretation of the association map ρ is clearly expressible in terms of the OR-interpretation we have used above, we will assume throughout the rest of this paper that the OR-interpretation is used. Some syntactic sugar can easily be created to make the AND-interpretation explicit for use in a deployed system.

Handbook of Video Databases. Design and Applications
Handbook of Video Databases: Design and Applications (Internet and Communications)
ISBN: 084937006X
EAN: 2147483647
Year: 2003
Pages: 393

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net