2. Metadata

Metadata has an essential role in multimedia content management and is critical for describing essential aspects of multimedia content, including main topics, author, language, events, scenes, objects, times, places, rights, packaging, access control, content adaptation, and so forth. Conformity with open metadata standards will be vital for multimedia content management systems to allow faster design and implementation, interoperability with broad field of competitive standards-based tools and systems, and leveraging of rich set of standards-based technologies for critical functions such as content extraction, advanced search, and personalization

2.1. MPEG-7 Standard Scope

The scope of the MPEG-7 standard is shown in Figure 26.2. The normative scope of MPEG-7 includes Description Schemes (DSs), Descriptors (Ds), the Description Definition Language (DDL), and Coding Schemes (CS). MPEG-7 standardizes the syntax and semantics of each DS and D to allow interoperability. The DDL is based on XML Schema Language. The DDL is used to define the syntax of the MPEG-7 DSs and Ds. The DLL allows the standard MPEG-7 schema to be extended for customized applications.

click to expand

Extraction	MPEG-7 Scope	Use
Content analysis (D, DS) Feature extraction (D,DS) Annotation tools (DS) Authoring (DS)	Description Schemes (DSs) Descriptors (Ds) Language (DDL) Coding Schemes (CS)	Searching &filtering Classification Complex querying Indexing

Extraction

MPEG-7 Scope

Use

Content analysis (D, DS)

Feature extraction (D,DS)

Annotation tools (DS)

Authoring (DS)

Description Schemes (DSs)

Descriptors (Ds)

Language (DDL)

Coding Schemes (CS)

Searching &filtering

Classification

Complex querying

Indexing

Figure 26.2: Overview of the normative scope of MPEG-7 standard. The methods for extraction and use of MPEG-7 descriptions are not standardized.

The MPEG-7 standard is "open" on two sides of the standard in that the methods for extraction and use of MPEG-7 descriptions are not defined by the standard. As a result, methods, algorithms, and systems for content analysis, feature extraction, annotation, and authoring of MPEG-7 descriptions are open for industry competition and future innovation. Likewise, methods, algorithms, and systems for searching and filtering, classification, complex querying, indexing, and personalization are also open for industry competition and future innovation.

2.2. Multimedia Description Schemes

The MPEG-7 Multimedia Content Description Interface standard specifies generic description tools pertaining to multimedia including audio and visual content. The MDS description tools are categorized as (1) basic elements, (2) tools for describing content and related metadata, (3) tools for describing content organization, navigation and access, and user interaction, and (4) classification schemes [28][32].

2.2.1. Basic Elements

The basic elements form the building blocks for the higher-description tools. The following basic elements are defined:

Schema tools. Specifies the base type hierarchy of the description tools, the root element and top-level tools, the multimedia content entity tools, and the package and description metadata tools.
Basic datatypes. Specifies the basic datatypes such as integers, reals, vectors, and matrices, which are used by description tools.
Linking and media localization tools. Specifies the basic datatypes that are used for referencing within descriptions and linking of descriptions to multimedia content, such as spatial and temporal localization.
Basic description tools. Specifies basic tools that are used as components for building other description tools such as language, text, and classification schemes.

2.2.2. Content Description Tools

The content description tools describe the features of the multimedia content and the immutable metadata related to the multimedia content. The following description tools for content metadata are defined:

Media description. Describes the storage of the multimedia data. The media features include the format, encoding, storage media. The tools allow multiple media description instances for the same multimedia content.
Creation & production. Describes the creation and production of the multimedia content. The creation and production features include title, creator, classification, purpose of the creation, and so forth. The creation and production information is typically not extracted from the content but corresponds to metadata related to the content.
Usage. Describes the usage of the multimedia content. The usage features include access rights, publication, and financial information. The usage information may change during the lifetime of the multimedia content.

The following description tools for content description are defined:

Structure description tools. Describes the structure of the multimedia content. The structural features include spatial, temporal, or spatio-temporal segments of the multimedia content.
Semantic description tools. Describes the "real-world" semantics related to or captured by the multimedia content. The semantic features include objects, events, concepts, and so forth.

The content description and metadata tools are related in the sense that the content description tools use the content metadata tools. For example, a description of creation and production or media information can be attached to an individual video or video segment in order to describe the structure and creation and production of the multimedia content.

2.2.3. Content Organization, Navigation, and User Interaction

The tools for organization, navigation and access, and user interaction are defined as follows:

Content organization. Describes the organization and modeling of multimedia content. The content organization tools include collections, probability models, analytic models, cluster models, and classification models.
Navigation and Access. Describes the navigation and access of multimedia such as multimedia summaries and abstracts; partitions, views and decompositions of image, video, and audio signals in space, time and frequency; and relationships between different variations of multimedia content.
User Interaction. Describes user preferences pertaining to multimedia content and usage history of users of multimedia content.

2.2.4. Classification Schemes

A classification scheme is a list of defined terms and their meanings. The MPEG-7 classification schemes organize terms that are used by the description tools. Applications need not use the classification schemes defined in the MPEG-7 standard. They can use proprietary or third party ones. However, if they choose to use the MPEG-7 standard classification schemes defined, no modifications or extensions are allowed. Furthermore, MPEG-7 has defined requirements for a registration authority for MPEG-7 classification schemes, which allows third parties to define and register classification schemes for use by others. All of the MPEG-7 classification schemes are specified using the ClassificationScheme DS, that is, they are themselves MPEG-7 descriptions.

2.2.5. MPEG-7 Schema and Values

While MPEG-7 fully specifies the syntax and semantics of the multimedia content description metadata, and allows for extensibility, another important aspect of metadata is the domain of values that populate the structures. We view this as the distinction of the syntax (schema definition) and terms (values). As shown in Figure 26.3, MPEG-7 defines a core set of Descriptors and Description Schemes that form the syntax of the standard schema. This core set is also extensible in that third parties can define new Descriptors and Description Schemes. On the other side, values and terms are required for instantiating the Descriptors and Description Schemes. It is possible for the values to be restricted or managed through MPEG-7 Classification Schemes or Controlled Terms. MPEG-7 specifies a core set of Classification Schemes. However, this core set is also extensible. A registration authority allows for the registration of classification schemes and controlled terms.

click to expand
Figure 26.3: The MPEG-7 classification schemes organize terms that are used by the description tools.

2.2.6. Classification Scheme Examples

The following example gives an MPEG-7 description that uses controlled terms from MPEG-7 classification schemes. The example describes the creation information and semantics associated with a video. The creation information describes a creator who has the role of "publisher." The term "publisher" is a controlled term from an MPEG-7 classification scheme. The semantic information describes a sport depicted in the video. The sport "Baseball" is a controlled term that is referenced as term 1.3.4 in the sports classification scheme referenced by the corresponding URN.

 <Mpeg7>   <Description xsi:type="CreationDescriptionType">     <CreationInformation>       <Creation>         <Title type="popular">           All Star Game</Title>         <Creator>           <Role href="urn:mpeg:mpeg7:cs:RoleCS:2002:PUBLISHER"/>           <Agent xsi:type="OrganizationType">             <Name>               The Baseball Channel</Name>           </Agent>         </Creator>       </Creation>     </CreationInformation>   </Description>   <Description xsi:type="SemanticDescriptionType">     <Semantics>       <Label href="urn:sports:usa:2002:Sports:1.3.4">         <Name xml:lang="en">           Baseball</Name>       </Label>       <MediaOccurrence>         <MediaLocator>           <MediaUri>             video.mpg </MediaUri>         </MediaLocator>       </MediaOccurrence>     </Semantics>   </Description> </Mpeg7>

2.2.7. Example MDS Descriptions

The following examples illustrate the use of different MPEG-7 Multimedia Description Schemes in describing multimedia content.

2.2.7.1. Creation Information

The following example gives an MPEG-7 description of the creation information for a sports video.

 <Mpeg7>   <Description xsi:type="CreationDescriptionType">     <CreationInformation>       <Creation>         <Title type="popular">Subway series</Title>         <Abstract>           <FreeTextAnnotation>             Game among city rivals</FreeTextAnnotation>         </Abstract>         <Creator>           <Role href="urn:mpeg:mpeg7:cs:RoleCS:2001:PUBLISHER"/>           <Agent xsi:type="OrganizationType">             <Name>Sports Channel</Name>           </Agent>         </Creator>       </Creation>     </CreationInformation>   </Description> </Mpeg7>

click to expand
Figure 26.4: Example showing temporal decomposition of a video into two shots and the spatio-temporal of each shot into moving regions.

2.2.7.2. Free Text Annotation

The following example gives an MPEG-7 description of a car that is depicted in an image.

 <Mpeg7>   <Description xsi:type="SemanticDescriptionType">     <Semantics>       <Label>         <Name>Car </Name>       </Label>       <Definition>         <FreeTextAnnotation>           Four wheel motorized vehicle</FreeTextAnnotation>       </Definition>       <MediaOccurrence>         <MediaLocator>           <MediaUri>image.jpg </MediaUri>         </MediaLocator>       </MediaOccurrence>     </Semantics>   </Description> </Mpeg7>

2.2.7.3. Collection Model

The following example gives an MPEG-7 description of a collection model of "sunsets" that contains two images depicting sunset scenes.

 <Mpeg7>   <Description xsi:type="ModelDescriptionType">     <Model xsi:type="CollectionModelType" confidence="0.75"       reliability="0.5" function="described">       <Label>         <Name>Sunsets</Name>       </Label>       <Collection xsi:type="ContentCollectionType">         <Content xsi:type="ImageType">           <Image>             <MediaLocator xsi:type="ImageLocatorType">               <MediaUri>sunset1.jpg</MediaUri>             </MediaLocator>           </Image>         </Content>         <Content xsi:type="ImageType">           <Image>             <MediaLocator xsi:type="ImageLocatorType">               <MediaUri>sunset2.jpg</MediaUri>             </MediaLocator>           </Image>         </Content>       </Collection>     </Model>   </Description> </Mpeg7>

2.2.7.4. Video Segment

The following example gives an MPEG-7 description of the decomposition of a video segment. The video segment is first decomposed temporally into two video segments. The first video segment is decomposed into a single moving region. The second video segment is decomposed into two moving regions.

 <Mpeg7>   <Description xsi:type="ContentEntityType">     <MultimediaContent xsi:type="VideoType">       <Video>         <TemporalDecomposition gap="false" overlap="false">           <VideoSegment >             <SpatioTemporalDecomposition>               <MovingRegion >                 <!-- more elements -->               </MovingRegion>             </SpatioTemporalDecomposition>           </VideoSegment>           <VideoSegment >             <SpatioTemporalDecomposition>               <MovingRegion >                 <!-- more elements -->               </MovingRegion>             </SpatioTemporalDecomposition>             <SpatioTemporalDecomposition>               <MovingRegion >                 <!-- more elements -->                 </MovingRegion>               </SpatioTemporalDecomposition>             </VideoSegment>           </TemporalDecomposition>         </Video>       </MultimediaContent>     </Description> </Mpeg7>

Figure 26.5: Example image showing two people shaking hands.

2.2.7.5. Semantic Event

The following example gives an MPEG-7 description of the event of a handshake between people. Person A is described as the agent or initiator of the handshake event and Person B is described as the accompanier or joint agent of the handshake.

 <Mpeg7>   <Description xsi:type="SemanticDescriptionType">     <Semantics>       <Label>         <Name>           Shake hands </Name>       </Label>       <SemanticBase xsi:type="AgentObjectType" >         <Label href="urn:example:acs">           <Name>Person A </Name>         </Label>       </SemanticBase>       <SemanticBase xsi:type="AgentObjectType" >         <Label href="urn:example:acs">           <Name>Person B </Name>         </Label>       </SemanticBase>       <SemanticBase xsi:type="EventType">         <Label>           <Name>Handshake </Name>         </Label>         <Definition>           <FreeTextAnnotation>             Clasping of right hands by two people</FreeTextAnnotation>         </Definition>         <Relation            type="urn:mpeg:mpeg7:cs:SemanticRelationCS:2001:agent"            target="#A"/>         <Relation           type="urn:mpeg:mpeg7:cs:SemanticRelationCS:2001:accompanier"           target="#B"/>       </SemanticBase>     </Semantics>   </Description> </Mpeg7>

2.3. Audio Description Tools

The MPEG-7 Audio description tools describe audio data [25]. The audio description tools are categorized as low-level and high-level. The low-level tools describe features of audio segments [24]. The high-level tools describe structure of audio content or provide application-level descriptions of audio.

2.3.1. Low-level Audio Tools

The following low-level audio tools are defined in MPEG-7:

Audio Waveform: Describes the audio waveform envelope for display purposes.
Audio Power: Describes temporally-smoothed instantaneous power, which is equivalent to square of waveform values.
Audio Spectrum: Describes features such as audio spectrum envelope (spectrum of the audio according to a logarithmic frequency scale), spectrum centroid (center of gravity of the log-frequency power spectrum), spectrum spread (second moment of the log-frequency power spectrum), spectrum flatness (flatness properties of the spectrum of an audio signal within a given number of frequency bands), spectrum basis (basis functions that are used to project high-dimensional spectrum descriptions into a low-dimensional representation),
Harmonicity: Describes the degree of harmonicity of an audio signal.
Silence: Describes a perceptual feature of a sound track capturing the fact that no significant sound is occurring.

2.3.2. High-level Audio Tools

The following high-level audio tools are defined in MPEG-7:

Audio Signature: Describes a signature extracted from the audio signal that is designed to provide a unique content identifier for purposes of robust identification of the audio signal.
Timbre: Describes the perceptual feature of an instrument that makes two sounds having the same pitch and loudness sound different. The Timbre Descriptors relate to notions such as "attack," "brightness," or "richness" of a sound.
Sound Recognition and Indexing: Supports applications that involve audio classification and indexing. The tools include Description Schemes for Sound Model and Sound Classification Model, and allow description of finite state models using Description Schemes for Sound Model State Path and Sound Model State Histogram.
Spoken Content: Describes the output of an automatic speech recognition (ASR) engine including the lattice and speaker information.
Melody: describes monophonic melodic information that facilitates efficient, robust, and expressive similarity matching of melodies.

2.3.3. Example Audio Descriptions

The following example describes a melody contour of a song:

 <Mpeg7>   <Description xsi:type="ContentEntityType">     <MultimediaContent xsi:type="AudioType">       <Audio>         <AudioDescriptionScheme xsi:type="MelodyType">           <Meter>             <Numerator>3</Numerator>             <Denominator>4</Denominator>           </Meter>           <MelodyContour>             <Contour>2 -1 -2 1 -1 1 -1</Contour>             <Beat>1 4 5 7 8 9 9 10</Beat>           </MelodyContour>         </AudioDescriptionScheme>       </Audio>     </MultimediaContent>   </Description> </Mpeg7>

The following example describes a continuous hidden Markov model of audio sound effects. Each continuous hidden Markov model has five states and represents a sound effect class. The parameters of the continuous density state model can be estimated via training, for example, using the Baum-Welch algorithm. After training, the continuous HMM model consists of a 3x3 state transition matrix, a 3x1 initial state density matrix, and 3 multi-dimensional Gaussian distributions defined in terms of the mean and variance parameters. Each multi-dimensional Gaussian distribution has six dimensions corresponding to audio features described by the AudioSpectrumFlatness D.

 <Mpeg7>   <Description xsi:type="ModelDescriptionType">     <Model xsi:type="ContinuousHiddenMarkovModelType" numOfStates="3">       <Initial mpeg7:dim="5">         0.1 0.2 0.1 </Initial>       <Transitions mpeg7:dim="3 3">         0.2 0.2 0.6 0.1 0.2 0.1 0.4 0.2 0.1 </Transitions>       <State>         <Label>           <Name>State 1 </Name>         </Label>       </State>       <State>         <Label>           <Name>State 2 </Name>         </Label>       </State>       <State>         <Label>           <Name>State 3 </Name>         </Label>       </State>       <DescriptorModel>         <Descriptor xsi:type="AudioSpectrumFlatnessType"           loEdge="250" highEdge="1600">           <Vector>1 2 3 4 5 6 </Vector>         </Descriptor>         <Field>           Vector</Field>       </DescriptorModel>       <ObservationDistribution xsi:type="GaussianDistributionType"         dim="6">         <Mean mpeg7:dim="6">0.5 0.5 0.25 0.3 0.5 0.3 </Mean>         <Variance mpeg7:dim="6">           0.25 0.75 0.5 0.45 0.75 0.3</Variance>       </ObservationDistribution>       <ObservationDistribution xsi:type="GaussianDistributionType"         dim="6">         <Mean mpeg7:dim="6">0.25 0.4 0.25 0.3 0.2 0.1</Mean>         <Variance mpeg7:dim="6">           0.5 0.25 0.5 0.45 0.5 0.2</Variance>       </ObservationDistribution>       <ObservationDistribution xsi:type="GaussianDistributionType"         dim="6">         <Mean mpeg7:dim="6">0.2 0.5 0.35 0.3 0.5 0.5</Mean>         <Variance mpeg7:dim="6">           0.5 0.5 0.5 0.5 0.75 0.5</Variance>       </ObservationDistribution>     </Model>   </Description> </Mpeg7>

2.4. MPEG-7 Visual Description Tools

The MPEG-7 Visual description tools describe visual data such as images and video. The tools describe feature such as color, texture, shape, motion, localization, and faces [5][6][7][8][11][15][16][17][18][21][22][46].

2.4.1. Color

The color description tools describe color information including color spaces and quantization of color spaces. Different color descriptors are provided to describe different features of visual data. The DominantColor D describes a set of dominant colors of an arbitrarily shaped region of an image. The ScalableColor D describes the histogram of colors of an image in HSV color space. The ColorLayout D describes the spatial distribution of colors in an image. The ColorStructure D describes local color structure in an image by means of a structuring element. The GoFGoPColor D describes the color histogram aggregated over multiple images or frames of video.

2.4.2. Texture

The HomogeneousTexture D describes texture features of images or regions based on energy of spatial-frequency channels computed using Gabor filters. The TextureBrowsing D describes texture features in terms of regularity, coarseness, and directionality. The EdgeHistogram D describes the spatial distribution of five types of edges in image regions.

2.4.3. Shape

The RegionShape D describes the region-based shape of an object using Angular Radial Transform (ART). The ContourShape D describes a closed contour of a 2D object or region in an image or video based on Curvature Scale Space (CSS) representation. The 3DShape D describes an intrinsic shape description for 3D mesh models based on a shape index value.

2.4.4. Motion

The CameraMotion D describes 3-D camera motion parameters, which includes camera track, boom, and dolly motion modes; and camera pan, tilt and roll motion modes. The MotionTrajectory D describes motion trajectory of a moving object based on spatio-temporal localization of representative trajectory points. The ParametricMotion D describes motion in video sequences including global motion and object motion by describing the evolution of arbitrarily shaped regions over time in terms of a 2-D geometric transform. The MotionActivity D describes the intensity of motion in a video segment.

2.4.5. Localization

The Localization Descriptors describe the location of regions of interest in the space and jointly in space and time. The RegionLocator describes the localization of regions using a box or polygon. The SpatioTemporalLocator describes the localization of spatio-temporal regions in a video sequence using a set of reference regions and their motions.

2.4.6. Face

The FaceRecognition D describes the projection of a face vector onto a set of 48 basis vectors that span the space of possible face vectors.

2.4.7. Example Video Descriptions

The following example uses the ScalableColor D to describe a photographic image depicting a sunset.

 <Mpeg7>   <Description xsi:type="ContentEntityType">     <MultimediaContent xsi:type="ImageType">       <Image>         <MediaLocator>           <MediaUri>             image.jpg</MediaUri>         </MediaLocator>         <TextAnnotation>           <FreeTextAnnotation>             Sunset scene </FreeTextAnnotation>         </TextAnnotation>         <VisualDescriptor xsi:type="ScalableColorType" numOfCoeff="16"           numOfBitplanesDiscarded="0">           <Coeff>             1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 </Coeff>         </VisualDescriptor>       </Image>     </MultimediaContent>   </Description> </Mpeg7>

The following example uses the GoFGoPColor D to describe a video segment.

 <VideoSegment>   <VisualDescriptor xsi:type="GoFGoPColorType" aggregation="Average">     <ScalableColor numOfCoeff="16" numOfBitplanesDiscarded="0">       <Coeff>         1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 </Coeff>     </ScalableColor>   </VisualDescriptor> </VideoSegment>

The following example uses the RegionShape D to describe a probability model that characterizes oval shapes.

 <Mpeg7>   <Description xsi:type="ModelDescriptionType">     <Model xsi:type="ProbabilityModelClassType" confidence="0.75"       reliability="0.5">       <Label relevance="0.75">         <Name>           Ovals </Name>       </Label>       <DescriptorModel>         <Descriptor xsi:type="RegionShapeType">           <MagnitudeOfART>             3 5 2 5 . . . 6 </MagnitudeOfART>         </Descriptor>         <Field>           MagnitudeOfART</Field>       </DescriptorModel>       <ProbabilityModel xsi:type="ProbabilityDistributionType"         confidence="1.0" dim="35">         <Mean dim="35">           4 8 6 9 . . . 5 </Mean>         <Variance dim="35">           1.3 2.5 5.0 4.5 . . . 3.2 </Variance>       </ProbabilityModel>     </Model>   </Description> </Mpeg7>