3. Video Indexing

Rich description of video content is essential for effective video indexing and retrieval applications [42]. Video indexing typically involves the following processing: (1) shot boundary detection - in which the video is partitioned into temporal units, (2) key-frame selection - in which representative frames are selected from each temporal unit, (3) textual annotation or speech transcription - in which textual information is associated with each temporal unit based on human description or speech transcription, (4) feature description - in which feature descriptors are extracted for each temporal unit, possibly from selected key-frame images (i.e., [35]), (5) semantic annotation - in which labels are assigned to describe the semantic content of each temporal unit (i.e., [36]). MPEG-7 provides a rich set of description tools for describing these aspects of video content. The following examples show the use of MPEG-7 for video indexing and retrieval.

3.1. Shot Boundary Description

The following example describes the shot boundary segmentation results for video using MPEG-7. In this example, the video has been partitioned into five video segments. The MPEG-7 description indicates the media time for each segment in terms of media time point and duration. The description indicates that there are no gaps or overlaps among the segments.

 <Mpeg7>   <Description xsi:type="ContentEntityType">     <MultimediaContent xsi:type="VideoType">       <Video >         <MediaTime>           <MediaTimePoint>             T00:00:00:0F30000</MediaTimePoint>           <MediaDuration>             PT10M13S22394N30000F</MediaDuration>         </MediaTime>         <TemporalDecomposition gap="false" overlap="false">           <VideoSegment >             <MediaTime>               <MediaTimePoint>                 T00:00:00:0F30000</MediaTimePoint>               <MediaDuration>                 PT2S19079N30000F</MediaDuration>             </MediaTime>           </VideoSegment>           <VideoSegment >             <MediaTime>               <MediaTimePoint>                 T00:00:02:19079F30000</MediaTimePoint>               <MediaDuration>                 PT3S2092N30000F</MediaDuration>             </MediaTime>           </VideoSegment>           <VideoSegment >             <MediaTime>               <MediaTimePoint>                 T00:00:05:21171F30000</MediaTimePoint>               <MediaDuration>                 PT4S1121N30000F</MediaDuration>             </MediaTime>           </VideoSegment>           <VideoSegment >             <MediaTime>               <MediaTimePoint>                 T00:00:09:22292F30000</MediaTimePoint>               <MediaDuration>                 PT2S26086N30000F</MediaDuration>             </MediaTime>           </VideoSegment>           <VideoSegment >             <MediaTime>               <MediaTimePoint>                 T00:00:12:18378F30000</MediaTimePoint>               <MediaDuration>                 PT9S24294N30000F</MediaDuration>             </MediaTime>           </VideoSegment>         </TemporalDecomposition>       </Video>     </MultimediaContent>   </Description> </Mpeg7>

3.2. Textual Annotation and Speech Transcription

The following example describes transcription of speech for temporal segments of video. In this example, the video has been partitioned into three video segments. The MPEG-7 description indicates a free text annotation for each segment in which the type of text annotation is "transcript" to indicate that it results from speech transcription of the video.

 <Mpeg7>   <Description xsi:type="ContentEntityType">     <MultimediaContent xsi:type="VideoType">       <Video >         <TemporalDecomposition gap="false" overlap="false">           <VideoSegment >             <TextAnnotation type="transcript">               <FreeTextAnnotation> Once upon a time.               </FreeTextAnnotation>             </TextAnnotation>           </VideoSegment>           <VideoSegment >             <TextAnnotation type="transcript">               <FreeTextAnnotation> There were three bears.               </FreeTextAnnotation>             </TextAnnotation>             <MediaTime>               <MediaTimePoint>T00:00:10</MediaTimePoint>               <MediaDuration>PT1M20S</MediaDuration>             </MediaTime>           </VideoSegment>           <VideoSegment >             <TextAnnotation type="transcript">               <FreeTextAnnotation> They were hungry bears.               </FreeTextAnnotation>             </TextAnnotation>           </VideoSegment>         </TemporalDecomposition>       </Video>     </MultimediaContent>   </Description> </Mpeg7>

3.3. Feature Description

The following example uses the ScalableColor D to describe the color distribution of a key-frame image from a video segment. In this example, the key-frame image depicts a sunset.

 <Mpeg7>   <Description xsi:type="ContentEntityType">     <MultimediaContent xsi:type="VideoType">       <Video >         <TemporalDecomposition gap="false" overlap="false">           <VideoSegment >             <SpatioTemporalDecomposition>               <StillRegion>                 <MediaIncrTimePoint timeUnit="PT1001N30000F">                   0 </MediaIncrTimePoint>                 <SpatialDecomposition>                   <StillRegion>                     <TextAnnotation>                       <FreeTextAnnotation>                         Sunset scene </FreeTextAnnotation>                     </TextAnnotation>                     <VisualDescriptor xsi:type="ScalableColorType"                       numOfCoeff="16" numOfBitplanesDiscarded="0">                       <Coeff>                         1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 </Coeff>                     </VisualDescriptor>                   </StillRegion>                 </SpatialDecomposition>               </StillRegion>             </SpatioTemporalDecomposition>           </VideoSegment>         </TemporalDecomposition>       </Video>     </MultimediaContent>   </Description> </Mpeg7>

The following example uses the GoFGoPColor D to describe the overall color content of a video segment.

 <Mpeg7>   <Description xsi:type="ContentEntityType">     <MultimediaContent xsi:type="VideoType">       <Video >         <TemporalDecomposition gap="false" overlap="false">           <VideoSegment >             <VisualDescriptor xsi:type="GoFGoPColorType"               aggregation="Average">               <ScalableColor numOfCoeff="16"                 numOfBitplanesDiscarded="0">                 <Coeff>1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 </Coeff>               </ScalableColor>             </VisualDescriptor>           </VideoSegment>         </TemporalDecomposition>       </Video>     </MultimediaContent>   </Description> </Mpeg7>

3.4. Semantic Description

The following example describes the semantics of temporal segments of video using MPEG-7. This example shows a single temporal segment. The MPEG-7 description indicates a free text annotation for the segment in which the type of text annotation is "scene description" to indicate that it describes the video scene contents.

 <Mpeg7>   <ContentDescription xsi:type="ContentEntityType">     <MultimediaContent xsi:type="VideoType">       <Video>         <TemporalDecomposition>           <VideoSegment>             <TextAnnotation type="scene description" relevance="1"               confidence="1">               <FreeTextAnnotation>                 Sky               </FreeTextAnnotation>               <FreeTextAnnotation>                 Water_Body               </FreeTextAnnotation>               <FreeTextAnnotation>                 Boat               </FreeTextAnnotation>             </TextAnnotation>             <MediaTime>               <MediaTimePoint> T00:00:00:0F30000 </MediaTimePoint>               <MediaIncrDuration timeUnit="PT1001N30000F"> 486               </MediaIncrDuration>             </MediaTime>           </VideoSegment>         </TemporalDecomposition>       </Video>     </MultimediaContent>   </ContentDescription> </Mpeg7>

The following example describes the semantics of temporal segments of video using MPEG-7 in which there is some uncertainty in the assigned semantic labels. This uncertainty may result from the use of automatic methods for labeling the content [36]. This example shows a single temporal segment. The MPEG-7 description indicates a keyword annotation for each segment. The keywords may belong to a set of controlled terms defined by an MPEG-7 classification scheme. In this example, the first video segment has been labeled "face" with a confidence of 0.75. The second segment has been labeled "outdoors" with a confidence of 0.9.

 <Mpeg7>   <Description xsi:type="ContentEntityType">     <MultimediaContent xsi:type="VideoType">       <Video >         <MediaTime>           <MediaTimePoint>T00:00:00</MediaTimePoint>           <MediaDuration>PT1M45S</MediaDuration>         </MediaTime>         <TemporalDecomposition gap="false" overlap="false">           <VideoSegment >             <TextAnnotation confidence="0.5" relevance="0.75"               type="label">               <KeywordAnnotation>                 <Keyword>Face</Keyword>               </KeywordAnnotation>             </TextAnnotation>             <MediaTime>               <MediaTimePoint>T00:00:00</MediaTimePoint>               <MediaDuration>PT0M10S</MediaDuration>             </MediaTime>           </VideoSegment>           <VideoSegment >             <TextAnnotation confidence="0.2" relevance="0.5"               type="label ">               <KeywordAnnotation>                 <Keyword>Outdoors</Keyword>               </KeywordAnnotation>             </TextAnnotation>             <MediaTime>               <MediaTimePoint>T00:00:10</MediaTimePoint>               <MediaDuration>PT1M20S</MediaDuration>             </MediaTime>           </VideoSegment>           <VideoSegment >             <TextAnnotation confidence="0.9" relevance="1.0"               type="label ">               <KeywordAnnotation>                 <Keyword>Indoors</Keyword>               </KeywordAnnotation>             </TextAnnotation>             <MediaTime>               <MediaTimePoint>T00:00:30</MediaTimePoint>               <MediaDuration>PT1M15S</MediaDuration>             </MediaTime>           </VideoSegment>           <!-- insert more video shots and labels -->         </TemporalDecomposition>       </Video>     </MultimediaContent>   </Description> </Mpeg7>