Rich description of video content is essential for effective video indexing and retrieval applications [42]. Video indexing typically involves the following processing: (1) shot boundary detection - in which the video is partitioned into temporal units, (2) key-frame selection - in which representative frames are selected from each temporal unit, (3) textual annotation or speech transcription - in which textual information is associated with each temporal unit based on human description or speech transcription, (4) feature description - in which feature descriptors are extracted for each temporal unit, possibly from selected key-frame images (i.e., [35]), (5) semantic annotation - in which labels are assigned to describe the semantic content of each temporal unit (i.e., [36]). MPEG-7 provides a rich set of description tools for describing these aspects of video content. The following examples show the use of MPEG-7 for video indexing and retrieval.
The following example describes the shot boundary segmentation results for video using MPEG-7. In this example, the video has been partitioned into five video segments. The MPEG-7 description indicates the media time for each segment in terms of media time point and duration. The description indicates that there are no gaps or overlaps among the segments.
<Mpeg7> <Description xsi:type="ContentEntityType"> <MultimediaContent xsi:type="VideoType"> <Video > <MediaTime> <MediaTimePoint> T00:00:00:0F30000</MediaTimePoint> <MediaDuration> PT10M13S22394N30000F</MediaDuration> </MediaTime> <TemporalDecomposition gap="false" overlap="false"> <VideoSegment > <MediaTime> <MediaTimePoint> T00:00:00:0F30000</MediaTimePoint> <MediaDuration> PT2S19079N30000F</MediaDuration> </MediaTime> </VideoSegment> <VideoSegment > <MediaTime> <MediaTimePoint> T00:00:02:19079F30000</MediaTimePoint> <MediaDuration> PT3S2092N30000F</MediaDuration> </MediaTime> </VideoSegment> <VideoSegment > <MediaTime> <MediaTimePoint> T00:00:05:21171F30000</MediaTimePoint> <MediaDuration> PT4S1121N30000F</MediaDuration> </MediaTime> </VideoSegment> <VideoSegment > <MediaTime> <MediaTimePoint> T00:00:09:22292F30000</MediaTimePoint> <MediaDuration> PT2S26086N30000F</MediaDuration> </MediaTime> </VideoSegment> <VideoSegment > <MediaTime> <MediaTimePoint> T00:00:12:18378F30000</MediaTimePoint> <MediaDuration> PT9S24294N30000F</MediaDuration> </MediaTime> </VideoSegment> </TemporalDecomposition> </Video> </MultimediaContent> </Description> </Mpeg7>
The following example describes transcription of speech for temporal segments of video. In this example, the video has been partitioned into three video segments. The MPEG-7 description indicates a free text annotation for each segment in which the type of text annotation is "transcript" to indicate that it results from speech transcription of the video.
<Mpeg7> <Description xsi:type="ContentEntityType"> <MultimediaContent xsi:type="VideoType"> <Video > <TemporalDecomposition gap="false" overlap="false"> <VideoSegment > <TextAnnotation type="transcript"> <FreeTextAnnotation> Once upon a time. </FreeTextAnnotation> </TextAnnotation> </VideoSegment> <VideoSegment > <TextAnnotation type="transcript"> <FreeTextAnnotation> There were three bears. </FreeTextAnnotation> </TextAnnotation> <MediaTime> <MediaTimePoint>T00:00:10</MediaTimePoint> <MediaDuration>PT1M20S</MediaDuration> </MediaTime> </VideoSegment> <VideoSegment > <TextAnnotation type="transcript"> <FreeTextAnnotation> They were hungry bears. </FreeTextAnnotation> </TextAnnotation> </VideoSegment> </TemporalDecomposition> </Video> </MultimediaContent> </Description> </Mpeg7>
The following example uses the ScalableColor D to describe the color distribution of a key-frame image from a video segment. In this example, the key-frame image depicts a sunset.
<Mpeg7> <Description xsi:type="ContentEntityType"> <MultimediaContent xsi:type="VideoType"> <Video > <TemporalDecomposition gap="false" overlap="false"> <VideoSegment > <SpatioTemporalDecomposition> <StillRegion> <MediaIncrTimePoint timeUnit="PT1001N30000F"> 0 </MediaIncrTimePoint> <SpatialDecomposition> <StillRegion> <TextAnnotation> <FreeTextAnnotation> Sunset scene </FreeTextAnnotation> </TextAnnotation> <VisualDescriptor xsi:type="ScalableColorType" numOfCoeff="16" numOfBitplanesDiscarded="0"> <Coeff> 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 </Coeff> </VisualDescriptor> </StillRegion> </SpatialDecomposition> </StillRegion> </SpatioTemporalDecomposition> </VideoSegment> </TemporalDecomposition> </Video> </MultimediaContent> </Description> </Mpeg7>
The following example uses the GoFGoPColor D to describe the overall color content of a video segment.
<Mpeg7> <Description xsi:type="ContentEntityType"> <MultimediaContent xsi:type="VideoType"> <Video > <TemporalDecomposition gap="false" overlap="false"> <VideoSegment > <VisualDescriptor xsi:type="GoFGoPColorType" aggregation="Average"> <ScalableColor numOfCoeff="16" numOfBitplanesDiscarded="0"> <Coeff>1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 </Coeff> </ScalableColor> </VisualDescriptor> </VideoSegment> </TemporalDecomposition> </Video> </MultimediaContent> </Description> </Mpeg7>
The following example describes the semantics of temporal segments of video using MPEG-7. This example shows a single temporal segment. The MPEG-7 description indicates a free text annotation for the segment in which the type of text annotation is "scene description" to indicate that it describes the video scene contents.
<Mpeg7> <ContentDescription xsi:type="ContentEntityType"> <MultimediaContent xsi:type="VideoType"> <Video> <TemporalDecomposition> <VideoSegment> <TextAnnotation type="scene description" relevance="1" confidence="1"> <FreeTextAnnotation> Sky </FreeTextAnnotation> <FreeTextAnnotation> Water_Body </FreeTextAnnotation> <FreeTextAnnotation> Boat </FreeTextAnnotation> </TextAnnotation> <MediaTime> <MediaTimePoint> T00:00:00:0F30000 </MediaTimePoint> <MediaIncrDuration timeUnit="PT1001N30000F"> 486 </MediaIncrDuration> </MediaTime> </VideoSegment> </TemporalDecomposition> </Video> </MultimediaContent> </ContentDescription> </Mpeg7>
The following example describes the semantics of temporal segments of video using MPEG-7 in which there is some uncertainty in the assigned semantic labels. This uncertainty may result from the use of automatic methods for labeling the content [36]. This example shows a single temporal segment. The MPEG-7 description indicates a keyword annotation for each segment. The keywords may belong to a set of controlled terms defined by an MPEG-7 classification scheme. In this example, the first video segment has been labeled "face" with a confidence of 0.75. The second segment has been labeled "outdoors" with a confidence of 0.9.
<Mpeg7> <Description xsi:type="ContentEntityType"> <MultimediaContent xsi:type="VideoType"> <Video > <MediaTime> <MediaTimePoint>T00:00:00</MediaTimePoint> <MediaDuration>PT1M45S</MediaDuration> </MediaTime> <TemporalDecomposition gap="false" overlap="false"> <VideoSegment > <TextAnnotation confidence="0.5" relevance="0.75" type="label"> <KeywordAnnotation> <Keyword>Face</Keyword> </KeywordAnnotation> </TextAnnotation> <MediaTime> <MediaTimePoint>T00:00:00</MediaTimePoint> <MediaDuration>PT0M10S</MediaDuration> </MediaTime> </VideoSegment> <VideoSegment > <TextAnnotation confidence="0.2" relevance="0.5" type="label "> <KeywordAnnotation> <Keyword>Outdoors</Keyword> </KeywordAnnotation> </TextAnnotation> <MediaTime> <MediaTimePoint>T00:00:10</MediaTimePoint> <MediaDuration>PT1M20S</MediaDuration> </MediaTime> </VideoSegment> <VideoSegment > <TextAnnotation confidence="0.9" relevance="1.0" type="label "> <KeywordAnnotation> <Keyword>Indoors</Keyword> </KeywordAnnotation> </TextAnnotation> <MediaTime> <MediaTimePoint>T00:00:30</MediaTimePoint> <MediaDuration>PT1M15S</MediaDuration> </MediaTime> </VideoSegment> <!-- insert more video shots and labels --> </TemporalDecomposition> </Video> </MultimediaContent> </Description> </Mpeg7>