2.8 Audio Part of MPEG-7

< Day Day Up >

The audio part of the MPEG-7 standard relies on two basic structures: the AudioSegment DS, inherited from the Segment DS that allows defining a temporal structure of the audio signal, and a set of low-level descriptors.

The elements an AudioSegmentType provides are shown in Exhibit 2.17. They are similar to the elements of the VideoSegment that concerns the SegmentDecompostion and the definition of the time interval, but are different in the way low-level descriptors are defined. Note that a tool is proposed for combined AV content, the AudioVisualSegment DS, which in principle combines the definitions of the Video-and AudioSegment. It relies on the AudioVisualRegion DS for a description of an arbitrary spatio-temporal region of AV content.

Exhibit 2.17: AudioSegementType as subtype of the SegmentType, and their elements.

click to expand

MPEG-7 audio distinguishes two classes of structures, the generic audio description framework and the application-related tools. The first class defines generic descriptions that may be built for any signal. The descriptors include the so-called Low-Level Audio Descriptors (LLDs), the scalable series scheme, and the silence segment. The second class includes sound recognition, instrumental timbre description, spoken content descriptions, and melody description tools. These descriptors are referenced as High-Level Audio Descriptors (HLDs).

For instance, the Melody DS describes melody as a sequence of pitches or contour values, plus some information about scale, meter, beat, and key. The Melody DS includes tools (MelodyContour DS) for melody contour representation and tools (MelodySequence DS) for a more a complete melody representation. Both tools support matching between melodies.

Exhibit 2.18 shows a small melody containing four notes (three intervals) with a time signature of 4/4. The following MPEG-7 document extract describes the melody using the MelodyContourDS.

Exhibit 2.18: Small melody containing four notes.

 <! - MelodyContour description of - >  <! - (3 intervals=4 notes) - >  <AudioDescriptionScheme xsi:type="MelodyType">   <Meter>     <Numerator>4</Numerator>     <Denominator>4</Denominator>   </Meter>   <MelodyContour>     <Contour>-2 2 -1</Contour>     <Beat>1 1 1 1</Beat>   </MelodyContour>  </AudioDescriptionScheme>

The Meter element describes the time signature, and the MelodyContour specifies the contour and the beat. The contour has as many values as there are intervals in the melody. Each value may take a value from -2 to +2 and contains the quantized interval change in cents. The beat has as many values as there are notes. The first note of a melody is labeled with its beat truncated to a whole beat. Successive notes increment the beat number according to their position relative to the first note. The MelodySequence DS avoids all quantizations and provides, therefore, a more verbose, but also more precise, description.

A simple and robust descriptor for audio retrieval is the AudioSignature DS. It summarizes the spectral flatness as unique content identifier for an audio signal. The spectral flatness expresses the deviation of the audio signal's power spectrum over frequency from a flat shape (corresponding to a noiselike or an impulselike signal). A high deviation from a flat shape may indicate the presence of tonal components. The AudioSignature DS summarizes the spectral flateness frame by frame, and the mean and variance values of the summarized flatness are retained. The spectral flatness is calculated for a number of frequency bands, which leads to a feature vector for robust matching between pairs of audio signals. AudioSignatures of two signals must have common band definitions to be compared.

Example: The AudioSignatureDS and its extraction mechanims are the base technology for the music search scenario described in Chapter 1, which is reconsidered in Section 2.2. The following audio description shows how an MPEG-7 document using the AudioSegmentType and the AudioSignatureType looks. The sample is a 3.5-second-long Wave (wav) encoded audio clip. The description was generated with the MPEG-7 Audio-Encoder provided at the Institut für Nachrichtentechnik, RWTH Aachen, and is available, for MPEG-7 developers only, at http://www.ient.rwth-aachen.de/forschung/mpeg7audio/.

 <Mpeg7>   <Description xsi:type="ContentEntityType">     <MultimediaContent xsi:type="AudioType">       <Audio xsi:type="AudioSegmentType">         <MediaTime>           <MediaTimePoint>T00:00:00</MediaTimePoint>           <MediaDuration>PT3S500N1000F</MediaDuration>         </MediaTime>         <AudioDescriptionScheme xsi:type=             "AudioSignatureType">           <Flatness hiEdge="4000.0" loEdge="250.0">             <SeriesOfVector hopSize="PT30N1000F"                 vectorSize="16"               totalNumOfSamples="96">                 <Scaling ratio="32"                     numOfElements="3"/>                 <Mean mpeg7:dim="3 16"> 0.730755 0.763238 0.610380 0.557224 0.630316 0.558102 0.500190 0.375704 0.459458 0.355976 0.438826 0.449985 0.661204 0.581780 0.395361 0.601645 0.756943 0.748773 0.675011 0.697741 0.648728 0.640419 0.467208 0.365971 0.376813 0.578655 0.610981 0.668466 0.727113 0.586444 0.490055 0.657457 0.741402 0.741298 0.710179 0.695944 0.678572 0.663125 0.599361 0.569031 0.578450 0.629878 0.626648 0.665652 0.749214 0.737452 0.628266 0.692530                </Mean>                <Variance mpeg7:dim="3 16"> 0.039768 0.031566 0.029853 0.020935 0.032923 0.049108 0.031126 0.035234 0.045049 0.050080 0.056384 0.066359 0.019457 0.035216 0.083071 0.011095 0.039203 0.029778 0.040931 0.026889 0.040051 0.017069 0.018748 0.023640 0.020661 0.013789 0.013072 0.011485 0.016243 0.032908 0.040397 0.015138 0.041668 0.020187 0.013026 0.021801 0.032787 0.036691 0.022090 0.023869 0.016559 0.010244 0.010880 0.011096 0.014046 0.011034 0.016143 0.009134               </Variance>             </SeriesOfVector>           </Flatness>         </AudioDescriptor>       </Audio>     </MultimediaContent>   </Description> </Mpeg7>

Precise details of all AudioDescriptor(Scheme)Types can be found in the audio specification, ISO-IEC 15938-4.

< Day Day Up >