Audio Indexing Algorithms | Handbook of Video Databases: Design and Applications (Internet and Communications)

The first step in audio content analysis task is to parse an audio stream into segments, such that the content within each clip is homogeneous. The segmentation criteria are determined by specific domains. In phone conversation, the segment boundaries correspond to speaker turns, and in TV broadcast, these segments may be in-studio reporting, live reporting, and commercials. Depending on applications, different tasks follow the segmentation stage. One important task is the classification of a segment into some predefined category, which can be high level (an opera performance in the Metropolitan Opera House), middle level (a music performance), or low level (a clip in which audio is dominated by music). Such semantic level classification is key to generating audio indexes. Beyond such labelled indexes, some audio descriptors may also be useful as low-level indexes, so that a user can retrieve an audio clip that is aurally similar to an example clip. Finally, audio summarization is essential in building an audio retrieval system to enable a user to quickly skim through a large set of retrieved items in response to a query.

In this section, we first introduce some effective audio features that well represent the audio characteristics, and then present the audio segmentation and categorization methods. We will briefly mention the audio summarization task before we finish this section.

1.7 Audio Feature Extraction

There are many features that can be used to characterize audio signals. Usually audio features are extracted in two levels: short-term frame-level and long-term clip-level. Here a frame is defined as a group of neighboring samples which last about 10 to 40 milliseconds (ms), within which we can assume that the audio signal is stationary and short-term features such as volume and Fourier transform coefficients can be extracted. The concept of audio frame comes from traditional speech signal processing, where analysis over a very short time interval has been found to be most appropriate.

For a feature to reveal the semantic meaning of an audio signal, analysis over a much longer period is necessary, usually from one second to several tens of seconds. Here we call such an interval an audio clip. ^[1] A clip consists of a sequence of frames and clip-level features usually characterize how frame-level features change over a clip. The clip boundaries may be the result of audio segmentation such that the content within each clip is similar. Alternatively, fixed length clips, usually 2 to 3 seconds (s), may be used. Both frames and clips may overlap with their previous ones, and the overlapping lengths depend on the underlying application. Figure 20.1 illustrates the relation of frame and clip. In the following, we first describe frame-level features, and then move onto clip-level features.

click to expand
Figure 20.1: Decomposition of an audio signal into clips and frames.

1.7.1 Frame-level features

Most of the frame-level features are inherited from traditional speech signal processing. Generally they can be separated into two categories: time-domain features, which are computed from the audio waveforms directly, and frequency-domain features, which are derived from the Fourier transform of samples over a frame. In the following, we use N to denote the frame length, and s_n(i) to denote the i-th sample in the n-th audio frame.

Volume

The most widely used and easy-to-compute frame feature is volume. ^[2] Volume is a reliable indicator for silence detection, which may help to segment an audio sequence and to determine clip boundaries. Normally volume is approximated by the root mean square (RMS) ^[3] of the signal magnitude within each frame. Specifically, the volume of frame n is calculated by

Note that the volume of an audio signal depends on the gain value of the recording and digitizing device. To eliminate the influence of such device-dependent conditions, we may normalize the volume for a frame by the maximum volume of some previous frames.

Zero crossing rate

Besides the volume, zero crossing rate (ZCR) is another widely used temporal feature. To compute the ZCR of a frame, we count the number of times that the audio waveform crosses the zero axis. Formally,

click to expand

where f_s represents the sampling rate. ZCR is one of the most indicative and robust measures to discern unvoiced speech. Typically, unvoiced speech has a low volume but a high ZCR. By using ZCR and volume together, one can prevent low energy unvoiced speech frames from being classified as silent.

Pitch

Pitch is the fundamental frequency of an audio waveform, and is an important parameter in the analysis and synthesis of speech and music. Normally only voiced speech and harmonic music have well-defined pitch. But we can still use pitch as a low-level feature to characterize the fundamental frequency of any audio signals. The typical pitch frequency for a human being is between 50Hz to 450Hz, whereas the pitch range for music is much wider. It is difficult to robustly and reliably estimate the pitch value for an audio signal. Depending on the required accuracy and complexity constraint, different methods for pitch estimation can be applied.

One can extract pitch information by using either temporal or frequency analysis. Temporal estimation methods rely on computation of the short time auto-correlation function R_n(l) or average magnitude difference function (AMDF) A_n(l), where

click to expand

For typical voiced speech, there exist periodic peaks in the auto-correlation function. Similarly, there are periodic valleys in the AMDF. Here peaks and valleys are defined as local extremes that satisfy additional constraints in terms of their values relative to the global extreme and their curvatures. Such peaks or valleys exist in voiced and music frames and they vanish in noise or unvoiced frames.

In frequency-based approaches, pitch is determined from the periodic structure in the magnitude of the Fourier transform or cepstral coefficients of a frame. For example, we can determine the pitch by finding the maximum common divider for all the local peaks in the magnitude spectrum. When the required accuracy is high, a large size Fourier transform needs to be computed, which is time consuming.

Spectral features

The spectrum of an audio frame refers to the Fourier transform of the samples in this frame. The difficulty of using the spectrum itself as a frame-level feature lies in its very high dimension. For practical applications, it is necessary to find a more succinct description. Let S_n(ω) denote the power spectrum (i.e., magnitude square of the spectrum) of frame n. If we think of ω as a random variable, and S_n(ω) normalized by the total power as the probability density function of ω, we can define mean and standard deviation of ω. It is easy to see that the mean measures the frequency centroid (FC), whereas the standard deviation measures the bandwidth (BW) of the signal. They are defined as

click to expand

It has been found that FC is related to the human sensation of the brightness of a sound we hear.

In addition to FC and BW, Liu et al. proposed to use the ratio of the energy in a frequency subband to the total energy as a frequency domain feature [19], which is referred to as energy ratio of subband (ERSB). Considering the perceptual property of human ears, the entire frequency band is divided into four subbands, each consisting of the same number of critical bands, where the critical bands correspond to cochlear filters in the human auditory model [12].

Specifically, when the sampling rate is 22050Hz, the frequency ranges for the four subbands are 0-630Hz, 630-1720Hz, 1720-4400Hz, and 4400-11025Hz. Because the summation of the four ERSB's is equal to one, only first three ratios were used as audio features, referred to as ERSB1, ERSB2, ERSB3, respectively.

Scheirer et al. used spectral rolloff point as a frequency domain feature [33], which is defined as the 95th percentile of the power spectrum. This is useful to distinguish voiced from unvoiced speech. It is a measure of the "skewness" of the spectral shape, with a right-skewed distribution having a higher value. Lu et al. [21] used spectrum flux, which is the average variation value of spectrum between the adjacent two frames in an audio clip.

Mel-frequency cepstral coefficients (MFCC) or cepstral coefficients (CC) are widely used for speech recognition and speaker recognition. While both of them provide a smoothed representation of the original spectrum of an audio signal, MFCC further considers the non-linear property of the human hearing system with respect to different frequencies. Based on the temporal change of MFCC, an audio sequence can be segmented into different segments, so that each segment contains music of the same style, or speech from one person.

1.7.2 Clip-level features

As described before, frame-level features are designed to capture the short-term characteristics of an audio signal. To extract the semantic content, we need to observe the temporal variation of frame features on a longer time scale. This consideration leads to the development of various clip-level features, which characterize how frame-level features change over a clip. Therefore, clip-level features can be grouped by the type of frame-level features that they are based-on.

Volume-based

Considering the difference of gain values in audio digitization systems, the mean volume of a clip does not necessarily reflect the scene content, but the temporal variation of the volume in a clip does. To measure the variation of volume, Liu et al. proposed several clip-level features [19]. The volume standard deviation (VSTD) is the standard deviation of the volume over a clip, normalized by the maximum volume in the clip. The volume dynamic range (VDR) is defined as (max(v) - min(v))/max(v), where min(v) and max(v) are the minimum and maximum volume within an audio clip. Obviously these two features are correlated, but they do carry some independent information about the audio scene content.

Another feature is volume undulation (VU), which is the accumulation of the difference of neighboring peaks and valleys of the volume contour within a clip. Scheirer proposed to use percentage of "low-energy" frame [33], which is the proportion of frames with RMS volume less than 50% of the mean volume within one clip. Liu et al. used non-silence-ratio (NSR), the ratio of the number of non-silent frames to the total number of frames in a clip, where silence detection is based on both volume and ZCR [19].

The volume contour of a speech waveform typically peaks at 4Hz. To discriminate speech from music, Scheirer et al. proposed a feature called 4Hz modulation energy (4ME) [33], which is calculated based on the energy distribution in 40 subbands. Liu et al. proposed a different definition that can be directly computed from the volume contour. Specifically, it is defined as [19]

where C(ω) is the Fourier transform of the volume contour of a given clip and W(ω) is a triangular window function centered at 4Hz. Speech clips usually have higher values of 4ME than music or noise clips.

ZCR-based

ZCR contours of different types of audio signal are different. For a speech signal, low and high ZCR periods are interlaced. This is because voiced and unvoiced sounds often occur alternatively in a speech. On the contrary, the mild music has a relatively smooth contour.

Liu et al. used the standard deviation of ZCR (ZSTD) within a clip to classify different audio contents [19]. Saunders proposed to use four statistics of the ZCR as features [32]. They are 1) standard deviation of first order difference, 2) third central moment about the mean, 3) total number of zero crossing exceeding a threshold, and 4) difference between the number of zero crossings above and below the mean values. Combined with the volume information, the proposed algorithm can discriminate speech and music at a high accuracy of 98%.

Pitch-based

The patterns of pitch tracks of different audio contents vary a lot. For speech clip, voiced segments have smoothly changed pitch values, while no pitch information is detected in silent or unvoiced segments. For audio with prominent noisy background, no pitch information is detected either. For gentle music clip, since there are always dominant tones within a short period of time, many of the pitch tracks are flat with constant values. The pitch frequency in a speech signal is primarily influenced by the speaker (male or female), whereas the pitch of a music signal is dominated by the strongest note that is being played. It is not easy to derive the scene content directly from the pitch level of isolated frames; but the dynamics of the pitch contour over successive frames appear to reveal the scene content more.

Liu et al. utilized three clip-level features to capture the variation of pitch [19]: standard deviation of pitch (PSTD), smooth pitch ratio (SPR), and non-pitch ratio (NPR). SPR is the percentage of frames in a clip that have similar pitch as the previous frames. This feature is used to measure the percentage of voiced or music frames within a clip, since only voiced and music have smooth pitch. On the other hand, NPR is the percentage of frames without pitch. This feature can measure how many frames are unvoiced speech or noise within a clip.

Frequency-based

Given frame-level features that reflect frequency distribution, such as FC, BW, and ERSB, one can compute their mean values over a clip to derive corresponding clip-level features. Since the frame with a high energy has more influence on the perceived sound by the human ear, Liu et al. proposed to use a weighted average of corresponding frame-level features, where the weighting for a frame is proportional to the energy of the frame [19]. This is especially useful when there are many silent frames in a clip because the frequency features in silent frames are almost random. By using energy-based weighting, their detrimental effects can be removed.

Zhang and Kuo used spectral peak tracks (SPT's) in a spectrogram to classify audio signals [42]. First, SPT is used to detect music segments. If there are tracks which stay at about the same frequency level for a certain period of time, this period is considered a music segment. Then, SPT is used to further classify music segments into three subclasses: song, speech with music, and environmental sound with music background. Song segments have one of three features: ripple-shaped harmonic peak tracks due to voice sound, tracks with longer duration than speech, and tracks with fundamental frequency higher than 300 Hz. Speech with music background segment has SPT's concentrating in the lower to middle frequency bands and has lengths within a certain range. Those segments without certain characteristics are classified as environmental sound with music background.

There are other clip features that are very useful. Some researchers studied the audio feature in compressed domain. Due to the space limit, we cannot include all of them here. Interested readers are referred to [2][3][15][23][28].

1.8 Audio Segmentation

Audio segmentation is finding the abrupt change locations along the audio stream. As we indicated before, this task is domain specific, and needs different approaches for different requirements. In this section, we present two segmentation tasks we investigated at two different levels. One is to segment speaker boundaries at the frame level, and the other one is to segment audio scenes, for example, commercials and news reporting in broadcast programs at the clip level.

1.8.1 Speaker segmentation

In our study, we employed 13 MFCC and their first order derivatives as audio features. The segmentation algorithm consists of two steps: splitting and merging. During splitting, we identify possible speaker change boundaries. During merging, neighboring scenes are merged if their contents are similar. In the first step, low energy frames, which are local minimum points on the volume contour, are located as boundary candidates. Figure 20.2 shows the volume contour of an audio file, where all low energy frames are indicated by a circle. For each boundary candidates, the difference between its neighbors (both left and right) is computed. The definition of neighbors is illustrated in the figure, where for frame X, two dotted rectangular windows W1 and W2 are the neighbors of X and each with length of L seconds. If the distance is higher than a certain threshold and it is the maximum in surrounding range, we declare that the corresponding frame is a scene boundary.

click to expand
Figure 20.2: Illustration of speaker segmentation.

Divergence [13] is adopted to measure the difference between two windows. Assume the features in W1 follow Gaussian distribution N( ₁, ∑₁), where ₁ is the mean vector, and ∑₁ is the covariance matrix. Similarly, the features in W2 follow N( ₂, ∑₂). Then the divergence can be simplified as

Such splitting process yields, in general, over segmentation. A merging step is necessary to group similar neighboring segments together to reduce the false speaker boundaries. This is done by comparing the statistical properties of adjacent segments. The same difference can be used based on longer segments (compared to fixed windows in splitting stage), and a lower threshold is applied.

Such an algorithm actually detects the change of acoustic channel property; for example, if the same speaker changed to a different environment, a segment boundary will be declared, although it is not a real speaker change. Testing on two half hour news sequences, 93% true speaker boundaries are detected with a false alarm rate of 22%.

1.8.2 Audio scene segmentation

Here, the audio scenes we considered are different types of TV programs, including news reporting, commercial, basketball, football, and weather forecast. To detect audio scene boundaries, a 14 dimension audio feature vector is computed over each audio clip. The audio features consist of VSTD, VCR, VU, ZSTD, NSR, 4ME, PSTD, SPR, NPR, FC, BW, ERSB1, ERSB2, ERSB3. For a clip to be declared as a scene change, it must be similar to all the neighboring future clips, and different from all the neighboring previous clips. Based on this criterion, we propose using the following measure:

click to expand

where f(i) is the feature vector of the i-th clip, with i=0 representing the current clip, i>0 a future clip, and i<0 a previous clip, ||*|| is the L-2 norm, var( ... ) is the average of the squared Euclidean distances between each feature vector and the mean vector of the N clips considered, and c is a small constant to prevent division by zero. When the feature vectors are similar within previous N clips and following N clips, respectively, but differ significantly between the two groups, a scene break is declared. If two breaks are closer than N clips away, the one with smaller scene-change-index value is removed. The selection of the window length N is critical: If N is too large, this strategy may fail to detect scene changes between short audio shots. It will also add unnecessary delay to the processing. Through trials-and-errors, we have found that N=6 give satisfactory results.

Figure 20.3 (a) shows the content of one testing audio sequence used in segmentation. This sequence is digitized from a TV program that contains seven different semantic segments. The first and the last segments are both football games, between which are TV station's logo shot and four different commercials. The duration of each segment is also shown in the graph. Figure 20.3 (b) shows the scene-change-index computed for this sequence. Scene changes are detected by identifying those clips for which the scene-change-indices are higher than a threshold, D_min. We used D_min=3, which have been found to yield good results through trial-and-error. In these graphs, mark "o" indicates real scene changes and "*" detected scene changes. All the real scene changes are detected using this algorithm. Note that there are two falsely detected scene changes in the first segment of the sequence. They correspond to the sudden appearance of the commentator's voice and the audience's cheering.

click to expand
Figure 20.3: Content and scene-change-index for one audio stream.

1.9 Audio Content Classfication

After audio segmentation, we need to classify each segment into predefined categories. The categories are normally semantically meaningful high level labels that are determined from low level features. The pattern recognition mechanism fits in this gap, and maps the distribution of low level features to high level semantic concepts. In the section, we will present three different audio classification situations: speaker identification, speech/nonspeech classification, and music genre classification.

1.9.1 Speaker recognition

Besides message via words, speaker identities are additional information conveyed in speech signal. Speaker identities are useful in audio content indexing and retrieval. For example, occurrences of the anchorpersons in broadcast news often indicate semantically meaningful boundaries for reported news stories. Speaker recognition aims to detect the speaker identities, and it generally encompasses two fundamental tasks: Speaker identification is the task to determine who is talking from a set of known voices or speakers, and speaker verification is the task of determining whether a person is who he/she claims to be.

Acoustic features for speaker recognition should have high speaker discrimination power, which means high inter-speaker variability and low intra-speaker variability. Adopted features include linear prediction coefficients (LPC), cepstrum coefficients, log-area ratio (LAR), MFCC, etc., within which MFCC gains more prevalence due to its effectiveness [29]. Depending on the specific applications, different speaker models and corresponding pattern matching methods can be applied. Popular speaker models are dynamic time warping (DTW), hidden Markov model, artificial neural network, and vector quantization (VQ).

Huang et al. studied anchorperson detection, which can be categorized as a speaker verification problem [10]. Detection of anchorperson segments is carried out using text independent speaker recognition techniques. The target speaker (anchorperson) and background speakers are represented by a 64-component Gaussian mixture model with diagonal covariance matrices. The utilized audio features are 13 MFCC coefficients and their first and second order derivatives. A maximum likelihood classifier is applied to detect the target speaker segments. Testing on a dataset of 4 half hour news sequences, this approach successfully detects 91.3% of real anchorperson speech, and the false alarm rate is 1%.

1.9.2 Audio scene detection

Audio scenes are segments with homogeneous content in an audio stream. For example, broadcast news program generally consists of two different audio scenes: news reporting and commercials. Discriminating them is very useful for indexing news content. One obvious usage is to create a summary of news program, where commercial segments are removed.

Depending on the application, different categories of audio scenes and different approaches are adopted. In [32], Saunders considered the discrimination of speech from music. Saraceno and Leondardi further classified audio into four groups: silence, speech, music, and noise [31]. The addition of the silence and noise categories is appropriate, since a large silence interval can be used as segment boundaries, and the characteristic of noise is much different from that of speech or music.

A more elaborate audio content categorization was proposed by Wold et al. [38], which divides audio content into ten groups: animal, bells, crowds, laughter, machine, instrument, male speech, female speech, telephone, and water. To characterize the difference among these audio groups, Wold et al. used mean, variance, and auto correlation of loudness, pitch, brightness (i.e., frequency centroid) and bandwidth as audio features. A nearest neighbor classifier based on weighed Euclidean distance measure was employed. The classification accuracy is about 81% over an audio database with 400 sound files.

Liu et al. [17][19] studied the problem of classifying TV broadcast into five different categories: news reporting, commercial, weather forecast, basketball game, and football game. Based on a set of 14 audio features extracted from audio energy, zero crossing rate, pitch, and spectrogram, a 3 layer feed forward neural network classifier achieves 72.5% accuracy. A classifier based on a hidden Markov model further increases the accuracy by 12%.

Another interesting work related to general audio content classification is by Zhang and Kuo [41]. They explored five kinds of audio features: energy, ZCR, fundamental frequency, timber, and rhythm. Based on these features, a hierarchical system for audio classification and retrieval was built. In the first step, audio data is classified into speech, music, environmental sounds, and silence using a rule-based heuristic procedure. In the second step, environmental sounds are further classified into applause, rain, birds' sound, etc., using an HMM classifier. These two steps provide the so-called coarse-level and fine-level classification. The coarse-level classification achieves 90% accuracy and the fine-level classification achieves 80% accuracy in a test involving ten sound classes.

1.9.3 Music genre classification

Digital music, in all kinds of formats including MPEG Layer 3 (MP3), Microsoft media format, RealAudio, MIDI, etc., is a very popular type of traffic on the Internet. When music pieces are created, they are normally assigned with related metadata by producers or distributors, for example, title, music category, author name, and date. Unfortunately, most of the metadata is not available or is lost in the stages of music manipulation and format conversion. Music genre, as a specific metadata, is important and indispensable for music archiving and querying. For example, a simple query to find all pop music in a digital music database requires the category information. Since manually re-labelling is time consuming and inconsistent, we need an automatic way to classify music genre.

Music genre classification has attracted a lot of research effort in recent years. Tzanetakis et al. [36] explored the automatic classification of audio signals into a hierarchy of music genres. On the first level, there are ten categories: classical, country, disco, hiphop, jazz, rock, blues, reggae, pop, and metal. On the second level, classical music is further separated into choir, orchestra, piano, and string quartet, and jazz is further split into bigband, cool, fusion, piano, quartet, and swing. Three sets of audio features are proposed, which reflect the timbral texture, rhythmic content and pitch content of audio signal, respectively. Timbral texture features include spectral centroid, spectral rolloff, spectral flux, zero crossing rate, and MFCC. Rhythmic content features are calculated based on wavelet transform, where the information of main beat, sub-beats and their periods and strengths are extracted. Pitch content features are extracted based on multiple pitch detection techniques. Utilized pitch features include the amplitude and period of the maximum peaks of pitch histogram, pitch interval between the two most prominent peaks of the pitch histogram, and the sum of the histogram. Tzanetakis et al. tested different classifiers, including simple Gaussian classifier, Gaussian mixture model classifier, and K-nearest neighbor classifier. Among them, GMM with 4 mixtures achieves best classification accuracy, which is 61%. Considering that human beings make 20% to 30% errors on classifying musical genre in a similar task, the performance of automatic music genre classification is reasonably good.

Lambrou et al. [14] investigated the task of classifying audio signal into three different music styles: rock, piano, and jazz. They used zero crossing rate and statistical signal features in wavelet transform domains as acoustic features. Overall, seven statistics are computed, which are first order statistics: mean, variance, skewness, and kurtosis, and second order statistics: angular second moment, correlation, and entropy. Lambrou et al. benchmarked four different classifiers: minimum distance classifier, K-nearest neighbors classifier, least squares minimum distance classifier (LSMDC), and quadrature classifier. Simulation results show that LSMDC gives the best performance with an accuracy of 91.67%.

1.10 Audio Summarization

The goal of audio summarization is to provide a compact version of the original audio signal in a way that most significant information is kept within a minimum duration. Effective audio summarization can save a tremendous amount of time for the user to digest the audio content without missing any important information. To further reduce the time that users need to skim audio content, the original audio data or the summary can be played back at a faster speed in a way that pitch information is kept. Normally when the speedup is less than two times real time, human beings do not lose much listening comprehension capability.

Huang et al. [11] studied how to summarize the news broadcast in different levels of details. The first level of summary filters out commercials, and second level of summary is composed of all anchorperson speeches, which cover the introductions and summaries of all news stories. The top level summary consists of a reduced set of anchorperson speech such that they cover all reported content, and are least redundant. For more information on audio summarization, please refer to the survey paper [40] of Zechner on spoken language summarization.

^[1]In the literature, the term "window" is sometimes used.

^[2]Volume is also referred to as loudness, although strictly speaking, loudness is a subjective measure that depends on the frequency response of the human listener.

^[3]The RMS volume is also referred to as energy.