Brief Review of Audio Indexing and Retrieval | Handbook of Video Databases: Design and Applications (Internet and Communications)

The volume of information accessible over the Internet or within personal digital multimedia collections has exceeded the users' capability to efficiently sift through and find relevant information. The last decade has seen an explosion of automatic multimedia information retrieval (IR) research and development with the booming of Internet and the improvement of digital storage capability. A series of powerful Internet search engines, from Alta Vista, Lycos, to Google, have brought the new information retrieval technologies into reality. Their capabilities of searching billions of documents within a few seconds have brought huge benefits for human beings and totally changed our life and working styles. Google and other popular search engines began to replace the libraries as the starting point in our information exploring and collecting task. While the main focus of the classic IR has been the retrieval of text, recent efforts addressing content-based retrieval of other media, including audio and video, are starting to show promise.

A video sequence is a rich multimedia information source, containing speech, music, text (if closed caption is available), image sequences, animation, etc. Although the human being can quickly and effectively interpret the semantic content, including recognizing the imaged objects, differentiating the types of audio, and understanding the linguistic meanings of speech, computer understanding of a video sequence is still in a primitive stage. There is a strong demand for efficient tools that enable easier dissemination of audiovisual information for the human being. Computer understanding of a video scene is a crucial step in building such tools. Other applications requiring scene understanding include spotting and tracing of special events in a surveillance video, active tracking of special objects in unmanned vision systems, video editing and composition, etc.

Research in video content analysis has focused on the use of speech and image information. Recently, researchers started to investigate the potential of analyzing audio signal. This is feasible because audio conveys discriminative information to identify different video content. For example, the audio characteristics in a football game are easily distinguishable from those in a news report. Obviously, audio information alone may not be sufficient for understanding the scene content, and in general, both audio and visual information should be analyzed. However, because audio-based analysis requires significantly less computation, it can be applied in a pre-processing stage before more comprehensive analysis involving visual information. The audio signal is very dynamic in terms of both waveform amplitude and spectrum distribution, which makes audio content analysis a challenge. Following, we will briefly introduce several sub-areas in this field.

1.1 Audio Content Segmentation

The fundamental approach for audio content analysis is breaking audio stream into smaller segments, whose contents are homogenous, and then processing each segment individually by feasible methods. The reason is that audio processing is a domain specific task, and there is no single wide applicable algorithm that suits for all kinds of audio content. For example, different approaches are adopted to deal with speech signal and music signal. Therefore, audio content segmentation is normally the first step in audio analysis system, and different systems may choose different methods depending on real applications. Siegler et al. [34] proposed to use symmetric Kullback-Leibler (KL) Distance as an effective metric for segmenting broadcast news into pieces based on speaker or channel changes. Chen et al. detected changes in speaker identity, and environment or channel conditions in audio stream based on Bayesian Information Criterion (BIC) [4]. Nam and Tewfik [26] proposed to detect sharp temporal variations in the power of the subband signals to segment visual streams. Liu et al. investigated how to segment broadcast news at scene level such that each segment contains a single type of TV program, for example, news reporting, commercial, or games [19].

1.2 Audio Content Categorization

Audio content categorization is to assign each audio segment to one predefined category. Normally, it is formulated as a pattern recognition problem, where a wide range of pattern classification algorithms can be applied on different sets of acoustic features. Depending on the specified audio categories, various combinations of audio features and classification methods are utilized. Saunders [32] presented a method to separate speech from music by tracking the change of the zero crossing rate. Wold et al. [38] proposed to use nearest neighbour classifier based on weighed Euclidean distance to classify audio into ten categories. Liu et al. [16][17] studied the effectiveness of artificial neural network, and Gaussian mixture model (GMM) classifiers on a task of differentiating five types of news broadcast. Tzanetakis et al. [36] classified music into ten categories based on k-nearest neighbor (k-NN) and GMM classifiers. Under certain situations, it is more feasible to segment and categorize audio content jointly. Huang et al. [10] proposed a hidden Markov model (HMM) approach that simultaneously segment and classify video content.

Audio content segmentation and categorization usually serve as the beginning steps in audio content analysis. Then, specific analysis methods can be utilized to process different types of audio; for example, speech and speaker recognition algorithms can be applied on speech signal, and note/pitch detection algorithms can be applied on music signal.

1.3 Speech Signal Processing

Although speech signal is normally narrow band, with energy centralizing within 4KHz in frequency domain, it carries rich semantics and wealthy side information. The side information includes language, speaker identification, age, gender, and emotion, etc. Such kind of side information is important for audio indexing, and it will significantly expand audio indexing and query capabilities.

Automatic speech recognition (ASR) transcribes the speech signal into text stream. With decades of hard work in ASR, this technique has matured from research labs to commercial markets. Almost all the state-of-the-art large vocabulary recognition systems are built based on hidden Markov model framework, along with sophisticated methods to improve both the accuracy and the speed. The last decade has witnessed substantial progresses in ASR technology. The ASR systems are able to achieve almost perfect results in restricted domains, e.g., recognize numbers in phone conversation and reasonable results in unrestricted domains, e.g., transcribe news broadcast. Even with an accuracy of about 50% for noisy audio input, ASR still produces half of the right words for index and it is reasonable easy to understand the content. Gauvain et al. [7] provided an overview of recent advances in large vocabulary speech recognition and explored feasible application domains.

How to extract the useful side information has also attracted many research efforts for a long time. Besides the benefit of further improving the ASR performance by adapting acoustic models, side information also provides additional audio query functionalities. For example, we can retrieve the speech of Kennedy if the speaker identification is available. Reynolds presented an overview on various speaker recognition technologies in [29]. Backfried et al. studied automatic identification of four languages in broadcast news based on GMM [1]. Dellaert et al. [6] explored several statistical pattern recognition techniques for emotion recognition in speech. Parris and Carey described a new technique for language independent gender identification based on HMM [27]. The average error rate on a testing database of 11 languages is 2.0%.

1.4 Music Signal Processing

Contrary to narrow band speech signal, music signal is the wide band component of audio signal. Musical information access is a crucial stake regarding the huge quantity available and worldwide interest. As the amount of musical content increases and the Web becomes an important mechanism for distributing music, we expect to see a surging demand for music search services. Many currently available music search engines rely on file names, song titles, and the names of composers or performers. These systems do not make use of the music content directly. To index music based on content, we need to address at least three issues: music categorization, music transcription, and music melody matching.

Similar approaches as those introduced in section 2.2 may be applied to music genre classification. Parallel to speech recognition, music transcription is to convert music signal into a sequence of notes. Musical instrument digital interface (MIDI) is a powerful tool for composers and performers, and it serves as a communication protocol that allows electronic musical instruments to interact with each other. Transforming music into MIDI format is an ideal choice for indexing music archive. Querying by example and humming are natural ways to query music pieces from a large music database. To realize such query capabilities, music melody matching is a key issue which determines the effectiveness and efficiency of music searching. Recent work on query by humming can be found in [8][20].

1.5 Audio Data Mining

Audio data mining is to look for hidden patterns in a group of audio data. For example, in a telemarketing conversation, certain words, sentences, or speech styles indicate the customer's preferences, personality, and shopping patterns, which help a human salesman to quickly decide the best selling strategy. With large collections of thousands of such conversations, audio data mining techniques can be applied to automatically discover the association between speech styles and the shopping patterns. The discoveries can be integrated into an automatic customer service system to serve future customers with humanlike intelligence and flexibility. Audio data mining technology combines speech recognition, language processing, knowledge discovery, and indexing and search algorithms to transform the audio data into useful intelligence. Promising application areas for audio mining include customer care call centers, knowledge gathering, law enforcement, and security operations.

Audio mining is not equal to audio query, but they share some fundamental techniques, including audio segmentation and categorization, speech transcription, audio searching, etc. The major difference is that audio query finds something we know that exists, yet audio mining discovers new patterns from the audio archive based on statistical analysis. As technology matures, even greater volumes of data in the form of audio will be captured from television, radio, telephone calls, meetings, conferences, and presentations. Audio mining techniques turn all these audio archives into valuable intelligent knowledge as easily as text mining.

1.6 Mutli-Modality Approach

Audio often coexists with other modalities, which include text and video streams. While audio content analysis itself is very useful for video indexing, it is more desirable to combine the information from other modalities, since semantics are embedded in multiple forms that are usually complementary to each other. For example, live coverage on TV about an earthquake conveys information that is far beyond what we hear from the reporter. We can see and feel the effects of the earthquake, while listening to the reporter talking about the statistics. Many efforts have been involved in this field. Wang et al. reported their recent work in combining audio and visual information for multimedia analysis and surveyed related works in [39]. The Informedia project at Carnegie Mellon University combined speech recognition, natural language understating, image processing, and text processing for content analysis in creating a terabyte digital video library [37].

With worldwide interests in multimedia indexing and retrieval, a single standard which can provide a simple, flexible, interoperable solution to multimedia indexing and retrieving problems will be extremely valuable. MPEG-7 standard [25] serves such a need by providing a rich set of standardized tools to describe multimedia content. MPEG-7 standardizes the content descriptions and the way for structuring them, but leaves the extraction and usage of them open. Consequently, instead of hurdling the evolution of multimedia content indexing and retrieval technologies, the standard stimulates and benefits from new progresses.

The organization of this chapter is as follows. In Section 3, we describe audio indexing algorithms, including audio feature extraction, audio segmentation, and classification. In Section 4, we show different types of audio query methods. Three representative audio retrieval systems are presented in Section 5. Given the high relevance of MPEG-7 audio with the content of this chapter, we briefly introduce it in Section 6. And finally in Section 7, we indicate some future research directions to conclude this chapter.