Audio Query Methods | Handbook of Video Databases: Design and Applications (Internet and Communications)

Besides audio content indexing, query mechanism is the other important component in an audio retrieval system. Here, we briefly introduce the three most commonly used query methods. Specifically, they are query by keywords, query by examples, and query by humming. We also describe the relevance feedback technique, which further improves the audio retrieval performance by user interaction.

1.11 Query by Keywords

Query by keywords or key phrases for audio content basically follows the same approach used in traditional text based information retrieval. The user provides several keywords, or key phrases as searching terms, and the query engine compares them with the textual information attached with audio files in the database to determine the list of returns. Query by text demands low computation resource, yet it requires that the content of each audio file is labelled with semantically meaningful metadata. For example, music songs may have titles, authors, players, and genres, and speech signals have transcriptions, etc. Some of these metadata can only be created manually, yet some of them can be produced by automatic approach; for example, the speech signal can be transcribed by automatic speech recognition technologies.

Query by keywords for audio takes the advantages of classical information retrieval technologies. For example, word stemming can be applied, such that words in different inflections of the same word can also be retrieved. Sometimes, straightforward keyword matching is not adequate, especially when the query is short. In such a situation, the keywords in query may not be the words used in the archive, although they share the same semantic meaning. Automatic query expansion [5][24] is to add more useful words in the query to improve the recall rate. For example, if the query is "AT&T," possible expansion are "AT and T," "telecommunication giant," and "Mike Armstrong (the CEO)". Simple expansion methods include adding corresponding acronyms of query keywords, or use full names of the acronyms in query. More sophisticated approaches utilize statistical relationship among words and relevance feedbacks by users.

Query by keywords for audio has its own shortage. The transcripts produced by ASR may be inaccurate due to the recognition errors or out of vocabulary words. If the query terms in audio samples are not correctly recognized, above mentioned information retrieval techniques no longer work. A promising alternative is to perform retrieval on sub-word acoustic units (e.g., phones or syllables) [5]. For example, a user query is first translated into a set of phone trigrams, and query is conducted by searching all possible three-phone sequences in the lattice output by the recognizer.

1.12 Query by Examples

Very often, audio query is hard to formulate explicitly in words. For example, it is difficult to explain in text what audio we are looking for if we don't know its title. Query by example is a more natural way for retrieving audio content. Suppose we are looking for a music masterpiece. We have no clue of the title, but we have a short portion of it, for example, a 10 second long clip. Then we can use this piece of audio sample, normally in the format of a file, as a query object. The search engine analyzes the content of query example, computes acoustic features, compares the audio content with audio files in the database, and then generates returns accordingly.

Considering that even the same audio clips may exist in different formats, for example, different sampling rates, different compression methods, etc., it requires the search engine to extract robust acoustic features that survive various kinds of distortions. Another issue is how to speed up the search procedure, which is especially important for large size audio database. Besides optimizing the structure of the database, a more fundamental issue is how to make the comparison between two audio files faster. Liu et al. addressed this problem in [18]. The proposed idea is that instead of comparing the difference of two sets of audio features extracted from audio files, search engine first builds a Gaussian mixture model for each set of audio features, and then compares the distance between two models. Models of archived audio can be built offline, where processing time is not a major concern. The cost to train a GMM for a short query audio sample is neglectable. Another advantage is that the comparison cost between GMMs is independent of the duration of corresponding audio files. As a trade off of the speed and model accuracy, different numbers of mixtures for GMM can be chosen. Liu et al. also proposed a new parametric distance metric for determining the distance between two GMMs efficiently, which makes this approach promising for query by examples in large size audio database.

1.13 Query by Humming

One step further than query by examples, query by humming is the most effective way to search audio content. Without forming any query keywords or having a piece of audio sample, the user hums a short clip of audio as a query. The challenges are twofold: robustly extracting the melodic information in query and efficiently matching melody.

Ghias et al. proposed their solution on how to specify a humming query and how to implement fast query execution in music database [8]. The hummed signal is first digitized, and pitch information is tracked. Melodic contour, which is the sequence of relative differences in pitch between successive notes, is then extracted. The query is transformed into a string of three letter alphabet (U, D, S), where U means a note is higher than previous note, D means it is lower than the previous note, and S means they are the same. Similarly, all songs in a music database are pre-processed to convert the melody into a stream of (U, D, S) characters. A fast approximate string matching algorithm which allows certain number of mistakes is adopted to search similar songs in the database.

Recently, Lu et al. proposed a new method to query by humming in music retrieval [20]. They used a triplet: pitch contour, pitch interval, and duration to represent melody information. A hierarchical matching method was used to make the matching fast and accurate. First, approximate string matching and dynamic programming are applied to align the pitch contours between query and candidate music segments, Then, the similarities of pitch interval and duration according to the matched path are computed. The final rank of candidate music is a weighted summation of the two similarities. Simulations show that 74% of 42 testing queries retrieve correct songs among the top three matches from a database of 1000 MIDI songs. The performance is encouraging for commercial applications of query by humming.

1.14 Relevance Feedback

The previous three query mechanisms accept user's query in different formats, yet they do not involve further user inputs in the retrieval process. Two reasons suggest that user's feedback in the loop of retrieval is desirable. First, the internal low level features, including textual or acoustic features, that are used for audio retrieval normally do not have clear mapping to high level query concepts. It is difficult to accurately catch user's intent in feature space based on one short query. Second, there is variety in human perception. Even with the same query and the same returns, different users may have different opinions on the results. With the user's relevance feedback on a set of retrieved results by labelling them either positive (wanted ones) or negative (others), the search engine has a better chance to capture user's searching goal by analyzing these additional audio samples. The internal query term or search mechanism can be adapted based on the feedback to refine the query results in the next iteration.

Rui et al. [30] investigated relevance feedback techniques in content-based image retrieval. Their approach is to dynamically assign weights, based on user's relevance feedback, to different sets of visual features, which determine the difference of two images. In such a way, there is no longer a burden for users to precisely specify their queries; it is the computer who intelligently finds out user's need and accordingly adjusts the searching scheme. Similar approach is also applicable for audio query system.