2.2 The Impact of Speech Technology on Design Decisions

Why have we taken the time to describe speech technology in such detail? As you will see, this understanding influences design decisions throughout the process of creating a voice user interface. In this section, we review a few of the concrete ways you can apply this knowledge to design.

Three primary areas stand out:

Performance challenges: If you understand what the technology can handle easily and what affects performance, you can best leverage the technology's strengths and design around its weaknesses.
Problem solving: Problems (such as recognition errors) can arise while a caller is interacting with an application. An understanding of the nature of potential problems, and how to detect them, will help you design dialog strategies for rapid and graceful recovery, as well as help you tune the system to avoid problems.
Definition files: The designer must create or modify a number of definition files (e.g., grammars, dictionaries) that are used by the recognition and natural language understanding modules. Your knowledge of how these definitions figure into the recognition and understanding process will guide you in creating them.

2.2.1 Performance Challenges

Three of the biggest challenges for recognition performance are ambiguity, limited acoustic information, and noise.

Ambiguity

As you observed in Figure 2-14, the recognizer's ability to correctly determine the spoken word string depends on finding a better match (between feature vectors and acoustic models) along the correct path than any other path. The recognizer's biggest enemy is similar-sounding paths because they can easily be confused.

A classic example is the pair of sentences "Wreck a nice beach" and "Recognize speech." Although the word strings look very different, when spoken quickly they can sound the same. A more practical example of a difficult recognition task is the alphabet. Many letters (such as B and D) rhyme with each other. Add to this the fact that the duration of the initial consonant when we say "bee" or "dee" is quite short (i.e., most of the feature vectors are for the vowel, which does not provide help distinguishing the two words), and you have a challenging recognition problem. Chapter 13 covers design guidelines for domains that have recognition ambiguities.

In general, as the vocabulary and grammar get larger, the potential for ambiguity increases. However, you cannot simply keep your grammar small or delete ambiguous items. The grammar must cover the things callers are expected to say. What's more, the worst ambiguities are often not obvious: You can judge only by actual system performance on real data. Chapter 16 provides guidelines for creating and tuning grammars.

Limited Acoustic Information

In general, shorter words and phrases are harder to recognize than longer ones. Longer words and phrases provide more acoustic information that can help in differentiating paths through the recognition model.

Let's consider an example from nationwide directory assistance. Such applications cover an extremely long list of cities from all across the country. To ease the task of recognizing cities, these applications typically begin by asking for "city and state" rather than asking only for the city name. As a result, the system gets more acoustic information to help it differentiate paths through the recognition model associated with different cities.

To see why this is helpful, consider the problem of differentiating the spoken city names "Boston" and "Austin." The only acoustic information that can distinguish between these two cities is the B in Boston. In general, the realization of B's are quite short. If we imagine the utterance was 0.75 seconds long (therefore including 75 feature vectors^[3]), it is quite likely that the B, if it was there, was represented by 3 or 4 of the 75 vectors; in other words, it played a small role in the overall score. Alternatively, if the caller said "Austin," any slight distortion at the beginning of the word (such as a lip smack) could easily have matched the B model. Clearly, differentiating "Boston, Massachusetts" from "Austin, Texas" is far easier than differentiating "Boston" from "Austin."

^[3] Assuming one feature vector for every 10-millisecond segment of speech.

Noise

Noise and distortion can come from numerous sources, including environmental noise and distortions created by the phone line. Noise makes recognition harder. It adds a random factor to the feature vectors so that they no longer represent the caller's actual speech as accurately. They may not then match as closely the acoustic models along the correct path. The noise may also mask important features for matching. In general, anything that changes the feature vectors so that they are less like the data used to train the acoustic models will make recognition less accurate.

In short, all the recognition challenges we have raised ambiguity, limited acoustic information, and noise can be understood with reference to Figure 2-14. The better you can differentiate between the correct path and all possible others, the better the recognition performance. The more feature vectors that match well to acoustic models along the correct path, the easier an utterance is to recognize. You can apply this basic concept to help understand and ameliorate recognition challenges in new situations.

2.2.2 Problem Solving

Recognition does not always succeed. When it fails, there are a number of messages the recognizer may return to the application. For example, a reject indicates that the recognizer did not find any path that matched the input well; even the best path had a very low confidence measure. A reject may have a number of causes, such as noise or the caller saying something not included in the grammar.

A no-speech timeout indicates that the endpointer did not detect speech. This could indicate that the caller said nothing, or it may be caused by poorly tuned endpointer parameters. For example, maybe it was set to listen for only one second before giving up, and the caller hesitated longer than that before beginning to speak.

To design effective dialog strategies to recover from problems, you must understand all the possible failure messages from the recognizer and what they may indicate about the problem. (These messages are covered in Chapter 13.) During the tuning stage, you must be able to track down the underlying cause of observed problems so that you find the right solutions (e.g., tune the endpointing parameters versus reprompt the caller).

2.2.3 Definition Files

Figure 2-8 shows three files used by the recognition model: the acoustic models, the dictionary, and the grammar. As you have seen, these three files play crucial roles in defining how the recognizer works. In addition, there is usually some type of configuration file that sets various parameters of the process (e.g., how long the endpointer waits for a sound before it gives up). In this section, we discuss the role of the VUI designer with respect to each of these definition files.

Grammar

Much of what has been written in the past on VUI design mainly treats the output side of the interface: everything that controls what the system will say to the caller (e.g., prompts and call flow). However, when you deploy a system, you must also design the input side, defining everything that the caller can say to the system (the grammar). These are the two sides of the conversation input and output, caller speech and system speech.

Grammar is perhaps the place where VUI design and technology are most intimately entwined. The output cannot be designed in isolation from the input. In fact, there is a close relationship between what a prompt says and what the caller ends up saying to the system. Many people have noted the correlation between the wording of a prompt and the words chosen by the caller in response (Baber 1997). Even when designers of prompts and call flow are lucky enough to have someone else write the grammar, they still must understand the issues of grammar: how prompt wording and dialog strategy determine grammar needs, how grammar possibilities may constrain the design of prompts, and so on. As discussed earlier, even the choice of grammar type (rule-based versus a statistical language model) will have a significant impact on VUI design decisions. You can make that choice only by combining insights about your end users, application, and business needs with an understanding of the technology.

Chapter 5 discusses the choice of grammar type, and Chapter 12 examines the implications of that choice on the details of prompt design. Chapter 16 discusses the development and tuning of grammars.

Dictionary

Most speech technology vendors supply large dictionaries with their products, covering the vast majority of words and pronunciations in the languages they handle. For most applications, the designer need not touch the dictionary. Chances are, the words in your grammar will already be there, with correct pronunciations.

In a few cases, however, it is appropriate to supplement the dictionary. Some applications may have unusual words that must be recognized. For example, a system that provides driving directions may need to recognize all the street names in a big city. Some street names are unusual words and may not be in the default dictionary. They must be added. (Some technology vendors provide tools to automatically determine the pronunciation of new words based on their spelling.) In other cases, callers may use an unexpected pronunciation for some words, warranting the addition of pronunciations to the dictionary. These pronunciation additions should be made only during the tuning stage, based on observations of real usage patterns. Chapter 15 discusses how to tune dictionary entries.

Acoustic Models

It is almost never necessary to alter the acoustic models supplied by the vendor. In general, the vendor has trained a set of acoustic models on a large data set, covering many domains. The default acoustic models should work well out of the box. In fact, vendors do not usually provide a means for designers or developers to alter (retrain) acoustic models. That is a good thing. Done wrongly, altering acoustic models is more likely to hurt than help performance.

Some vendors supply a mechanism for acoustic model adaptation, which automatically refines the acoustic models for the application and domain without the need for intervention by the designer or developer of the application. If offered, that is the best way to ensure optimal performance.

Configuration Files

There may be other definition files that are used to control the recognition and understanding process. For example, you may choose the confidence level at which the recognizer will reject the input rather than return an answer. You may also choose parameters for the endpointer (e.g., how long it should listen before timing out). In general, default values are supplied for all these parameters, and they need to be changed only in special cases. If, for example, the system is about to execute a stock transaction, you want to be very sure the caller has confirmed it. In that case, you may want to set the reject rate higher.