15.3 Tuning | Voice User Interface Design 2004

After testing is complete, you are ready to deploy the system for your target user population. This is typically done in two phases. The first phase is referred to as the pilot phase, in which the system is released to a limited number of users (a few hundred to a few thousand) in the actual caller population. Data are collected, and the system is tuned using these data. Often, as many as three iterations of data collection and tuning may happen during the pilot phase. When the pilot phase is complete, the system is rolled out to the entire user base. Tuning may continue with data collected following the full rollout.

The pilot is the first opportunity to evaluate the system with real in-service data data from real callers performing real tasks. Such data are essential for tuning recognition performance and grammar coverage. There is no way to evaluate the great variety of ways people will talk to the system until you collect data from users engaged in real, task-oriented behavior, performing tasks that matter to them.

The same is true for evaluation of the dialog design. Users will not respond in exactly the same way in an experimental situation (such as when performing an assigned task in a usability study) as they will when they are dealing with their own money or booking a real flight. In the latter, the difference between flying to Boston or Austin matters a great deal! In general, users do not exhibit the same impatience with obvious inefficiencies when they are experimental subjects as they do when they call a company trying to accomplish a specific goal.

On the other hand, there are important advantages to usability studies. In particular, the ability to talk with participants about their experience and the reasons for any problems they might have had will help you identify the root cause behind observed problems. When evaluating in-service data, despite the advantage of the authenticity of the data, you can only guess at what was going on in the callers' minds when they experienced problems. As a result, we recommend a combination of usability studies (during design and just before pilot) and analysis of in-service data, as described here.

To support the tuning process, you collect data from all calls to the pilot system, every input utterance from every caller. The data are transcribed; the actual word strings spoken by the callers are notated. These transcriptions can then be used to measure recognition accuracy and to tune grammars. In later phases, such as during tuning after full deployment, you may decide to sample a subset of call data for tuning.

Three aspects of the system are tuned: dialog, recognition, and grammar. You must analyze observed problems to figure out the root cause. For example, if a high out-of-grammar rate is observed in one particular dialog state, the cause may be missing items in the grammar, or it may be a poorly worded prompt that is leading to confusion or a poorly designed dialog strategy that leads callers into the wrong state. Poor recognition may be caused by pronunciations that are missing from the dictionary, poorly tuned endpointing parameters, or a variety of other problems.

This chapter covers dialog tuning and recognition tuning, and Chapter 16 covers grammar tuning.

15.3.1 Dialog Tuning

There are three primary approaches to dialog tuning: call monitoring, call log analysis, and user experience research. Let's look at each approach.

Call Monitoring

An essential step in understanding the performance of a system is to monitor calls. It can be extremely instructive to take the time to listen to 100 randomly selected calls. There are a number of ways to do that, depending on the platform and tools that are available.

In some cases, you can actually listen while calls are taking place. In other cases, you can make whole-call recordings: digital recordings of both sides of a call (system prompts and caller utterances) that can be played back later. If neither of these approaches is available, you can use software to re-create calls by playing back all the collected utterances from a call, interspersed with the appropriate system prompts.

The advantage of monitoring or whole-call recording is that, given the ability to hear both sides of the conversation continuously, you can more easily diagnose problems with barge-in or endpointing. However, you can also identify many dialog issues from re-created calls. In the early days of Nuance, one of the authors (MC) spent all his commutes to and from work listening to recordings of calls to deployed systems. As the traffic problems in Silicon Valley worsened, our performance improved.

When listening to re-creations of calls or recordings, you can be selective about which calls to review based on various criteria. As mentioned, you should start by listening to randomly selected calls to get a sense of how the system is doing. You can then select calls with lots of problems, calls with problems in particular areas, or calls that exercised particular features. Be careful not to choose simply "the 100 calls with the most problems" because you will likely get only the calls with babies screaming and dogs barking in the background.

Earlier, we discussed the trade-offs between in-service data and usability data: the representative nature of real use versus the ability to probe the subject to gain insight into the reasons for problems. There is a call monitoring approach that combines the advantages of both. It consists of callbacks to users right after they have made a call to the application. Using a callback technique, you can combine the realism of in-service data with the ability to interview the caller. This approach can be very valuable, especially if it can be focused on callers who experience a particular problem that you are trying to understand.

Call Log Analysis

Most platforms allow the collection of call logs: data records about the performance of each call to the system. You can make a number of performance measurements from call logs that are useful in identifying dialog problem areas.

A task completion analysis looks at all tasks that callers attempt to perform, measuring how often they successfully complete the task and noting the reason for failure when they do not complete the task. Tasks with high noncompletion rates are candidates for more detailed analysis.

A dropout analysis looks at where in the dialog calls end, whether by hang-ups or by requests to transfer to a live agent. States with a high dropout rate that are not normally associated with logical places to end a call are candidates for further analysis.

A hotspot analysis identifies those dialog states with a high rate of problems. A "problem" can be defined in a number of ways. For example, you may define a problem as all cases when the recognizer returned something other than a recognition result (e.g., rejects or timeouts) plus requests for help. You can even run a number of hotspot analyses using different criteria.

Once a problem area is identified, you need to find the root cause. Listening to the interaction in only that dialog state for a number of calls may shed light on the problem. It may also be useful to begin listening at a point a couple of interchanges before the problematic state. You can also look at a long list of transcriptions of things callers said when in that state.

Looking at misrecognitions and out-of-grammar rates may also be useful. If the out-of-grammar rate is high, see whether it is dominated by in-domain (reasonable answers that were not covered by the grammar) or out-of-domain (inappropriate answers) instances of out-of-grammar utterances. A high rate of out-of-domain out-of-grammars may indicate a confusing prompt or dialog strategy, whereas many in-domain out-of-grammars may be a stronger indicator that certain items are missing from the grammar.

You can use pilot data to tune error recovery approaches. Review every state with more than a minimal number of rejects or timeouts. For a number of the problem calls, listen to the relevant dialog state (or start a couple of states earlier for context). Often, the predominant causes of problems in that state will become clear, and you can tune the error recovery messages (or even the main prompt for the state) accordingly.

User Experience Research

User experience research is focused on getting qualitative input about the performance of the system. A common approach is to send a survey to callers who used the system during the pilot period. On a number of projects, we have tacked a telephone survey onto the end of the application. When callers completed their task, they were asked whether they were willing to participate in a brief survey. The survey was automated. They were asked a few questions in a directed dialog that used the speech system to understand and tabulate their answers and then asked for open-ended feedback, which was recorded for later transcription. Interviews, typically over the telephone, can also be used to get qualitative input about the experience callers are having with the system.

Longitudinal studies are sometimes performed for systems that expect lots of repeat callers. These studies are designed to track the usage experience of individual users over time. Longitudinal studies may use periodic surveys or interviews, correlated with usage and performance data tracked over time for the participating users. You should use performance data to help choose appropriate participants in a longitudinal study. For example, you can purposely include callers who have used particular features, have neglected particular features, or have stopped using the system after one or two tries.

15.3.2 Recognition Tuning

The goal of recognition tuning is to optimize the accuracy of recognition for every grammar. Typically, each dialog state has its own grammar, although some grammars (e.g., yes/no grammars) may be shared among a number of dialog states.

The first step in recognition tuning is to measure the performance for every grammar. You measure recognition accuracy by comparing the transcriptions to the recognizer output for each utterance. The data are typically divided into in-grammar and out-of-grammar sets. For the in-grammar data, you measure the correct-accept, false-accept, and false-reject rates (these are defined in Chapter 13). For the out-of-grammar data, you measure the correct-reject and false-accept rates. Additionally, you measure recognition latency for each grammar. Latency is measured as the time between the end of the caller's utterance and the beginning of the next prompt played by the system. Excessive latency may indicate that the system is underprovisioned; you may need to add more servers.

Efforts to improve recognition performance are typically focused on a number of parameters that control the recognition process. Most commercially deployed recognition engines have some recognition parameters the developer can set, including the following:

Rejection threshold: This is the confidence level below which the recognizer will return a <reject> rather than a recognition result. As the rejection threshold is lowered, the recognizer will reject less often, thereby lowering the false-reject rate (the rejection of utterances that should have been accepted). However, at the same time the false-reject rate is lowered, the false-accept rate is raised (the choice, by the recognizer, of an incorrect recognition hypothesis). You must choose an appropriate trade-off between false accept and false reject.
Start-of-speech timeout: This is the maximum length of time the endpointer will listen for the beginning of speech before timing out. The longer the start-of-speech timeout, the longer the endpointer will listen. As the start-of-speech timeout gets longer, the system is less likely to miss a caller's response, but the amount of time waited gets longer before the system realizes the caller is not responding and takes appropriate action.
End-of-speech timeout: This is the length of silence the endpointer will listen to, after detecting speech, to decide that the caller has finished speaking. As the end-of-speech timeout gets longer, the system is less likely to mistakenly believe callers are finished speaking when, in fact, they are only pausing. However, the longer the end-of-speech timeout, the greater the delay between the end of caller speech and the system response.
Pruning threshold: Most commercial recognizers use some sort of pruning approach to limit the recognition search as it progresses so that it stops pursuing paths through the recognition model that are very unlikely to end up as the best-matching path when the search is finished. This speeds recognition, at the cost of occasionally introducing an error. (An error can be introduced if a path that seemed unlikely early in the search is actually the best path when the entire utterance is considered.) As the pruning threshold gets higher (i.e., fewer paths are pruned out), recognition accuracy improves, at the cost of more time (greater computational cost for recognition).

There may be other parameters associated with the recognition engines from different vendors, such as those used to optimize the choice of acoustic models or the recognition search. We do not cover such vendor-specific details here. They are covered in each vendor's documentation.

To optimize the value of recognition parameters, you run a series of recognition experiments in batch mode. Most commercial recognition vendors provide a batch mode in which you can supply the system with a set of utterances, along with their transcriptions, plus a set of parameter settings (e.g., the reject threshold). The system then runs the entire set of utterances through the recognizer, compares the recognized word strings to the transcriptions, and reports the results. By running a number of experiments for each parameter to be tuned, testing a series of values for the parameter, you can choose the best trade-off for recognition performance and behavior of your application.

When recognition performance is lower than expected, you can make a number of adjustments in addition to the tuning of recognition parameters. Let's discuss the two most common: dictionary tuning and grammar probabilities.

Dictionary Tuning

If specific words are consistently missed by the recognizer, it is worth checking the dictionary pronunciation and listening to the data to make sure the pronunciations in the dictionary cover the pronunciations callers are using. Most vendors provide a means of adding pronunciations to the dictionary. Their documentation should describe the phonetic alphabet they use for specifying dictionary pronunciations.

Some application grammars include very long lists of items for example, lists of company names, person names, street names, city names, and so on. Some words, especially company names, may be made-up words (e.g., "Microsoft" was not a word before the company was founded) and have nonstandard pronunciations. Many person names are foreign and therefore use different pronunciation rules than those of standard North American English. In both cases, it may be necessary to take the time to listen to the data and update the dictionary to accurately reflect the most common pronunciations being used.

Grammar Probabilities

Recognition accuracy for grammars that include long lists, such as the examples in the preceding section, can often be improved by embedding probabilities in the grammar. The probabilities represent the a priori probability that callers will say each of the items in the list.

For example, suppose a stock quote grammar includes 15,000 names of companies, indexes, and mutual funds. Users ask about some companies far more often than others. Embedding appropriate probabilities in the grammar can significantly improve recognition performance. Of course, if the probabilities are wrong, they will do more harm than good. The best way to obtain accurate probabilities is to estimate them based on a large set of data, such as the data collected during a pilot or from a deployed system. Probabilities for certain types of grammars (e.g., stocks) may change over time and therefore must be updated periodically. Most commercial recognizers facilitate probabilities, even for rule-based grammars.