18.3 Tuning | Voice User Interface Design 2004

The goal of the tuning phase is to have real customers interact with the system and provide enough data to help us tune the grammars (both rule-based and SLM), recognition parameters, dictionary, and of course the dialog flow. Often, pilot studies can introduce a Catch-22 dilemma for clients. They are reluctant to let customers interact with an untuned system, but to tune the system, you need real customer data. The Lexington executives have been comfortable with the results of the recognition test and evaluative usability, so they are ready to move forward with the pilot. We do manage their expectations and explain that the accuracy of the SLM will not be as high in the early pilot as we will ultimately achieve after we have application-specific training data.

The pilot is scheduled to run for eight weeks. For the first four weeks, Lexington will redirect 10 percent of its touchtone traffic to the speech system, resulting in approximately 20,000 calls per week. If all goes well, we will increase it to 20 percent for the next four weeks. We will send out user surveys at the end of week 3 and again, to different users, at the end of week 8. At that point, if both our quantitative measures of system performance and the subjective results of our user survey achieve our success metrics, the system will be fully rolled out.

All the pilot data is logged and the waveforms saved for all user utterances. Our transcription team is geared up, so there is only a one-day lag from the time a phone call comes in until it is transcribed and ready for our analysis.

18.3.1 Dialog Tuning

We begin the tuning process by evaluating the performance of the dialog. We start by using a call monitoring tool to listen to about 100 randomly selected calls. Next, we use a tool that allows us to set criteria to determine which calls we listen to and reconstructs calls from logs by playing the recorded utterances interspersed with the recordings of the prompts.

We select calls in which multiple trades were placed. The result of our usability test is corroborated on real call data: First-time callers often place their first trade step-by-step, allowing the system to lead them through the series of requests for the number of shares, the price, and so on. Then, after hearing the just-in-time instruction describing the more efficient way to specify their trades, they place their next trade more efficiently using a more complex sentence. We listen to 30 such calls. Of these, 28 go step-by-step on their first trade. Of those 28, 22 of them use more complex sentences on their second trade.

By the third week of the pilot, we have 8,142 repeat callers (customers who have called at least twice). Of these, 863 do trades on both their first and second call. Again, the just-in-time instruction has worked; 748 of those callers use complex sentences for their trades on their second call (whereas only 41 did on the first trade of their first call). We conclude that the just-in-time instruction is working both within and across calls.^[1]

^[1] These and other results on the Lexington Brokerage sample application have been fashioned to illustrate the application design and tuning process, along with ways of thinking about how to solve problems when they are discovered. These results should not be interpreted as specific results of actual valid scientific studies or a specific real deployment.

Next, we do a hotspot analysis. We look for dialog states with a high rate of rejects, timeouts, or requests for help. It turns out that the GetPIN state has a high reject rate. In fact, many callers are hanging up after one or two rejects in the GetPIN state. We listen to the login dialog for 25 calls that have rejects followed by hang-ups in this state. It turns out that 21 of them seem to be cases in which the caller does not remember the PIN. For example, some of the inputs include, "I don't know," "Um, I can't think of it," and "I don't remember." Some others seem to be guessing; they try one or two numbers that do not match and then hang up.

You can see the problem with our design by looking back at the call flow design for the Login subdialog (Figure 14-2 in Chapter 14). We accommodated callers who did not know their account numbers by recognizing inputs such as "I don't know" and transferring them to an agent who could provide an alternative means to get into their account, but we did not provide the same capability for those who did not know their PIN. Our pilot data show that this is a problem for PINs as well.

Therefore, we change the call flow for the Login subdialog as shown in Figure 18-1, adding a path to transfer those who do not know their PIN to an agent who can help them. We add the Unknown subgrammar, shown earlier in the grammar for the GetAccountNumber state, to the GetPIN state. Additionally, in the first reject message in the GetPIN state, or in response to a request for help, or if the caller is reentering the state after a nonmatching PIN was tested, we tell callers they can say, "I don't know" and get connected to someone who can help them. After the change is implemented and deployed in the pilot system, we collect more data and repeat the hotspot analysis. The data show that the problem is fixed.

Figure 18-1. The revised call flow for the Login subdialog shows a new way of handling unknown PINs.

graphics/18fig01.gif

We are surprised to learn how large a percentage of callers know their account number but do not remember their PIN. The Lexington team is also surprised. We agree to make speaker verification a high-priority feature for the next version of the system. In that way, customers will not have to remember a PIN; instead, their voiceprint will provide security. It is interesting to note that the lesson about the shortcomings of the GetPIN state could have been learned only from real usage by real callers. A usability study with made-up tasks would not have uncovered such a problem.

The next step in our dialog tuning process is an evaluation of our error recovery and help messages. We use a hotspot analysis to find all states with more than a 10 percent reject rate. For each such state, we listen to a number of the interactions that had rejects. One thing we learn is that in the directed-dialog portion of trading, when callers are answering the question about the number of shares they want to buy, they often try to check their balance or say things like "Oh, I'd better see how much I have in my account . . . " Therefore, we add phrases for balance queries to the grammar for that state. Additionally, we add to the help and reject messages the sentence, "You can also ask, 'What is my balance?'" Logic is added to the call flow to respond to such a request with the amount of money in the caller's default account, followed by a reprompt for the number of shares.

18.3.2 Recognition Tuning

For each of the states that use rule-based grammars, we divide the data into in-grammar and out-of-grammar sets. On the in-grammar sets, we measure the rates of correct accepts, false accepts, and false rejects. On the out-of-grammar sets, we measure correct rejects and false accepts.

To adjust the recognition parameters, we run parameter sweeps: a series of recognition tests, in batch mode, varying the parameter we are tuning. For example, we run a series of recognition tests for the GetNumberofShares state in the Trading subdialog, each with a different pruning threshold. We look at the result in terms of the rate of correct accepts and the average time for recognition in order to choose a value that optimizes the trade-off between accuracy and time. We use a parameter sweep of the reject threshold to adjust the trade confirmation state, choosing a threshold that makes it extremely unlikely we will have a false accept of "yes."

We find that company name recognition is not as high as we expected. Further analysis shows that there are 12 companies that we almost never get right. They are all company names based on foreign words. We listen to a bunch of the utterances and adjust the dictionary entries. Further testing shows that the solution has worked.

18.3.3 Grammar Tuning

We begin our grammar tuning by looking at the out-of-grammar (OOG) utterances for each of our rule-based grammars. The GetAccountNumber state has quite a few OOGs. Analysis of the data shows that the OOGs are a result of three problems:

Many callers say "dash" between digit positions 4 and 5, and again between digit positions 8 and 9. A quick check reveals that callers are simply replicating the format they see on their account statements. The 12-digit account numbers are printed in three groups of four digits each, separated by dashes.
Callers have a tendency to pause after every group of four digits. In some cases, the pauses are long enough to trigger the end-of-speech detector, causing the recognizer to stop listening and therefore missing the last four or last eight digits.
Some callers are using natural numbers. For example, for the four-digit sequence 2347 some callers say "twenty-three forty-seven," for the sequence 3400 some callers say "thirty-four hundred," and for 5000 some say "five thousand." Our grammar accommodates only digits.

To solve the problem of OOGs in which callers say "dash," we add the word "dash" optionally to the grammar between digit positions 4 and 5 and between digit positions 8 and 9. Here is the new version of the AccountNumber subgrammar for the .GetAccountNumber grammar:

 AccountNumber      (Digit:d1 Digit:d2 Digit:d3 Digit:d4 ?dash Digit:d5       Digit:d6 Digit:d7 Digit:d8 ?dash Digit:d9 Digit:d10       Digit:d11 Digit:d12)      {return (strcat ($d1 strcat($d2 strcat($d3 strcat($d4               strcat ($d5 strcat($d6 strcat($d7 strcat($d8               strcat ($d9 strcat($d10 strcat($d11 $d12)))))))                                                       )))))}

To solve the problem of pauses causing the endpointer to detect end-of-speech before the caller is actually finished, we adjust the end-of-speech timeout parameter to listen for a longer pause before assuming that the caller has finished speaking.

We consider two solutions to the problem that some callers use natural numbers. The first possible solution is to augment the grammar to include natural numbers. The second solution is to leave the grammar as is and make it clear in the first reject prompt that callers should "say the digits" of their account number. We decide to use the second approach. Given our desire to achieve very high recognition accuracy for account numbers, we don't want to add the natural numbers to the grammar, something that would make recognition more challenging.

After the changes are implemented and installed in the pilot system, we collect more data. We are able to verify that our three fixes for the account number OOG problems are effective.

The handling of the OOG problems for the GetAccountNumber state is a good illustration of the variety of approaches that can be brought to bear to fix what initially appear to be "grammar" problems: In one case we extended the grammar, in another we adjusted a recognition parameter, and in the final case we changed the wording of an error prompt. In general, when OOGs are observed, you must first determine the root cause of the problem. In some cases, once you understand the cause, the solution will be obvious. In other cases, you may need to consider trade-offs between solutions.

The remainder of our grammar tuning goes smoothly. We discover a few alternative ways to refer to certain companies. For example, some callers refer to IBM as "Big Blue." That gets added to the grammar.

When we have the data from the first week of the pilot, we measure recognition accuracy for states that use the SLM. It is roughly what we expect. For the remainder of the pilot, we train a new SLM every two weeks, with increasing amounts of pilot data, and install it on the system. In the first month we see substantial improvements given the increased training data for the SLMs. In the second month, the improvements continue, although they are smaller.

18.3.4 User Survey

Lexington plans to use a survey to gather subjective data from its customers at two phases. The first set of surveys is sent at the end of week 3 of the pilot. The second set, with identical questions, is sent at the end of week 8 of the pilot to a different set of customers.

The surveys include a set of Likert-scaled questions related to the success criteria, as well as a few questions designed to gather open-ended feedback. As discussed in Chapter 7, this survey is based on the survey the firm has used to evaluate its touchtone system, augmented with questions about accuracy. The goal is to meet all success criteria by the end of the pilot and final survey and then proceed with the full deployment.

At the end of pilot week 3, Lexington sends the survey to 5,000 of its customers who called the system for the first time during that week. In that way, they will have experienced the system after the first iteration of improvements and retraining of the SLM. Two weeks later, responses have been received from 1,254 customers.

The responses, in general, are quite positive, with one exception: The subjective impression of accuracy is not as high as we expected. This is surprising given the high accuracy we are measuring on the data we are gathering from the system. Many of the respondents who rated accuracy low commented on it in the open-ended questions. Their comments help uncover the problem. The following example illustrates the issue:

I got a quote on a company and then placed a market order and realized that the market price had changed by a whopping 8%. The quote was delayed, but the trade was based on real time prices.

The quotes the system is providing are delayed as much as 20 minutes. As a result, customers are not placing trades at the market price they expect. The important lesson from this result is that, to the end user, accuracy is a whole-service concept. A system seems accurate when it consistently does what they expect. Measurements we make in the lab, such as the correct-accept rate on in-grammar utterances, are only part of the story. All the other pieces that contribute to the system's ability to faithfully fulfill a caller's request contribute to the caller's perception of system accuracy and reliability.

We discuss the problem with Lexington. It turns out that the company from which it licenses its quote feeds offers two pricing schemes: one for real-time quotes, and a cheaper one for delayed quotes. Lexington has been using real-time quotes at its brokers' desks and using delayed quotes for the touchtone system. It has never been a problem in the past because trades are not available from the touchtone system. Luckily, the integration is the same whether you license real-time or delayed quotes. Within a week, the speech system is running with real-time quotes.

At the end of week 8 of the pilot, Lexington sends the survey to another 5,000 customers. This time, it chooses customers who used the system for the first time in week 8. Therefore, these respondents have experienced the fully tuned system. Two weeks later, responses have been received from 1,139 customers. The results are very positive.

While waiting for the survey results, we transcribe the data collected during the final week of the pilot and run all our tuning tests. The final tuning reports are submitted to Lexington. At a follow-up meeting, we all agree that both the objective measures in our tuning reports and the subjective measures in the survey show that the system is performing well. We have met all the success criteria established during requirements definition. Lexington feels ready to fully deploy the system. We all go out to dinner to celebrate and discuss future projects!