18.2 Testing | Voice User Interface Design 2004

The code for the application is finally complete and goes through several rounds of bug fixes. As a part of this process, our dialog designer performs a dialog traversal test, calling into the system and exhaustively traversing every dialog state and identifying any areas where the application behavior does not match the logic defined in the dialog specification. The application also undergoes QA tests and load tests to ensure that there are no problems and that the fully integrated system is provisioned correctly.

After application testing, we have a handful of employees call into the system to check speech recognition performance. We supply the volunteers with a list of things to say to the system and also let them talk to it without a structured script. To test the performance of recognition for the states using SLMs, we use the test set we earlier set aside. We decide that recognition performance is in the ballpark; no special problems are noted.

After we have written grammars for all the dialog states in the specification, we are ready to test them. First, we use the test suites we created to ensure that we have covered everything we intended to have in the grammar. We also make sure that each utterance returns the correct slot values. Using other grammar testing tools, we check to see that the grammars do not unintentionally include utterances that the caller would never say, such as, "I want quotes for my the holdings." We also eliminate any ambiguity in the grammar that the system is not designed to handle.

Finally, we do a pronunciation test on company names to make sure that the dictionary entries are reasonable. Given the large number of company names, we decide to use the company name prompt recordings for this test. We feed them to the recognizer, running in batch mode, using the company-name grammar. A few companies are rejected. A quick fix to the dictionary solves the problem.

18.2.1 Evaluative Usability Testing

We run an evaluative usability test. This time, rather than use a Wizard of Oz approach, we run the tests using the real, fully integrated system. This is our first opportunity to observe people use the actual system.

The format of the tests is similar to that of the earlier iterative usability tests. We design a set of task scenarios for participants to run. This time, we have them exercise additional functionality, including setting up a watch list and changing an existing open order. We create some fake customer accounts for them to use. We will use the same debriefing questions used in the earlier tests, with the addition of questions on accuracy. Again, we will run the test over the phone, calling subjects at home.

On the first day of testing, we run seven participants. We discover a timing issue that causes problems for five of the seven participants. It turns out that there is significant latency in the system response after an account number is entered. This latency is a result of the checksum test on the items in the N-best list as well as the database latency. The delay is often close to five seconds. Our participants are confused by the delay; they are not sure the system has heard them. After a few seconds of silence, some of them say "Hello?" and others repeat their account number. In general, this kind of timing problem could not have been found until we were testing with the actual system.

We decide to add the following prompt, which will play immediately after a caller finishes saying the account number: "Please hold while I access your account." The voice actor comes in the next day to record it, and we quickly make the change to the system. Two days later, we run seven more participants. None of them has the problem; the extra prompt has been successful in clarifying for them what the system is doing. Furthermore, all subjects successfully complete all tasks, and we have very positive responses to our survey questions.

The application is now in good shape. Implementation bugs have been fixed, recognition accuracy is in the expected range, and the system has achieved good scores on the evaluative usability test. We agree with the client that the system is ready for the pilot test.