8.4 User Testing | Voice User Interface Design 2004

As we discussed in Chapter 3, designers are, in some ways, the worst people to judge the usability of their designs. Given their design role, they are thoroughly aware of how to use the system. End-user testing is key to evaluating and refining design decisions.

The previous two sections describe call flow and prompt design as two steps in the design process. The third important step is end-user testing. Keep in mind that these steps do not happen in strict sequence. A typical project goes through a number of iterations of testing and design refinement. One common approach is to begin by designing the primary paths through the dialog and do some early user testing. It is also advisable to design the riskiest parts of the system (from a usability point of view) early, leaving time for iterative testing and refinement.

The primary means of user testing at this phase is formal usability testing, as discussed in the next section. We also cover card sorting, a technique that helps you design menu hierarchies.

8.4.1 Formal Usability Testing

Usability testing merits a full book of its own. Luckily, a number of excellent references are available. We recommend Rubin (1994), Dumas and Redish (1999), and Nielsen (1993). In this section, we review at a high level the basics of usability testing and point out some of the special issues that arise in testing voice user interfaces.

Basic Approaches

Usability testing should begin early in the design process. One common approach, which enables early testing even before a prototype exists, is referred to as a Wizard of Oz (WOZ) test (Fraser and Gilbret 1991). The key idea is to simulate the behavior of a working system by having a human (the "wizard") act as the system, performing virtual speech recognition and understanding and generating appropriate responses and prompts. In some cases, the wizard actually says the prompts with his or her own voice. In other cases, a software system allows the wizard to choose the appropriate response from prerecorded speech files. The latter approach is preferable because it leads to more consistent system behavior across test participants.

The WOZ approach has a number of advantages:

Early testing: You can begin testing as soon as you design the first pieces of dialog you wish to test.
No bugs: A WOZ system is not subject to software and integration bugs. If you test using an early version of a working system or a prototype, you run the risk of hitting bugs that will interfere with your ability to test usability.
High grammar coverage: Grammar coverage is a key ingredient of recognition performance. Given that it can take substantial effort to develop grammars, it is unlikely that an early prototype will have high grammar coverage. Poor grammar coverage will interfere with the ability to test usability (because users will experience many recognition errors that do not reflect the actual recognition performance you ultimately expect to achieve). A WOZ, by virtue of using a human speech recognizer, has high grammar coverage. (Note: That can sometimes be a disadvantage, because a human will have higher grammar coverage than the ultimate system will have, but it is difficult to precisely simulate the behavior of the grammar before it is developed.)
Quick and easy changes: It is far easier to change a WOZ than a working prototype. After a day of testing, it is often desirable to make some quick changes and be ready to run more participants the following day.

The primary disadvantage of a WOZ is the difficulty of simulating realistic recognition accuracy and grammar coverage.

Another approach is to run usability tests using a prototype. Typically, you do this later in the cycle than WOZ testing. The main advantage of a prototype is realism. The behavior of the system is likely to more accurately reflect the behavior of the final system than a version simulated by a human (assuming that you have reasonable grammar coverage and no destructive bugs).

Usability tests have traditionally been run in usability labs. Participants come to the lab, perform various tasks, and are interviewed by the moderator. However, usability tests of VUIs are increasingly being run over the telephone, with the participant at home. Telephone-based usability testing offers a number of advantages:

You can more easily reach a geographically dispersed pool of participants, thereby possibly getting a sample that is more representative of the caller population.
The participants are at home, in an environment and state of mind that may be more representative of real system use.
It is less expensive than bringing participants into a laboratory and less inconvenient to participants.

The advantage of running a usability test in a lab is the ability to see the participant. Often, you can read body language that indicates the caller is unhappy or confused, something you miss if you are listening over the phone.

We run many of our usability studies using a Wizard of Oz system over the telephone. This allows us to get quick, early feedback inexpensively. Chapter 15 describes evaluative usability tests, which are run after a complete working system has been created. They are run using the real working system.

Task Design and Measurements

In a typical usability test, the participant is presented with a number of tasks, which are designed to exercise the parts of the system you wish to test. Given that it is seldom possible to test a system exhaustively, the tests are focused on primary dialog paths (i.e., features that are likely to be used frequently), tasks in areas of high risk, and tasks that address the major design goals and design criteria identified during requirements definition.

You should write the task definitions carefully to avoid biasing the participant in any way. You should describe the goal of the task without mentioning command words or strategies for completing the task.

In addition to the subjective measures to be described later (e.g., the perceptions and opinions of the participant), a number of performance measures will require tracking during the test. Performance measures should include task completion (does the participant successfully complete the assigned task?) and efficiency (does the participant take the most direct path to completion or end up going through error recovery procedures, restarts, and help commands?). The participant's specific path through the dialog should be noted, and success or failure of various error messages and help messages should be recorded. Additional performance measures should cover, where possible, measurements identified during requirements definition as success metrics. Any other measures that can be used to indicate relevant performance issues and provide design guidance will also be useful.

Selecting and Recruiting Participants

A typical test includes 10 15 participants. The most important rule of thumb about participant selection is that the participants must be representative of the ultimate end-user population. One way to enlist participants is through a recruiting firm. These firms typically have huge lists of potential participants, with significant demographic and other information about each person. You can define criteria for the participants to meet. Recruiting firms typically charge a fee for each participant they provide. Additionally, it is often useful to provide a financial incentive to participants.

In some cases, the company deploying the system can provide you with a list of its customers who can act as usability study participants. This can be very useful, especially if the system targets a specific group, such as users of the Web site. Whether you are using a recruiter or getting suggested participants from the client company, you should create a screening questionnaire that includes questions about demographics, level of education, experience using various systems, and any other criteria that are relevant to the test. The results of the questionnaire can be used to narrow the choice of participants.

Running the Test

Before running the test with your first participant, test it with one or two people to make sure everything is working and that the instructions and task descriptions are clear. When you carry out the tests with the real participants, either audiotape them (with their permission) if the test is conducted over the telephone, or videotape them if they are in a usability lab.

Begin by describing the purpose and nature of the test. It is important to assure the participants that they are not being tested. Rather, the goal is to get their help assessing and improving the design of the system. Assure the participant that you are not vested in the quality of the current system but are trying to find any problems. Therefore, you will not be offended by negative feedback.

It is often useful to do some debriefing with the participant after each task, in addition to a more general debriefing at the end. While subjects perform the task, note issues and problems that arise so that you can ask questions later to try to understand how they were thinking. Don't interrupt and help has soon as you see a participant having trouble with a task. Often you will learn the most by listening to how people react in those situations.

The general debriefing, when they have completed all tasks, usually consists of a questionnaire designed to elicit their subjective reactions, followed by a more open-ended discussion of their experience with the system. The questionnaire consists of a number of statements about their experience, with a rating of 1 to 7 indicating how much they agree or disagree with the statement (where 1 is "strongly disagree" and 7 is "strongly agree"). These are referred to as Likert scales. Some of the statements may be general, such as, "The system was quick and efficient," whereas others may be specific to the features tested. Questionnaires may also include questions requesting short answers and comments.

The questionnaire is typically followed by a more open-ended discussion. Begin with a general question, such as, "What did you think?" From there, drill down into specifics. Try to learn about how participants were thinking, what their mental models were, what assumptions they had about how the tasks should be performed, and so on. Focus on understanding problems and issues for the user, not on coming up with solutions.

Analysis of Data

The purpose of data analysis is to identify and prioritize problems. Often, problems with an interface will surface in a number of ways during a test. For example, a lack of clarity about how to perform a certain task may result in numerous error messages and requests for help, lack of task completion for some participants, lower Likert scores for certain questions, and negative subjective feedback.

Results should be compiled. Responses to Likert statements should be summarized, showing individual values and means, as well as the mean across all Likert statements. For each task, you should compute completion rates and use of error recovery procedures and help; summarize this information in a table, along with summaries of problems with the tasks and comments from participants. Similar participant comments should be grouped and noted in a table, with counts.

Part of the purpose of analyzing the data is to look for trends. Sometimes, a basic underlying problem will have symptoms that show up in a number of places. For example, a problem with the general error recovery strategy may result in problems in a number of tasks. In that case, you want to fix the general problem rather than deal with each of the symptoms.

You should prioritize problems with a number of issues in mind:

Scope: Is the problem local (e.g., an ambiguous prompt), or does it affect many parts of the application (e.g., an ineffective error strategy)?
Frequency: How many participants had the problem?
Recoverability: When participants experience the problem, do they fail to complete the task, or do they ultimately recover and successfully complete the task?
Success metrics: How much will this problem keep you from meeting the success metrics agreed upon in requirements definition?

Proposals to fix each problem should be worked out with the designer. Otherwise, you risk making suggestions that do not work in the larger design context.

8.4.2 Card Sorting

Card sorting is a technique used to gain insight into how people categorize information and to find out what labels they use for the categories (Balogh 2002). It is a useful technique when you design applications having complex menu hierarchies. The results can help you design an intuitive menu structure and choose useful names for menu choices.

There are two primary approaches to card sorting tests. In the first approach, you give participants a set of index cards labeled with the names of various items. The participants' job is to sort the cards into groups of items that go together. It is a good idea to limit them to some maximum number of groups. When they have finished sorting, have them create a name for each group, and ask them to explain their thinking behind the groups chosen.

Here are examples of the types of items that may appear on the index cards for a hypothetical telecom application:

I'm having a problem with my handset.
I want to add call forwarding to my service.
How much do I owe on last month's bill?
I don't remember my voice mail PIN.
How does call waiting work?
Can I check my balance?
I need a duplicate bill.
I'm being charged for calls I didn't make.
I'm moving, and I want to cancel my service.
I'm not getting a dial tone at home.
Do you have a voice dialing option?
I was in the middle of a call, and it got dropped.

The second approach is similar to the first, except that you begin with a set of existing category names and ask the participants to sort the items into those categories. This is a good way to test an existing design for a menu structure.