3.2 Steps of the Methodology | Voice User Interface Design 2004

This section describes a six-step methodology for designing and deploying voice applications. Our emphasis is on those steps typically performed by a VUI designer, although we also cover other steps in deployment.

As a designer, even if you are lucky enough to be part of a team in which others are responsible for some phases such as software development, you must have an understanding of what is involved in the entire process. Such an understanding will facilitate teamwork and avoid design decisions that cause problems for those involved in other parts of the project.

The six steps are as follows:

Requirements definition
High-level design
Detailed design
Development
Testing
Tuning

3.2.1 Requirements Definition

The purpose of requirements definition is to achieve a detailed understanding of the application. The goal goes beyond understanding the features and desired functionality. You must also understand the target users of the application: What motivates them to use the system? What expectations do they bring? What are the typical usage scenarios? For example, will callers usually use the system while driving? What are the most common reasons for calling?

You also must understand the business context: the primary goal for deploying the system, how the system functions in concert with other ways that the company communicates with its customers, the image and branding goals, and so on. You also must understand the full application context, including the other systems it integrates with and all possible failure modes.

A thorough understanding of requirements not only helps you understand the features you will design but also helps you make appropriate trade-offs, design the system persona, and create a usable system for the target audience. One output of the requirements process is a set of success metrics: performance attributes that can be measured to assess success and identify areas for tuning and improvement.

3.2.2 High-Level Design

A high-level design encapsulates what has been learned from the requirements phase in a concrete form that creates the framework for the detailed design. Additionally, by making high-level design decisions up-front before diving into the multitude of application details, you can achieve a consistency and unity of structure that would never emerge from the details without planning.

A high-level design includes a number of decisions. You determine the basic dialog structure (e.g., simple menu versus complex natural language), grammar needs (rule-based versus SLM), use of nonverbal audio, and system persona. Chapter 4 covers the details and approaches to both requirements definition and high-level design.

3.2.3 Detailed Design

For the VUI designer, the detailed design is the phase that receives the bulk of the effort. The detailed design includes the complete specification of the call flow and all the prompts. You specify every detail of every application feature, and you design dialogs to handle all possible scenarios and problems (e.g., recognition errors, rejects, system problems, and caller requests to speak to an operator). You define ways to offer callers help and instruction when needed, to confirm recognition results when necessary, and to support callers when they change their minds and need to start over or redo tasks. You must test your design decisions on subjects who are representative of users, typically with iterative usability studies, beginning early in the detailed design process. If you're planning to use nonverbal audio, you must design the specific sounds.

Much of this book (Chapters 8 14) covers the detailed design process. We present a set of fundamental design principles and show how you can apply them in concrete ways. Additionally, we present examples of designs for many of the basic types of actions that happen in dialogs (e.g., confirmations and recovery from rejects).

3.2.4 Development

The development phase includes implementation of the call flow as software, specification of grammars, voice recording and audio production, and creation of interfaces to backend databases, Web services, and other software systems that interact with the application. In some cases, you implement the software in proprietary languages that run on particular interactive voice response (IVR) platforms, often using tools provided by the IVR vendor. Increasingly, systems are implemented using VoiceXML, a markup language in the same vein as HTML (the language typically used to design Web pages). VoiceXML is quickly becoming a standard. Chapter 15 discusses VoiceXML in detail.

You develop a grammar by specifying expected inputs from callers: all the words, phrases, and sentences callers are commonly expected to say. Your approach will depend on whether you're creating a rule-based or a statistical grammar. Chapter 16 discusses grammar development in detail.

3.2.5 Testing

The testing we refer to here happens after development is complete and before deployment or pilot rollout. You run a number of types of tests at this stage, each with a different goal. Some tests are run to make sure the implementation faithfully follows the design specification. Other tests help find good initial values for recognition parameters and make sure that recognition performance is reasonable before the system is exposed to the public. Also, you typically run evaluative usability tests to validate that the system meets basic usability goals. Evaluative usability testing may also help find problems that could not be detected before the availability of a fully implemented system (e.g., timing problems due to poorly chosen endpointing parameters or latencies).

3.2.6 Tuning

Before full system rollout, you typically run pilot tests using the fully implemented system. In pilot mode, the system is deployed to a subset of the caller population. This is your first opportunity to measure the system's performance as it handles real callers using the system to perform the real tasks for which it was designed.

You collect the pilot data (including all the audio data from calls to the system) along with data on system behavior. Then you use the data to measure recognition performance, tune recognition parameters, improve the grammar's coverage (add the things spoken by real callers that were missed in the original grammar), and tune the performance of the VUI. The latter includes finding dialog states that have problems such as high hang-up rates, and tracking down and fixing the problems. For example, it may be necessary to improve the wording of a prompt so that callers speak "out of grammar" less often. If you're using a statistical grammar, you will add the pilot data to the grammar training set.

You often continue data collection and tuning after full system rollout. This is especially true if you're using a statistical grammar, because larger and more task-specific training data are likely to improve performance.