4.2 High-Level Design | Voice User Interface Design 2004

Once the requirements are understood, there is one more step before detailed design can begin: high-level design. The high-level design is often brief, but it plays a crucial role in achieving the goals of the application while creating a consistent and effective user experience and a unity of structure in the design. The goal is to encapsulate the requirements in a concrete way that can guide design and to make decisions about dialog strategies and dialog elements that permeate the design, thereby achieving consistency.

The elements of a high-level design include definitions of the following:

Key design criteria
Dialog strategy and grammar type
Pervasive dialog elements
Recurring terminology
Metaphor
Persona
Nonverbal audio

4.2.1 Key Design Criteria

The first element of the high-level design is a list of the key design criteria for the application. You must keep these criteria in mind throughout high-level and detailed design. This list will provide guidance when it comes to making difficult trade-offs.

The list of key criteria should be very short, generally one to three items. Naturally, there is a long list of attributes that every system tries to achieve. However, the purpose of the list of key criteria is to guide design trade-offs and the major focus of the design, so a very short list of the highest-priority criteria is best. By the time requirements definition is complete, the key criteria should be pretty obvious. They should be covered by the success metrics defined during the requirements process.

4.2.2 Dialog Strategy and Grammar Type

A dialog strategy is a basic scheme for structuring the dialog of an application or task. For example, if you are designing a travel application, you might choose to go step-by-step, as if filling in a form, asking the caller for one piece of information at a time. For example, you would first ask for origin city, then the destination, then the travel date, and so on. Alternatively, you might choose to accommodate greater flexibility in how callers can describe their travel plan. You might start with a prompt such as, "What are your travel plans?" and handle a wide range of inputs such as, "I wanna go from New York to San Francisco next Tuesday," "I need a flight to San Francisco," and "I need to be in San Francisco by three p.m. Tuesday."

In the latter case, the application must choose its next dialog step based on the caller's input. Furthermore, the grammar must be flexible enough to handle the great variety of possible caller utterances. In this case, you would typically use a statistical language model for the recognition grammar.

In general, the choice of dialog strategy will strongly influence the choice between rule-based and statistical grammars. Chapter 5 covers the various dialog strategies and discusses how to decide which grammar type is best for your application.

4.2.3 Pervasive Dialog Elements

A number of operations happen frequently throughout a VUI. For example, in every dialog state, you must provide handling for typical problems that may occur, such as recognition rejects. Making decisions about the approach to such operations up-front, before detailed design begins, will help ensure consistency in how these elements are designed throughout the system.

Pervasive elements include error handling (e.g., handling of recognition rejects and timeouts) and universals (i.e., commands always available to callers in every dialog state, such as "Help"). Particular applications may have other elements that either happen repeatedly or influence many dialog states and therefore should also be included in the high-level design. One such element is the login strategy the means by which callers are identified. Chapter 5 describes the design choices for error handling, universals, and login.

4.2.4 Recurring Terminology

It is important to use terminology consistently throughout an application. Violating this rule will lead to confusion for callers. Furthermore, if the application is used in concert with other systems, such as Web sites, it is important to ensure consistency in terminology between them. Deciding on terminology up-front will help ensure consistency.

The topics of the next three sections metaphor, persona, and nonverbal audio are sometimes referred to as the consistent character of the application because they present a layer that plays a key role in creating the look and feel of the system. Keep in mind, however, that these elements are there not only on the surface, for look and feel, but also to play key roles in achieving your usability goals.

4.2.5 Metaphor

Metaphors are used in many user interfaces to help create a useful mental model for the user. A metaphor uses knowledge about one concept to understand or comment on a second concept. In a user interface, metaphors can be used as a tangible analogy for the abstract organizational scheme of an application.

Familiar examples of metaphors in user interfaces include the desktop metaphor in Microsoft Windows and the shopping cart metaphor at Amazon.com. Metaphors can be overarching, influencing the entire application (much like the desktop metaphor), or they can be specific to a subset of operations (like the shopping cart). The metaphor for a voice system can be as simple as a definition of the role of the system with respect to the caller (e.g., a personal assistant). If you use an overarching metaphor, you should choose it in advance of detailed design, because it may influence many aspects of the design. At the very least, some thought should be given to the role the system plays for the caller.

4.2.6 Persona

When people engage in conversations, even over the telephone and even if they know they are conversing with a machine, they make many inferences about the kind of "person" they are speaking with. These inferences are driven by numerous characteristics, including voice quality, the words spoken, how the words are spoken, and so on. This phenomenon provides an opportunity for a company to create a specific image and extend its brand by carefully designing the persona behind its speech system. If you define the persona before detailed design, you will be able to consistently apply it as you craft the interface.

Many companies invest significantly in creating and extending their brands. When you work with such a company, it may be worth a significant effort to design a persona that supports its brand and corresponds to how it presents itself in other media (e.g., television ads). You may even create more than one persona (with voice recordings of sample dialogs) and compare them in focus groups.

Other companies may have little interest in the branding opportunity offered by a speech system. Even in those cases, it is worth at least a small effort to describe a persona before detailed design begins. Doing so will result in far greater consistency throughout the design. Furthermore, it will provide guidance in choosing a voice actor to record the prompts and will help the actor achieve a consistent delivery.

Some technology and platform vendors offer prepackaged or "standard" personas. These are a set of carefully designed personas, with voices already chosen, and a set of sample dialogs you can listen to in order to understand the look and feel of the persona. Although using a standard persona does not provide differentiation in terms of brand, it does provide an inexpensive way for a company to take advantage of a well-designed persona. Additionally, some vendors create TTS voices using the same voice actors who voice their standard personas. You can take advantage of this when designing systems that combine recorded voice with TTS, thereby achieving a smoother integration between recorded and TTS segments of prompts.

Some practitioners claim that the reason for designing a persona is to fool the caller into thinking the system is human. We advise against that. Design choices should never mislead the caller about the capabilities of the system. The reason for designing a persona is to better meet user needs by creating a more engaging and familiar experience and a more usable system, and to better meet business needs by extending the company brand and creating a favorable and appealing image.

The issues surrounding persona design are complicated, especially considering particular user and business goals, so we devote all of Chapter 6 to this topic. Furthermore, because deploying companies need to be aware of the value of explicit persona design for their applications, Chapter 6 also includes information on a number of studies that support our contentions about the role a persona can play in the perceptions callers have of a system.

4.2.7 Nonverbal Audio

Nonverbal audio (NVA) includes all sounds, other than speech, that you design as part of your system. There are three primary goals for using nonverbal audio in applications:

To help create a particular look and feel
To solve usability problems
To communicate particular types of information

Look and Feel

The look and feel of an application can be enhanced with careful use of nonverbal sounds. To achieve a uniformity of look and feel, the set of sounds should be carefully designed. There are three types of NVA specifically designed for look and feel: background music, environmental sounds, and branding sounds.

When you use background music, you should carefully choose it for relevance to the application or piece of the application it is associated with. Background music has been used most effectively in voice browsers, with different background music associated with different "voice sites" (e.g., sports news, restaurant guide). Keep in mind that extra sounds over the telephone, in addition to speech, can lead to distortion and make the speech difficult to understand. When you use background music, it is often most effective to have it play for only a few seconds to set the mood, and end before the spoken interaction begins.

Environmental sounds are used in a similar way. For example, a visit to the restaurant guide might begin with a few seconds of the noises you typically hear in a busy restaurant. The same precautions mentioned for background music apply to the use of environmental sounds.

Branding sounds are earcons specifically designed to identify a particular company or service. Many companies already have readily identified branding sounds. The typical place to use these is right at the beginning of a call, as part of a welcome message.

Usability

You can solve or mitigate a number of usability challenges by using NVA. One such issue is latency. Some applications access other systems, such as backend databases, with unpredictable amounts of latency before connecting or returning a result. Complete silence may be frustrating for callers. They may wonder whether the system has disconnected or is still working on their problem. To fill the space, you can use latency sounds, such as music or a specifically designed repeating sound. In addition to creating more interest than silence, the sound indicates to callers that they are still connected to the system and it is working.

Another usability issue that can be aided with nonverbal audio is landmarking. Systems that move between different services (e.g., voice browsers) can be confusing to the caller as the context switches. You can landmark each service with a sound that identifies it. These sounds can be background music, environmental sounds, or specifically designed identifying earcons. As noted earlier, such sounds should last only a few seconds and should not overlap with spoken interaction.

Communication

Nonverbal audio can be used to communicate specific messages. For example, you can design a specific earcon to indicate a specific meaning. We are familiar with many such sounds in our daily lives, the most common being the sound of a telephone ringing.

Earcons are sometimes used to communicate the occurrence of an asynchronous event for example, a tone that indicates that voice mail has arrived while you are in the middle of a telephone conversation. Earcons have also been used to indicate voice hyperlinks in some voice browsers.

General Considerations

You should not assume that all your designs should use nonverbal audio. Many designs work quite well without it. When using NVA, you should keep in mind a number of general considerations.

The most important guideline is to be sparse. If you use too many sounds, they lose their effectiveness and may lead to confusion. If you decide to use multiple sounds, design them carefully to work well together. For example, if you use a number of earcons with different meanings, make sure that each sound is distinctive. Although trained musicians may be sensitive to subtle differences, most callers will differentiate only those sounds that have significant contrast. Furthermore, think about the complete set of sounds and design each of them with some sense of how they will work together to create a particular look and feel.

Always test NVA over the telephone. The sound over the phone will be very different from the sound over high-quality speakers in the recording studio. Some examples that sound great in the studio may not work at all over the phone.

During high-level design, decide whether you will use any nonverbal audio and, if so, where. This is important because other design decisions will depend on placements of NVA. The specific designs can be left for the detailed design phase, and final audio production left to the development phase. In general, the design of NVA involves a very different set of skills than the other VUI elements, so it is often best to bring in someone with appropriate expertise in sound design.

For more information on the design of NVA, see Raman (1997) and Kramer (1994).