1.1 What Is a Voice User Interface? | Voice User Interface Design 2004

A voice user interface (or VUI) is what a person interacts with when communicating with a spoken language application. The elements of a VUI include prompts, grammars, and dialog logic (also referred to as call flow). The prompts, or system messages, are all the recordings or synthesized speech played to the user during the dialog. Grammars define the possible things callers can say in response to each prompt. The system can understand only those words, sentences, or phrases that are included in the grammar. The dialog logic defines the actions taken by the system for example, responding to what the caller has just said or reading out information retrieved from a database.^[1]

^[1] Note that when we use the word "dialog" throughout the book, we mean a spoken exchange of words and not "dialog" in the sense it is used by some software designers to refer to a box containing written words.

The following example is an interaction between a caller and a flight information application:

(1)

SYSTEM:	Hello, and thanks for calling BlueSky Airlines. Our new automated system lets you ask for the flight information you need. For anything else, including reservations, just say "more options." Now do you know what the flight number is?
CALLER:	No, I don't.
SYSTEM:	No problem. What's the departure city?
CALLER:	San Francisco.
. . .

In this application, a voice actor has prerecorded everything the system says. Following the prompt "Now do you know what the flight number is?" the system listens, using a grammar that accommodates inputs from the caller such as "No," "No, I don't," "Yes," "Yeah, it's flight two twenty seven," and so on. The dialog logic then decides what to do next, depending on the answer in this case, to prompt the caller for the departure city. If the dialog succeeds, the system ultimately provides the desired flight information.

The methodologies and design principles for VUI design overlap substantially with those used for other types of user interface design. However, there are a number of characteristics of voice user interfaces that pose unique design challenges and opportunities. Two primary characteristics stand out: the modality is auditory, and the interaction is through spoken language.

1.1.1 Auditory Interfaces

An auditory interface is one that interacts with the user purely through sound typically speech input from the user and speech output and nonspeech output from the system. Nonspeech output (often referred to as nonverbal audio, or NVA^[2]) may include earcons (auditory icons, or sounds designed to communicate a specific meaning), background music, and environmental or other background sounds.

^[2] A term introduced by Wally Brill.

Auditory interfaces present unique design challenges in that they depend on the ability to communicate with a user through transient, or nonpersistent, messages. The user hears something, and then it is gone. There is no screen to display information, instructions, or commands, as is the case with a visual Web interface, where items can be accessed over time at will. Users do not have the opportunity to review the system's output or state their wishes at their own pace. Rather, the pacing is largely controlled by the system.

The ephemeral nature of output in auditory interfaces places significant cognitive demands on the user. There are a number of guidelines you can use to make sure your interface designs do not overload the user, do not unduly challenge short-term memory or learning ability, and provide a means for the user to influence the pacing of the interaction. We cover these guidelines in Chapter 9.

Multimodal interfaces that combine speech with other modalities have the potential to mitigate some of these problems. Even a small screen, when combined effectively with a speech interface, can significantly reduce the cognitive demands on the user, thereby changing some of the design criteria and trade-offs.^[3] However, given the immature state of the multimodal device industry and the large number of spoken language systems that are currently being designed and deployed for use in traditional telephony networks, this book focuses on purely auditory interfaces. Many of the same design principles, with appropriate refinement, can be applied to multimodal design. Consideration of multimodal interfaces that include speech will be left for a future volume.

^[3] See Oviatt 1996 for a review of studies showing the complementary power of speech and other modalities.

Despite the challenges, auditory interfaces offer a number of unique design opportunities. People rely on their auditory systems for many levels of communication. Listeners derive semantic and other information not only from word choice and sentence structure but also from the way a message is delivered: from prosody (intonation, stress, and rhythm), voice quality, and other characteristics. By carefully choosing a voice actor (for recording the prompts the system will play to the user) and effectively coaching the way the prompts are spoken, you can help create a consistent system persona, or personality. This offers opportunities for branding and for creating a user experience appropriate to the application and user population.^[4] We discuss the crafting of a persona in Chapter 6.

^[4] Clearly, other features such as the wording of prompts also play a role in persona creation.

Auditory interfaces offer an additional opportunity based on effective use of nonverbal audio. You can use earcons to deliver information (e.g., a sound indicating "voice mail has arrived") without interrupting the flow of the application. Distinctive sounds can be used to landmark different parts of an application, thus making it easier to navigate. Additionally, nonverbal audio such as background music and natural sounds can create an auditory environment for the user, thereby creating a unique sound and feel associated with a particular business or message. Designers, through effective use of nonverbal audio, can solve user interface problems in new ways as well as exploit opportunities for added value.

1.1.2 Spoken Language Interfaces

Voice user interfaces are unique in that they are based on spoken language. Spoken communication plays a big role in everyday life. From an early age, we spend a substantial portion of our waking hours engaged in conversations. An understanding of human-to-human conversation can be brought to bear to improve the conversations we design between humans and machines.

Humans share many conversational conventions, assumptions, and expectations that support spoken communication; some are universal, and others apply in specific language communities. These conventions, assumptions, and expectations operate at many levels, from the pronunciation, meaning, and use of words to expectations about such things as turn-taking in conversations. Some expectations people bring to conversation are conscious, but many operate outside our awareness. Although largely unconscious, these shared expectations are key to effective communication.

An understanding of these shared expectations is essential to the design of a successful spoken language interface. Violation of expectations leads to interfaces that feel less comfortable, flow less easily, are more difficult to comprehend, and are more prone to induce errors. Effective leverage of shared expectations can lead to richer communication and streamlined interaction. In Chapters 10 and 11 we cover many of the expectations speakers bring to conversations and show you how to leverage that understanding in the design of VUIs.

Two other realities of spoken language have major impacts on VUI design choices and design methodology. First, humans learn spoken language implicitly at a very young age rather than through explicit education. In contrast, most other user interfaces depend on specific learned actions designed to accomplish the task at hand (e.g., choosing operations from a toolbar, dragging and dropping icons). Therefore, the VUI designer must work on the user's terms, with an understanding of the user's conversational conventions. As designers, we don't get to create the underlying elements of conversation.

Second, communication through language is largely an unconscious activity in the sense that speakers usually do not explicitly think about word choice, pronunciation, sentence structure, or even turn-taking. Their conscious attention is instead on the meaning of the message they wish to communicate. Therefore, despite the fact that we all engage in conversation, designers are at risk for violating the user's conversational expectations unless they have some explicit knowledge of conversational structure, as discussed in Chapters 10 and 11. Furthermore, design approaches that make explicit the role of each prompt in the conversational context in which it occurs will maximize the ability of designers to bring to bear their own unconscious conversational skills as they design dialogs. The detailed design methodology discussed in Chapter 8 will show how this can be done.