11.8 TTS Guidelines


The epitome of science fiction is a human having an intelligent conversation with a computer. The image is commonplace in movies and television as humans talk to the holodeck on the Enterprise, to the android C3PO, and to Hal, the definitive supercomputer. What makes these interactions notable is the computer's apparently effortless understanding of human speech, as well as the clear, intelligible, natural, humanlike quality of the computer's speech.

The availability of high-quality text-to-speech (TTS) engines brings us ever closer to the worlds of our imagination, where computers speak like humans. As the process of generating natural-sounding spoken language from text becomes increasingly automated, we can expect to find many new practical applications of the technology to everyday life.

In the world of voice user interfaces, TTS offers a number of conveniences. Because there is no need for prerecorded audio files, implementation is relatively easy. Developers can produce prompts quickly and edit them easily. More importantly, the content of TTS messages can be spontaneous and can be changed dynamically. In contrast, applications that rely on prerecorded audio files alone are best suited for conveying information that is static. When the information changes often or needs editing, the voice talent must be called in to record either whole utterances or concatenation units that will in turn form whole utterances. But no amount of this kind of prerecording can handle situations in which information cannot be predicted or constrained, as in the case of e-mail playback. TTS renders these problems moot because it lets you easily integrate spontaneous and dynamic texts, such as e-mail and late-breaking news stories, into speech applications.

Despite these conveniences, you should consider certain human factors issues when using TTS in your VUI. Even though the quality of TTS has greatly improved over the past ten years, users are still very much aware that TTS is not human speech, and many if not most users prefer recordings of human speech. One area of dissatisfaction is simply that TTS output is more difficult to understand. This is not surprising. As you have seen in this chapter, the prosody of messages is highly dependent on contextual cues. When a native speaker reads aloud an e-mail or news story, he or she gathers contextual cues over the course of the text, which can be lengthy. So in real life, the prosody of a particular sentence is often determined by information that was presented several sentences earlier. The current state of TTS technology does not acknowledge these kinds of contextual cues, let alone gather them and in turn realize them in natural-sounding prosody. The added cognitive burden required for comprehending TTS is especially problematic for nonnative speakers and elderly users.

Some of these issues, however, are mitigated to the extent that listeners seem to be able to adapt to the sound of TTS over time. There is evidence that repeated exposure to TTS improves comprehension. Presumably, listeners learn to "turn off" their implicit expectations for phonetic and prosodic naturalness. However, if you use a high-quality TTS engine and if the content of your output is easy that is, predictable and constrained then many comprehension difficulties can be avoided.

When you're deciding how to integrate TTS into an application, consider the following guidelines to help optimize usability.

11.8.1 Analyze Application Usage

Estimate how often and in what environments callers will be using the application. If most callers are one-time users in noisy environments, TTS will be more difficult to comprehend. Also, if callers are performing another task, such as driving, while listening to the system, then TTS will impose more cognitive demands on the caller and serve as a distraction.

Ideally, you should use TTS in applications that will be accessed by repeat callers in quiet, undemanding environments. Repeated exposure to the system will give callers the opportunity to become accustomed to the voice, something that has been shown to facilitate comprehension.

11.8.2 Choose an Appropriate Voice

Consider the target audience and type of application when you choose the gender of the TTS voice. The voice itself will affect users' impression of the system. Unsurprisingly, it has been found that the gender of TTS voices elicits cultural connotations, just as human voices do.

Also consider how the TTS voice will sound next to the voice of your prerecorded audio files. Actually, you may want to exploit the distinction between the two so that any negative impressions of TTS, which are likely beyond your control, do not "contaminate" users' favorable evaluation of the application's principal persona, over which you should have total control. For example, if the persona of a voice portal that reads e-mail and news is a woman, then the TTS rendering of headers and content should perhaps be in a man's voice. Unless you are using a very high-quality, sophisticated TTS engine to read items whose form is predictable and constrained, do not attempt to pass off TTS as natural, persona-rich recordings.

11.8.3 When Possible, Use Audio Recordings

Because human speech is generally preferred over TTS, you should use audio recordings of a professional voice actor whenever possible. In some cases a single sentence may have sections that are dynamic, whereas other parts are always static, as in (34) and (35).

graphics/sound_icon.gif

(34)

First message: | "meeting time."


(35)

The street address is: | 1313 Mockingbird Lane.


The question in these cases is whether the consistency of TTS would be better than combining TTS with recorded speech. Research at the University of Edinburgh (CCIR-5 1999) and British Telecom has shown that users prefer prompts that use both TTS and recorded speech rather than TTS for the entire prompt. So in these examples, "First message" and "The street address is" would be recorded by a voice actor, and "meeting time" and "1313 Mockingbird Lane" would be in TTS.

Note that the use of the colon in (34) and (35) is intentional. The voice actor should read the text so as to mindfully "announce" the entry of a new breath group delivered in a different voice, that is, the voice of TTS. The first part of both (34) and (35) should be read with a slight sense of suspension, and following the colon there should be a brief pause, as recommended here. Without the colon, we have found that professional voice talents are inclined to deliver these sentence fragments as concatenation units that wrongly suggest only the first part of a breath group and only the first part of an intonation contour. When these prosodically incomplete recordings are concatenated with their TTS complements, the result is jarring. (It is as if two voices are somehow working off one set of lungs!)

As always, set the context for your voice actors. Let them know that they are introducing TTS.

11.8.4 Make Content Easy to Understand

As stated earlier, TTS allows you to include information in your application that could not possibly have been recorded in the studio in other words, content you have little or no control over. When you do have control, however, try to use simple vocabulary and grammar. The simpler the text, the greater the intelligibility and usability of the application. Text should provide ample context to ground the information to concepts that are generally known to most users. Repeating important ideas will also help users retain them.

If the content appears somewhat difficult, try slowing the TTS speaking rate. A comfortable rate for most listeners is between 150 and 200 words per minute.

11.8.5 Use Appropriate Formats

Often, we take for granted the way certain types of information are expressed. For example, the number 1313 in example (35) should be read "thirteen thirteen" instead of "one thousand three hundred thirteen." Similarly, the zip code "94536" should be read as "nine four five three six," never "ninety-four thousand, five hundred thirty-six."

Make sure that all abbreviations, too, are read back in the appropriate format. For example, "St. Andrews St." should be read back as "Saint Andrews Street"; otherwise, it will likely be unintelligible. Aberrant delivery will distract the listener and divert attention from the surrounding content.

11.8.6 Mark Up Text for Naturalness

There are also strategies for marking up the text to be delivered by TTS in order to get the most natural results possible. For example, some researchers have shown that inserting pauses between sentences and between major phrases and adding breath intake sounds before sentences can improve the naturalness of the text being presented and make it easier to remember. In addition to the use of pauses to facilitate intelligibility, there are prosodic mark-up strategies so that TTS will simulate natural, humanlike stress and intonation patterns, as described throughout this chapter. You can also improve pronunciation by adding phonetic spellings to the TTS dictionary for words it consistently gets wrong. For an overview of methods for optimizing the quality of TTS for your application, see Ishihara (2003).

Human factors issues become important at many levels when you're thinking about how to integrate TTS from the high-level goals of your application to the details of number formats. Carefully considering all these issues will help make TTS an important part of a human-centered, user-friendly interface. As the technology that enables computers to understand and speak becomes more widely available and gains acceptance, the possibilities of how and where it can be used are almost endless.



Voice User Interface Design 2004
Voice User Interface Design 2004
ISBN: 321185765
EAN: N/A
Year: 2005
Pages: 117

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net