17.1 Scripting for Success | Voice User Interface Design 2004

If one of your design goals is to create an engaging experience for the user, one that is linguistically and socially familiar, then an essential step is to prepare a script that will enable the voice actor to deliver messages that flow smoothly in the context of an actual interaction between the user and the system. A well-devised recording script can prevent prompts from being delivered with the wrong prosody or from sounding fatigued, insecure, or confused. The success of one of your design goals therefore depends on two general areas of concern:

Voice actors should have access to the information they need in order to deliver recordings that capture the right persona and prosody. That is, they should know the intonation, stress patterns, phrasing, and rate of speech that are appropriate to the prompt's function in the context of an actual interaction between the system and a user.
The voice actor's script should be formatted for ease of use. It should be clear and legible and should not be visually or mentally fatiguing to the actor.

17.1.1 An Introductory Case Study

As we've mentioned, the two main ideas in this chapter on script preparation are to provide adequate context for the voice actor and to format the material with a view to the actor's comfort and convenience. These concerns are requisites for high-quality, good-sounding results, but they are frequently ignored in practice, as we see in the script excerpt shown in Figure 17-1. (The actual prompt text has been changed in the interest of anonymity.)

Figure 17-1. This excerpt from an actual script reflects more concern for technicians than the actor.

graphics/17fig01.gif

The script does not explicitly say so, but (from left to right) the first column refers to the item number on the script. The second is the name of the audio file. The third is the text to be recorded. The last column is reserved for notes to the engineer about individual audio files.

The excerpt in Figure 17-1 reflects more concern for technicians than actors. In general, the script supplies no information illuminating the function of any of the recordings in context. As it happens in this particular application, the item in question should be recorded as in (1).

(1)

Your vehicle, a Chrysler LeBaron, will be ready for pick-up on . . . "

That is, the make and model information should be recorded as a breath group all its own, standing in nonfinal position in the sentence. Another possibility would be to record the make and model as the second half of what is intended to pass as a single, continuous breath group that stands in final position, as in (2).

(2)

Your vehicle is a Chrysler LeBaron.

In either case, the intonation of "a Chrysler LeBaron" is dictated by its contextually determined function as well as placement in the utterance.

This basic lack of concern for contextualization manifests itself in a number of other ways throughout this script. For example, another section lists small function words that are generally unstressed such as "a," "an," "the," "or," "and," and "if" in isolation, out of context. In addition, the script seems to have been generated automatically from another document and sorted according to strict alphabetical order by filename. Consequently, portions of multipart messages, such as the vehicle type fragment, are dispersed throughout the script, often separated by dozens of pages. But the user, of course, is going to experience multipart messages as a unit, so they are best grouped in the same vicinity on the script.

Another oddity is that the type of vehicle in this excerpt occupies nine lines of this script, eight of which are consecutive. Does this mean that all of them are to be inflected identically? Or do the nine occurrences suggest that they are somehow different? Why is so much space and ink, not to mention time and effort, devoted to redundancy? (As it turns out, only one recording was intended; the others were to be copies of the audio file.) Where we would hope to find some contextual clues or direction notes for the voice actor, who should be the center of attention during the recording session, we instead find notes ordering file copying and naming operations for the sound engineer. Presumably, the voice actor or director is charged with reading these file management notes to find out whether the item to the left should be recorded.

The format of this script presents additional problems. For example, it lists 493 prompts printed in 10-point type. This is more likely to give the voice actor a headache, eyestrain, and fatigue than inspiration to voice an interface that sounds upbeat and engaging.

There are many ways that this script can be improved to provide the voice actor with the information needed to provide a prosodically accurate, natural-sounding read. This is the subject of the following section.

17.1.2 Scripting Tips

What can you do to build the voice actor's contextual knowledge and ensure the best delivery possible? The following sections describe techniques and protocols for helping the voice actor understand the context and therefore the prosody requirements of the messages to be recorded.

The voice actor should also receive direction regarding persona, as well as information about the target users of the application, its areas of functionality, high-level business goals, and so on. This is discussed in section 17.3.1.

Create Useful Direction Notes

Direction notes can be placed in a special "Comments" column. Direction notes explain contextual cues for example, what the preceding prompt was, what the user has just said, any information that will help the voice actor deliver the prompt with appropriate prosody, and so on. See Figure 17-2.

Figure 17-2. Good direction notes (shown in the "Comments" column) give the actor and director the context for each prompt.

graphics/17fig02.gif

Many voice actors prefer that you not write out numbers as words (e.g., "twenty-seven"), especially for digit strings whose formats adhere to familiar conventions, such as phone numbers and times of day. In such cases, rely on direction notes to help the voice actor interpret potentially ambiguous forms. For example, does the numeral "0" mean "oh" or "zero"? Does "1200" mean "one two zero zero" or "twelve hundred"? Does "727" mean "seven two seven" or a "seven twenty-seven"? These are important questions, especially considering that VUI users adopt certain linguistic forms and behaviors that the application itself evidences, and presumably reinforces, through prompting.

Another use of the comments column is for pronunciation tips, as in (3). The "comments" column in Figure 17-3 uses an informal system to show how some unusual city names should be pronounced.

CONTEXT:	"Sure, here's the weather for Bangor, Maine."

Figure 17-3. A "Comments" column might include pronunciations of unfamiliar terms.

graphics/17fig03.gif

The phonetic transcriptions in this example are informal, so make sure that your transcriptions do not yield ambiguous or confusing interpretations. You may want to try out your phonetic transcriptions on a few colleagues before finalizing the script.

Group Related Items and Contextualize

Keep prompts together that are similar in function. Order them according to how the user might hear them in an actual interaction with the system. That is, the initial or top-level prompt should be presented first, error prompts in the order they will be played, and so on. Provide notes for context where appropriate (see Figure 17-4).

Figure 17-4. Group like items, and order them according to the order callers will hear them.

graphics/17fig04.gif

It is typical to find such items, along with everything else to be recorded that day, arranged on scripts in strict alphabetical order by filename. Because of alphabetical sorting, we have seen many scripts in which error prompts actually precede the initial or top-level prompt. (To make matters worse, on many of these scripts the filenames themselves fail to reveal the prompt's function, such as "initial" versus "first-time error" versus "help.") Instead, decide on a more intuitive ordering of prompts by state, and be consistent.

It also helps to give prompts names that voice actors will find intuitive, such as get_time_err2 instead of 20.wav. The more intuitive and consistent the presentation of material in the script, the less coaching will be needed in the long run.

Especially where concatenation items are concerned, keep prosodically related items together in the script. For example, if your application requires recordings of digits with a falling, final intonation contour (%1 see Chapter 11), record these digits in sequence. Similarly, record all digits with rising intonation (%3) as a separate group, and all digits with prepausal, nonfinal intonation (%2) as yet another group.

A popular alternative, which we do not endorse, is to interleave these three intonation types for example, by recording "one" (rising), "one" (prepausal, nonfinal), and "one" (falling, final), "two," "two," "two," and so on. This approach is highly error-prone because the interleaved sequence makes most voice actors automatically revert to list intonation, and this means that recordings intended as nonfinal/prepausal (%2) end up sounding identical or at least very similar to the rising versions (%3). In this case, list intonation is an inadvertent and undesirable artifact of the material's presentation.

Consider an application that reads back cardinal numbers in dollar amounts. The cardinal numbers are zero through 99, 100 to 900 by hundreds, 1,000 to 99,000 by thousands, and so on. (As it happens, the application also reads back dates consisting of names of months followed by ordinal numbers, but we will deal with this case separately because it constitutes a distinct prosodic and semantic module.)

In any case, there are several ways that all this material can be scripted. As always, keep prosodically and semantically related items together, and provide adequate context, as in Figure 17-5.

CONTEXT:	"Here's your account balance: five hundred eighty-two dollars and sixty-five cents."

Figure 17-5. Keep prosodically and semantically related items together, and provide adequate context.

graphics/17fig05.gif

The underlined words in the context header are the recording targets for this portion of the script. The pipe symbol ( | ) indicates a concatenation break; at these points, the voice actor should allow the slightest of pauses so that the sound engineer has just enough "room" to crop out and save the desired take. There are two reasons for preferring the pipe symbol over the more usual indicator of concatenation breaks, which are ellipses (. . .). First, ellipses are often used to suggest a more prominent, more drawn-out pause, which in turn has a significant effect on the prosody of the preceding material. Second, the pipe symbol simply takes up less horizontal space.

In Figure 17-5, each item yields at least two files to be cropped and saved. Although it may seem time-consuming to organize a script in this way, this example shows that scripts can be prepared with efficiency in mind and still provide all the benefits of recording in context. Of course, it is not always practical to massage a script to this degree of leanness and economy, especially under time pressure. The excerpt in Figure 17-6 is a simpler version of Figure 17-5 but will consequently require a longer completed script.

CONTEXT:	"Here's your account balance: five hundred dollars and sixty-five cents."

Figure 17-6. This simpler version of Figure 17-5 will require a longer completed script.

graphics/17fig06.gif

Most developers seem to prefer the quicker, more straightforward preparation of Figure 17-6 than the more time-consuming, linguistically strategized script of Figure 17-5.

The same scripting techniques hold for dates. For example, "first" to "thirty-first" can be scripted and captured along with the months, as in Figure 17-7. To facilitate cropping, make sure the voice actor leaves a little space between the month and the ordinal. Alternatively, the ordinals and months can be captured more simply in separate lists (not shown here). Either way, you should always provide context to ensure natural-sounding results.

CONTEXT:	". . . And in interest, you've earned: one dollar and sixty-nine cents on October seventh."

Figure 17-7. Recording months and ordinals together will make for more natural-sounding concatenation.

graphics/17fig07.gif

Figure 17-8 shows how you might script a set of numbers that will figure as winning scores in a sports application.

CONTEXT:	The Mets defeated the Padres thirteen to seven.

Figure 17-8. Here is how you might script sports scores.

graphics/17fig08.gif

Winning-losing score sequences more or less conform to a rising-falling intonation contour, assuming they fall at the end of declarative sentences (statements). By capturing these pairs in their naturally occurring context, you can attain more natural results.

As an aside, we recommend recording "to" with the number that follows rather than concatenating it as needed. When the preposition "to" is recorded by itself and concatenated with a number, it often sounds like the number "two," which is confusing in this context. This is because the preposition "to" is conversationally pronounced with a reduced vowel (schwa), whereas "two" is never reduced.

Again, many developers prefer to record the high and low scores in separate recording groups. In this case, make sure that the voice actor is familiar with the context to deliver the target files with the appropriate prosody. This scripting technique applies to a number of other scenarios, such as high-and-low temperature sequences for example, ". . . with a high of 72 and a low of 63."

Indicate Contrastive Stress

If the context calls for special emphasis on a particular word (contrastive stress), this should be indicated on the script. Figure 17-9 shows how contrastive stress can be indicated with italics. (These prompts have been culled from various applications.)

Figure 17-9. The script indicates contrastive stress using italics.

graphics/17fig09.gif

Special or unusual stress patterns can also be indicated with accent marks, underlining, or capital letters, as in (3).

(3)

Is thát right?

Is that right?

Is THAT right?

Use Punctuation Wisely

Most voice actors take their punctuation fairly seriously. Observing details such as punctuation, along with phonetic tips and direction notes, is emblematic of professionalism in this industry.

About Alphabetization

Scripts that are organized according to alphabetical order by filename obscure context, which is essential for delivering appropriate prosody. In theory, alphabetization does not necessarily imply undesirable context insensitivity, but it always seems to be the case in practice. The director and actor thus are faced with having to figure out the context in spite of the script, and that imposes an undue cognitive burden on both.

Consider the excerpt in example (4), which was taken from an actual script.

(4)

Eighty-three thousand

Eighty-two

Eighty-two thousand

Eleven

Eleventh

eleven thousand

February

Fifteen

fifteenth

fifteen thousand

fifty

fifty thousand

fifty-eight

fifty-eight thousand

fifty-five

fifty-five thousand

This hodgepodge of cardinal numbers, ordinal numbers, and months is mentally taxing if the goal is to ensure prosodic appropriateness. It is highly unlikely that these items are to be recorded with identical intonation contours, so the voice director will have to direct and redirect as the actor progresses from item to item. The voice actor, in turn, will have to internalize a different context for each item to produce a context-sensitive, prosodically appropriate delivery.

In other words, the presentation of material in (4) requires lot of unnecessary mental gear-shifting. And in case you're wondering, there appears to be no obvious reason why some items should begin with a capital letter, whereas others do not, but this is a formatting issue. Compare this unwieldy list with the more manageable arrangements of this same material recommended in Figures 17-5, 17-6, and 17-7.

Often, scripts are alphabetized because of the way certain tools are built. These tools are designed to extract filenames and the associated text from a document, such as a dialog specification document, and then sort the results. Alphabetization, however, does not always promote the voice actor's contextual knowledge of how items relate to each other.

First, let's consider commas. For example, the prosody of the message in (5) is different from that of (6).

(5)

Sorry, I'm having trouble understanding. Let's start over.

(6)

Sorry I'm having trouble understanding. Let's start over.

Specifically, in (5) the statement, "I'm having trouble understanding" is prefaced with a distinct transitional element, "sorry," similar in use to sentence-initial disjunct adverbs such as "unfortunately" or "regretfully." So there are three prosodic groupings here: "Sorry," "I'm having trouble understanding," and "Let's start over." In contrast, (6) consists of only two such groupings; the first means "I apologize for this difficulty," and the second is "Let's start over."

This point may seem minor, but it is only an example to illustrate how the comma can influence delivery and the listener's perception of thought groups, which are an essential aspect of prosody. Omitting commas, or inserting them where they don't belong, can also render a prompt unintelligible or ungrammatical to the reader or listener. For example, the prompt, "Tell me what's the registration number?" seems to be ungrammatical, as if it were written or spoken by a nonnative student of English. What the writer actually intended was, "Tell me, what's the registration number?" Without the comma, the question is ungrammatical.

Periods suggest "final" intonation a melodic fall to a relatively low pitch level so make sure you are using them appropriately. For example, each of the items in (7) is followed by a period and thus is delivered with the falling-final contour.

(7)

. . . 2001.

. . . 2002.

. . . a Boeing 727.

. . . an MD-11.

Perhaps owing to force of habit, there seems to be a tendency to put a period at the end of every line of a script, even when the final contour is not intended. We recommend against that practice. Use periods only where appropriate.

Colons, too, are prosodically meaningful. Compare the effect of the colon in (9) and (11) with the lack of a colon in (8) and (10), respectively. Square brackets mark text-to-speech material in (10) and (11).

(8)

Your choices are The Pimco Low Duration Fund, The Pimco Stable Value Fund, or The Pimco Foreign Fund.

(9)

Your choices are: The Pimco Low Duration Fund, The Pimco Stable Value Fund, or The Pimco Foreign Fund.

(10)

The street address is [1313 Mockingbird Lane].

(11)

The street address is: [1313 Mockingbird Lane].

In these examples, the colon at the end of each opening phrase ensures a more graceful hand-off from a piece of naturally recorded audio to a concatenated list or to TTS. This colon cues the actor to treat the item as a single breath group all its own and invites a natural pause, which in turn will coincide with the first concatenation break.

Perhaps it seems obvious, but parentheses are the best way to indicate information that serves a parenthetical function and therefore calls for a parenthetical delivery, as in (12), (13), and (14).

(12)

I need to record your voice a few times so I'll be able to recognize your voice next time you call. (This is just for today.)

(13)

Got it. Here's your financial report . . . (Remember, all market data is delayed by at least twenty minutes.)

(14)

Whenever you like, you can also say "main menu" or "customer service." (Or if you're finished, feel free to hang up.) So, what would you like?

Parentheses should be used consistently, so don't use them to bracket direction notes for example, "(Friendly)." To avoid confusion, you can place these notes in, for example, square brackets. Alternatively, you can separate them from the prompt text altogether and into their own column, as we have done throughout this chapter.

Avoid symbols that can be read aloud in more than one way for example, the dashes ( ) in "Our business hours are Monday Friday, 8:30 a.m. 8:30 p.m., EST." This message can be more clearly scripted as, for example, "Our business hours are Monday through Friday, 8:30 a.m. to 8:30 p.m., Eastern Standard Time," where the first dash has been replaced with "through" and the second with "to." Note also that "EST" has been rewritten as "Eastern Standard Time," in the interest of clarity.

We sometimes find prompts that make sense in print but do not lend themselves to being read aloud. For example, "To be connected to a representative/agent, press 0." How does the author of this prompt intend the voice actor to read "representative/agent"? Is the voice actor supposed to say "representative 'slash' agent"? "Representative and/or agent"? Or was this simply some sort of mental Post-it that accidentally escaped revision? In any case, we recommend something like, "To speak with a customer service representative, press zero."

Follow Practical Guidelines

Finally, there are some practical guidelines to keep in mind when you prepare the script:

Use a large font.
Use 1.5 or double spacing.
Print the script in landscape mode (wider than it is long). This will accommodate the columns we have recommended. In particular, try to make the prompt text column, which is the voice actor's center of attention, as wide as possible so that the actor's eyes don't tire from constantly having to jump from line to line after every few words.
Break pages between prompts, not interrupting them.
Number the items on the script as well as the pages.
Don't staple. Most voice actors prefer to work with one page at a time. Stapled scripts are unwieldy.
Keep scripts current and accurate. This will help avoid version-control problems later on.