18.1 Development | Voice User Interface Design 2004

The development phase takes all the thinking that has gone into definition and design and transforms these ideas into concrete deliverables, including code, grammars, and audio recordings.

18.1.1 Application Development

Lexington decides that the application should be developed in VoiceXML. The executives want to avoid proprietary languages to ensure flexibility in the future, while taking advantage of the speech-centric nature of VoiceXML. Ultimately, they want to be able to update the application themselves, so they are training some members of their development team on VoiceXML.

Based on the definitions of grammar slots and sample phrases for each dialog state, the developers create simple stub grammars that they can use for unit-testing of modules before the completion of grammar development. They also create stubs for database return values to accommodate testing before completion of all the backend integrations.

The backend and CTI integrations are straightforward. They are implemented by Lexington. The firm decides to reuse the integrations to its touchtone system to the greatest extent possible. The customer account database is updated to accommodate personalization features, such as the tracking of calls to the system, so that the necessary information is available for just-in-time instruction. To accommodate future additions for greater personalization, they make the changes in an easily extensible framework.

From time to time during development, members of the development team meet with the VUI designer to make sure that they fully understand the dialog spec. One of the developers suggests eliminating the need for multiple entry prompts to dialog states by rewording those prompts to be more generic. He estimates such a change will save two days of development time when you count the development and testing effort. This idea is rejected because the change would significantly diminish the naturalness of the dialog flow and would risk reducing clarity for callers.

In general, the application development proceeds smoothly, according to schedule.

18.1.2 Grammar Development

A number of types of grammars must be developed. Rule-based grammars are needed for those dialog states that are part of a directed dialog. SLMs are needed for states that accept more flexible input. Robust natural language grammars are needed for semantic interpretation for those states that use SLMs for a recognition grammar.

As we begin working on rule-based grammars for specific dialog states, we look at the slot definitions and sample phrases defined for that state. We also look at the prompt wording. These give us a starting point for imagining the variety of ways a caller may respond to our prompt.

Before beginning to flesh out the grammar, we create an initial test suite, which is simply a list of phrases and sentences we want to make sure the grammar covers. We then flesh out the grammar definition and add items to the test suite. The following is our first version of the grammar for the GetAccountNumber state:

 .GetAccountNumber      (?Uh       [(?Prefiller AccountNumber:number)                        {<account_number $number>}        Unknown {<unknown true>}       ]      ) Uh [uh hm um] Prefiller [( [my the] ?account number is )            [ it's (it is) ]           ] AccountNumber      (Digit:d1 Digit:d2 Digit:d3 Digit:d4 Digit:d5 Digit:d6       Digit:d7 Digit:d8 Digit:d9 Digit:d10 Digit:d11 Digit:d12)      {return (strcat($d1 strcat($d2 strcat($d3 strcat($d4               strcat($d5 strcat($d6 strcat($d7 strcat($d8               strcat($d9 strcat($d10 strcat($d11 $d12)))))))                                                      )))))} Unknown [( ?i [ (do not) don't ] know ?[it (what it is) ] )          ( ?i dunno )          ( ?i [ (do not) don't ] remember ?it )          ( ?i [ (do not) don't ] have it ?(with me) )          ( [i'm (i am)] not sure )          ( i have no idea )         ] Digit [  [oh zero]  {return(0)}          one        {return(1)}          two        {return(2)}          three      {return(3)}          four       {return(4)}          five       {return(5)}          six        {return(6)}          seven      {return(7)}          eight      {return(8)}          nine       {return(9)}       ]

(Note that strcat is a function in GSL that concatenates strings. The AccountNumber subgrammar demonstrates a way to return a continuous string of 12 digits.)

In a similar fashion, we create grammars for all the states using directed dialogs.

The next issue is the bootstrapping of the SLM so that we have a good starting point. Because we have done a number of similar projects in the past, with the same functionality available at the main menu, we have data we can use for training an initial SLM. Although we expect improved performance after we train with application-specific data, this initial model should be good enough to get us to pilot, at which point we will collect a lot of data with the working application. We split off a portion of the data to use as a test set.

Next, we create the robust natural language grammars for the states that use SLMs. In addition to looking at the definitions of slots and sample phrases for these states, we look at the transcriptions used for training the SLM; this gives us an idea of the phrases we should cover in our grammars. The development of these grammars is straightforward, given that we need to cover only the core slot-filling phrases and not the fillers. For a test suite for the robust grammars, we develop one by hand, in the same way we did for rule-based grammars. We also add the transcriptions from the SLM test set.

18.1.3 Audio Production

Audio production includes the selection of the voice actor, coaching recording sessions, postprocessing prompt recordings, and creating nonverbal audio. As discussed in Chapter 14, for this project we selected the voice actor early so that we could get feedback on the persona from the iterative usability tests we ran during the detailed design phase. We also began recording company name prompts as soon as the voice actor was chosen. In this way, we had time to record all 15,000 names in a series of recording sessions over the course of a few weeks.

To select the voice actor, we get a CD from a talent agency we often work with containing examples from 10 male voice actors. With the persona definition in mind, we narrow the choice to two possible voices. Lexington listens to the CD and agrees with these choices. We decide to bring in both actors for live auditions.

Both auditions follow the same process. When the voice actor arrives, we give him a copy of the persona definition and discuss with him the application and persona. We then go through a brief recording session. The session begins by recording the prompts for a few of the sample dialogs, including login, trades, and disambiguation of company names. We include a number of complex help prompts and just-in-time instructions. We then test each actor's coachability on prosody for complex concatenated prompts, focusing on confirmation prompts for trades. We have them record a few company names, numbers to be used for the number of shares, numbers for stock prices, and the other pieces of trade confirmations (e.g., "Confirming: You want to buy"). After the session, we use the recordings to splice together some confirmation prompts. Finally, we listen to all the material over the telephone.

After the auditions, one candidate stands out as the clear choice. The Lexington team listens to the audition recordings and agrees. We sign a contract and set up a schedule of recording sessions. The first session covers the material needed for the first usability test. We then begin a series of recording sessions, gradually covering the 15,000 company names over the following few weeks. At the beginning of each recording session, we have the voice actor listen to recordings from previous sessions to get back into character and match the voice.