Prompt CreationText-to-Speech and Recorded Voices | The Art and Business of Speech Recognition: Creating the Noble Voice

Prompt Creation ”Text-to-Speech and Recorded Voices

There are two ways to create audio prompts. One method is to record a real person, called a voice talent, saying phrases, while the other is to use text-to-speech (TTS) software that converts text stored in a digital form (for example, an e-mail message) to a spoken utterance, in real time. TTS is generally used to read dynamic information in a cost-effective manner that otherwise would be difficult or impossible to prerecord, for example, the daily news or the weather. There are two types of popular TTS engines ”those that synthesize the sound, formant TTS , and those that take thousands of small pieces of prerecorded human-speech and concatenate them, or string them, together, called concatenative TTS . The following are the primary differences among the three methods (recorded phrases and the two types of TTS) of producing the audio files.

Recorded prompts sound great and convey the most precise meaning, since voice talents can vary every aspect of how they speak according to the desired direction. However, each prompt takes up disk space, though not a large enough amount as to be much of an issue.
Formant TTS engines sound the worst; they don't sound like any particular voice talent since they generate the speech signal from scratch using a noise generator and a series of filters to change the noise to make it sound like speech. However, they can sometimes be a good choice because they require very little computing power, disk space, or memory.
Concatenative TTS engines, when built properly, are able to sound nearly like the person from whom the audio files were recorded (allowing seamless blending between the recorded prompts and the TTS-generated ones), though they can't convey the rich meaning that the recorded prompts can. However, these systems require faster computers and much more disk space and memory.

Most often it's a good idea to use recorded prompts, since they will sound the most natural and the total time to record the prompts is generally a fraction of the total time of development. I don't advocate only using TTS prompts for an entire application, because that method could compromise the ability to express the endless amount of variation that the human voice can produce to convey particular thoughts.

The preferred and more traditional method is to record a real person ”the voice talent ”saying phrases that are recorded and stored digitally in a computer, with each phrase saved as a unique file and played to the caller as appropriate. Even though callers know that they're not listening to a live person, they are much more comfortable interacting with something that sounds more like a fellow human being ^[1] and less like the somewhat emotionally removed HAL 9000 from 2001: A Space Odyssey.

^[1] See Byron Reeves and Clifford Nass, The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places (New York: Cambridge University Press, 1996), pp.106 “107.

Production of effective audio prompts requires three tasks .

Casting ”choosing the appropriate voice talent
Directing ”guiding the voice talent in how to say the words
Concatenative recording ”ensuring that the phrases spoken by the voice talent are captured and can be joined together for smooth playback