Concatenative Prompt Recording | The Art and Business of Speech Recognition: Creating the Noble Voice

Concatenative prompt recording refers to the recording of a voice talent saying particular words that will later be spliced together. Let's use the phrase "I'll transfer 50 dollars from your checking account to your savings account" as an example.

Most touchtone systems deliver this phrase poorly ”here's an example. (Read this aloud to understand what I mean.)

"Transferring. Fifty, dollars. From your, savings. Account. To your, checking. Account."

It sounds stilted and artificial ”particularly when we hear the first instance of "account" where the system, using the only recording it has of "account," sounds as if it is finished its thought. This problem is generated from designers who seek to do the minimal amount of intellectual work by recording one version of each word that they'll use without regards to context.

If we examine the word "account" when spoken naturally by a person, we see that it is used in two different contexts: when it refers to the originating account and will be followed by some more information, and at the end of the sentence (when it refers to the destination account).

When we speak, we all vary the intonation of our words or phrases to suit their context. In audio prompt recordings, we refer to three basic types of intonation .

Rising ” when the pitch of the word or phrase rises at the end, as in the beginning of a sentence.
Medial ” when the pitch of the word or phrase remains steady, as in mid-sentence.
Falling ” when the pitch of the word or phrase descends at the end, as in the end of a sentence.

If there were five types of accounts, the bad touchtone example requires the recording of ten prompts (plus all the numbers ).

"Transferring" / "< list of numbers >" / "dollars" / "from your" / "checking"/ "savings" / "money market" / "brokerage" / "retirement" / "account" / "to your" /

The system can be greatly ”and easily ”improved to sound like this (again, read it aloud)

"Transferring, 50, dollars from your checking account, to your money market account."

by only recording 11 prompts ”the right 11 (plus the numbers). The basic idea is to record larger phrases with regard to the context in which they'll be played , and then program the application to concatenate them correctly. We only have to create two recordings for each type of account: one using a medial intonation when it is the originating account, and one with falling intonation when it is the destination account. For five types of accounts, here is the recording list.

"Transferring"

"< list of numbers >" (e.g., "31," "32," "33," "34," and so on)

"dollars, from your checking account" / "dollars, from your savings account" / "dollars, from your money market account" / "dollars, from your brokerage account" / "dollars, from your retirement account"

"to your checking account." / "to your savings account." / "to your money market account." / "to your brokerage account." / "to your retirement account."

Remember the old TV game show Password? The object of the game was for players to get their partners to say a word by giving them clues. One common tactic was to say the beginning of a common phrase using a medial tone in order to get the partner to complete the phrase. Player A might say "Hammer and . " (with a leading tone to the word "and") to get player B to finish the phrase ”"nail."

But it's not just whole words that can be concatenated to form phrases. Parts of words can also be concatenated , most often when the application calls for numbers. The minimal number of recordings to express North American formatted telephone numbers is ten (one recording for each digit, 0 to 10). However, the result sounds like this.

"Six. One. Seven. Four. Two. Eight. Four. Four. Four. Four." (with each "4" sounding exactly the same)

To achieve a significantly more natural sound in an application that uses numbers, we would want to record a total of 30 prompts ”each of the numbers 0 to 9 in all three intonations: rising, medial, and falling.

This technique would enable the system to concatenate the prompts so that a number like 555 ("five, five, five") sounds like a complete thought, with a rising sound to the first five, a medial sound to the second five, and a falling sound to the final five. And while it will sound less "robotic" than the minimalist approach, most listeners will be able to detect that the numbers have been pasted together (unless the recording and the splicing are done carefully ).

Alternatively, it is possible to achieve exceptional, natural-sounding quality, but it requires the recording of 1,200 prompts ”recording every complete three-digit prompt from 000 to 999, plus two sets of two-digit prompts (from 00 to 99), one rising and one falling. This means that phone numbers like "(617) 428-4444" will be played from audio files that are recordings of someone saying the complete phrases: "617," "428," "44" (rising), and "44" (falling). Piece them together and it sounds totally natural ”but the obvious cost is the amount of time to record the additional 1,170 phrases.