Chapter Two. Technology Primer: About Speech Recognizers | The Art and Business of Speech Recognition: Creating the Noble Voice

Port de bras is the foundation of the great science of the use of arms in classical ballet. The arms, legs and body are developed separately through special exercises. But only the ability to find the proper position for her arms lends a finesse to the artistic expression of the dancer, and renders full harmony to her dance .

” AGRIPPINA VAGANOVA

A speech recognizer listens to people say something, and then attempts to match what they've said against a list of known (or expected) words or phrases. It sounds simple, but consider the size of the challenge. In an over-the-phone speech-recognition system, we're talking about a single computer that has to listen to as many as 48 or 96 people talking simultaneously over low-fidelity telephone microphones. All of these are people with a variety of regional accents, people who often don't enunciate clearly, people expecting a seemingly immediate response. The speech recognizer must process all these data quickly enough to respond accurately and make the interactions feel like natural conversations.

It's clearly a daunting task, which is why all over-the-phone speech-recognition systems use a kind of electronic shorthand. Instead of listening and trying to understand every word a person might say, they listen for a small set of key words ”as few as two and as many as several hundred thousand, depending on the system and its constraints.

Take, for example, when a speech-recognition system asks callers to say their U. S. ZIP code numbers. It can expect to hear a series of any five out of a possible ten digits (zero to nine) in response, as well as "oh" for "zero." Because the vocabulary of the responses is constrained to a fixed-length string of just five digits (no letters), algorithms enable the computer to quickly figure out the ZIP code spoken by each caller. On the other hand, if the system had to listen to bank account numbers ”which often vary in length (certain companies have account numbers that can be anywhere between 8 and 16 digits) and may include letters and dashes in addition to the numbers ”it would have a harder time breaking down the sounds due to the lack of constraints on the length of the string and the wider scope of valid utterances.

The more the system designer can constrain the vocabulary for each response, the greater the speed and accuracy of the speech recognition. That's why most over-the-phone speech systems collect street addresses by first asking for the ZIP codes: the system dynamically loads only the list of valid streets for that area, then when it asks for the street name (or street address) it can match the caller's utterance against the limited list of street names in that ZIP code. If the system were to ask a caller for the street address first (in the way that people do to each other), and a caller said "1221 Gray Street," the system would have to match this against a significantly larger list of possible street names ” compromising the speed and accuracy of the recognition.

Designers have to be prepared to make trade-offs between sounding natural ("What's your street address?") and ensuring accuracy ("In which ZIP code is that address located?"). And because the capabilities of speech recognizers vary greatly, designers also have to consider the strengths or limitations of their speech recognizer as they design an application.

There are several different recognizers available today, each of which has its own strengths and weaknesses. Designers need to know how to take advantage of the strengths while working around the weaknesses. While all recognizers listen to spoken utterances and attempt to understand them, many recognizers do other things as well. For example, some recognizers perform speaker verification ”a security technology that matches a caller's voiceprint to an utterance recorded earlier. While this is a great way to reduce unauthorized use, it is not yet 100% accurate. Therefore, designers must supplement it with other methods of authentication.

Some (but not all) recognizers report statistics that can later be used to analyze the performance of the application. If a system asks, "Where are you flying from?" the recognizer can report many things, some of which include the percentage of times the recognizer needed to ask a question to confirm that it correctly understood what the caller said, as well as the number of callers that hung up the phone after listening to that question. If the recognizer provides more reports , it's generally easier to diagnose problems, as in a case where statistics provided by the recognizer indicate 10% of calls are being transferred to an operator.

Good recognizers provide a method to analyze the data so that the reason(s) for the transfers can be determined. For example, are the transfers all occurring in one state where the caller is asked a particular question? Spread over several states uniformly or randomly ? After the application attempts to query a database? Or are the transfers due to callers' requests ? And while these statistics don't necessarily indicate the reason for the transfer they do help to point the developer in the right direction.

Also, it is worth noting that some recognizers are speaker dependent and require users to train the recognizer for their particular voice (generally used on personal computers for dictation applications), while other recognizers are speaker independent and can be used by almost anyone , without any training on the part of the caller. The speaker-independent recognizer is by far the most common recognizer for large-scale, telephony-based systems.