Speech-Recognition Applications: A Typical Example
Before we get into the particulars of the design process, let's examine what a typical speech-recognition application looks like. Here's an example of a person calling the AirTran flight information toll-free number. The following exchange ensues.
Let's examine some elements of this conversation.
When the system answers the telephone call, it plays an
, a recorded, spoken prompt. The caller then responds. At this and each succeeding
However, when it asks for the flight number, it's expecting to hear one of many, many different responses. For example, the caller might say, "Three twenty-one," "Three, two, one," "Flight three twenty-one," "Flight three, two, one," or any other number within a particular range of a few thousand possible flight
Where We've Been ”Where We're Going
The power of speech-recognition technology enables designers to help people in new ways. And similarly to how the underlying principles of design don't change, the same is true about the underlying mechanics of how a speech recognizer functions.
Throughout this book we will make references to the recognizer, the technology that enables a system to recognize spoken words. Although system designers needn't become experts on technology to do their jobs well, it's important that they know enough about the workings of the recognizer to understand what it can and cannot do.
Chapter Two. Technology Primer: About Speech Recognizers
A speech recognizer listens to people say something, and then attempts to match what they've said against a list of known (or expected) words or phrases. It sounds simple, but consider the size of the challenge. In an over-the-phone speech-recognition system, we're talking about a single computer that has to listen to as many as 48 or 96 people talking
It's clearly a daunting task, which is why all over-the-phone speech-recognition systems use a kind of electronic shorthand. Instead of listening and trying to understand every word a person might say, they listen for a small set of key words ”as few as two and as many as several hundred thousand, depending on the system and its constraints.
Take, for example, when a speech-recognition system asks
The more the system designer can constrain the vocabulary for each response, the greater the speed and accuracy of the speech recognition. That's why most over-the-phone speech systems collect street addresses by first asking for the ZIP codes: the system dynamically loads only the list of valid
Designers have to be prepared to make trade-offs between sounding natural ("What's your street address?") and ensuring accuracy ("In which ZIP code is that address located?"). And because the capabilities of speech recognizers vary greatly, designers also have to consider the strengths or limitations of their speech recognizer as they design an application.
There are several different recognizers available today, each of which has its own strengths and weaknesses. Designers need to know how to take advantage of the strengths while working around the weaknesses. While all recognizers listen to spoken utterances and attempt to understand them, many recognizers do other things as well. For example, some recognizers perform
”a security technology that matches a caller's voiceprint to an utterance recorded earlier. While this is a great way to reduce unauthorized use, it is not yet 100% accurate. Therefore, designers must supplement it with other
Some (but not all) recognizers report statistics that can later be used to analyze the performance of the application. If a system asks, "Where are you flying from?" the recognizer can report many things, some of which include the percentage of times the recognizer needed to ask a question to confirm that it correctly
Good recognizers provide a method to analyze the data so that the reason(s) for the transfers can be determined. For example, are the transfers all occurring in one state where the caller is asked a particular question? Spread over several states uniformly or
Also, it is worth noting that some recognizers are
and require users to train the recognizer for
particular voice (generally used on personal computers for dictation applications), while other recognizers are
and can be used by almost