13.2 Dialog Design Guidelines for Maximizing Accuracy

Many activities during design and development contribute to accuracy. Some we have covered earlier (e.g., the use of acoustic adaptation, mentioned in Chapter 2). Others are covered in Chapter 15 (tuning recognition parameters) and Chapter 16 (tuning grammar coverage). In this chapter, we cover some of the design choices you can make during detailed design that can help optimize the accuracy of difficult recognition tasks.

A number of common recognition tasks are extremely challenging. One example is the alphabet. The names of 25 of the 26 letters in the English alphabet have one syllable and thus provide limited acoustic information. Furthermore, 9 of the letter names rhyme with E; these are often referred to as the eset. Another 4 have a vowel rhyming with A. Another two F and S are hard to distinguish over the phone. Telephones don't transmit frequencies higher than about 3,500 cycles per second, which is where most of the information to distinguish F and S lies. Recognition of alphabetics comes up in applications requiring spelling (e.g., of person names or street names) and often as part of account IDs.

Another common and challenging recognition task is digit strings. Most of the names of the digits have one syllable. Even if recognition of individual digits is extremely high, when you string many of them together, the recognition rate on the entire string (getting every digit right) may be low. Many common recognition tasks require digit-string recognition for example, account numbers, PINs, telephone numbers, credit card numbers, and social security numbers.

When handling tough recognition problems such as strings of alphabetics, digits, or a combination of both (alphanumerics), you should look for all possible ways to constrain which strings are valid, as well as all possible sources of knowledge that can be brought to bear to limit the possibilities the recognizer must consider.

In general, you can apply these constraints either by building them into the grammar or by postprocessing an N-best list (i.e., choose the first item on the N-best list that fulfills the constraint). It is best to build these constraints into the grammar, if at all possible. This approach will result in higher accuracy.

Here are examples of encoding structure in grammar:

Fixed-length digit strings: If the string will always be the same length, write a grammar that accepts only strings of that length. It is far easier to accurately recognize fixed-length strings than variable-length strings.
Combinations of fixed-length strings: In some cases, you may be working with more than one valid string length for example, phone numbers that may be local (seven-digit) or long distance (ten-digit).
Constraints on particular character positions: Account IDs may have constraints on various positions for example, the first two positions may contain one of the 50 U.S. state codes.
Constraints on relationships between character positions or groups: An example of this, for phone numbers, would be a grammar that is constrained to capture only the valid exchanges for each area code.
Spelling of valid words: If the recognition task is spelling and if the vocabulary being spelled is from a known set (e.g., person names, English words), you can train a statistical language model (SLM) so that it learns the probabilities of letter sequences. This is likely to be a useful source of constraint, because in English certain letter combinations are far more common than others.

Here are examples of encoding structure by post-processing an N-best list:

Checksums: Many account numbers are designed with some form of checksum digit. The checksum digit is computed as a function of the other digits in the account number. The checksum computation is usually designed so that there is a roughly uniform distribution of checksum values (over the ten digits) given the expected distribution of account numbers. As a result, nine times out of ten, if there are any recognition errors in the account number, the checksum will not be valid. You can perform the checksum test on each item in the N-best list, choosing the first one that matches.
Database of valid items: Often, only a small subset of the possible account numbers is actually in use. If you have access to the database of existing account numbers, you can go down the N-best list, testing each item for validity.

Another difficult recognition problem is recognition from an extremely long list, such as street names. One approach for maximizing accuracy is to constrain the grammar dynamically (as the application is running) based on information previously supplied by the caller. For example, if the zip code is known, you can dynamically load a grammar with only the streets for that zip code. Dynamic grammars are discussed in Chapter 16. In general, recognition from long lists can be improved by incorporating probabilities into the grammar based on in-service data from the application.