16.3 Grammar Tuning | Voice User Interface Design 2004

Chapter 15 discusses the importance of pilot data in assessing and tuning the performance of the system. The pilot is the first opportunity to collect real, in-service data. Such data are just as important for tuning grammar coverage as for tuning recognition accuracy and dialog performance. People will choose different words to express themselves when they are engaged in real, task-oriented behavior.

16.3.1 Tuning Rule-Based Grammars

The same data that are collected and transcribed for tuning recognition accuracy and dialog performance can be used to tune grammar coverage. The utility described earlier, for running the coverage test of a rule-based grammar, can be used on the transcriptions of the pilot (or deployment) data set. The result will be a list of those utterances that are handled by the grammar (in-grammar) and those that are not handled (out-of-grammar).

You can evaluate the out-of-grammar utterances to find candidates for addition to the grammar, as well as indications of general refinements that are needed. Each out-of-grammar utterance should be evaluated. In general, those phrases and sentences that happen often and are missing from the grammar are the most important candidates for addition. Even a single observation, though, may be enough to warrant inclusion if it makes obvious sense for the grammar.

Out-of-grammar examples should be considered as candidates for addition to the grammar if they are in-domain that is, the caller is appropriately responding to the system prompt. They should not contain a substantial amount of extraneous material that is unrelated to the application. For example, in response to a prompt asking for a desired travel date, "My aunt Gertrude is having her sixtieth birthday party next Tuesday, so I would like to fly on Monday" would not be reasonable to try to cover in the grammar, although it is in-domain (it answers the question asked in the prompt). On the other hand, it would be reasonable to add the phrase "I would like to fly on Monday," especially if it happens numerous times.

Each example that is added to the grammar should be considered for generalization. For example, if you add "I want to go in the afternoon," you should also consider "morning" and "evening."

If out-of-domain examples happen repeatedly, it may be an indication that the prompt is unclear, that callers are ending up in that dialog state by mistake, or that the notion of what is in-domain needs to be reconsidered. In general, there is a gray area between clearly in-domain and clearly out-of-domain utterances. If you have out-of-domain utterances in that gray area that happen repeatedly, you should reconsider what is appropriate to consider in-domain and cover in the grammar.

After the grammar is rewritten to expand the coverage, you should rerun the original set of tests, using the original test suite, to make sure that bugs have not been introduced. This is referred to as regression testing. Once the tests are passed, the test suite can be fleshed out to cover new paths that have been added to the grammar. The new test suite should then be run and, after it passes, the results saved for future regression testing.

16.3.2 Tuning Statistical Language Models

With each phase of pilot testing and early deployment, the new data sets of transcribed utterances should be added to the training set used for SLM training. In general, the larger the training set, the better the probability estimates. Furthermore, data from the working system may be more representative of speech from the real caller population than the data used for bootstrapping the SLM.

Hold out some of the new data to add to the test set. If the original bootstrap data were less realistic than the new data set (e.g., based on laboratory-style data collection rather than real in-service data), you may want to replace the old test set. After a new SLM is trained, you can measure recognition accuracy with both the old and the new models to make sure performance is improving.

16.3.3 Tuning Robust Natural Language Grammars

Tuning a robust natural language grammar is similar to tuning a standard rule-based grammar, except that it is simpler because you need add only new core items (e.g., slot-filling phrases). There is no need to worry about extraneous words and phrases. Otherwise, the same approaches and tests apply.

16.3.4 Tuning Statistical Natural Language Grammars

Tuning a statistical natural language grammar is similar to tuning an SLM: You add the new data to the training set. In this case, the data will need to be both transcribed and labeled with the appropriate slot value.

As with the SLM tuning, hold out some of the new data to add to the test set. Be sure to test performance of both the old and the new models to make sure performance is improving.

It can be informative to look at false accepts, false rejects, and even correct rejects. Multiple rejects can indicate that there is a missing class (a route or service that people are asking for) that should be added to the system. Rejects and false accepts may also indicate that there is possibly a better classification scheme. Perhaps the way the business has divided its services doesn't match the callers' needs or mental models. A card sort approach (see Chapter 8) can be used to test whether there is a better way to organize the set of available services.