16.2 Grammar Testing | Voice User Interface Design 2004

Grammars can be quite complex. In the same way that complex software can have bugs and must be tested, grammars should be tested before release for a pilot. The purpose of the testing is both to fix bugs and to ensure reasonable performance out of the box.

16.2.1 Testing Rule-Based Grammars

There are six standard tests to run on rule-based grammars:

Coverage
Overcoverage
Natural language
Ambiguity
Spelling
Pronunciation

Speech technology vendors provide a variety of tools to accommodate these tests. For the specific details, see the vendor documentation. Here, we describe each test in a general way and explain how to use it.

Coverage Testing

A coverage test uses a test suite consisting of phrases and sentences the grammar is designed to cover and makes sure that all of them are covered. Typically, you build up the test suite while developing the grammar. Start with the sample grammar expressions supplied for each dialog state during detailed design. Before writing any grammar rules, supplement the test suite with items you know you intend to cover in the grammar. As you write the grammar, add more items to the test suite to make sure that the various grammar paths are adequately exercised.

When the grammar is complete, run the test suite and make sure all items are parsed by the grammar. Whenever the grammar is changed, you can rerun the test suite to make sure that bugs have not been introduced. When enhancing the grammar, such as during the tuning phase, you can add items to the test suite to cover the grammar paths you are adding.

The primary purpose of the coverage test is to find bugs as well as to ensure a certain minimum level of coverage going into the pilot. The ultimate refinement of the grammar to maximize coverage will happen in the tuning phase, based on actual spoken input from real callers.

Overcoverage Testing

Overcoverage refers to sentences or phrases the grammar covers unintentionally, such as nonsense phrases that the grammar parses either because of a bug in the grammar or as a side effect of combining various partial intended paths through the grammar. Overcoverage may, in some cases, be an indication of a bug in a written grammar. Even if it is not due to bugs, excessive overcoverage makes recognition harder. There are more possible grammar paths that can lead to ambiguities and confusions during the recognition search.

In some cases, a certain amount of overcoverage either is unavoidable or actually makes sense when traded off against increased grammar-writing complexity. For example, a sentence such as, "Please get me a flight to San Francisco, please" may be very unlikely, but it may end up as a legal sentence in a grammar because the prefiller and postfiller subgrammars are separate. They cannot therefore restrict their paths based on the path that was taken in the other subgrammar. Although it is possible to write the grammar in a way that will avoid this problem, it might add significant complexity and might not be worthwhile.

The standard way to test for overcoverage in a grammar is to generate strings from the grammar using a utility that simply enumerates valid word strings that the grammar parses. In relatively small, finite grammars, you can generate all possible strings. Most grammars are too complex to generate all possible strings. In fact, many grammars can represent an infinite number of possible strings for example, those using the + or * operator or recursive grammars in which a grammar can refer to itself as a subgrammar or to other subgrammars that ultimately refer back to it. For large or infinite grammars, you can randomly generate a set of strings from the grammar (around 100 is a reasonable number). In either case, after the strings are generated, you read through them. When you find strings that are unintended (constitute overcoverage), you should either fix them or make the decision to accept them.

Natural Language Testing

The purpose of the natural language test is to make sure that the correct meaning is assigned to input utterances (e.g., in the case of GSL, that each slot is assigned the correct value). Typically, you use the same test suite for natural language testing that you used for coverage testing. You run a utility that returns the meaning for each input string in the test set. The first time the test is run, you must check the results by eye to make sure that the meanings are correct. If they are, you should save a copy of the utility's output so that in the future, when rerunning the test (e.g., after a change to the grammar), you can automatically compare the new test result to the old one. In this way, you can make sure that nothing has changed, meaning that the change has not introduced natural language understanding bugs.

Ambiguity Testing

An ambiguity is a case in which there are multiple possible natural language interpretations of an input. Vendors provide utilities that search for ambiguities in grammars. Often, an ambiguity is caused by a bug in the grammar, which you must fix. In some cases, the ambiguity reflects actual ambiguity in the application domain (e.g., the same input utterance may be spoken whether the caller wants a stock quote for Cisco Systems or for Sysco Foods). Any ambiguity that is not fixed in the grammar must be explicitly handled in the application (e.g., adding the prompt, "Did you want Cisco Systems or Sysco Foods?").

Spelling Testing

A misspelled word in a grammar can cause degradation in recognition performance that can be very hard to track down. For example, a misspelled word can lead to an automatically generated pronunciation that is wrong, making recognition difficult. This is especially problematic in applications that have grammars that include long lists, such as a travel application that handles travel between 2,000 cities. If five of those cities are misspelled, leading to poor pronunciation models that cause a slight degradation in recognition performance, the problem will be very hard to detect and will not be detected until you are tuning pilot data, if at all. Related problems can also be caused by misspellings in values used in slot-filling commands.

A simple check with a standard spell checker can detect many of these problems. Of course, some words in grammars will not be in the spell check dictionary (e.g., company names that are not standard words). Therefore, the list of misspellings should be reviewed, and only those that are mistakes should be fixed.

Pronunciation Testing

Pronunciation testing is useful for applications that have long lists of items such as company names, city names, street names, and so on that may not be in the initial dictionary and are added either by hand or with automatic pronunciation facilities. A pronunciation test uses a facility that lets you test a grammar by speaking to it. Simply go down the list saying each item. For any misrecognitions, take a look at the dictionary pronunciation. An alternative approach is to feed the pronunciations of each word in the list to a text-to-speech engine and listen to the result.

16.2.2 Testing Statistical Language Models

Given that statistical language models are automatically trained from data, they are far less prone to bugs than rule-based models. However, before pilot, they should be tested to make sure they provide reasonable out-of-the-box recognition accuracy.

When you test statistical language models, the most important concept is that the data used for the test must not be part of the training set. To perform a valid test you must use separate data. Otherwise, you are likely to end up with unrealistically optimistic results that will not reflect real-world performance.

When you collect data to train the initial SLM, some of the data (e.g., 2,000 utterances) should be held out of the training set and used as a test set to measure the performance of the model. You can assess the performance of the model either by running recognition on the test set or by measuring perplexity: a measure of the predictive power of a model. The lower the perplexity, the greater the predictive power. Greater predictive power is likely to lead to more accurate recognition performance. (Utilities to measure perplexity, given a test set, are provided by speech recognition vendors that offer SLMs.) It is often most straightforward to assess SLMs based on recognition tests. Most vendors provide utilities for running recognition in batch mode that is, from a list of prerecorded utterances.

16.2.3 Testing Robust Natural Language Grammars

Testing robust natural language grammars is very similar to testing standard rule-based grammars. The only difference is that the former are simpler; they include only the slot-filling phrase-grammars. Therefore, you need not worry about coverage of the many possible filler and other extraneous words and phrases that callers may use. All the tests listed for rule-based grammars can be used for robust natural language grammars.

16.2.4 Testing Statistical Natural Language Grammars

A statistical natural language grammar is generally tested for accuracy. You want to know how accurately it assigns the correct slot value (e.g., the correct destination service for a call router) to input word strings. To perform a valid test, you need a test set that is not part of the training set. The same test set used for testing an SLM can be used for testing the statistical natural language grammar. In addition to transcriptions, each item in the test set should be labeled with the correct slot value. You measure performance by running the statistical natural language engine on the test set.