16.1 Grammar Development

The process of grammar development is substantially different for rule-based than for statistical approaches. This section covers the development of rule-based grammars, statistical language models for recognition, and two approaches for developing natural language grammars for systems that use SLMs for recognition.

16.1.1 Developing Rule-Based Grammars

The primary challenge for the developer of a rule-based grammar is to anticipate what callers will say in each dialog state. To begin, you look at the grammar definition for the dialog state you are ready to work on. The designer will have specified two things: the set of information items, or slots, to be returned by the grammar; and a set of sample expressions (described in Chapter 8).

The slots will tell you about the core of the grammar: items to include that actually carry the salient information. For example, one slot for a travel application might be destination-city. This tells you that city names will be included in the grammar and that they represent part of the core, or meaning-bearing, elements in the grammar. It also tells you that the information about the city chosen by the caller will be communicated to the application in a slot named destination-city.

The sample expressions supplied by the designer will give you information about both the core and the fillers. Fillers are the words and phrases that may surround the core. If one sample expression is "I want to go to the Big Apple," you know that "the Big Apple," as a synonym for New York, is part of the core grammar and that "I want to go to" is an example of the kind of filler that can be expected. Obviously, all the sample expressions should be covered by the grammar. However, they are just meant to be examples. You should flesh out the grammar with appropriate related core and filler items.

Another important source of information about the expected caller utterances is the prompt wording itself. Callers often choose the same wording as the prompt. For example, if the prompt says, "Where are you going?" callers are likely to give an answer such as, "I am going to San Francisco," whereas if the prompt says, "What's the destination city?" an answer such as, "My destination is San Francisco" is more likely.

A number of languages have been developed for specifying grammars, and all of them have roughly the same capabilities. For the examples presented here, we use Grammar Specification Language (GSL), a language created by Nuance and available on a number of platforms.

Table 16-1. Grammar Operators for GSL
OPERATOR	EXPRESSION	MEANING
( ) concatenation	(A B C … D)	A followed by B followed by C followed by … D
[ ] disjunction	[A B C … D]	One of A or B or C or … D
? optional	?A	A is optional
+ positive closure	+A	A occurs one or more times
* Kleene closure	*A	A occurs zero or more times

A GSL grammar is a set of rules of the form:

GrammarName	GrammarDescription

A grammar description uses the operators defined in Table 16-1 to combine basic grammar elements. The elements, or operands, are either lowercase strings (representing the actual words in the grammar) or strings that include uppercase characters, which represent subgrammars. Subgrammars are themselves grammar rules defined elsewhere.

To make it more concrete, let's look at an example of a simple grammar for the travel application mentioned earlier specifically, the grammar for the GetDestination dialog state. The first grammar name is .GETDESTINATION. The dot operator (".") at the beginning indicates that it is the top-level grammar, the grammar name referenced in the application:

 .GETDESTINATION (?PREFILLER CITY ?POSTFILLER) PREFILLER [(i want to go to)            (i am going to)            (i need a flight to)            (?i'm going to)           ] CITY      [[(new york) (the big apple)]            (san francisco)            boston           ] POSTFILLER [please]

The top-level grammar, .GETDESTINATION, is defined as the concatenation of three subgrammars: PREFILLER, CITY, and POSTFILLER. Two of the subgrammars PREFILLER and POSTFILLER are defined as optional; the ? operator means that grammatical word strings may or may not have such fillers.

This grammar allows inputs such as, "I want to go to New York, please," "I'm going to San Francisco," "Going to Boston," and "Boston." The CITY subgrammar is the core. Isolating it in a separate module as a subgrammar makes it easy to update for example, to add more cities. As a separate module, it is also easy to reuse. You could, for example, reuse it in another dialog state as part of a grammar for getting the traveler's origin city.

One thing missing from the example is the semantic specification: the instructions for filling slots given the caller's path through the grammar. A simple approach is shown in the following example:

 .GETDESTINATION (?PREFILLER CITY ?POSTFILLER) PREFILLER [(i want to go to)            (i am going to)            (i need a flight to)            (?i'm going to)           ] CITY      [[(new york) (the big apple)] {<destination-city ny>}            (san francisco)              {<destination-city sf>}            boston                   {<destination-city boston>}           ] POSTFILLER [please]

The slot-filling commands (e.g., <destination-city ny>) are executed if the preceding grammar construct is traversed. In this example, if the caller said, "I want to go to New York," "Going to the Big Apple," "I need a flight to New York," or "New York," the destination-city slot will be filled with the value ny, indicating to the application the intended meaning, despite the variety of ways the caller may have expressed that meaning.

An alternative approach, for a different but related grammar, is shown in the following example. In this case, the subgrammar returns values to the higher-level grammar that referenced it, rather than directly filling slots. This approach is useful if the subgrammar is to be used multiple times by the higher-level grammar to fill multiple slots, as is the case in this example.

 .GETCITIES  (?PREFILLER              [(from CITY:orig to CITY:dest)               (to CITY:dest from CITY:orig)              ] {<origin-city $orig> <destination-city $dest>}              ?POSTFILLER             ) PREFILLER   [(i want to go)              (i am going)              (i need a flight)              (?i'm going)             ] CITY        [[(new york) (the big apple)] {return(ny)}              (san francisco)              {return (sf)}              boston                       {return (boston)}             ] POSTFILLER  [please]

This grammar accepts inputs such as, "I want to go from New York to San Francisco" or "I'm going to Boston from New York." Rather than directly fill slots, the CITY subgrammar returns values. Those values get assigned to the variables dest and orig (by the expressions CITY:dest and CITY:orig), which are then referenced, as appropriate, by the slot-filling commands. For example, <origin-city $orig> causes the origin-city slot to be filled with the value of the variable orig. Using this approach makes it possible for the higher-level grammar .GETCITIES to use the CITY subgrammar in two places, returning values to be used in two different slots.

In general, grammars are tuned and refined iteratively, and therefore they should be organized for easy maintenance. Here are some guidelines:

Break the grammar into a logical, modular structure of subgrammars.
Place the core and the fillers in separate subgrammars.
Choose descriptive names for subgrammars, slots, and variables.
Format the grammars to make the structure obvious. For example, use a clear indenting scheme to offset logical groupings within the grammar.

In some cases, the grammar cannot be fully specified before runtime. For example, imagine an application for paying bills. Each subscriber to the service will select the companies to be paid. When subscribers call the system, their specific list of companies must get loaded into a grammar. This is called a dynamic grammar. Each speech technology vendor has its own approach for handling dynamic grammars. Therefore, we do not cover it here; see the vendor-specific documentation for details.

16.1.2 Developing Grammars for Statistical Language Models

Statistical language models (SLMs) are used when the amount of expected variation in spoken inputs is hard to capture with explicit grammar rules. The basic approach is to automatically learn what word strings occur, and with what likelihood, from real caller data. In this way, the grammar developer no longer needs to imagine all the variations. Instead, you need only collect a data set, transcribe it, and feed it to the software utility that creates the SLM.

The creation of the statistical language model is referred to as training the language model. The data set used by the training utility, consisting of a list of transcriptions of caller utterances (the actual word strings spoken), is called the training set. The basic approach used by the training utility is to estimate the probability of the occurrence of each word in the vocabulary, given its context. The context used for these estimates is the most recent few words spoken.

The order of the model determines how much context is considered. An Nth order model (called an N-gram) considers the N 1 (N minus 1) predecessor words as context. In other words, the model provides an estimate of the probabilities of what the next word to be spoken may be, given the previous N 1 words. A first-order N-gram, called a unigram, consists of estimates that a word will occur, without regard to context. A second-order N-gram is called a bigram. It consists of estimates of word occurrence, given the most recent predecessor word. A trigram, or third-order N-gram, provides estimates given the most recent two preceding words.

Theoretically, the higher the order of the model, the more predictive power you will have about which word will occur next. However, the higher the order of the model, the more training data we need to come up with reliable estimates, given the larger number of probabilities that we must estimate. Assuming a vocabulary size of 1,000 words, a unigram model consists of 1,000 probabilities. By contrast, a bigram model consists of 1,000² probabilities (i.e., the probability of each of the 1,000 words in the vocabulary, given each of the 1,000 possible predecessor words).

In typical applications that use SLMs, a vocabulary size of a few thousand words is common. Trigram models are often used. A rule of thumb about the size of the training set for such a model is a minimum of 20,000 transcribed utterances.

After an application goes to pilot, it is easy to collect and transcribe data, thereby increasing the size of the training set. The challenge with SLM-based systems is bootstrapping: creating an initial model so that you can achieve reasonable performance at the start of the pilot. There are a number of approaches you can use to create the initial model. One approach is to use a Wizard of Oz system to collect data (see Chapter 8). If the wizard is equipped to help callers complete their task, you can handle real callers. Otherwise, you need to solicit callers to use the system and give them tasks to complete.

Alternatively, if there are live agents currently handling the task, you can use them as wizards. The caller will hear the recorded prompts played by the system, but the "recognition" and "understanding" will be performed by the live agents.

Another possibility is to develop a GSL-based grammar for the first phase of the pilot. In that case, you should still use the prompting targeted for the SLM system, even though the wording of the prompts may encourage language from the callers that leads to a high out-of-grammar rate for the GSL. When a recognition reject occurs, the back-off prompting should be more appropriately constraining, to help the caller succeed with the GSL grammars. In that way, you will be able to collect data, especially from the first interchange with the system, that will be useful for training the SLM. However, when there is a problem, the system will quickly back off to a directed dialog, and in this way the caller is unlikely to experience more than one reject due to the data collection setup. As soon as enough data have been collected, the GSL grammar can be replaced with an SLM.

The SLMs we have discussed in this section fulfill the role of the syntactic side of the grammar. They are used to create the search space for the recognizer (the recognition model described in Chapter 2), resulting in the recognition of word strings. The next two sections describe the two approaches commonly used for the semantic role of the grammar. These are the methods for assigning a meaning to the word strings recognized by an SLM-based recognizer.

16.1.3 Developing Robust Natural Language Grammars

In the late 1980s and early 1990s, researchers at a number of sites worked on projects that combined speech recognition technology and natural language understanding technology, resulting in the early spoken language understanding systems (Cohen, Rivlin, and Bratt 1995). Before those projects, most natural language understanding research was applied to text rather than speech.

One of the lessons of these first applications of natural language technology to spoken language was that spoken and written language differ dramatically. Beyond the structural and word-choice differences discussed in Chapter 10, spoken language is different from written language in that it is often "ungrammatical," includes disfluencies (e.g., "Um, I want the sec no, the third flight"), and often includes extraneous information that is not directly needed to answer a particular question ("I have a meeting in the afternoon, so I want to arrive by eleven a.m.").

Robust parsing approaches were developed to deal with these problems (Jackson et al. 1991). The basic idea is to search for meaning-bearing words and phrases without trying to parse and understand the entire word string that was spoken. For example, consider the following dialog:

SYSTEM:	What time do you want to arrive?
CALLER:	I have a meeting in the afternoon, so I want to arrive by eleven a.m.

Here, a robust grammar might search only for phrases that specify a time, disregarding everything else. The grammar specification would include phrase-grammars with slot-filling commands, but no grammars to cover fillers. In this example, the grammar need only cover "eleven AM." There is no need to write a grammar that can cover all the other, hard-to-predict material.

As the language you must interpret becomes more variable and flexible (as with applications using SLMs), the ability to write only the slot-filling phrase-grammars, disregarding everything else, is a tremendous simplification of what would otherwise be a daunting grammar-writing task. When applications accommodate the expression of multiple slot values in a single utterance, not only the surrounding fillers but also the actual order in which slot values get expressed may vary. Luckily, when writing a robust natural language grammar, you need not specify the order in which the slot-filling phrases happen. As a result, a very simple grammar could cover variations such as the following:

"I want to get a flight from Boston to San Francisco."

"I need to get to San Francisco right away, starting from Boston."

"I have a very important meeting, and I need the next flight from Boston to San Francisco."

"Tomorrow is my aunt's birthday party in San Francisco. When's the next flight from Boston?"

The method for developing phrase grammars for robust parsing is the same as that described earlier for rule-based grammars, but simpler. The specification syntax is usually similar, although you will have to look at vendor-specific documentation for the details.

16.1.4 Developing Statistical Natural Language Grammars

Sometimes, SLM-based grammars are used to fill a single slot. One example is call routing, discussed in earlier chapters. In call routing, a tremendous variation is expected in the ways callers express their needs, but the result is the filling of a single slot: the identity of the service to which the system must forward the call.

Some speech technology vendors let you use a statistical approach to create the natural language grammar (the slot-filling grammar) for such single-slot applications. The result is that there is no need to develop handwritten grammar rules. Instead, you simply collect data for a training set. Then a training utility automatically learns, from the data, appropriate mappings from words, phrases, and combinations of words and phrases to the appropriate slot value. This capability makes it far easier to develop call routing applications, given the difficulty of writing the slot-filling rules by hand for input that has so much natural variation. In effect, all the knowledge to be encapsulated in the grammar is learned automatically.

To use a utility for statistical natural language training, you need a training set. Typically, the same training set used for the SLM is used for training the natural language grammar. However, in addition to transcriptions, each utterance must be labeled with the appropriate slot value (e.g., the appropriate route, given the caller request). You can collect data appropriate for natural language training by using the same methods for collecting data discussed in section 16.1.2.

The specific details of developing such systems are vendor-specific. You should consult the vendor directly.

16.1.1 Developing Rule-Based Grammars

Table 16-1. Grammar Operators for GSL

16.1.2 Developing Grammars for Statistical Language Models

16.1.3 Developing Robust Natural Language Grammars

16.1.4 Developing Statistical Natural Language Grammars