Creating a Speech Application

< Day Day Up >

The SASDK provides a template for creating new speech applications with Visual Studio .NET. It also provides visual editors for building the prompts (words spoken to the user) and grammars (words spoken by the user). This section will examine the basics of creating a speech application with the SASDK.

To utilize the template provided with the SASDK, open Visual Studio.NET and execute the following steps:

1.	Click File, New, and Project. From the New Project dialog box, select the desired Project Type and click the Speech Web Template icon in the Templates window. This template was created when you installed the SASDK. Change the value in the location dropdown box to the desired project name and click OK.
2.	You can either accept the setting defaults and click Finish, or select Application Settings and Application Resources to specify custom settings.
3.	The default application mode is voice-only, so if you want to create a multimodal application, you can change the mode from the Application Settings tab.
4.	The Application Resources tab allows you to specify that a default grammar library file will be created and the name it will be called. From here you can also indicate that a new prompt project will be created and specify what the name for it will be.
5.	Click Finish at any time to build the new project.

If you choose to build a voice-only application, the project will include a Web page named Default.aspx. This page contains two speech controls, AnswerCall and SemanticMap. These are basic controls used in every voice-only application. Their specific functions will be covered in the section titled "Using Speech Controls." The default project will also include a folder named Grammars that contains two grammar files, Library.grxml and SpeechWebApplication1.grxml. For voice-only applications the prompt project and Grammars folder are included by default.

If you choose to build a multimodal application, the Default.aspx page is included, but it will contain no controls. There will be a Grammars folder, but no prompt project will be created.

By default, the Manifest.xml file is included for both project types. It is an XML-based file that contains references to the resources used by the project. References include grammar files and prompt projects. Speech Server will preload and cache these resources to help improve performance.

The Prompt Editor

Microsoft recommends that you prerecord static prompts because voice recordings are more natural than the result of the text to speech engine. The prompt editor (see Figure 2.4) is a tool that allows you to specify potential prompts and record wave files associated with each prompt.

Figure 2.4. Screenshot of the prompt editor in the prompt database project. The prompt editor is used to record the wave files associated with each prompt. The screenshot includes four different prompts.

The utterance "Welcome to my speech application" represents a single prompt. For voice-only applications, you need to make sure you include a wide range of prompts. Since the user relies on these prompts to understand how the application works, they need to be clear and meaningful.

The Prompt Database

An application built with the Speech SDK wizard adds a prompt database project by default. If you choose to add another prompt database, it can be done by using the File menu and selecting Add Project and New Project (see Figure 2.5). The new project will be based on the Prompt Project template. Once the project is added, a new Prompt Database can be added by right-clicking the prompt project and then selecting Add and Add New Item. The Prompt Database item opens up a data grid style screen that allows you to specify all the potential prompts.

Figure 2.5. Screenshot of the dialog used to add a new prompt project to your speech application. This dialog is accessed by clicking Add project from the File menu and then clicking New Project.

The prompt database contains all the prerecorded utterances used to communicate with the user. An application can reference more than one prompt database. One reason for doing this is ease of maintenance. Prompts that change often can be placed in a separate prompt database. By restricting the size of the prompt database, the amount of time needed to recompile is minimized.

If you followed the instructions in the last section to create a new speech project, you can now open the default prompt database by double-clicking the prompts file from Solution Explorer.

Transcriptions and Extractions

Figure 2.6 is a screenshot of the recording pane in the prompt project database. There are two grids in a prompt project. The top one contains transcriptions, and the bottom one extractions. Transcriptions are the individual pieces of speech that relate to a single utterance. Extractions combine transcription elements to form phrases. Extractions are formed when you place square brackets around the transcription elements.

Figure 2.6. Contents of the Recording pane in the prompt database project. Transcriptions are the individual pieces of speech that can be prerecorded. No utterances have been recorded for prompts with a red X in the Has Wave column.

Sometimes a prompt can involve one or more transcription elements, such as "I heard you say Sara Rea." In this case, the two elements are "I heard you say" and "Sara Rea." In some cases employee names may also be prerecorded in the prompt database. This adds an additional burden, because every time a new employee is added to the database, someone needs to record the employee's name. However, by doing this, we prevent the speech engine from utilizing text-to-speech (TTS) to render the prompt. This is preferred because using recordings results in a more natural-sounding prompt.

Prompts are controlled from prompt functions. These functions programmatically indicate what phrases are spoken to the user. When the speech engine is passed a phrase from the function, it first searches the prompt database to see if any prerecorded utterances are present. It searches the entire database for matches and will string together as many transcription elements as necessary to retrieve the entire phrase.

Because the speech engine parses transcription elements together to form phrases, you can break phrases up to prevent redundancy. For instance, the phrase "Sorry, I am having trouble hearing you. If you need help, say help" may be spoken when an application encounters silence. The phrase "Sorry, I am having trouble understanding you. If you need help, say help" is used whenever the speech engine does not recognize the user's response. Therefore, the subphrase "If you need help, say help" can be recorded as a separate phrase in the prompt database. This means that the subphrase will only have to be recorded once. In addition, the size of the prompt database is minimized.

The Recording Tool

The Recording Tool can be accessed by clicking the red circle icon above the Transcription pane or by clicking Prompt and then Record All. The text from the transcription item selected is displayed in the Display Text textbox (see Figure 2.7). After clicking Record, the person making the recording should speak clearly into the microphone. Click Stop as soon as the entire phrase is spoken. Try to select a recording location where background noise is minimized.

Figure 2.7. The Recording tool allows you to directly record each prompt associated with a transcription. Prompts can also be recorded by professional voice talent in a studio, made into wave files, and imported.

In some cases, you may want to utilize professional voice talent to make recordings. There are third-party vendors, such as ScanSoft (see the "ScanSoft" profile box), that can provide professional voice talent and assistance with recordings. Wave files created in a recording studio can be associated with a specific transcription element by clicking Import and browsing to the file's location.

If the speech engine is unable to find a match in any of the prompt databases, it utilizes TTS. The result is a machine-like voice that may go against the natural interface you are trying to create. Speech Server comes bundled with ScanSoft's Speechify TTS engine (see the "ScanSoft" profile box), but at present the results from a text-to-speech engine are not as natural-sounding as a recorded human voice. On the other side, it will not always be possible or manageable to prerecord all utterances. You will have to weigh these options when designing your speech application.

ScanSoft, Inc.

In 2003, ScanSoft, Inc. (www.scansoft.com) merged with SpeechWorks to become one of the largest suppliers of speech-related applications and services. SpeechWorks, one of the original founders of the SALT Forum, offered ScanSoft expertise in the areas of speech recognition, text-to-speech (TTS), and speaker verification.

ScanSoft, a publicly traded company (NASDAQ:SSFT), was already a large supplier of popular products, such as Dragon's NaturallySpeaking, which allows users to dictate into any Windows-based application up to 160 words per minute. The merger made the company a dominant force in the area of speech technology.

ScanSoft offers a broad range of products, but also offers such services as assistance in deployment and configuration of speech solutions. In addition, it provides voice talent for companies that wish to have their prompts professionally recorded.

ScanSoft has deployed speech-based solutions to a broad range of industries, including financial services, government, health care, and retail.

Microsoft first licensed the Speechify text-to-speech engine with its Microsoft Speech API (SAPI) 5.0 product in 2001. This was done because of the high quality of the Speechify text-to-speech voice.

In 2002, SpeechWorks and Microsoft formed a strategic alliance which ensured that the Speechify text-to-speech engine would be included with the Microsoft Speech Server product. The Speechify TTS engine is SALT-based and is included out-of-the-box with Speech Engine Services.

Including the Speechify product was important not only because it provides a natural-sounding voice, but also because it supports multiple languages and performs well. Utilizing ScanSoft's TTS engine, Microsoft was able to focus more on perfecting the speech-recognition engine.

At the time of the alliance, SpeechWorks agreed to add support for its OpenSpeech Recognizer (OSR) product. The OSR product is a high-performance recognition engine that can be purchased separately from ScanSoft and then included with the Enterprise Edition of Speech Server.

ScanSoft and the ScanSoft logo are registered trademarks of ScanSoft, Inc.

The recording of prompts is a major consideration when designing a speech-enabled application. If professional talent is used, you will want to try to minimize the need for multiple recording sessions. If the application requires the utilization of text-to-speech for most prompts, you may want to consider purchasing a third-party TTS add-in.

The Grammar Editor

Grammar, the reverse of prompts, represents what the user says to the application. This is a key element of voice-only applications because they rely completely on accurate understanding of the user's commands. The grammar editor builds Extensible Markup Language (XML) files that are used by the speech-recognition engine to understand the user's speech. What is nice about the grammar editor is that you drag-and-drop controls to build the XML instead of having to type it in directly. This helps to reduce the time spent building grammars.

A grammar is stored in the form of an XML file with a grxml extension. Each of its Question/Answer (QA) controls, representing an interaction with the user, is associated with one or more grammars. A single grammar file will contain one or more rules that the application uses to interpret the user's response.

Tip

In order to increase the speed of the application, grammars can be compiled into .cfg files using the grammar compiler. This is typically done after the application is deployed, because that is when you will obtain the most benefit. Using compiled grammar files instead of text files reduces the size of files and thus the amount of time required to download them. Most important, it allows the speech engine to load files into memory faster.

Grammars are compiled with a command-line utility named SrGSGc.exe. It is installed by default in Program Files\Microsoft Speech Application SDK 1.0\SDKTools\bin. The tool can be accessed by clicking Start|Programs|Microsoft Speech Application SDK 1.0|Debugging Tools|Microsoft Speech Application SDK Command Prompt. From the command prompt, type SrGSGc.exe followed by the name of the output file (.cfg) and then the name of the input file (.grxml).

Clicking Add New Item from the Project menu accomplishes adding a grammar file. From there, select the category Grammar File and name the file accordingly. Existing grammars can be viewed by expanding the Grammar folder within Solution Explorer. By default, two grammar files are added when you create a voice-only or multimodal application. The first file, named library.grxml, contains common grammar rules you may need to utilize. For instance, it includes a rule for collecting yes/no responses (see Figure 2.8). It also includes rules for handling numbers, dates, and even credit card information. Rules embedded within the library grammar file can be referenced in other grammar files through the RuleRef control.

Figure 2.8. Screenshot displaying the yes/no rule inside the grammar editor. This is one of several rules included by default with the Library.grxml file.

The second grammar file is named the same as the project file by default. This is where you will place the grammar rules associated with your application. Although you could store all the rules in a single file, you may want to consider adding subfolders within the main Grammars folder. You can then create multiple grammar files to group similar types of grammar rules. This helps to organize code and makes referencing grammar rules easier.

Grammar rules are built by dragging elements onto the page. Controls are available in the Grammar tab of the toolbox. Figure 2.9 is a screenshot of these grammar controls. Most rules will consist of one or all of the following:

Phrase represents the actual phrase spoken by the user.
List contains multiple phrase elements that all relate to the same thing. For instance, a yes response could be spoken as "yeah," "ok," or "yes please." A list control allows you to indicate that all these responses are the same as yes.
RuleRef used to reference other rules through the URI property. This is useful when you have multiple grammar files and want to reuse the logic in existing rules.
Group used to group related elements. It can contain any element, such as a List, Phrase, or RuleRef.
Wildcard used to specify which words in a phrase can be ignored.
Halt used to stop the recognition path.
Skip used to indicate that a recognition path is optional.
Script Tag used to get semantic information from the grammar.

Figure 2.9. Screenshot of the Grammar tab, available in the toolbox when creating a new grammar. The elements you will use most often are the List, Phrase, RuleRef, and Script Tag elements.

The grammar editor (see Figure 2.8) contains a textbox called Recognition String. When dealing with complex rules, it can be used to test the rule without actually running the application. This is very useful when you are building the initial grammar set. To use this feature, just enter text that you would expect the user to say and click Check. The output window will display the Semantic Markup Language (SML), which is the XML generated by the speech engine and sent to the application. If the text was recognized, you will see "Check Path test successfully complete" at the bottom of the output window.

Tip

Do not use quotation marks when entering text in the Recognition String textbox. Doing so will cause the speech engine not to recognize the text.

The Script tag element is used to value a semantic item with the user's response. The properties for a script tag include an ellipsis that brings you to the Semantic Script Editor. This editor helps you to create an assignment so that the correct SML result is returned. You can also switch to the Script tab and edit the script directly. Figure 2.10 is a screenshot of the Semantic Script Editor.

Figure 2.10. Screenshot of the Semantic Script Editor that is available when you use a Script Tag element. The Script Tag is used whenever you need to value a semantic item with the user's response.

When building grammars you will probably not anticipate all the responses on an initial pass. Therefore, grammars require fine-tuning to make the application as efficient and accurate as possible. This process is eased since grammar files are not compiled and instead are available as XML reference files. For this reason, you would not want to compile grammar files until after the application has been thoroughly tested and is ready to deploy.

Using Speech Controls

A voice-only application has no visible interface. It runs on IIS as a Web page and is accessed with a telephone. When developing and debugging the application, it is executed within the Web browser, and the Speech Debugging Console is used to provide the developer with information about the application dialog. The user will never see the page created, so it is not important what is placed on it visually. Therefore, the only elements on the page will be speech controls, and they will be seen only by the developer.

The Speech Application SDK includes several speech controls that are visible from the Speech tab in the Toolbox. These controls will be dragged onto the startup form as the application is built. Figure 2.11 is a screenshot of the speech controls available in the speech tab of the toolbox. Speech controls are the basic units for computer-to-human interaction, and the SASDK contains two varieties of controls: dialog and application speech controls.

Figure 2.11. Screenshot of all the speech controls available in the speech tab of the toolbox. The QA control is the most basic unit and is utilized in every interaction with the user. SmexMessage, AnswerCall, TransferCall, MakeCall, RecordSound, and DisconnectCall are only applicable for telephony applications.

Dialog Speech Controls

Table 2.1 is a listing of the dialog speech controls used for controlling the conversational flow with the user. A QA control, the most commonly used control, represents a single interaction with the user in the form of a prompt and a response.

Table 2.1. Dialog Speech Controls are used for controlling the conversational flow with the user.
Control Name	Description
Semantic Map	Collection of SemanticItem controls where a SemanticItem control represents a single piece of information collected from the user, such as a last name.
QA	Question/Answer control. This represents one interaction with the user in the form of a question and then a response.
Command	Often used to navigate the application with unprompted commands such as Help or Main Menu.
SpeechControlSettings	Specify common settings for a group of controls.
SmexMessage	Sends and receives messages from a computer-supported telephony application (CSTA) that complies with European Computer Manufacturers Association (ECMA) standards.
AnswerCall	Answer calls from a telephony device. Used for inbound telephony applications.
TransferCall	Transfers a call.
MakeCall	Initiates a new call. Used for outbound telephony applications.
DisconnectCall	Ends a call
CompareValidator	Compares what the user says with some value
CustomValidator	Validates data with client-side script
RecordSound	Records what the user says and copies it to the Web server so it can be played back later.
Listen	Represents the listen element from the SALT specification. Considered a basic speech control.
Prompt	Represents the prompt element from the SALT specification. Considered a basic speech control.

Speech Application Controls

Speech Application Controls are extensions of the basic speech controls. They are used to anticipate common user interaction scenarios. Refer to Table 2.2 for a listing of the application controls included with the SASDK. For instance, the Date control is a speech application control that expands on the basic QA control. It is used to retrieve a date and allows for a wide range of input possibilities. Application controls can reduce development time because much of the user interaction is built directly into them.

Table 2.2. Speech Application Controls available in the Speech tab of the toolbox. These controls can reduce development time by building in typical user interactions.
Control Name	Description
ListSelector	Databound control that presents the user with a list of items and asks user to select one.
DataTableNavigator	Databound control that the user navigates with commands such as Next, Previous, and Read.
AlphaDigit	Collects an alphanumeric string.
CreditCardDate	Collects a credit card expiration date (month and year); does not ensure that it is a future date.
CreditCardNumber	Collects a credit card number and type. Although it does not validate the number, it ensures that the number matches the format for the particular type of credit card.
Currency	Collects an amount in U.S. dollars that falls within a specified range.
Date	Used to collect either a complete date or one broken out into month, day, and year.
NaturalNumber	Collects a natural number that falls within a specified range.
Phone	Collects a U.S. phone number where area code is three numeric digits, number is seven numeric digits, and extension is zero to five numeric digits.
SocialSecurityNumber	Collects a U.S. Social Security number.
YesNo	Collects a yes or no answer.
ZipCode	Collects a U.S. zip code where the zip code is five numeric digits and the extension is four numeric digits.

Creating Custom Controls

If no control does everything you need, you have the option of creating a custom control. Custom controls allow you to expand on the functionality already available with the built-in speech controls. Utilizing the concept of inheritance, custom controls are created using the ApplicationControl class and the IDtmf interface. The developer will create a project file that is compiled into a separate DLL for each custom control.

The Samples solution file, installed with the SASDK, includes a project titled ColorChooserControl. The ColorChooserControl project by itself is installed by default in the C:\Program Files\Microsoft Speech Application SDK 1.0\Applications\Samples\ColorChooserControl directory. This project can serve as a template for any custom control you wish to create. The Color Chooser control is a complex control that consists of child QA controls used to prompt the user for a color and then confirm their selection. The grammar and prompts associated with the control are built directly in. This particular control supports voice-only mode.

The ColorChooserControl is a custom control used to control the dialog flow with the user. It demonstrates what considerations must be made when building these types of controls. It is an excellent starting point for anyone wanting to create custom controls.