|< Day Day Up >|
The SASDK provides a template for creating new speech applications with Visual Studio .NET. It also provides visual editors for building the prompts (words spoken to the user) and grammars (words spoken by the user). This section will examine the basics of creating a speech application with the SASDK.
To utilize the template provided with the SASDK, open Visual Studio.NET and execute the following steps:
If you choose to build a voice-only application, the project will include a Web page named Default.aspx. This page contains two speech controls, AnswerCall and SemanticMap. These are basic controls used in every voice-only application. Their specific functions will be covered in the section titled "Using Speech Controls." The default project will also include a folder named Grammars that contains two grammar files, Library.grxml and SpeechWebApplication1.grxml. For voice-only applications the prompt project and Grammars folder are included by default.
If you choose to build a multimodal application, the Default.aspx page is included, but it will contain no controls. There will be a Grammars folder, but no prompt project will be created.
By default, the Manifest.xml file is included for both project types. It is an XML-based file that contains references to the resources used by the project. References include grammar files and prompt projects. Speech Server will preload and cache these resources to help improve performance.
The Prompt Editor
Microsoft recommends that you prerecord static prompts because voice recordings are more natural than the result of the text to speech engine. The prompt editor (see Figure 2.4) is a tool that allows you to specify potential prompts and record wave files associated with each prompt.
Figure 2.4. Screenshot of the prompt editor in the prompt database project. The prompt editor is used to record the wave files associated with each prompt. The screenshot includes four different prompts.
The utterance "Welcome to my speech application" represents a single prompt. For voice-only applications, you need to make sure you include a wide range of prompts. Since the user relies on these prompts to understand how the application works, they need to be clear and meaningful.
The Prompt Database
An application built with the Speech SDK wizard adds a prompt database project by default. If you choose to add another prompt database, it can be done by using the File menu and selecting Add Project and New Project (see Figure 2.5). The new project will be based on the Prompt Project template. Once the project is added, a new Prompt Database can be added by right-clicking the prompt project and then selecting Add and Add New Item. The Prompt Database item opens up a data grid style screen that allows you to specify all the potential prompts.
Figure 2.5. Screenshot of the dialog used to add a new prompt project to your speech application. This dialog is accessed by clicking Add project from the File menu and then clicking New Project.
The prompt database contains all the prerecorded utterances used to communicate with the user. An application can reference more than one prompt database. One reason for doing this is ease of maintenance. Prompts that change often can be placed in a separate prompt database. By restricting the size of the prompt database, the amount of time needed to recompile is minimized.
If you followed the instructions in the last section to create a new speech project, you can now open the default prompt database by double-clicking the prompts file from Solution Explorer.
Transcriptions and Extractions
Figure 2.6 is a screenshot of the recording pane in the prompt project database. There are two grids in a prompt project. The top one contains transcriptions, and the bottom one extractions. Transcriptions are the individual pieces of speech that relate to a single utterance. Extractions combine transcription elements to form phrases. Extractions are formed when you place square brackets around the transcription elements.
Figure 2.6. Contents of the Recording pane in the prompt database project. Transcriptions are the individual pieces of speech that can be prerecorded. No utterances have been recorded for prompts with a red X in the Has Wave column.
Sometimes a prompt can involve one or more transcription elements, such as "I heard you say Sara Rea." In this case, the two elements are "I heard you say" and "Sara Rea." In some cases employee names may also be prerecorded in the prompt database. This adds an additional burden, because every time a new employee is added to the database, someone needs to record the employee's name. However, by doing this, we prevent the speech engine from utilizing text-to-speech (TTS) to render the prompt. This is preferred because using recordings results in a more natural-sounding prompt.
Prompts are controlled from prompt functions. These functions programmatically indicate what phrases are spoken to the user. When the speech engine is passed a phrase from the function, it first searches the prompt database to see if any prerecorded utterances are present. It searches the entire database for matches and will string together as many transcription elements as necessary to retrieve the entire phrase.
Because the speech engine parses transcription elements together to form phrases, you can break phrases up to prevent redundancy. For instance, the phrase "Sorry, I am having trouble hearing you. If you need help, say help" may be spoken when an application encounters silence. The phrase "Sorry, I am having trouble understanding you. If you need help, say help" is used whenever the speech engine does not recognize the user's response. Therefore, the subphrase "If you need help, say help" can be recorded as a separate phrase in the prompt database. This means that the subphrase will only have to be recorded once. In addition, the size of the prompt database is minimized.
The Recording Tool
The Recording Tool can be accessed by clicking the red circle icon above the Transcription pane or by clicking Prompt and then Record All. The text from the transcription item selected is displayed in the Display Text textbox (see Figure 2.7). After clicking Record, the person making the recording should speak clearly into the microphone. Click Stop as soon as the entire phrase is spoken. Try to select a recording location where background noise is minimized.
Figure 2.7. The Recording tool allows you to directly record each prompt associated with a transcription. Prompts can also be recorded by professional voice talent in a studio, made into wave files, and imported.
In some cases, you may want to utilize professional voice talent to make recordings. There are third-party vendors, such as ScanSoft (see the "ScanSoft" profile box), that can provide professional voice talent and assistance with recordings. Wave files created in a recording studio can be associated with a specific transcription element by clicking Import and browsing to the file's location.
If the speech engine is unable to find a match in any of the prompt databases, it utilizes TTS. The result is a machine-like voice that may go against the natural interface you are trying to create. Speech Server comes bundled with ScanSoft's Speechify TTS engine (see the "ScanSoft" profile box), but at present the results from a text-to-speech engine are not as natural-sounding as a recorded human voice. On the other side, it will not always be possible or manageable to prerecord all utterances. You will have to weigh these options when designing your speech application.
The recording of prompts is a major consideration when designing a speech-enabled application. If professional talent is used, you will want to try to minimize the need for multiple recording sessions. If the application requires the utilization of text-to-speech for most prompts, you may want to consider purchasing a third-party TTS add-in.
The Grammar Editor
Grammar, the reverse of prompts, represents what the user says to the application. This is a key element of voice-only applications because they rely completely on accurate understanding of the user's commands. The grammar editor builds Extensible Markup Language (XML) files that are used by the speech-recognition engine to understand the user's speech. What is nice about the grammar editor is that you drag-and-drop controls to build the XML instead of having to type it in directly. This helps to reduce the time spent building grammars.
A grammar is stored in the form of an XML file with a grxml extension. Each of its Question/Answer (QA) controls, representing an interaction with the user, is associated with one or more grammars. A single grammar file will contain one or more rules that the application uses to interpret the user's response.
In order to increase the speed of the application, grammars can be compiled into .cfg files using the grammar compiler. This is typically done after the application is deployed, because that is when you will obtain the most benefit. Using compiled grammar files instead of text files reduces the size of files and thus the amount of time required to download them. Most important, it allows the speech engine to load files into memory faster.
Grammars are compiled with a command-line utility named SrGSGc.exe. It is installed by default in Program Files\Microsoft Speech Application SDK 1.0\SDKTools\bin. The tool can be accessed by clicking Start|Programs|Microsoft Speech Application SDK 1.0|Debugging Tools|Microsoft Speech Application SDK Command Prompt. From the command prompt, type SrGSGc.exe followed by the name of the output file (.cfg) and then the name of the input file (.grxml).
Clicking Add New Item from the Project menu accomplishes adding a grammar file. From there, select the category Grammar File and name the file accordingly. Existing grammars can be viewed by expanding the Grammar folder within Solution Explorer. By default, two grammar files are added when you create a voice-only or multimodal application. The first file, named library.grxml, contains common grammar rules you may need to utilize. For instance, it includes a rule for collecting yes/no responses (see Figure 2.8). It also includes rules for handling numbers, dates, and even credit card information. Rules embedded within the library grammar file can be referenced in other grammar files through the RuleRef control.
Figure 2.8. Screenshot displaying the yes/no rule inside the grammar editor. This is one of several rules included by default with the Library.grxml file.
The second grammar file is named the same as the project file by default. This is where you will place the grammar rules associated with your application. Although you could store all the rules in a single file, you may want to consider adding subfolders within the main Grammars folder. You can then create multiple grammar files to group similar types of grammar rules. This helps to organize code and makes referencing grammar rules easier.
Grammar rules are built by dragging elements onto the page. Controls are available in the Grammar tab of the toolbox. Figure 2.9 is a screenshot of these grammar controls. Most rules will consist of one or all of the following:
Figure 2.9. Screenshot of the Grammar tab, available in the toolbox when creating a new grammar. The elements you will use most often are the List, Phrase, RuleRef, and Script Tag elements.
The grammar editor (see Figure 2.8) contains a textbox called Recognition String. When dealing with complex rules, it can be used to test the rule without actually running the application. This is very useful when you are building the initial grammar set. To use this feature, just enter text that you would expect the user to say and click Check. The output window will display the Semantic Markup Language (SML), which is the XML generated by the speech engine and sent to the application. If the text was recognized, you will see "Check Path test successfully complete" at the bottom of the output window.
Do not use quotation marks when entering text in the Recognition String textbox. Doing so will cause the speech engine not to recognize the text.
The Script tag element is used to value a semantic item with the user's response. The properties for a script tag include an ellipsis that brings you to the Semantic Script Editor. This editor helps you to create an assignment so that the correct SML result is returned. You can also switch to the Script tab and edit the script directly. Figure 2.10 is a screenshot of the Semantic Script Editor.
Figure 2.10. Screenshot of the Semantic Script Editor that is available when you use a Script Tag element. The Script Tag is used whenever you need to value a semantic item with the user's response.
When building grammars you will probably not anticipate all the responses on an initial pass. Therefore, grammars require fine-tuning to make the application as efficient and accurate as possible. This process is eased since grammar files are not compiled and instead are available as XML reference files. For this reason, you would not want to compile grammar files until after the application has been thoroughly tested and is ready to deploy.
Using Speech Controls
A voice-only application has no visible interface. It runs on IIS as a Web page and is accessed with a telephone. When developing and debugging the application, it is executed within the Web browser, and the Speech Debugging Console is used to provide the developer with information about the application dialog. The user will never see the page created, so it is not important what is placed on it visually. Therefore, the only elements on the page will be speech controls, and they will be seen only by the developer.
The Speech Application SDK includes several speech controls that are visible from the Speech tab in the Toolbox. These controls will be dragged onto the startup form as the application is built. Figure 2.11 is a screenshot of the speech controls available in the speech tab of the toolbox. Speech controls are the basic units for computer-to-human interaction, and the SASDK contains two varieties of controls: dialog and application speech controls.
Figure 2.11. Screenshot of all the speech controls available in the speech tab of the toolbox. The QA control is the most basic unit and is utilized in every interaction with the user. SmexMessage, AnswerCall, TransferCall, MakeCall, RecordSound, and DisconnectCall are only applicable for telephony applications.
Dialog Speech Controls
Table 2.1 is a listing of the dialog speech controls used for controlling the conversational flow with the user. A QA control, the most commonly used control, represents a single interaction with the user in the form of a prompt and a response.
Speech Application Controls
Speech Application Controls are extensions of the basic speech controls. They are used to anticipate common user interaction scenarios. Refer to Table 2.2 for a listing of the application controls included with the SASDK. For instance, the Date control is a speech application control that expands on the basic QA control. It is used to retrieve a date and allows for a wide range of input possibilities. Application controls can reduce development time because much of the user interaction is built directly into them.
Creating Custom Controls
If no control does everything you need, you have the option of creating a custom control. Custom controls allow you to expand on the functionality already available with the built-in speech controls. Utilizing the concept of inheritance, custom controls are created using the ApplicationControl class and the IDtmf interface. The developer will create a project file that is compiled into a separate DLL for each custom control.
The Samples solution file, installed with the SASDK, includes a project titled ColorChooserControl. The ColorChooserControl project by itself is installed by default in the C:\Program Files\Microsoft Speech Application SDK 1.0\Applications\Samples\ColorChooserControl directory. This project can serve as a template for any custom control you wish to create. The Color Chooser control is a complex control that consists of child QA controls used to prompt the user for a color and then confirm their selection. The grammar and prompts associated with the control are built directly in. This particular control supports voice-only mode.
The ColorChooserControl is a custom control used to control the dialog flow with the user. It demonstrates what considerations must be made when building these types of controls. It is an excellent starting point for anyone wanting to create custom controls.
|< Day Day Up >|