Developing Telephony/Speech ApplicationsSpeech applications allow user-to-user speech as the main medium for user input/output. Users provide the input (or select choices) in these applications, typically through the user 's own natural speech, which is recognized based on a speech grammar by a speech recognition engine (ASR). For responses, a combination of synthesized text-to-speech (TTS) technology and prerecorded prompts are used. Speech applications are generally used in the telephony world, where the user interacts with an interactive voice response (IVR) application through a telephone. Apart from speech, the user can also use the 12-digit keypad of the phone as an additional input method.
With the advent of mobile phones, telephony applications can now be used from anywhere using a mobile phone. In a number of scenarios, speech-based mobile applications have a greater applicability, adoption, and usage rate than mobile devices and Web applications. For instance, consider mobile salespeople. They typically travel a lot to meet customers and spend a great amount of time in the car. Speech applications provide an intuitive and hands-free user interface to interact with company systems and data ”for instance, to connect with a customer, to get directions to a customer's site, or to execute a stock trade.
Developing Speech Applications Using .NETWhy .NET? Interactive voice response (IVR) applications and touch-tone systems-based telephony applications have also been around for some time, providing basic connectivity (typically using the touch-tone and prerecorded speech interfaces) to telephony users. However, these systems are typically proprietary in nature and have required specialized skills for customization and integration with back-end applications. Based on the open SALT standard and a large collection of prebuilt speech controls, Microsoft Speech Application SDK enables .NET application developers to easily develop telephony applications using their ASP.NET development skills. In a number of ways, developing speech applications using .NET is similar to developing mobile applications; the major difference is that the speech controls are focused on delivering speech-based interactions instead of the data-centric mobile Web interactions. Hello Speech WorldSimilar to Web applications, speech applications can be natively developed using SALT tags embedded in static HTML documents or ASP.NET documents. However, to accelerate the development of SALT applications, Microsoft Speech Application SDK (SASDK) provides a rich set of controls (Figure 10.11). The SDK can be accessed from the Microsoft Speech site at www.microsoft.com/speech. The controls are also available with Visual Studio .NET 2003 for design time support. To develop a basic Hello World “style interactive speech application, you create a new project in Visual Studio .NET using the Speech Web Application project type, which is available for Visual C# and Visual Basic .NET in the current release (use the default properties, which set the project to be a voice-only application, enable the Telephony Application Simulator [Figure 10.12], and create an empty grammar). Next, drag and drop a panel control to the page. To the panel add the QA control, which is really the core control used for enabling interaction. Set the QA control's inline prompt as "Hello Speech World". Add a Disconnect Call control to disconnect the call. Your basic telephony application is done. Listing 10.9 shows a simple speech-enabled page. Listing 10.9 Speech-enabled Web Page<%@ Register TagPrefix="speech" Namespace="Microsoft.Speech.Web.UI" Assembly="Microsoft.Speech.Web" %> <%@ Page language="c#"%> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" > <HTML> <body xmlns:speech="http://schemas.microsoft.com/speech/WebControls"> <form id="Form1" method="post" runat="server"> <speech:AnswerCall id="AnswerCall1" runat="server"></speech:AnswerCall> <asp:Panel id="HelloWorldPanel" runat="server"> <speech:QA id="HelloQA" runat="server" PlayOnce="True"> <Prompt InlinePrompt="Hello to the World of Speech Applications"> </Prompt> </speech:QA> </asp:Panel> <speech:DisconnectCall id="DisconnectCall1" runat="server"/> </form> </body> </HTML> Figure 10.11. Building speech applications using Microsoft Speech Application SDK.
Figure 10.12. Using the Telephony Application Simulator.
After you have developed your speech-enabled application, you are ready to test and debug the application. SASDK provides a set of tools you can use. By default, a speech-enabled Web page is opened with Microsoft Internet Explorer with the speech add-in. The add-in is installed with the SASDK. You should hear the magical "Hello world..." in a machine-generated voice (because TTS was used). The ultimate objective of developing a telephony/speech application is to use it with a phone. For that, you need a whole lot of software and hardware, including connectivity with Plain Old Telephony System (POTS). Before you get to that stage, you should test in an environment that closely resembles the telephony world. The Telephony Application Simulator provides such a tool for testing telephony applications. Similar to a normal phone, you dial the application (using a dummy number) and then get connected to the application. If any user input is required, the telephony application simulator provides both options: a short-cut text entry mechanism or invoking the desktop microphone to recognize spoken audio. I highly recommend working with a good quality external microphone if you want to test it with user speech. I used one of the USB microphones with my laptop. Speech Controls.NET application development is really assembling controls. This is true even for developing telephony applications. SASDK provides a rich set of controls for basic speech recognition, text-to-speech synthesis, and sound recording, as well as predeveloped controls for getting user input such as Social Security numbers, credit card numbers, phone numbers , and so on. It is important to realize that although getting similar input is a relatively straightforward task in desktop, Web, and even mobile device applications, recognizing structured speech input is quite a challenging task. Table 10.2 shows a set of common speech controls. Table 10.2. Speech Controls
GrammarSpeech grammar specifies a set of utterances that a user may speak to perform an action or supply information and provides a corresponding string value or set of attribute-value pairs to describe the information or action (see Section 10.1 of the specification). The SASDK and SALT allow developers to create grammar for both spoken input and DTMF-based grammar for input through touch-tone key presses. The grammar for an input element is specified by a <grammar> and/or <dtmf> tag. As part of the effort to develop standards to allow speech-based interaction with Web-based applications, W3C has established a voice browser activity. One of the standards delivered by the group has been the Speech Recognition Grammar Specification for W3C Speech Interface Framework, which defines grammar syntax for use in speech recognition systems. The syntax of the grammar is described in two formats: an XML-based syntax (with an associated DTD) and a traditional augmented BNF (ABNF) syntax. This grammar specification has been used and extended by SALT and the SASDK. For instance, Listing 10.10 is the grammar for a main menu of a customer service application that allows the user to say one of these three phrases: order entry, order status, or customer service. Listing 10.10 Grammar for a Main Menu<grammar xmlns:sapi="http://schemas.microsoft.com/Speech/2002/06/SRGSExtensions" xml:lang="en-US" tag-format="semantics-ms/1.0" version="1.0" mode="voice" xmlns="http://www.w3.org/2001/06/grammar"> <rule id="CommandRule" scope="public"> <one-of> <item>Order Entry</item> <item>Order Status</item> <item>Customer Service</item> </one-of> <tag>$.Cmd = $recognized.text</tag> </rule> </grammar> By splitting possible user inputs into a series of tokens with possible utterances, speech grammar can be quite complex. For instance, a flight-booking application could have a complex grammar that can recognize full sentences such as, "I want to go from Newark, NJ to San Francisco, CA on September 30" ”the cities and dates being the important tokens to be recognized. The SASDK also provides a set of prebuilt grammar for commonly used scenarios. Also included with SASDK, is design-time support for visual development and testing of speech grammars within the Visual Studio .NET environment (Figure 10.13). Figure 10.13. Developing speech grammar.
Multimodal Applications: Beyond Standalone Speech and Web ApplicationsMultimodality means that you can utilize more than one mode of user interface with the application, something like normal human communication. For instance, consider an application that allows you to get driving directions. Although it's typically easier to speak the start and destination addresses (or even better, shortcuts like "my home," "my office," "my doctor's office," based on my previously established profile), the turn-by- turn directions are best viewed through a map, something similar to what you're used to seeing on MapPoint. In essence, a multimodal application, when executed on a desktop device, would be an application very similar to MapPoint but would allow the user to talk/listen to the system for parts of the application's input/output as well ”for example, the starting and destination addresses. That's multimodal. Imagine the same application using the same interface on a wirelessly connected PDA. Now you're talking true mobile multimodal application. If you let your imagination go a little bit wilder, you can easily extend the same application to the dashboard of your car or any other device. That's really the vision, which, given the current state of technology, isn't far away. Another modality that can be added to the application example is a pointing device that would zoom the map, focusing on a particular location. Microsoft SASDK provides support for building both telephony and multimodal applications. Building a multimodal application is really creating a new speech Web application with the multimodal application model. Also, consider that developing a multimodal application involves developing the normal Web application and adding the speech controls to introduce speech connectivity. The Speech Application SDK includes a plug-in for Microsoft Pocket Internet Explorer that can be used for running multimodal applications on a Pocket PC. So, for instance, a multimodal flight reservation application would allow the user to either enter the dates by selecting them in a calendar control or by responding to the appropriate prompts using speech recognition. Microsoft Speech Server 2004Now that you have developed speech applications and have tested them using the desktop simulators, how do you start actually using them over a telephony device, such as an ordinary phone or a mobile phone? That is really where deployment of speech applications comes into play. Microsoft Speech Server 2004 (which was in limited beta release at the time of writing of this book) is the answer (see Figure 10.14). Speech Server provides key server components for deploying telephony and multimodal applications. Speech Server runs on top of a Windows 2003 platform and provides the required speech recognition, speech synthesis (TTS), and connectivity to the PBXs and telephony lines. Figure 10.14. Deploying speech applications using Microsoft Speech Server.
Application ScenariosA number of applications fit into the list of possible candidates for speech application development. For instance, most of us use email as a basic collaboration medium, and many companies have now invested in collaboration systems such as Microsoft Exchange Server as a collaboration platform. In addition to email, Microsoft Exchange Server also provides calendar, address book, and tasks capabilities. Through a speech-enabled corporate portal, employees can access their critical emails, contacts, and calendar anytime and anywhere over a normal phone. A number of medium-to-large businesses use the power and flexibility of ERP systems from vendors such as SAP, PeopleSoft, Oracle, Baan, J. D. Edwards, and Microsoft Great Plains Software. For instance, a company that has invested in an HRMS system such as PeopleSoft could provide tremendous flexibility to employees by providing them with a speech-enabled, self-service application that allows employees to review their personnel profile, participate in benefits enrollment, review pay stubs, and even submit time sheets and expenses. Many of these applications are now exposed as Web services, and with .NET Web services support, speech-enabled applications can be leveraged as well.
|