Developing TelephonySpeech Applications | Microsoft .NET Kick Start

Developing Telephony/Speech Applications

Speech applications allow user-to-user speech as the main medium for user input/output. Users provide the input (or select choices) in these applications, typically through the user 's own natural speech, which is recognized based on a speech grammar by a speech recognition engine (ASR). For responses, a combination of synthesized text-to-speech (TTS) technology and prerecorded prompts are used. Speech applications are generally used in the telephony world, where the user interacts with an interactive voice response (IVR) application through a telephone. Apart from speech, the user can also use the 12-digit keypad of the phone as an additional input method.

WHAT ARE 3G NETWORKS?

3G (or third generation) networks aim to provide fast speed and reliable wireless networks and will make the mobile Web application delivery model ubiquitous. 3G networks will enable faster and reliable communications for mobile devices, enabling enhanced interactivity and media delivery. (For instance, you could potentially build a health monitor application, which could monitor and deliver on-demand video of a patient to a specialist.)

With the advent of mobile phones, telephony applications can now be used from anywhere using a mobile phone. In a number of scenarios, speech-based mobile applications have a greater applicability, adoption, and usage rate than mobile devices and Web applications. For instance, consider mobile salespeople. They typically travel a lot to meet customers and spend a great amount of time in the car. Speech applications provide an intuitive and hands-free user interface to interact with company systems and data ”for instance, to connect with a customer, to get directions to a customer's site, or to execute a stock trade.

A PINCH OF SALT

Speech Application Language Tags (SALT) is a set of XML-based tags that can be added to existing Web-based applications for developing interactive speech applications. SALT applications can be used for developing pure telephony-style applications as well as the next -generation multimodal applications. Development of the SALT standard has been spearheaded by SALT Forum, an organization founded by Microsoft, Cisco, SpeechWorks, Philips, Comverse, and Intel. A 1.0 release of the specification can be found at http://www.saltforum.org. Additionally, the SALT specification has been submitted to the W3C (http://www.w3.org) Voice Browser and Multimodal Interaction Working Groups for further standardization.

Developing Speech Applications Using .NET

Why .NET? Interactive voice response (IVR) applications and touch-tone systems-based telephony applications have also been around for some time, providing basic connectivity (typically using the touch-tone and prerecorded speech interfaces) to telephony users. However, these systems are typically proprietary in nature and have required specialized skills for customization and integration with back-end applications. Based on the open SALT standard and a large collection of prebuilt speech controls, Microsoft Speech Application SDK enables .NET application developers to easily develop telephony applications using their ASP.NET development skills. In a number of ways, developing speech applications using .NET is similar to developing mobile applications; the major difference is that the speech controls are focused on delivering speech-based interactions instead of the data-centric mobile Web interactions.

Hello Speech World

Similar to Web applications, speech applications can be natively developed using SALT tags embedded in static HTML documents or ASP.NET documents. However, to accelerate the development of SALT applications, Microsoft Speech Application SDK (SASDK) provides a rich set of controls (Figure 10.11). The SDK can be accessed from the Microsoft Speech site at www.microsoft.com/speech. The controls are also available with Visual Studio .NET 2003 for design time support. To develop a basic Hello World “style interactive speech application, you create a new project in Visual Studio .NET using the Speech Web Application project type, which is available for Visual C# and Visual Basic .NET in the current release (use the default properties, which set the project to be a voice-only application, enable the Telephony Application Simulator [Figure 10.12], and create an empty grammar). Next, drag and drop a panel control to the page. To the panel add the QA control, which is really the core control used for enabling interaction. Set the QA control's inline prompt as "Hello Speech World". Add a Disconnect Call control to disconnect the call. Your basic telephony application is done. Listing 10.9 shows a simple speech-enabled page.

Listing 10.9 Speech-enabled Web Page

 <%@ Register TagPrefix="speech" Namespace="Microsoft.Speech.Web.UI"     Assembly="Microsoft.Speech.Web" %> <%@ Page language="c#"%> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" > <HTML>    <body xmlns:speech="http://schemas.microsoft.com/speech/WebControls">       <form id="Form1" method="post" runat="server">         <speech:AnswerCall id="AnswerCall1" runat="server"></speech:AnswerCall>         <asp:Panel id="HelloWorldPanel" runat="server">            <speech:QA id="HelloQA" runat="server" PlayOnce="True">                <Prompt                  InlinePrompt="Hello to the World of Speech Applications">                </Prompt>             </speech:QA>          </asp:Panel>          <speech:DisconnectCall id="DisconnectCall1" runat="server"/>       </form>    </body> </HTML>

Figure 10.11. Building speech applications using Microsoft Speech Application SDK.

Figure 10.12. Using the Telephony Application Simulator.

After you have developed your speech-enabled application, you are ready to test and debug the application. SASDK provides a set of tools you can use. By default, a speech-enabled Web page is opened with Microsoft Internet Explorer with the speech add-in. The add-in is installed with the SASDK. You should hear the magical "Hello world..." in a machine-generated voice (because TTS was used).

The ultimate objective of developing a telephony/speech application is to use it with a phone. For that, you need a whole lot of software and hardware, including connectivity with Plain Old Telephony System (POTS). Before you get to that stage, you should test in an environment that closely resembles the telephony world. The Telephony Application Simulator provides such a tool for testing telephony applications. Similar to a normal phone, you dial the application (using a dummy number) and then get connected to the application. If any user input is required, the telephony application simulator provides both options: a short-cut text entry mechanism or invoking the desktop microphone to recognize spoken audio. I highly recommend working with a good quality external microphone if you want to test it with user speech. I used one of the USB microphones with my laptop.

Speech Controls

.NET application development is really assembling controls. This is true even for developing telephony applications. SASDK provides a rich set of controls for basic speech recognition, text-to-speech synthesis, and sound recording, as well as predeveloped controls for getting user input such as Social Security numbers, credit card numbers, phone numbers , and so on. It is important to realize that although getting similar input is a relatively straightforward task in desktop, Web, and even mobile device applications, recognizing structured speech input is quite a challenging task. Table 10.2 shows a set of common speech controls.

Table 10.2. Speech Controls

SPEECH CONTROLS	DESCRIPTION
Basic Controls
SpeechControl	Base class of all Speech controls.
Prompt	Implements SALT's <prompt> element, TTS/prerecorded sound playback.
Listen	Implements SALT's <listen> element, speech recognition/recording.
Dialog Controls
AnswerCall	Answers incoming telephony calls.
CallInfo	Provides information about the current call, such as Caller ID.
Command	Recognizes user speech for unprompted speech, typically used for Help, Cancel phrases.
DialogPrompt	Defines prompts (either TTS speech or prerecorded prompts).
DisconnectCall	Disconnects the current telephony call.
DTMF	Used for DTMF (Dual Tone Multi Frequency) input, basically input from a phone's 12-digit keypad. Typically used for inputting numbers accurately.
Grammar	Provides an inline or external URL-based speech grammar, used by a recognition control (QA/Command).
MakeCall	Initiates a telephony call, used in a scenario where the application is used to locate people (such as a telephone directory) or to connect with a human customer service representative.
QA	Primary dialog control, typically associated with grammar/DTMF control to recognize user input and provides result to SemanticItem controls.
Reco	Provides speech recognition/recording capability, typically used in the context of a QA control.
Record	Records user input instead of speech recognition, typically used to take a message and send it to another system or user.
SemanticMap, SemanticItem	SemanticMap is a container of SemanticItem controls; contains the result of speech recognition of associated controls.
SmexMessage	Sends messages dependent on the telephony platform.
TransferCall	Transfers the current telephony call.
Validation Controls
BaseValidator	Base control for all validators.
CompareValidator	Compares the value of a control with another or a constant.
CustomValidator	Provides a mechanism for custom server-side validation.
Application Speech Controls
AlphaDigit	Provides a prebuilt dialog for alphanumeric string input.
CreditCardDate	Provides a prebuilt dialog for credit card expiring date input.
CreditCardNumber	Provides a prebuilt dialog for credit card number input.
Currency	Provides a prebuilt dialog for currency (U.S. dollars) input.
Date	Provides a prebuilt dialog for date input.
NaturalNumber	Provides a prebuilt dialog for number input.
Phone	Provides a prebuilt dialog for phone input.
SocialSecurityNumber	Provides a prebuilt dialog for U.S. Social Security number (SSN) input.
YesNo	Provides a prebuilt dialog for yes/no input.
ZipCode	Provides a prebuilt dialog for U.S. ZIP code input.

Grammar

Speech grammar specifies a set of utterances that a user may speak to perform an action or supply information and provides a corresponding string value or set of attribute-value pairs to describe the information or action (see Section 10.1 of the specification). The SASDK and SALT allow developers to create grammar for both spoken input and DTMF-based grammar for input through touch-tone key presses.

The grammar for an input element is specified by a <grammar> and/or <dtmf> tag.

As part of the effort to develop standards to allow speech-based interaction with Web-based applications, W3C has established a voice browser activity. One of the standards delivered by the group has been the Speech Recognition Grammar Specification for W3C Speech Interface Framework, which defines grammar syntax for use in speech recognition systems. The syntax of the grammar is described in two formats: an XML-based syntax (with an associated DTD) and a traditional augmented BNF (ABNF) syntax. This grammar specification has been used and extended by SALT and the SASDK. For instance, Listing 10.10 is the grammar for a main menu of a customer service application that allows the user to say one of these three phrases: order entry, order status, or customer service.

Listing 10.10 Grammar for a Main Menu

 <grammar     xmlns:sapi="http://schemas.microsoft.com/Speech/2002/06/SRGSExtensions"     xml:lang="en-US" tag-format="semantics-ms/1.0" version="1.0"     mode="voice" xmlns="http://www.w3.org/2001/06/grammar">    <rule id="CommandRule" scope="public">       <one-of>          <item>Order Entry</item>          <item>Order Status</item>          <item>Customer Service</item>       </one-of>       <tag>$.Cmd = $recognized.text</tag>    </rule> </grammar>

By splitting possible user inputs into a series of tokens with possible utterances, speech grammar can be quite complex. For instance, a flight-booking application could have a complex grammar that can recognize full sentences such as, "I want to go from Newark, NJ to San Francisco, CA on September 30" ”the cities and dates being the important tokens to be recognized. The SASDK also provides a set of prebuilt grammar for commonly used scenarios. Also included with SASDK, is design-time support for visual development and testing of speech grammars within the Visual Studio .NET environment (Figure 10.13).

Figure 10.13. Developing speech grammar.

Multimodal Applications: Beyond Standalone Speech and Web Applications

Multimodality means that you can utilize more than one mode of user interface with the application, something like normal human communication. For instance, consider an application that allows you to get driving directions. Although it's typically easier to speak the start and destination addresses (or even better, shortcuts like "my home," "my office," "my doctor's office," based on my previously established profile), the turn-by- turn directions are best viewed through a map, something similar to what you're used to seeing on MapPoint.

In essence, a multimodal application, when executed on a desktop device, would be an application very similar to MapPoint but would allow the user to talk/listen to the system for parts of the application's input/output as well ”for example, the starting and destination addresses. That's multimodal. Imagine the same application using the same interface on a wirelessly connected PDA. Now you're talking true mobile multimodal application. If you let your imagination go a little bit wilder, you can easily extend the same application to the dashboard of your car or any other device. That's really the vision, which, given the current state of technology, isn't far away. Another modality that can be added to the application example is a pointing device that would zoom the map, focusing on a particular location.

Microsoft SASDK provides support for building both telephony and multimodal applications. Building a multimodal application is really creating a new speech Web application with the multimodal application model. Also, consider that developing a multimodal application involves developing the normal Web application and adding the speech controls to introduce speech connectivity. The Speech Application SDK includes a plug-in for Microsoft Pocket Internet Explorer that can be used for running multimodal applications on a Pocket PC. So, for instance, a multimodal flight reservation application would allow the user to either enter the dates by selecting them in a calendar control or by responding to the appropriate prompts using speech recognition.

Microsoft Speech Server 2004

Now that you have developed speech applications and have tested them using the desktop simulators, how do you start actually using them over a telephony device, such as an ordinary phone or a mobile phone? That is really where deployment of speech applications comes into play. Microsoft Speech Server 2004 (which was in limited beta release at the time of writing of this book) is the answer (see Figure 10.14). Speech Server provides key server components for deploying telephony and multimodal applications. Speech Server runs on top of a Windows 2003 platform and provides the required speech recognition, speech synthesis (TTS), and connectivity to the PBXs and telephony lines.

Figure 10.14. Deploying speech applications using Microsoft Speech Server.

Application Scenarios

A number of applications fit into the list of possible candidates for speech application development. For instance, most of us use email as a basic collaboration medium, and many companies have now invested in collaboration systems such as Microsoft Exchange Server as a collaboration platform. In addition to email, Microsoft Exchange Server also provides calendar, address book, and tasks capabilities. Through a speech-enabled corporate portal, employees can access their critical emails, contacts, and calendar anytime and anywhere over a normal phone.

A number of medium-to-large businesses use the power and flexibility of ERP systems from vendors such as SAP, PeopleSoft, Oracle, Baan, J. D. Edwards, and Microsoft Great Plains Software. For instance, a company that has invested in an HRMS system such as PeopleSoft could provide tremendous flexibility to employees by providing them with a speech-enabled, self-service application that allows employees to review their personnel profile, participate in benefits enrollment, review pay stubs, and even submit time sheets and expenses. Many of these applications are now exposed as Web services, and with .NET Web services support, speech-enabled applications can be leveraged as well.

SHOP TALK : EMPLOYEE DIRECTOR, YOUR FIRST TELEPHONY APPLICATION

A very useful and basic telephony speech-enabled application that you can develop for your own company or your customers is to provide the functionality for an employee directory and also serve as a global assistant in connecting people to people. An employee can dial an 800 telephone number, connect to a speech-based employee directory application, and get connected to anyone in the company ("connect me with Hitesh on his cell phone") instead of searching for an individual in a corporate directory by entering the last four digits of the person's last name . In fact, the Speech Application SDK includes a Contacts prebuilt example that implements most of the employee directory- related functions. I have always used an employee directory as a starter application to illustrate the benefits of the interactive telephony applications and highly recommend that you build or customize the same for your own organization. Also, in my experience, even though multimodal applications sound great and have potential usages, it is telephony applications that really bring out the technology and business benefits for speech platforms.