The Microsoft Speech Application SDK (SASDK) | Building Intelligent .NET Applications(c) Agents, Data Mining, Rule-Based Systems, and Speech Processing

< Day Day Up >

The Microsoft Speech Application SDK (SASDK), version 1.0 enables developers to create two basic types of applications: telephony (voice-only) and multimodal (text, voice, and visual). This is not the first speech-based SDK Microsoft has developed. However, it is fundamentally different from the earlier ones because it is the first to comply with an emerging standard known as Speech Application Language Tags, or SALT (refer to the "SALT Forum" profile box). Run from within the Visual Studio.NET environment, the SASDK is used to create Web-based applications only.

Speech-based applications offer more than just touch-tone access to account information or call center telephone routing. Speech-based applications offer the user a natural interface to a vast amount of information. Interactions with the user involve both the recognition of speech and the reciting of static and dynamic text. Current applications can be enhanced by offering the user a choice to utilize either traditional input methods or a speech-based one.

Development time is significantly reduced with the use of a familiar interface inside Visual Studio.NET. Streamlined wizards allow developers to quickly build grammars and prompts. In addition, applications developed for telephony access can utilize the same code base as those accessed with a Web browser.

The SASDK makes it easy for developers to utilize speech technology. Graphical interfaces and drag-and-drop capabilities mask all the complexities behind the curtain. All the .NET developer needs to know about speech recognition is how to interpret the resulting confidence score.

SALT Forum

Founded in 2001 by Cisco, Comverse, Intel, Microsoft, Philips, and ScanSoft, the SALT Forum now has over seventy contributors and adopters. Their collaboration has resulted in the creation and refinement of the Speech Application Language Tags (SALT) 1.0 specification. SALT is a lightweight extension of other markup languages such as HTML and XML. The specification standardizes the way devices such as laptops, personal digital assistants (PDA's), phones, and Tablet PC's access information using speech. It enables multimodal (text, voice, and visual) and telephony (voice-only) access to applications.

The forum operates a Web site at www.saltforum.org which provides information and allows individuals to subscribe to a SALT Forum newsletter. The site also provides a developer forum that contains a download of the latest specification along with tutorials and sample code. Membership is open to all, and interested companies can download a SALT Adopters agreement.

SALT is based on a small set of XML elements. The main top-level outputs/inputs are as follows:

<prompt. . .> A prompt is a message that the system sends to the user to ask for input. This tag is used to specify the content of audio output and can point to prerecorded audio files. It contains the subqueue object used to specify one or more prompt objects.
<listen. . .> Input element used for speech recognition or audio recording. It contains the grammar element used to specify the different things a user can say. The record element is used to configure the recording process.
<dtmf. . .> Short for Dual Tone Multi-Frequency tones. Element used in telephony applications. It is similar to the listen element in that it specifies possible inputs. Like the <listen> element, its main elements are grammar and bind.
<smex. . .> Short for Simple Messaging Extension. Asynchronous element used to communicate with the device. Can be used to receive XML messages from the device through the received property.

SALT supports different browsers by allowing for two modes of operation, object and declarative. The object mode exposes the full interface for each SALT element but is only supported by browsers with event and scripting capabilities. The declarative mode provides a limited interface and is supported by browsers such as those found on portable devices.

Note

VoiceXML, 2.x is simple markup language introduced by the World Wide Web (W3C) Consortium (http://www.w3.org). Like SALT, it is used to create dialogs with a user using computer-generated speech. They are both based on W3C standards.

The VoiceXML specification was created before SALT and was designed to support telephony applications. SALT was designed to run on a wide variety of devices, including PDA's, smartphones, and Tablet PC's.

SALT has a low-level API, and VoiceXML has a high-level API. This allows SALT a finer-level control over the interface with the user.

VoiceXML does not natively support multimodal applications and is used primarily for limited IVR applications. Because of this, Microsoft Speech Server does not support VoiceXML. But everything that can be accomplished with VoiceXML can be accomplished with SALT.

Telephony Applications

The Microsoft Speech Application SDK enables developers to create telephony applications, in which data can be accessed over a phone. Prior to the Speech Application SDK, one option for creating voice-only applications was the Telephony API (TAPI), version 3.0, that shipped with Windows 2000. This COM-based API allowed developers to build interactive voice systems. The TAPI allowed developers to create telephony applications that communicated over a Public Switched Telephone Network (PSTN) or over existing networks and the Internet. It was responsible for handling the communication between telephone and computer.

Telephony application development would further incorporate the use of the SAPI (Speech Application Programming Interface), version 5.1, to provide speech recognition and speech synthesis services. This API is COM based and designed primarily for desktop applications. Like TAPI, it does not offer the same tools and controls available with the new .NET version. Most important, the SAPI is not SALT compliant and therefore does not utilize a common platform.

Telephony applications built with the SASDK are accessed by clients using telephones, mobile phones, or smartphones. They require a third-party Telephony Interface Manager (TIM) to interpret signals sent from the telephone to the telephony card. The TIM then communicates with Telephony Application Services (TAS), the Speech Server component responsible for handling incoming telephony calls (see Figure 2.1). Depending on which version of Speech Server is used, TAS can handle up to ninety-six telephony ports per node, with the ability to add an unlimited number of additional nodes.

Figure 2.1. The main components involved when telephony applications are received. The user's telephone communicates directly with the server's telephony card across the public telephone network. The Third-party Telephony Interface Manager (TIM) then communicates with Telephony Application Services (TAS), a key component of Speech Server 2004.

Telephony applications can be either voice-only, DTMF (Dual Tone Multi-frequency) only, or a mixture of the two. DTMF applications involve the user pressing keys on the telephone keypad. This is useful when the user is required to enter sensitive numerical sequences such as passwords or account numbers. In some cases, speaking these types of numerical sequences might entail a security violation, because someone might overhear the user.

Call centers typically use telephony applications to route calls to appropriate areas or to automate some basic function. For instance, a telephony application can be used to reset passwords or request certain information. By automating tasks handled by telephone support employees, telephony applications can offer significant cost savings.

Telephony applications can also be useful when the user needs to iterate through a large list of information. The user hears a shortened version of the item text and can navigate through the list by speaking certain commands. For example, if the telephony application is used to recite e-mail, the user can listen as the e-mail subjects of all unread e-mails are recited. A user who wants to hear the text of a specific e-mail can speak a command such as "Read e-mail." The user can then navigate through the list by speaking commands such as "Next" or "Previous."

Multimodal Applications

Multimodal applications allow the user to choose the appropriate input method, whether speech or traditional Web controls. The application can be used by a larger customer base because it allows the user to choose. Since not all customers will have access to microphones, the multimodal application is the perfect way to offer speech functionality without forcing the user into a corner.

Multimodal applications are accessed via Microsoft Internet Explorer (IE) on the user's PC or with IE for the Pocket PC (see Figure 2.2). Both versions of IE require the installation of a speech add-in. Users indicate that they wish to utilize speech by triggering an event, such as clicking an icon or button.

Figure 2.2. The high-level process by which multimodal applications communicate with Speech Server. The ASP.NET application is accessed either by a computer running Internet Explorer (IE) with the speech add-in or by Pocket IE with the speech add-in.

The speech add-in for IE, necessary for interpreting SALT, is provided with the SASDK. It should be installed on any computer or Pocket PC device accessing the speech application. In addition to providing SALT recognition, the add-in displays an audio meter that visually indicates the volume level of the audio input.

< Day Day Up >