35.8 VOICEXML

< Day Day Up >

The VoiceXML forum (http://www.voicexmlforum.org) was founded by AT&T, IBM, Lucent Technologies, and Motorola to promote Voice eXtensible Markup Language (VoiceXML). VoiceXML was designed to make Internet content available through voice from telephones. VoiceXML, in short, makes it possible to achieve voice-enabled Web. VoiceXML Version 1.0 was released in March 2000 and a working draft of Version 2.0 in October 2001.

Web access is normally through desktop PCs. The information obtained will be rich in content with graphics. But PC penetration is very low in developing countries, and computer literacy is a must to access Web services. Accessing the Web through mobile phones using WAP protocols is another alternative, but WAP-enabled mobile phones are costly and are not within the reach of many. However, because of the limited display on mobile phones, WAP services are not user friendly.

If Web services are accessible through normal telephones or mobile phones, with the output in voice form, Web reach can be much greater, and services will be very user friendly because speech is a very natural way of communicating among people. VoiceXML provides this possibility.

Consider a simple example of obtaining weather information from an Internet Web server. The dialogues between the computer (C) and the Human (H) can take one of two forms: (a) directed dialogue and (b) mixed initiative dialogue.

VoiceXML provides the capability of accessing Internet content through telephones. VoiceXML is derived from XML.

Directed dialogue: In this approach, the interaction between C and H can be as follows:
- C: Please say the state for which you want the weather information
- H: Indiana
- C: Please say the city
- H: Fort Wayne
- C: The maximum temperature in Fort Wayne is 63 degrees Fahrenheit
Mixed initiative dialogue: In this approach, the interaction between C and H can be as follows:
- C: Please say the state and the city for which you want the weather information.
- H: Fort Wayne Indiana.
- C: The maximum temperature in Fort Wayne Indiana is 63 degrees Fahrenheit.

This kind of interaction is possible (completely through speech) for obtaining information available on the Web. This calls for interfacing text-to-speech conversion system, a speech recognition system, and if required a IVR system to the Web servers. It is possible to provide voice-enabled Web service without VoiceXML, but the problem is that because all these components are built around proprietary hardware and software, it is difficult to port the application for different platforms.

Accessing Internet content through telephones is done either through directed dialogue or mixed initiative dialogue.

Note

VoiceXML separates the service logic from the user interaction code. Hence, it is possible to port the application from one platform to another because VoiceXML is an industry standard for content development.

VoiceXML has been designed with the following goals:

To integrate data services and voice services.
To separate the service logic (CGI scripts) to access databases and interface with legacy databases from the user interaction code (VoiceXML).
To facilitate portability of applications from one platform to another; VoiceXML is based on an industry standard for content development.
To shield application developers from the low-level platform-dependent details such as hardware and software for text-to-speech conversion, IVR digit recognition, and speech recognition.

The operation of voice-enabled Web is shown in Figure 35.9. The VoiceXML server contains the necessary hardware and software for: (a) telephone interface; (b) speech recognition; (c) text-to-speech conversion; and (d) audio play/record. The Web server contains the information required for the specific application in the form of VoiceXML documents, along with the service logic in the form of CGI scripts and necessary database interfaces. When a user calls an assigned telephone number to access weather information, for instance, through PSTN or PLMN, the call reaches the VoiceXML server and, this server converts the telephone number to a URL. The weather information corresponding to the URL is obtained by this server from the Web Server, which is in the format of VoiceXML. The VoiceXML server converts the content into speech form and plays it to the user. When the user utters some words (the state and city names for obtaining weather information), the VoiceXML server recognizes those words and, based on the information available in the database, plays the information to the user.

click to expand
Figure 35.9: Operation of voice-enabled Web.

The human-machine interaction is carried out using the following:

DTMF digits dialed by the user from the telephone.
Text-to-speech conversion.
Speech recognition.
Recording of speech input of the user.
Playing of already recorded speech to the user.

A VoiceXML server has the necessary hardware and software to facilitate this human-computer interaction. It gets the user inputs in the form of DTMF digits or voice commands and gives the output in speech format. The dialogues are of two types: menus and forms. Menus provide the user with a list of choices, and forms collect values for a set of variables (such as an account number). When a user does not respond or requests help, events are thrown.

The implementation platform of the VoiceXML server generates events in response to user actions (such as a spoken word or a pressed key) and system events (such as a timeout, in case the user does not respond). The implementation platform is different from the content, and the content is independent of the hardware used for developing voice-enabled Web.

A typical VoiceXML document is shown in the following. It contains the tags to generate prompts and obtain user responses and the grammar to indicate the service logic.

A VoiceXML document contains tags to generate prompts, and obtain user responses and grammar to indicate the service logic.

    <?xml version="1.0"?>    <vxml version="1.0">    <form id = "weather">            <block> Welcome to the weather information service </ block>            <field name="state">            <prompt> Please tell for which state you want weather information </prompt>            <grammar src="/books/4/329/1/html/2/state.gram" type="application/x-jsgf"/>            </field>                    <field name="city">                    <prompt> What city </prompt>                    <grammar scr="city.gram" type="application/x-jsgf"/ >                   </field>    <submit next="/servlet/weather" namelist=" city state"/>    </block>    </form>    </vxml>

As can be seen from this code, VoiceXML provides a simple and efficient method of providing the content for developing voice-enabled Web applications. In the next decade, these services will catch up for very user-friendly web Browsing through telephones.

Using VoiceXML, content can be developed without bothering about the implementation details of the various components, such as text-to-speech conversion, speech recognition, or an interactive voice response system.

Summary

Computer telephony integration facilitates accessing the information available in computers through telephones. CTI technology has become very popular in recent years, particularly in developing countries because telephone density is very high compared to computer density. The three technology components of CTI are text-to-speech conversion, speech recognition, and interactive voice response (IVR). The details of all these technology components are presented in this chapter.

Text-to-speech conversion involves converting the text into its phonetic equivalent and then applying speech synthesis techniques. For English, a set of pronunciation rules is required to convert the text into its phonetic equivalent. For Indian languages, this step is very easy because there is a one-to-one correspondence between the written form and spoken form. For generating speech, the basic units can be words, syllables, diphones, or phonemes.

Speech recognition is a very complex task because speaker characteristics vary widely. Present commercial speech recognition systems can recognize limited vocabulary of limited speakers very accurately. Unlimited vocabulary speaker-independent speech recognition is still an active research area. Speech recognition is pattern recognition wherein prestored templates obtained during the training phase are compared with the test patterns.

Interactive voice response systems are now being widely deployed to provide information to consumers such as in railway/airline reservation systems, banking and so on. An IVR system consists of a hardware module to take care of the telephony functions and software to access the database and convert the text into speech.

CTI technology is now being used effectively in call centers by service organizations to provide efficient customer service. CTI also is very useful for accessing the Web services of the Internet through voice-enabled Web.

References

Many universities all over the world are working on text-to-speech conversion and speech recognition as research topics. You can visit the Web sites of the leading universities to obtain the latest information. IEEE Transactions on Acoustics, Speech and Signal Processing (ASSP) and Bell System Technical Journal (BSTJ) are two excellent references that publish research papers in this area. Also, a number of vendors such as Cisco, Nortel, Microsoft, and Sun Microsystems supply CTI products. Their Web sites also give the latest information on the state of the art in these areas. Selected references and Web resources are given below.
W.A.Ainsworth. "A System for Converting English Text into Speech", IEEE Transactions on Audio Electroacoustics, Vol. AU 21, No. 3, June 1974. This paper gives a complete list of pronunciation rules required for converting British English into speech.
Dreamtech Software Team. Instant Messaging Systems, Wiley Publishing Inc., 2002. This book gives excellent coverage of instant messaging. Using the source code listings given in the book, you can develop full-fledged instant messaging software.
Elaine Rich and Kevin Knight. Artificial Intelligence. McGraw Hill Inc., 1991. Speech synthesis and speech recognition fall under the realm of Artificial Intelligence. This book gives an excellent introduction to artificial intelligence concepts and systems.
J.L. Flanagan. Speech Analysis, Synthesis and Perception, Springer Verlag, New York, 1972. This book is considered the bible for researchers in speech. A must-read for everyone interested in research in speech processing.
Special issue on speech synthesis, IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 1, January 2001. This special issue contains articles that describe research efforts in text-to-speech conversion in different languages.
http://www.call-center.net Resources about call centers.
http://www.cdacindia.com Web site of Centre for Development of Advanced Computing (CDAC), which carries out research in computing in Indian languages. CDAC also has products for development of content in Indian languages.
http://www.cisco.com Cisco's Web site. Cisco is one of the leading vendors of CTI products.
http://www.ibm.com IBM's Web site gives a wealth of information on CTI. You can get the details of the IBM's WebSphere voice server also from this site.
http://www.linuxtelephony.org/ Web site that provides resources for computer telephony on Linux platform.
http://www.philips.com Philips' FreeSpeech 2000 is software that facilitates development of CTI applications.
http://www.voicexmlforum.org The Web site of VoiceXML Forum. You can obtain VoiceXML standards documents from this site.

Questions

List the various technology components used in computer telephony integration.
One way of developing a text-to-speech conversion system in English is to record and store about 200,000 words of English and then concatenate the required words to generate speech. For example, if the input sentence in text is "she is a beautiful woman", the software has to pick up the speech data corresponding to all the words and concatenate these words and play through the sound card. Which search algorithm is good to search for the words in the database? Study the algorithmic complexity and the storage requirement for the search algorithm.
For text-to-speech conversion, is it easier to handle English or the Indian languages (such as Hindi, Telugu, Kannada, Bengali, Marathi, Gujarati, and Malayalam)? Why?
If you have to develop an automatic speech recognition system that recognizes any word spoken by a person, what are the issues to be addressed? If this system has to recognize anybody's speech, what are the issues?
What are the different categories of speech recognition systems? List the potential applications for each category.
Is it possible to communicate with computers the way we communicate with each other in a natural language such as English or Hindi? If not, why not?
What is an interactive voice response system? What are its potential applications?
Describe the architecture of a call center.
Call centers are now being set up in major Indian cities for foreign clients. Study the various market segments for which such call centers are being set up.
What is unified messaging? What are its advantages?
What is the need for a new standard for a markup language (VoiceXML) for CTI applications? What are the salient features of VoiceXML?
Study the Unicode representation of different Indian languages.

Exercises

1.	Using the sound card of your PC, record about 100 words in English and store them in different files. Write a program that takes an English sentence as input and speaks out the sentence by concatenating the words. If the gap between two successive words is high, the speech does not sound good. Try to edit the voice files to reduce the silence at the beginning and end of each word and then try to do the text-to-speech conversion.
2.	Design an IVR system for a telebanking application. Create a database that contains bank account number, password, type of account, and balance amount. Simulate a telephone keypad on the monitor using a Java applet. Design the conversation between the IVR system and the user.
3.	Search for freely available text-to-speech conversion and speech recognition software packages available on the Internet. Experiment with these packages.
4.	For any Indian language, find out the number of words, syllables, diphones, and syllables required to achieve unlimited vocabulary text-to-speech conversion.
5.	List the various components required for developing a call center.
6.	Study the commercial equipment available for setting up a cell center.

Answers

1.	You can develop text-to-speech conversion software using words as the basic units. You can record a large number of words and store each word in a separate file. The program for text-to-speech conversion has to do the following: Read the input sentence. Remove punctuation marks. Convert all capital letters to small letters. Expand abbreviations such as Mr., Prof., and Rs. Scan each word in the sentence and pick up the corresponding sound file from the database of spoken words. Create a new sound file that is a concatenation of all the sound files of the words. Play the new sound file through the sound card.
2.	To design an IVR system for a telebanking application, you need to create a database (in MS Access, MS SQL, or Oracle, for instance) that contains bank account number, password, type of account, and balance amount. The information that needs to be stored in the database is: account holder name, address, account number, type of account, and present bank balance. You also need to design the dialogues for interaction between the account holder and the IVR system. A typical dialogue is as follows: IVR: Welcome to ABC Bank's IVR system. Please dial 1 for information in English, dial 2 for information in Hindi. User: Dials 1. IVR: Please dial your account number. User: Dials 2346. IVR: The account number you dialed is two three four six. Please dial your password. User: Dials 4567. IVR: You have a savings bank account. The present balance is Rupees Ten Thousand Four Hundred. Thank you for calling the IVR.
3.	You can experiment with Microsoft's Speech SDK to develop text-to-speech and speech recognition–based applications. IBM's WebSphere can also be used for development of such applications.
4.	For most Indian languages, the number of phonemes is about 60. The number of diphones is about 1500. The number of syllables is about 20,000. You need to store nearly 200,000 words if you want to use the word as the basic unit for text-to-speech conversion. It is better to use syllables.
5.	A call center consists of the following components: A local area network in which one node (computer) is given to each agent. A server that runs the customer relations management software. A PBX with one extension to each agent. Automatic call distribution (ACD) software. Fax-on-demand software. A interactive voice response system.
6.	Nortel and Cisco are the two major suppliers of call center equipment. You can get the details from their Web sites http://www.cisco.com and http://www.nortelcommuncations.com.

Projects

Develop a full-fledged text-to-speech conversion system for your native language. You can store the speech data for a reasonably large number of words, say 500. Using this database of words, create a database of syllables. Write the software that takes the text as input and converts it into speech using concatenation of words and syllables. If you develop a good database of syllables, you will achieve very good quality text-to-speech conversion.
Using Microsoft's Speech SDK, create a voice browsing application. Microsoft's Speech SDK can be used to recognize words. When a particular word is recognized, the system has to jump to a specific link. You can use VoiceXML to create the content.
Using a voice/data/fax modem connected to a PC, develop an IVR system. You can use Microsoft's Telephony API (TAPI) to control the modem and generate the responses based on the user's input of DTMF digits.
Using IBM's WebSphere Voice Server Software Developers Kit, develop a telebanking application that facilitates banking through voice commands.
Develop fax-on-demand software. A set of five documents (MS Word files) should be stored in the PC. These files can be brochures of five products. When a user dials a telephone number to which the voice/data/fax modem is connected, the user should get the response "Please dial 1 to get the brochure of TV, dial 2 to get the brochure of refrigerator, dial 3 to get the brochure of microwave oven, dial 4 to get the brochure of DVD player and dial 5 to get the brochure of washing machine". When the user dials a number, the user should hear the message "Please enter your fax number". When the user dials the fax message, the corresponding brochure should be faxed to the user's fax machine.

< Day Day Up >