35.3 TEXT-TO-SPEECH CONVERSION

< Day Day Up >

Speech is the most natural and convenient mode of communication among human beings, and it enriches interaction between computers and humans as well. In the past few decades, significant progress has been made in achieving interaction between humans and computers through speech, though unrestricted speech communication between humans and computers still remains an elusive goal. Text-to-speech conversion and speech recognition are the two pillars on which human interaction with computers through speech rests.

Text-to-speech conversion is gaining importance in many information processing systems, such as information retrieval from databases, computer-aided instruction, conversion of e-mail into speech form so that e-mails can be accessed from telephones, reading machines for the blind, and accessing Web pages from telephones. As shown in Figure 35.1, text-to-speech conversion involves mainly three steps: machine representation of text, transliteration through which text is converted into its corresponding sound symbols, and synthesizing the speech signal from the sound symbols.

click to expand
Figure 35.1: Text-to-speech conversion.

Note

Text-to-speech conversion has many practical applications such as retrieval of information stored in databases through telephones, reading machines for the blind, and accessing Internet content using telephones.

35.3.1 Machine Representation of Text

Text can be input into the computer either through a keyboard or through an optical character recognition (OCR) system. When the text is input through the keyboard, the characters are encoded using American Standard Code for Information Interchange (ASCII) format, in which 7 bits represent each character. For Indian language representation, ISCII (Indian Standard Code for Information Interchange) is used—both 7-bit and 8-bit ISCII formats are available. If the OCR system is used to input the text, an orthographic knowledge source that contains the knowledge of the written symbols is required. The OCR software interprets the scanned text and converts it into machine code.

Typed text is represented in the computer using ASCII or Unicode. Alternatively, optical character recognition software is used to convert the written or typed text into machine code.

Unicode is now used extensively for machine representation of text in all world languages. Unicode is a 16-bit code. Programming languages such as Java as well as markup languages such as XML support Unicode.

Note

Optical character recognition (OCR) will be highly accurate for typed text but not for handwritten text. Many pattern recognition algorithms are used in OCR.

35.3.2 Transliteration

Conversion of text into corresponding pronunciation is called transliteration. This process is complicated for languages such as English. This is because in English, 26 letters are mapped onto 42 phonemes—a phoneme is the smallest speech sound in a language. These phonemes are represented by special symbols (such as a, aa, i., ii, u, uu). For English, 350 pronunciation rules are required to convert text into the corresponding pronunciation. Even with these 350 rules, all the words are not pronounced properly. Hence, a dictionary of exceptions is required. This dictionary contains the words and their corresponding pronunciation in the form [are] = aa r.

The process of converting text into its equivalent pronunciation is called transliteration. To convert English text into its equivalent pronunciation, nearly 350 pronunciation rules are required.

When a word is given as input to the transliteration algorithm, first the word is checked in the dictionary of exceptions. If the word is found in this dictionary, the corresponding pronunciation is assigned to it. If the word is not found in the dictionary, the pronunciation rules are applied and the pronunciation is obtained.

Unlike English, most Indian languages are phonetic languages, meaning that there is a one-to-one correspondence between the written form and the spoken form. Hence, for Indian languages, there is no need for a transliteration algorithm. An exception to this rule is Tamil, in which the pronunciation of a letter may depend on the following letter; so one symbol look-ahead algorithm is required for transliteration of Tamil.

Note

Transliteration of most Indian languages is easy because there is a one-to-one correspondence between the written form and spoken form. An exception is that Tamil language.

35.3.3 Speech Synthesis

From the pronunciation of the text obtained from the transliteration algorithm, speech has to be generated by the computer. Conceptually, for each phoneme, the speech data can be stored in the computer and, combining the data for all the phonemes in a given word, speech can be generated. For example, if the speech data for the phonemes aa and r are combined, we can produce the sound for the word "are". But unfortunately, it does not work well, and the quality of speech produced this way will be very poor.

To produce good quality speech, various techniques have been tried. These techniques are based on using different basic units of speech—words, syllables, diphones, and phonemes.

Words: A simple mechanism is to use the word as the basic unit: each word in the language is spoken and recorded in the computer. When a sentence is to be spoken, the speech data corresponding to all the words in the sentence is concatenated and played. This approach gives very good quality speech. The only problem is that the number of words in any language is very high.

Assume that we store about 100,000 words of English and each word on an average takes about 0.4 seconds. We need to store 40,000 seconds of speech, which is 320Mbytes of data if 64kbps PCM coding is used. This used to be a very high storage space requirement in earlier days, but not any longer. One CD-ROM can hold nearly 200,000 words of speech data. Now the only requirement is to be able to pick up the data corresponding to the required words quickly from the database, and so a fast searching mechanism is all that is required. Nowdays, many text-to-speech conversion systems follow this approach.

Syllables: A syllable is a combination of two phonemes. The symbol for ka in any language is a syllable consisting of two phonemes—k and a. Similarly, kaa, ki, and kii are syllables. Each language has about 10,000 to 30,000 syllables. If a syllable is taken as the basic unit of speech, we can store the speech data corresponding to these syllables. From the transliteration algorithm, we obtain the pronunciation, and from the pronunciation and the syllable speech data, we can synthesize the speech. This approach gives good quality speech and is recommended if there is a constraint on the storage space.

Note

Using the syllable as the basic unit is the best approach to obtain good quality speech in text-to-speech conversion systems. The number of syllables in a language will be between 10,000 and 30,000.

Diphones: The sound from the middle of one phoneme to the middle of the next phoneme is called diphone. For instance, in the word put there are four diphones: #p, pu, ut, and t#, where # stands for blank. Diphone as the basic unit is considered an attractive choice because the transition from one phoneme to another phoneme is important for obtaining good quality speech. The number of diphones in any language is limited to about 1,500, and so the storage requirement is very small.

For synthesizing speech, the waveforms corresponding to basic units of speech need to be stored in the computer. These basic units can be words, syllables, diphones, or phonemes. Depending on the storage space and quality of speech required, the best basic unit can be chosen.

Phonemes: The number of phonemes in any language is very small—fewer than 63 in any language. Hence, if the phoneme is used as the basic unit, the storage requirement will be very small. However, it is difficult to get good quality speech because there will be subtle variation in the sound of a phoneme when it occurs in the context of other phonemes—for example, the phoneme b has different sounds in the two words "bat" and "but". How to obtain this effect in speech synthesis is still an active research area. The advantage of using phonemes as the basic unit is that we can manipulate the sound through software, for example to put stress or vary the pitch. This is required to obtain natural sounding speech.

To summarize, text-to-speech conversion using words and syllables as the basic units gives good quality speech. If the vocabulary is limited, as in the case of most systems (discussed in the next section), it is better to use words; if the vocabulary is very large, it is better to use syllables.

To make computers generate natural sounding speech is still very difficult. Speech produced by most of the text-to-speech conversion systems sounds rather artificial. This is because when people speak, they vary the pitch, they vary the stress on different sounds, and they vary the timing (some sounds are elongated). She is a beautiful woman has different meanings depending on how one pronounces the word "beautiful". Currently, the machines cannot do that kind of job.

To develop text-to-speech conversion systems that produce natural-sounding speech, we need to introduce three special effects (refer to Figure 35.1): intonation (variation of pitch with time), rhythm (variation of stress with time), and quantity (variation of duration of the phonemes). This calls for various knowledge components as shown in Figure 35.2. The knowledge of written symbols is represented by orthographic components, and the knowledge of phonemes is represented by phonological components. Using phonemes, words are formed with the lexical knowledge component. Using the syntactic component (grammar), words are combined to form sentences. Semantic components represent the meaning associated with the words. A prosodic component represents the context-dependent meaning of the sentences. Generation/understanding of speech using all these knowledge components is still complicated and it will require thousands of people their Ph.D. in this area to make computers talk like human beings. To produce natural-sounding speech by developing these knowledge sources is an active research topic in artificial intelligence.

To produce natural-sounding speech, intonation, rhythm, and quantity are important. Intonation is variation of pitch with time. Rhythm is variation of stress with time. Quantity is variations in the duration of the phonemes.

click to expand
Figure 35.2: Components of language.

Note

To produce natural-sounding speech is still an active research area because many artificial intelligence concepts need to be introduced for generating speech.

35.3.4 Issues in Text-to-Speech Conversion System Design

To develop practical text-to-speech conversion systems that give very high quality speech is still a challenging task. The following issues need to be considered while designing commercial text-to-speech conversion systems:

To develop commercial text-to-speech conversion systems, the design issues are vocabulary size, basic speech units, number of languages, and the low bit rate coding technique to be used for storing the basic speech units.

Vocabulary size: If the vocabulary is limited (as in most of the IVR systems), the speech data corresponding to the words and phrases can be recorded directly. Text-to-speech conversion is basically concatenation of the speech data files and replaying the concatenated file to the user.

If the vocabulary size is not limited, conversion to the phonetic alphabet using a transliteration algorithm has to be done, and then the speech synthesis has to be done.

Basic speech units: Depending on the quality required and the storage space available, the basic speech unit has to be chosen—it can be word, syllable, diphone, phoneme. As discussed earlier, depending on the unit chosen, the quality of the speech will vary.

Number of languages: In some applications, multilingual support is required. Each language has to be considered separately, and the text-to-speech conversion mechanism has to be worked out.

Low bit rate coding of speech: Storing the speech data in PCM format (64 kilobits per second) requires lots of storage space. Though storage is not a major issue for many applications, some applications may demand conserving storage space, as in the case of talking toys. In such a case, low bit rate coding schemes such as ADPCM or LPC can be used. Of course, quality and storage space are trade-off parameters.

Note

If you can store the speech waveforms of about 200,000 words of any language, you can get good quality speech by concatenating the words to form sentences. Though the storage requirement is high, storage is not a constraint on desktops.

< Day Day Up >