Analyzing Speech Synthesis

Analyzing Speech Synthesis

As lifelong users of spoken language, we find speech synthesis (or "generation") a simple task. Many of us probably don't remember a time when we weren't capable of conversing with the people around us. But getting a computer to speak intelligibly is a surprisingly complicated process.

The simplest approach is to record audio clips of the desired words or phrases and then to play them back as required by the application. This is the approach used by answering machines, voice mail, and VRU systems. It has the advantage of not requiring much processing power, but it doesn't scale to the case of arbitrary phrases. Every conceivable phrase that the system can utter would have to be recorded by a human voice.

A trade-off can be made between flexibility and smoothness: Instead of recording every possible phrase, a human will record every possible word that can be uttered by the system. The words are then strung together to create the appropriate phrases. However, without the natural rise and fall of pitch through the phrase, it sounds jerky or stilted and sometimes can lead to a comical-sounding situation: (female voice) "Please wait by your car for parking officer Steve " (deep male voice) "Grebowski" ( female again) "between the hours of 9 and 5."

As computing power has increased through the years , dynamic synthesis of voice waveforms has become a less expensive proposition. Nevertheless, there are still many different programmatic steps required to convert a string of words to an output waveform. Speech scientists have identified these steps in great detail, but for our purposes four will suffice. In order, from the human-centric level to the machine-centric level, they are: tokenization, phrasing and accenting, phonetics and intonation , and waveform generation.

Not all speech synthesis packages provide all the higher-level components . For ex ample, a simple synthesizer may perform only the waveform-generation step, which requires the application to specifically create the appropriate string of sounds and embedded intonation hints.

Tokenization

Tokenization is the process of breaking up an input string into words, identifying punctuation marks, and often includes conversion of special characters into their alphabetic equivalents. For example, a typical tokenization would be:

 The time is 8:04 PM, July 24th. 
 the time is eight oh four P M <comma> july twenty fourth <end> 

Sometimes some of these steps are part of the phrasing and intonation step, to be described shortly.

Tokenization rules can quickly become complicated. Even in our short example, we can see that the following rules must be defined somewhere:

·                 Numbers are expanded to their word equivalents (8).

·                 A colon surrounded by numbers is skipped .

·                 A zero following a colon becomes "oh."

·                 Unrecognized words are spelled out (P.M.).

·                 A number followed by an ordinal suffix is expanded to the corresponding ordinal word (24th).

Phrasing and Intonation

The next step in speech synthesis is to take the list of tokens, identify parts of speech, and divide them into plausible phrases. Punctuation marks, like commas and periods, are critical to this process. If the machine has a knowledge of the correct grammar as well, the dependent and independent clauses can be recognized in order to provide additional hints. In our example, the phrasing rules would break the sequence of tokens into two parts :

 the time is eight oh four P M 
 july twenty fourth 

Usually commas and semicolons will indicate the end of a phrase, followed by a brief pause, and periods will indicate the end of a phrase, followed by a longer pause.

Now, the words within each phrase must be assigned accent , which describes variations in the pitch and speed of the spoken word. Accented words tend to have a heavier, slower intonation, and other words are spoken more quietly and quickly. Phrase-level accenting is a highly language-specific property.

Here is an example of the importance of phrase-level accenting (in English, of course) (the accented words are underlined ):

 Alice is telling Bob about her appointment: 
 A: I'm  supposed  to be there at eight. (flat tone, falling at end) 
 
 Alice thought her appointment was at seven, but Bob has just told her differently: 
 A: I'm supposed to be there at  eight  ? (rising sharply at end) 
 
 Alice, reminding herself of the right time: 
 A: I'm supposed to  be  there at eight. (rising a little at end) 

All three sentences contain the same words, but they imply very different things, depending on the intonation. It's difficult to tell the appropriate intonation just from the written words. Some speech synthesizers allow special marker symbols to provide accenting hints at this step.

Phonetics

Next, each word needs to be decomposed into its specific phonemes. Phonemes are the specific, small units of sound that make up the characteristic sound of a language. The International Phonetic Alphabet describes a universal way of writing each of these sounds and all their subtleties. One of the more familiar uses of phonemes is probably on your bookshelf : The pronunciation guides in dictionaries spell each word phonetically.

Decomposition into phonemes is difficult for an irregular language like English, where the spelling of a word does not bear as much relation to its pronunciation as in other languages. Most speech synthesizers use both a dictionary of known words and a set of rules for unknown words to derive the correct pronunications.

There can be more than one pronunciation dictionary for a given language, corresponding to the quality commonly known as accent but more properly termed dialect . For example, the Festival speech synthesis package comes with several English phoneme databases, including a British English male speaker, two American English male speakers , and a Castilian Spanish male speaker.

Once the appropriate phonemes and their accents are looked up in the pronunciation dictionary, the information is combined with the overall intonation of the containing phrase to derive the appropriate pitch and duration for each phoneme. This process provides continuity between words and avoids a monotone effect.

After this step is completed, our time-and-date example is reduced to the following sequence of phonemes:

  th (pitch normal)  
  <weak vowel>  
  t  
  ah  
  ee  
  m (longer)  
  <stop>  
  ih (pitch decreasing)  
  z  
  ay (pitch higher than normal)  
  t(s)  
  <stop>  
  oh (long)  
  f (pitch normal)  
  aw  
  r  

Of course, the description of each phoneme is not as complete as the software would generate, but many of the more important aspects are shown.

Waveform Generation

Finally, the waveforms for each phoneme are generated, either by looking them up in a file or by a more sophisticated wave envelope synthesis. This is the point at which different voice qualities can be simulated. For example, the voice of a man or woman or a child, a gravelly voice, a smooth voice, and a whispering voice are all distinguished and generated at this step. The waveforms for each phoneme are adjusted for pitch and duration and then blended into a final waveform, which can be stored to a file or sent directly to an audio device.

 



Multitool Linux. Practical Uses for Open Source Software
Multitool Linux: Practical Uses for Open Source Software
ISBN: 0201734206
EAN: 2147483647
Year: 2002
Pages: 257

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net