![Page 1: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/1.jpg)
Speech Synthesis
April 12, 2013
![Page 2: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/2.jpg)
Speech Synthesis:A Basic Overview
• Speech synthesis is the generation of speech by machine.
• The reasons for studying synthetic speech have evolved over the years:
1. Novelty
2. To control acoustic cues in perceptual studies
3. To understand the human articulatory system
• “Analysis by Synthesis”
4. Practical applications
• Reading machines for the blind, navigation systems
![Page 3: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/3.jpg)
Speech Synthesis:A Basic Overview
• There are four basic types of synthetic speech:
1. Mechanical synthesis
2. Formant synthesis
• Based on Source/Filter theory
3. Concatenative synthesis
• = stringing bits and pieces of natural speech together
4. Articulatory synthesis
• = generating speech from a model of the vocal tract.
![Page 4: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/4.jpg)
1. Mechanical Synthesis• The very first attempts to produce synthetic speech were made without electricity.
• = mechanical synthesis
• In the late 1700s, models were produced which used:
• reeds as a voicing source
• differently shaped tubes for different vowels
![Page 5: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/5.jpg)
Mechanical Synthesis, part II• Later, Wolfgang von Kempelen and Charles Wheatstone created a more sophisticated mechanical speech device…
• with independently manipulable source and filter mechanisms.
![Page 6: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/6.jpg)
Mechanical Synthesis, part III• An interesting historical footnote:
• Alexander Graham Bell and his “questionable” experiments with his dog.
• Mechanical synthesis has largely gone out of style ever since.
• …but check out Mike Brady’s talking robot.
![Page 7: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/7.jpg)
The Voder• The next big step in speech synthesis was to generate speech electronically.
• This was most famously demonstrated at the New York World’s Fair in 1939 with the Voder.
• The Voder was a manually controlled speech synthesizer.
• (operated by highly trained young women)
![Page 8: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/8.jpg)
Voder Principles• The Voder basically operated like a vocoder.
• Voicing and fricative source sounds were filtered by 10 different resonators…
• each controlled by an individual finger!
• Only about 1 in 10 had the ability to learn how to play the Voder.
![Page 9: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/9.jpg)
The Pattern Playback• Shortly after the invention of the spectrograph, the pattern playback was developed.
• = basically a reverse spectrograph.
• Idea at this point was still to use speech synthesis to determine what the best cues were for particular sounds.
![Page 10: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/10.jpg)
2. Formant Synthesis• The next synthesizer was PAT (Parametric Artificial Talker).
• PAT was a parallel formant synthesizer.
• Idea: three formants are good enough for intelligble speech.
• Subtitles: What did you say before that? Tea or coffee? What have you done with it?
![Page 11: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/11.jpg)
PAT Spectrogram
![Page 12: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/12.jpg)
2. Formant Synthesis, part II• Another formant synthesizer was OVE, built by the Swedish phonetician Gunnar Fant.
• OVE was a cascade formant synthesizer.
• In the ‘50s and ‘60s, people debated whether parallel or cascade synthesis was better.
• Weeks and weeks of tuning each system could get much better results:
![Page 13: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/13.jpg)
Synthesis by rule• The ultimate goal was to get machines to generate speech automatically, without any manual intervention.
• synthesis by rule
• A first attempt, on the Pattern Playback:
(I painted this by rule without looking at a spectrogram. Can you understand it?)
• Later, from 1961, on a cascade synthesizer:
• Note: first use of a computer to calculate rules for synthetic speech.
• Compare with the HAL 9000:
![Page 14: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/14.jpg)
Parallel vs. Cascade• The rivalry between the parallel and cascade camps continued into the ‘70s.
• Cascade synthesizers were good at producing vowels and required fewer control parameters…
• but were bad with nasals, stops and fricatives.
• Parallel synthesizers were better with nasals and fricatives, but not as good with vowels.
• Dennis Klatt proposed a synthesis (sorry):
• and combined the two…
![Page 15: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/15.jpg)
KlattTalk
• KlattTalk has since become the standard for formant synthesis. (DECTalk)
http://www.asel.udel.edu/speech/tutorials/synthesis/vowels.html
![Page 16: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/16.jpg)
KlattVoice• Dennis Klatt also made significant improvements to the artificial voice source waveform.
• Perfect Paul:
• Beautiful Betty:
• Female voices have remained problematic.
• Also note: lack of jitter and shimmer
![Page 17: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/17.jpg)
LPC Synthesis• Another method of formant synthesis, developed in the ‘70s, is known as Linear Predictive Coding (LPC).
• Here’s an example:
• To recapitulate childhood: http://www.speaknspell.co.uk/
• As a general rule, LPC synthesis is pretty lousy.
• But it’s cheap!
• LPC synthesis greatly reduces the amount of information in speech…
![Page 18: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/18.jpg)
Filters + LPC• One way to understand LPC analysis is to think about a moving average filter.
• A moving average filter reduces noise in a signal by making each point equal to the average of the points surrounding it.
yn = (xn-2 + xn-1 + xn + xn+1 + xn+2) / 5
![Page 19: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/19.jpg)
Filters + LPC• Another way to write the smoothing equation is
• yn = .2*xn-2 + .2*xn-1 + .2*xn + .2*xn+1 + .2*xn+2
• Note that we could weight the different parts of the equation differently.
• Ex: yn = .1*xn-2 + .2*xn-1 + .4*xn + .2*xn+1 + .1*xn+2
• Another trick: try to predict future points in the waveform on the basis of only previous points.
• Objective: find the combination of weights that predicts future points as perfectly as possible.
![Page 20: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/20.jpg)
Deriving the Filter• Let’s say that minimizing the prediction errors for a certain waveform yields the following equation:
• yn = .5*xn - .3*xn-1 + .2*xn-2 - .1*xn-3
• The weights in the equation define a filter.
• Example: how would the values of y change if the input to the equation was a transient where:
• at time n, x = 1
• at all other times, x = 0
• Graph y at times n to n+3.
![Page 21: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/21.jpg)
Decomposing the Filter• Putting a transient into the weighted filter equation yields a new waveform:
• The new equation reflects the weights in the equation.
• We can apply Fourier Analysis to the new waveform to determine its spectral characteristics.
![Page 22: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/22.jpg)
LPC Spectrum• When we perform a Fourier Analysis on this waveform, we get a very smooth-looking spectrum function:
• This function is a good representation of what the vocal tract filter looks like.
LPC spectrum
Original spectrum
![Page 23: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/23.jpg)
LPC Applications• Remember: the LPC spectrum is derived from the weights of a linear predictive equation.
• One thing we can do with the LPC-derived spectrum is estimate formant frequencies of a filter.
• (This is how Praat does it)
• Note: the more weights in the original equation, the more formants are assumed to be in the signal.
• We can also use that LPC-derived filter, in conjunction with a voice source, to create synthetic speech.
• (Like in the Speak & Spell)
![Page 24: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/24.jpg)
3. Concatenative Synthesis• Formant synthesis dominated the synthetic speech world up until the ‘90s…
• Then concatenative synthesis started taking over.
• Basic idea: string together recorded samples of natural speech.
• Most common option: “diphone” synthesis
• Concatenated bits stretch from the middle of one phoneme to the middle of the next phoneme.
• Note: inventory has to include all possible phoneme sequences
• = only possible with lots of computer memory.
![Page 25: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/25.jpg)
Concatenated Samples• Concatenated synthesis tends to sound more natural than formant synthesis.
• (basically because of better voice quality)
• Early (1977) combination of LPC + diphone synthesis:
• LPC + demisyllable-sized chunks (1980):
• More recent efforts with the MBROLA synthesizer:
• Also check out the Macintalk Pro synthesizer!
![Page 26: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/26.jpg)
Recent Developments• Contemporary concatenative speech synthesizers use variable unit selection.
• Idea: record a huge database of speech…
• And play back the largest unit of speech you can, whenever you can.
• Interesting development #2: synthetic voices tailored to particular speakers.
• Check it out:
![Page 27: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/27.jpg)
4. Articulatory Synthesis• Last but not least, there is articulatory synthesis.
• Generation of acoustic signals on the basis of models of the vocal tract.
• This is the most complicated of all synthesis paradigms.
• (we don’t understand articulations all that well)
• Some early attempts:
• Paul Boersma built his own articulatory synthesizer…
• and incorporated it into Praat.
![Page 28: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/28.jpg)
Synthetic Speech Perception• In the early days, speech scientists thought that synthetic speech would lead to a form of “super speech”
• = ideal speech, without any of the extraneous noise of natural productions.
• However, natural speech is always more intelligible than synthetic speech.
• And more natural sounding!
• But: perceptual learning is possible.
• Requires lots and lots of practice.
• And lots of variability. (words, phonemes, contexts)
• An extreme example: blind listeners.
![Page 29: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/29.jpg)
More Perceptual Findings1. Reducing the number of possible messages
dramatically increases intelligibility.
![Page 30: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/30.jpg)
More Perceptual Findings2. Formant synthesis produces better vowels;
• Concatenative synthesis produces better consonants (and transitions)
3. Synthetic speech perception uses up more mental resources.
• memory and recall of number lists
4. Synthetic speech perception is a lot easier for native speakers of a language.
• And also adults.
5. Older listeners prefer slower rates of speech.
![Page 31: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/31.jpg)
Audio-Visual Speech Synthesis
• The synthesis of audio-visual speech has primarily been spearheaded by Dominic Massaro, at UC-Santa Cruz.
• “Baldi”
• Basic findings:
• Synthetic visuals can induce the McGurk effect.
• Synthetic visuals improve perception of speech in noise
• …but not as well as natural visuals.
• Check out some samples.
![Page 32: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/32.jpg)
Further Reading• In case you’re curious:
• http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html
• http://www.acoustics.hut.fi/publications/files/theses/lemmetty_mst/contents.html
![Page 33: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/33.jpg)
![Page 34: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/34.jpg)
Wait a minute…• (Classical) Categorical perception really does
occur…
• But only in limited circumstances.
• Works best for:
1. Sounds with rapid transitions
• (consonants, not vowels)
2. Tasks that require retaining more than one sound in memory.
• Ex: AXB discrimination induces more categoriality than AX discrimination.
• In these circumstances, sounds are stored in memory with less acoustic details in them.
![Page 35: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/35.jpg)
CP Results Experienced Listeners
0%
20%
40%
60%
80%
100%
1-3 2-4 3-5 4-6 5-7 6-8 7-9 8-10 9-11
Different Pair
% Different Responses
Observed Predicted
New Listeners
0%
20%
40%
60%
80%
100%
1-3 2-4 3-5 4-6 5-7 6-8 7-9 8-10 9-11
Different Pair
% Different Responses
Observed Predicted
• Generally: more “correct” different responses than
predicted.
• Experienced listeners gave more different responses than new
listeners.
Responses to different pairs
![Page 36: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/36.jpg)
CP Results Experienced Listeners
0%
20%
40%
60%
80%
100%
1-1 2-2 3-3 4-4 5-5 6-6 7-7 8-8 9-9 10-10 11-11
Same Pair
% Same Responses
Observed Predicted
New Listeners
0%
20%
40%
60%
80%
100%
1-1 2-2 3-3 4-4 5-5 6-6 7-7 8-8 9-9 10-10 11-11
Same Pair
% Same Responses
Observed Predicted
Responses to same pairs
• Experienced listeners also gave
more “different” responses in this
condition.
• = Indicative of response bias
![Page 37: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/37.jpg)
![Page 38: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/38.jpg)
![Page 39: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/39.jpg)
• A (pretend) example: traces = vowels from the Peterson & Barney data set. *
probe
• Activation of each trace is proportional to distance (in vowel space) from the probe.
highly activated
traces
low activation
![Page 40: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/40.jpg)
Filters + LPC• In LPC analysis, we only look at previous points in the waveform to predict the current point in the waveform.
• Objective: reduce noise as much as possible
• Weights need to be adjusted to get the best possible prediction.
• (sort of like reducing the waveform to 0)
• Basic principle of LPC analysis:
• any point in a waveform can be regarded as the sum of a number of previous points,
• each of which has been multiplied by a suitable positive or negative number.
• = the LPC coefficients
![Page 41: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/41.jpg)
Formant Synthesis• Strategies, successes and problems
• Rule-based synthesis
• Enables an infinite number of sounds.
![Page 42: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/42.jpg)
Some TTS Systems?
![Page 43: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/43.jpg)
TTS Problems• Homophones and all that.
• Names
• Numbers
• Interpretation
• Prosody
![Page 44: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/44.jpg)
Reading Machines for the Blind?
• Maybe something about perception and rate.
![Page 45: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/45.jpg)
KlattTalk• Combination of cascade and parallel synthesizers, apparently.
![Page 46: Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying](https://reader035.vdocuments.mx/reader035/viewer/2022062422/56649f285503460f94c40a85/html5/thumbnails/46.jpg)
KlattRules