chapter 10 speech synthesis ch t 11 a t ti s h r...
TRANSCRIPT
Chapter 10 Speech SynthesisCh t 11 A t ti S h R itiChapter 11 Automatic Speech Recognition
1
An Engineer’s Perspectiveg p
Speech production
Speech analysis
S hSpeech coding
Speech quality
assessment
Speech recognition
Speech synthesis
Speaker recognition
Speech enhancement
2
3
4
HistoryLong before modern electronic signal Long before modern electronic signal processing was invented, speech researchers tried to build machines to create human speech Early examples of 'speaking heads' speech. Early examples of speaking heads were made by Gerbert of Aurillac (d. 1003), Albertus Magnus (1198-1280), and Roger Bacon (1214-1294). Bacon (1214 1294).
In 1779, the Danish scientist Christian Kratzenstein working at the time at the Kratzenstein, working at the time at the Russian Academy of Sciences, built models of the human vocal tract that could produce the five long vowel sounds (a e i o and u)
5
five long vowel sounds (a, e, i, o and u).
Kratzenstein's resonators
6
Engineering the vocal tract: Riesz 1937
7
Homer Dudley 1939 VODER
Synthesizing speech by electrical meansSynthesizing speech by electrical means1939 World’s Fair
•Manually controlled •Manually controlled through complex keyboard•Operator training was a problem
8
Cooper’s Pattern Playback
Haskins Labs for investigating speech Haskins Labs for investigating speech perceptionWorks like an inverse of a spectrographWorks like an inverse of a spectrographLight from a lamp goes through a rotating disk then through spectrogram into g p gphotovoltaic cellsThus amount of light that gets transmitted
h f b d dat each frequency band corresponds to amount of acoustic energy at that band
9
Cooper’s Pattern Playback
10
Modern TTS systems1960’s first full TTS: Umeda et al (1968)1960 s first full TTS: Umeda et al (1968)1970’s
Joe Olive 1977 concatenation of linear-prediction pdiphonesSpeak and Spell
1980’s1980 s1979 MIT MITalk (Allen, Hunnicut, Klatt)
1990’s-presentDiphone synthesisUnit selection synthesis
11
Types of Modern Synthesis
Articulatory Synthesis:Articulatory Synthesis:Model movements of articulators and acoustics of vocal tractacous cs o oca ac
Formant Synthesis:Start with acoustics, create rules/filters , /to create each formant
Concatenative Synthesis:Use databases of stored speech to assemble new utterances.
12Text from Richard Sproat slides
Articulatory Synthesis
13
Articulatory Synthesis
14
Articulatory Synthesis
aiueoMPEG 1850KB
sasisusesoMPEG 2490KB
Human Vocal Mimicry"hassei"MPEG 1850KB MPEG 2490KB hassei
MPEG 2167KB
15
TTS Architecture
Th th t f TTSThe three types of TTSConcatenativeF tFormantArticulatory
O l th Only cover the segments+f0+duration to waveform partpart.A full system needs to go all the way from random text to sound
16
from random text to sound.
TTS Architecture
Text AnalysisText in Text AnalysisText NormalizationPart-of-Speech taggingHomonym Disambiguation
Text in
Phonetic AnalysisDictionary LookupGrapheme-to-Phoneme (LTS)
Prosodic AnalysisBoundary placementPitch accent assignmentD ti t tiDuration computation
Waveform synthesisSpeech out
17
Dictionaries aren’t always sufficient
Unknown wordsUnknown wordsSeem to be linear with number of words in unseen textMostly person, company, product namesBut also foreign words, etc.
S i l t h 3 t tSo commercial systems have 3-part system:Big dictionarySpecial code for handling namesSpecial code for handling namesMachine learned LTS system for other unknown words
18
RoadmapRoadmapRoadmapRoadmap
Speech production
Speech analysis
S hSpeech coding
Speech quality
assessment
Speech recognition
Speech synthesis
Speaker recognition
Speech enhancement
19
Speech RecognitionFundamentally speaking, how human auditory system works Fundamentally speaking, how human auditory system works still largely remains as a mysteryExisting approaches
Template matching with dynamic time warping (DTW)Stochastic recognition with Hidden Markov Model (HMM)
State-of-the-artSmall vocabularies (<100 words)L b l i (>10000) b t k i i l tiLarge vocabularies (>10000) but spoken in isolationLarge and continuous but constrained to a certain task domain (e.g., only work for office correspondence at a particular company)p p y)
20
Known Dimensions of DifficultySpeaker-dependent or speaker independentSpeaker-dependent or speaker independentSize of vocabularyDiscrete vs. continuousThe extent of ambiguity and acoustic confusability (e.g., “know” vs. “no”)Quiet vs noisy environmentQuiet vs. noisy environmentLinguistic constraints and knowledgeExample: Wreck a nice beach.
21
Vocabulary Size
Rule of thumbRule of thumbSmall: 1-99 words (e.g., credit card and telephone number)Medium: 100-999 words (experimental lab systems for continuous recognition)Large: >1000 words (commercial products Large: >1000 words (commercial products such as office correspondence and document retrieval)
Relevant to linguistic constraintsRelevant to linguistic constraintsThose constraints (e.g., grammar) helps reduce the search space when vocabulary size i
22
increases
Speaker DependencySpeaker dependent recognitionSpeaker dependent recognition
You will be asked to use “speech tools” offered by Windows XP in a future
i tassignmentIt requires retraining when the system is used by a new user y
Speaker independent recognitionTrained for multiple users and used by the same populationsame populationTrained for some user but might be used by others (outside the training population)
23
Isolated vs. Continuous Isolated Word Recognition (IWR)Isolated Word Recognition (IWR)
Discrete utterance of each word (minimum pause of 200ms is required)
Continuous Speech Recognition (CSR)Continuous Speech Recognition (CSR)User utters the message in a relatively (or completely) unconstrained mannerChallengesg
Deal with unknown temporal boundaries Handle cross-word coarticulation effects and sloppy articulation (e.g., St. Louis Zoo vs. San ppy ( g ,Diego Zoo)
24
Linguistic ConstraintsClosely related to natural language processingClosely related to natural language processingWhat are they?
Grammatical constraints, lexical constraints, syntactic constraintsconstraints
ExamplesColorless paper packages crackle loudlyColorless yellow ideas sleep furiously (grammatically Colorless yellow ideas sleep furiously (grammatically correct, semantically incorrect)Sleep roses dangerously young colorless ((grammatically incorrect)((g y )Begn burea sferewrtet aweqwrq (lexically incorrect)
25
Acoustic Ambiguity and Confusability
AmbiguityAmbiguityAcoustically ambiguous words are indistinguishable in their spoken renditionsg pExamples: “Know” vs. “No”, “Two” vs. “Too”
ConfusabilityRefers to the extent to which words can be easily confused due to partial acoustic similaritysimilarityExamples: “one” vs. “nine”, “B” vs. “D”
26
Environmental Noise Background noiseBackground noise
Other speakers, equipment sounds, air conditioners, construction noise etc.,
Speaker’s own actionLip smacks, breath noises, coughs, sneezes
Communication noiseChannel errors, quantization noise
U l f f iUnusual form of noiseDeep-sea divers breathing hybrid of helium and oxygen
27
and oxygen
Sources of verification errorsSources of verification errors
28
RoadmapRoadmapRoadmapRoadmap
Speech production
Speech analysis
S hSpeech coding
Speech quality
assessment
Speech recognition
Speech synthesis
Speaker recognition
Speech enhancement
29
SpeechSpeech--LanguageLanguage--HearingHearingSpeechSpeech LanguageLanguage HearingHearing
30