chapter 10 speech synthesis ch t 11 a t ti s h r...

30
Chapter 10 Speech Synthesis Ch t 11 A t ti S h R iti Chapter 11 Automatic Speech Recognition 1

Upload: others

Post on 29-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Chapter 10 Speech SynthesisCh t 11 A t ti S h R itiChapter 11 Automatic Speech Recognition

1

Page 2: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

An Engineer’s Perspectiveg p

Speech production

Speech analysis

S hSpeech coding

Speech quality

assessment

Speech recognition

Speech synthesis

Speaker recognition

Speech enhancement

2

Page 3: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

3

Page 4: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

4

Page 5: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

HistoryLong before modern electronic signal Long before modern electronic signal processing was invented, speech researchers tried to build machines to create human speech Early examples of 'speaking heads' speech. Early examples of speaking heads were made by Gerbert of Aurillac (d. 1003), Albertus Magnus (1198-1280), and Roger Bacon (1214-1294). Bacon (1214 1294).

In 1779, the Danish scientist Christian Kratzenstein working at the time at the Kratzenstein, working at the time at the Russian Academy of Sciences, built models of the human vocal tract that could produce the five long vowel sounds (a e i o and u)

5

five long vowel sounds (a, e, i, o and u).

Page 6: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Kratzenstein's resonators

6

Page 7: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Engineering the vocal tract: Riesz 1937

7

Page 8: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Homer Dudley 1939 VODER

Synthesizing speech by electrical meansSynthesizing speech by electrical means1939 World’s Fair

•Manually controlled •Manually controlled through complex keyboard•Operator training was a problem

8

Page 9: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Cooper’s Pattern Playback

Haskins Labs for investigating speech Haskins Labs for investigating speech perceptionWorks like an inverse of a spectrographWorks like an inverse of a spectrographLight from a lamp goes through a rotating disk then through spectrogram into g p gphotovoltaic cellsThus amount of light that gets transmitted

h f b d dat each frequency band corresponds to amount of acoustic energy at that band

9

Page 10: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Cooper’s Pattern Playback

10

Page 11: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Modern TTS systems1960’s first full TTS: Umeda et al (1968)1960 s first full TTS: Umeda et al (1968)1970’s

Joe Olive 1977 concatenation of linear-prediction pdiphonesSpeak and Spell

1980’s1980 s1979 MIT MITalk (Allen, Hunnicut, Klatt)

1990’s-presentDiphone synthesisUnit selection synthesis

11

Page 12: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Types of Modern Synthesis

Articulatory Synthesis:Articulatory Synthesis:Model movements of articulators and acoustics of vocal tractacous cs o oca ac

Formant Synthesis:Start with acoustics, create rules/filters , /to create each formant

Concatenative Synthesis:Use databases of stored speech to assemble new utterances.

12Text from Richard Sproat slides

Page 13: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Articulatory Synthesis

13

Page 14: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Articulatory Synthesis

14

Page 15: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Articulatory Synthesis

aiueoMPEG 1850KB

sasisusesoMPEG 2490KB

Human Vocal Mimicry"hassei"MPEG 1850KB MPEG 2490KB hassei

MPEG 2167KB

15

Page 16: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

TTS Architecture

Th th t f TTSThe three types of TTSConcatenativeF tFormantArticulatory

O l th Only cover the segments+f0+duration to waveform partpart.A full system needs to go all the way from random text to sound

16

from random text to sound.

Page 17: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

TTS Architecture

Text AnalysisText in Text AnalysisText NormalizationPart-of-Speech taggingHomonym Disambiguation

Text in

Phonetic AnalysisDictionary LookupGrapheme-to-Phoneme (LTS)

Prosodic AnalysisBoundary placementPitch accent assignmentD ti t tiDuration computation

Waveform synthesisSpeech out

17

Page 18: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Dictionaries aren’t always sufficient

Unknown wordsUnknown wordsSeem to be linear with number of words in unseen textMostly person, company, product namesBut also foreign words, etc.

S i l t h 3 t tSo commercial systems have 3-part system:Big dictionarySpecial code for handling namesSpecial code for handling namesMachine learned LTS system for other unknown words

18

Page 19: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

RoadmapRoadmapRoadmapRoadmap

Speech production

Speech analysis

S hSpeech coding

Speech quality

assessment

Speech recognition

Speech synthesis

Speaker recognition

Speech enhancement

19

Page 20: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Speech RecognitionFundamentally speaking, how human auditory system works Fundamentally speaking, how human auditory system works still largely remains as a mysteryExisting approaches

Template matching with dynamic time warping (DTW)Stochastic recognition with Hidden Markov Model (HMM)

State-of-the-artSmall vocabularies (<100 words)L b l i (>10000) b t k i i l tiLarge vocabularies (>10000) but spoken in isolationLarge and continuous but constrained to a certain task domain (e.g., only work for office correspondence at a particular company)p p y)

20

Page 21: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Known Dimensions of DifficultySpeaker-dependent or speaker independentSpeaker-dependent or speaker independentSize of vocabularyDiscrete vs. continuousThe extent of ambiguity and acoustic confusability (e.g., “know” vs. “no”)Quiet vs noisy environmentQuiet vs. noisy environmentLinguistic constraints and knowledgeExample: Wreck a nice beach.

21

Page 22: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Vocabulary Size

Rule of thumbRule of thumbSmall: 1-99 words (e.g., credit card and telephone number)Medium: 100-999 words (experimental lab systems for continuous recognition)Large: >1000 words (commercial products Large: >1000 words (commercial products such as office correspondence and document retrieval)

Relevant to linguistic constraintsRelevant to linguistic constraintsThose constraints (e.g., grammar) helps reduce the search space when vocabulary size i

22

increases

Page 23: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Speaker DependencySpeaker dependent recognitionSpeaker dependent recognition

You will be asked to use “speech tools” offered by Windows XP in a future

i tassignmentIt requires retraining when the system is used by a new user y

Speaker independent recognitionTrained for multiple users and used by the same populationsame populationTrained for some user but might be used by others (outside the training population)

23

Page 24: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Isolated vs. Continuous Isolated Word Recognition (IWR)Isolated Word Recognition (IWR)

Discrete utterance of each word (minimum pause of 200ms is required)

Continuous Speech Recognition (CSR)Continuous Speech Recognition (CSR)User utters the message in a relatively (or completely) unconstrained mannerChallengesg

Deal with unknown temporal boundaries Handle cross-word coarticulation effects and sloppy articulation (e.g., St. Louis Zoo vs. San ppy ( g ,Diego Zoo)

24

Page 25: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Linguistic ConstraintsClosely related to natural language processingClosely related to natural language processingWhat are they?

Grammatical constraints, lexical constraints, syntactic constraintsconstraints

ExamplesColorless paper packages crackle loudlyColorless yellow ideas sleep furiously (grammatically Colorless yellow ideas sleep furiously (grammatically correct, semantically incorrect)Sleep roses dangerously young colorless ((grammatically incorrect)((g y )Begn burea sferewrtet aweqwrq (lexically incorrect)

25

Page 26: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Acoustic Ambiguity and Confusability

AmbiguityAmbiguityAcoustically ambiguous words are indistinguishable in their spoken renditionsg pExamples: “Know” vs. “No”, “Two” vs. “Too”

ConfusabilityRefers to the extent to which words can be easily confused due to partial acoustic similaritysimilarityExamples: “one” vs. “nine”, “B” vs. “D”

26

Page 27: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Environmental Noise Background noiseBackground noise

Other speakers, equipment sounds, air conditioners, construction noise etc.,

Speaker’s own actionLip smacks, breath noises, coughs, sneezes

Communication noiseChannel errors, quantization noise

U l f f iUnusual form of noiseDeep-sea divers breathing hybrid of helium and oxygen

27

and oxygen

Page 28: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

Sources of verification errorsSources of verification errors

28

Page 29: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

RoadmapRoadmapRoadmapRoadmap

Speech production

Speech analysis

S hSpeech coding

Speech quality

assessment

Speech recognition

Speech synthesis

Speaker recognition

Speech enhancement

29

Page 30: Chapter 10 Speech Synthesis Ch t 11 A t ti S h R ...zhanglab.wdfiles.com/local--files/summer/SLHS1301_week8.pdf · Model movements of articulators and acousacous cs o oca actics of

SpeechSpeech--LanguageLanguage--HearingHearingSpeechSpeech LanguageLanguage HearingHearing

30