chapter 10 speech synthesis ch t 11 a t ti s h r...

Chapter 10 Speech SynthesisCh t 11 A t ti S h R itiChapter 11 Automatic Speech Recognition

1

An Engineer’s Perspectiveg p

Speech production

Speech analysis

S hSpeech coding

Speech quality

assessment

Speech recognition

Speech synthesis

Speaker recognition

Speech enhancement

2

HistoryLong before modern electronic signal Long before modern electronic signal processing was invented, speech researchers tried to build machines to create human speech Early examples of 'speaking heads' speech. Early examples of speaking heads were made by Gerbert of Aurillac (d. 1003), Albertus Magnus (1198-1280), and Roger Bacon (1214-1294). Bacon (1214 1294).

In 1779, the Danish scientist Christian Kratzenstein working at the time at the Kratzenstein, working at the time at the Russian Academy of Sciences, built models of the human vocal tract that could produce the five long vowel sounds (a e i o and u)

5

five long vowel sounds (a, e, i, o and u).

Kratzenstein's resonators

6

Engineering the vocal tract: Riesz 1937

7

Homer Dudley 1939 VODER

Synthesizing speech by electrical meansSynthesizing speech by electrical means1939 World’s Fair

•Manually controlled •Manually controlled through complex keyboard•Operator training was a problem

8

Cooper’s Pattern Playback

Haskins Labs for investigating speech Haskins Labs for investigating speech perceptionWorks like an inverse of a spectrographWorks like an inverse of a spectrographLight from a lamp goes through a rotating disk then through spectrogram into g p gphotovoltaic cellsThus amount of light that gets transmitted

h f b d dat each frequency band corresponds to amount of acoustic energy at that band

9

Cooper’s Pattern Playback

10

Modern TTS systems1960’s first full TTS: Umeda et al (1968)1960 s first full TTS: Umeda et al (1968)1970’s

Joe Olive 1977 concatenation of linear-prediction pdiphonesSpeak and Spell

1980’s1980 s1979 MIT MITalk (Allen, Hunnicut, Klatt)

1990’s-presentDiphone synthesisUnit selection synthesis

11

Types of Modern Synthesis

Articulatory Synthesis:Articulatory Synthesis:Model movements of articulators and acoustics of vocal tractacous cs o oca ac

Formant Synthesis:Start with acoustics, create rules/filters , /to create each formant

Concatenative Synthesis:Use databases of stored speech to assemble new utterances.

12Text from Richard Sproat slides

Articulatory Synthesis

13


14


aiueoMPEG 1850KB

sasisusesoMPEG 2490KB

Human Vocal Mimicry"hassei"MPEG 1850KB MPEG 2490KB hassei

MPEG 2167KB

15

TTS Architecture

Th th t f TTSThe three types of TTSConcatenativeF tFormantArticulatory

O l th Only cover the segments+f0+duration to waveform partpart.A full system needs to go all the way from random text to sound

16

from random text to sound.

TTS Architecture

Text AnalysisText in Text AnalysisText NormalizationPart-of-Speech taggingHomonym Disambiguation

Text in

Phonetic AnalysisDictionary LookupGrapheme-to-Phoneme (LTS)

Prosodic AnalysisBoundary placementPitch accent assignmentD ti t tiDuration computation

Waveform synthesisSpeech out

17

Dictionaries aren’t always sufficient

Unknown wordsUnknown wordsSeem to be linear with number of words in unseen textMostly person, company, product namesBut also foreign words, etc.

S i l t h 3 t tSo commercial systems have 3-part system:Big dictionarySpecial code for handling namesSpecial code for handling namesMachine learned LTS system for other unknown words

18

RoadmapRoadmapRoadmapRoadmap

Speech production

Speech analysis

S hSpeech coding

Speech quality

assessment

Speech recognition

Speech synthesis

Speaker recognition

Speech enhancement

19

Speech RecognitionFundamentally speaking, how human auditory system works Fundamentally speaking, how human auditory system works still largely remains as a mysteryExisting approaches

Template matching with dynamic time warping (DTW)Stochastic recognition with Hidden Markov Model (HMM)

State-of-the-artSmall vocabularies (<100 words)L b l i (>10000) b t k i i l tiLarge vocabularies (>10000) but spoken in isolationLarge and continuous but constrained to a certain task domain (e.g., only work for office correspondence at a particular company)p p y)

20

Known Dimensions of DifficultySpeaker-dependent or speaker independentSpeaker-dependent or speaker independentSize of vocabularyDiscrete vs. continuousThe extent of ambiguity and acoustic confusability (e.g., “know” vs. “no”)Quiet vs noisy environmentQuiet vs. noisy environmentLinguistic constraints and knowledgeExample: Wreck a nice beach.

21

Vocabulary Size

Rule of thumbRule of thumbSmall: 1-99 words (e.g., credit card and telephone number)Medium: 100-999 words (experimental lab systems for continuous recognition)Large: >1000 words (commercial products Large: >1000 words (commercial products such as office correspondence and document retrieval)

Relevant to linguistic constraintsRelevant to linguistic constraintsThose constraints (e.g., grammar) helps reduce the search space when vocabulary size i

22

increases

Speaker DependencySpeaker dependent recognitionSpeaker dependent recognition

You will be asked to use “speech tools” offered by Windows XP in a future

i tassignmentIt requires retraining when the system is used by a new user y

Speaker independent recognitionTrained for multiple users and used by the same populationsame populationTrained for some user but might be used by others (outside the training population)

23

Isolated vs. Continuous Isolated Word Recognition (IWR)Isolated Word Recognition (IWR)

Discrete utterance of each word (minimum pause of 200ms is required)

Continuous Speech Recognition (CSR)Continuous Speech Recognition (CSR)User utters the message in a relatively (or completely) unconstrained mannerChallengesg

Deal with unknown temporal boundaries Handle cross-word coarticulation effects and sloppy articulation (e.g., St. Louis Zoo vs. San ppy ( g ,Diego Zoo)

24

Linguistic ConstraintsClosely related to natural language processingClosely related to natural language processingWhat are they?

Grammatical constraints, lexical constraints, syntactic constraintsconstraints

ExamplesColorless paper packages crackle loudlyColorless yellow ideas sleep furiously (grammatically Colorless yellow ideas sleep furiously (grammatically correct, semantically incorrect)Sleep roses dangerously young colorless ((grammatically incorrect)((g y )Begn burea sferewrtet aweqwrq (lexically incorrect)

25

Acoustic Ambiguity and Confusability

AmbiguityAmbiguityAcoustically ambiguous words are indistinguishable in their spoken renditionsg pExamples: “Know” vs. “No”, “Two” vs. “Too”

ConfusabilityRefers to the extent to which words can be easily confused due to partial acoustic similaritysimilarityExamples: “one” vs. “nine”, “B” vs. “D”

26

Environmental Noise Background noiseBackground noise

Other speakers, equipment sounds, air conditioners, construction noise etc.,

Speaker’s own actionLip smacks, breath noises, coughs, sneezes

Communication noiseChannel errors, quantization noise

U l f f iUnusual form of noiseDeep-sea divers breathing hybrid of helium and oxygen

27

and oxygen

Sources of verification errorsSources of verification errors

28

RoadmapRoadmapRoadmapRoadmap

Speech production

Speech analysis

S hSpeech coding

Speech quality

assessment

Speech recognition

Speech synthesis

Speaker recognition

Speech enhancement

29

SpeechSpeech--LanguageLanguage--HearingHearingSpeechSpeech LanguageLanguage HearingHearing

30

chapter 10 speech synthesis ch t 11 a t ti s h r...

Documents