speech signal processing lizy

173
SPEECH SIGNAL PROCESSING SPEECH SIGNAL PROCESSING SPEECH SIGNAL PROCESSING SPEECH SIGNAL PROCESSING KERALA UNIVERSITY M KERALA UNIVERSITY M KERALA UNIVERSITY M KERALA UNIVERSITY M- - -TECH 1 TECH 1 TECH 1 TECH 1 ST ST ST ST SEMESTER SEMESTER SEMESTER SEMESTER [email protected] Lizy Abraham +919495123331 Assistant Professor Department of ECE LBS Institute of Technology for Women (A Govt. of Kerala Undertaking) Poojappura Trivandrum -695012 Kerala, India 1

Upload: lizy-abraham

Post on 24-Apr-2015

2.275 views

Category:

Technology


0 download

DESCRIPTION

Based on Kerala University M-Tech 1st Semester Speech Signal Processing of Signal Processing Branch.

TRANSCRIPT

Page 1: Speech signal processing lizy

SPEECH SIGNAL PROCESSINGSPEECH SIGNAL PROCESSINGSPEECH SIGNAL PROCESSINGSPEECH SIGNAL PROCESSINGKERALA UNIVERSITY MKERALA UNIVERSITY MKERALA UNIVERSITY MKERALA UNIVERSITY M----TECH 1TECH 1TECH 1TECH 1

STSTSTSTSEMESTERSEMESTERSEMESTERSEMESTER

[email protected] Lizy Abraham

+919495123331 Assistant Professor+919495123331 Assistant Professor

Department of ECE

LBS Institute of Technology for Women

(A Govt. of Kerala Undertaking)

Poojappura

Trivandrum -695012

Kerala, India

1

Page 2: Speech signal processing lizy

SYLLABUS TSC 1004 SPEECH SIGNAL PROCESSING 3TSC 1004 SPEECH SIGNAL PROCESSING 3TSC 1004 SPEECH SIGNAL PROCESSING 3TSC 1004 SPEECH SIGNAL PROCESSING 3----0000----0000----3333

SpeechSpeechSpeechSpeech ProductionProductionProductionProduction :- Acoustic theory of speech production (Excitation, Vocal tract model for

speech analysis, Formant structure, Pitch). Articulatory Phonetic (Articulation, Voicing,

Articulatory model). Acoustic Phonetics ( Basic speech units and their classification).

SpeechSpeechSpeechSpeech AnalysisAnalysisAnalysisAnalysis :- Short-Time Speech Analysis, Time domain analysis (Short time energy, short

time zero crossing Rate, ACF ). Frequency domain analysis (Filter Banks, STFT, Spectrogram,

Formant Estimation &Analysis). Cepstral Analysis

ParametricParametricParametricParametric representationrepresentationrepresentationrepresentation ofofofof speechspeechspeechspeech :- AR Model, ARMA model. LPC Analysis ( LPC model, Auto

correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR,correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR,

MFCC, Sinusoidal Model, GMM, HMM

SpeechSpeechSpeechSpeech codingcodingcodingcoding :- Phase Vocoder, LPC, Sub-band coding, Adaptive Transform Coding , Harmonic

Coding, Vector Quantization based Coders, CELP

SpeechSpeechSpeechSpeech processingprocessingprocessingprocessing :- Fundamentals of Speech recognition, Speech segmentation. Text-to-

speech conversion, speech enhancement, Speaker Verification, Language Identification, Issues

of Voice transmission over Internet.

2

Page 3: Speech signal processing lizy

REFERENCEREFERENCEREFERENCEREFERENCE

1. Douglas O'Shaughnessy, Speech Communications : Human & Machine, IEEE

Press, Hardcover 2nd edition, 1999; ISBN: 0780334493.

2. Nelson Morgan and Ben Gold, Speech and Audio Signal Processing : Processing

and Perception Speech and Music, July 1999, John Wiley & Sons, ISBN:0471351547

3. Rabiner and Schafer, Digital Processing of Speech Signals, Prentice Hall, 1978.

4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994. 4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994.

5. Thomas F. Quatieri, Discrete-Time Speech Signal Processing: Principles and

Practice, Prentice Hall; ISBN: 013242942X; 1st edition

6. Donald G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley &

Sons, September 1999; ISBN: 0471349593

For the End semester exam (100 marks), the question paper shall have six questions

of 20 marks each covering entire syllabus out of which any five shall be answered. It

shall have 75% problems & 25% Theory. For the internal marks of 50, Two test of 20

marks each and 10 marks for assignments (Minimum two) /Term Project.

3

Page 4: Speech signal processing lizy

Speech Processing means Processing of

discrete time speech signals

4

Page 5: Speech signal processing lizy

Speech Processing

Signal

Processing Information Phonetics

Acoustics

Algorithms

(Programming)Psychoacoustics

Room acoustics

Speech production

Processing Information

TheoryPhonetics

Fourier transforms

Discrete time filters

AR(MA) models

Entropy

Communication theory

Rate-distortion theory

Statistical SP

Stochastic

models

5

Page 6: Speech signal processing lizy

6

Page 7: Speech signal processing lizy

7

Page 8: Speech signal processing lizy

HOW IS SPEECH PRODUCED ?HOW IS SPEECH PRODUCED ?HOW IS SPEECH PRODUCED ?HOW IS SPEECH PRODUCED ?

Speech can be defined as “ a pressure Speech can be defined as “ a pressure Speech can be defined as “ a pressure Speech can be defined as “ a pressure acoustic signal that is articulated in the acoustic signal that is articulated in the acoustic signal that is articulated in the acoustic signal that is articulated in the vocal tract”vocal tract”vocal tract”vocal tract”vocal tract”vocal tract”vocal tract”vocal tract”

Speech is produced when: air is forced Speech is produced when: air is forced Speech is produced when: air is forced Speech is produced when: air is forced from the lungs through the vocal cords from the lungs through the vocal cords from the lungs through the vocal cords from the lungs through the vocal cords and along the vocal tract.and along the vocal tract.and along the vocal tract.and along the vocal tract.

8

Page 9: Speech signal processing lizy

This air flow is referred to as “excitation signal”. This air flow is referred to as “excitation signal”. This air flow is referred to as “excitation signal”. This air flow is referred to as “excitation signal”.

This excitation signal causes the vocal cords to This excitation signal causes the vocal cords to This excitation signal causes the vocal cords to This excitation signal causes the vocal cords to vibrate and propagate the energy to excite the oral vibrate and propagate the energy to excite the oral vibrate and propagate the energy to excite the oral vibrate and propagate the energy to excite the oral and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in shaping the sound produced.shaping the sound produced.shaping the sound produced.shaping the sound produced.

Vocal Tract components:Vocal Tract components:Vocal Tract components:Vocal Tract components:–––– Oral Tract: (from lips to vocal cords).Oral Tract: (from lips to vocal cords).Oral Tract: (from lips to vocal cords).Oral Tract: (from lips to vocal cords).–––– Nasal Tract: (from the velum till Nasal Tract: (from the velum till Nasal Tract: (from the velum till Nasal Tract: (from the velum till nostrillsnostrillsnostrillsnostrills).).).).

9

Page 10: Speech signal processing lizy

10

Page 11: Speech signal processing lizy

11

Page 12: Speech signal processing lizy

• Larynx: the source of speech

• Vocal cords (folds): the two folds of tissue in the larynx. They

can open and shut like a pair of fans.

• Glottis: the gap between the vocal cords. As air is forced

through the glottis the vocal cords will start to vibrate and

modulate the air flow.

• The frequency of vibration determines the pitch of the voice (for

a male, 50-200Hz; for a female, up to 500Hz).

12

Page 13: Speech signal processing lizy

SPEECH PRODUCTION MODEL

13

Page 14: Speech signal processing lizy

Places of articulation

dentalalveolar post-alveolar/palatal

velar

uvularlabial

uvular

pharyngeal

laryngeal/glottal

14

Page 15: Speech signal processing lizy

Classes of speech sounds

Voiced sound The vocal cords vibrate open and close

Quasi-periodic pulses of air

The rate of the opening and closing – the pitch

Unvoiced sounds Forcing air at high velocities through a constriction Forcing air at high velocities through a constriction

Noise-like turbulence

Show little long-term periodicity

Short-term correlations still present

Eg. “S”, “F”

Plosive sounds A complete closure in the vocal tract

Air pressure is built up and released suddenly

Eg. “B” , “P”15

Page 16: Speech signal processing lizy

Speech Model

16

Page 17: Speech signal processing lizy

SPEECH SOUNDS

Coarse classification with phonemes.

A phone is the acoustic realization of a A phone is the acoustic realization of a

phoneme.

Allophones are context dependent

phonemes.

17

Page 18: Speech signal processing lizy

PHONEME HIERARCHY

Speech soundsSpeech soundsSpeech soundsSpeech sounds

VowelsVowelsVowelsVowels ConsonantsConsonantsConsonantsConsonantsDiphtongsDiphtongsDiphtongsDiphtongs

iy, ih, ae, aa,

ah, ao,ax, eh,ay, ey,

oy, aw

Language dependent.

About 50 in English.

PlosivePlosivePlosivePlosive

NasalNasalNasalNasalFricativeFricativeFricativeFricative

RetroflexRetroflexRetroflexRetroflex

liquidliquidliquidliquid

LateralLateralLateralLateral

liquidliquidliquidliquid

GlideGlideGlideGlide

ah, ao,ax, eh,

er, ow, uh, uwoy, aw

w, y

p, b, t,

d, k, gm, n, ng

f, v, th, dh,

s, z, sh, zh, h

r

l

18

Page 19: Speech signal processing lizy

19

Page 20: Speech signal processing lizy

20

Page 21: Speech signal processing lizy

sounds like /SH/ and /S/ look like

(spectrally shaped) random noise,

while the vowel sounds /UH/, /IY/,

and /EY/ are highly structured and

quasi-periodic.

These differences result from the

distinctively different ways that these

sounds are produced.

21

Page 22: Speech signal processing lizy

22

Page 23: Speech signal processing lizy

Vowel Chart

Front

High

BackCenter

i

ɪ

u

ʊ

ɪ

o ə ʌ

e

Mid

Lowa

ɪ ə ʌ

ɛ

ɪ æ

Page 24: Speech signal processing lizy

24

Page 25: Speech signal processing lizy

SPEECH WAVEFORM CHARACTERISTICS

Loudness

Voiced/Unvoiced.

Pitch.

Fundamental frequency.

Spectral envelope.

Formants.

25

Page 26: Speech signal processing lizy

Pitch:Signal within each voiced interval is periodic. The period T is

called “pitch”. The pitch depends on the vowel being spoken,

changes in time. T~70 samples in this ex.

f0=1/T is the fundamental frequency (also known as formant

frequency).

Acoustic Characteristics of speech

frequency).

26

Page 27: Speech signal processing lizy

FORMANTS

Formants can be recognized in the frequency content

of the signal segment.

Formants are best described as high energy peaks in the

frequency spectrum of speech sound.

27

Page 28: Speech signal processing lizy

The resonant frequencies of the vocal tract are

called formant frequencies or simply formants.

The peaks of the spectrum of the vocal tract

response correspond approximately to its response correspond approximately to its

formants.

Under the linear time-invariant all-pole

assumption, each vocal tract shape is

characterized by a collection of formants.

28

Page 29: Speech signal processing lizy

Because the vocal tract is assumed stable with

poles inside the unit circle, the vocal tract

transfer function can be expressed either in

product or partial fraction expansion form: product or partial fraction expansion form:

29

Page 30: Speech signal processing lizy

30

Page 31: Speech signal processing lizy

A detailed acoustic theory must consider the effects of the

following:

• Time variation of the vocal tract shape

• Losses due to heat conduction and viscous friction at the

vocal tract wallsvocal tract walls

• Softness of the vocal tract walls

• Radiation of sound at the lips

• Nasal coupling

• Excitation of sound in the vocal tract

Let us begin by considering a simple case of a lossless tube:

31

Page 32: Speech signal processing lizy

28 December 2012

MULTI-TUBE APPROXIMATION OF THE VOCAL

TRACT

We can represent the vocal tract as a concatenation of N lossless tubes with area Ak.and equal length ∆x = l/N

The wave propagation time through each tube is τ =∆x/c = l/Nc

32

Page 33: Speech signal processing lizy

33

Page 34: Speech signal processing lizy

Consider an N-tube model of the previous figure. Each tube has length lkand cross sectional area of Ak.

Assume:

No losses

Planar wave propagation

The wave equations for section k: 0≤x≤l The wave equations for section k: 0≤x≤lk

34

Page 35: Speech signal processing lizy

35

Page 36: Speech signal processing lizy

28 December 2012

SOUND PROPAGATION IN THE CONCATENATED

TUBE MODEL

Boundary conditions:

Physical principle of continuity:

Pressure and volume velocity must be continuous both in time and in space

everywhere in the system:

At k’th/(k+1)’st junction we have:

36

Page 37: Speech signal processing lizy

28 December 2012

ANALOGY WITH ELECTRICAL CIRCUIT

TRANSMISSION LINE

37

Page 38: Speech signal processing lizy

28 December 2012

PROPAGATION OF SOUND IN A UNIFORM TUBE

The vocal tract transfer function of volume velocities is

38

The vocal tract transfer function of volume velocities is

Page 39: Speech signal processing lizy

28 December 2012

PROPAGATION OF SOUND IN A UNIFORM TUBE

Using the boundary conditions U (0,s)=UG(s) and P(-l,s)=0

*(derivation in Quateri text: page 122 – 125)

39

The poles of the transfer function T (jΩ) are where cos(Ωl/c)=0

119 – 124: Quatieri

Derivation of eqn.4.18 is

important.

Page 40: Speech signal processing lizy

28 December 2012

PROPAGATION OF SOUND IN A UNIFORM TUBE

(CON’T)

For c =34,000 cm/sec, l =17 cm, the natural frequencies (also called the formants) are at 500 Hz, 1500 Hz, 2500 Hz, …

40

The transfer function of a tube with no side branches, excited at one end and response measured at another, only has poles

The formant frequencies will have finite bandwidth when vocal tract losses are considered (e.g., radiation, walls, viscosity, heat)

The length of the vocal tract, l, corresponds to 1/4λ1, 3/4λ2, 5/4λ3, …, where λi is the wavelength of the ith natural frequency

Page 41: Speech signal processing lizy

28 December 2012

UNIFORM TUBE MODEL

Example

Consider a uniform tube of length l=35 cm. If speed

of sound is 350 m/s calculate its resonances in Hz.

Compare its resonances with a tube of length l =

41

Compare its resonances with a tube of length l =

17.5 cm.

f=Ω/2π ⇒

,...1250,750,250f

25035.04

350

2

1

2k

2f

1,3,5,...k ,2

k

=

==Ω

=

==Ω

kkl

c

l

c

π

π

π

π

Page 42: Speech signal processing lizy

28 December 2012

UNIFORM TUBE MODELUNIFORM TUBE MODELUNIFORM TUBE MODELUNIFORM TUBE MODEL

For 17.5 cm tube:

,...2500,1500,500f

250175.04

350

2

1

2k

2f

=

==Ω

= kkl

c

π

π

π

42

,...2500,1500,500f =

Page 43: Speech signal processing lizy

43

Page 44: Speech signal processing lizy

APPROXIMATING VOCAL TRACT SHAPES

44

Page 45: Speech signal processing lizy

45

Page 46: Speech signal processing lizy

VOWELS

Modeled as a tube closed at one end and open at the other

the closure is a membrane with a slit in it

the tube has uniform cross sectional area

membrane represents the source of energy (vocal folds)

the energy travels through the tube

the tube generates no energy on its own

the tube represents an important class of resonators

odd quarter length relationship

Fn=(2n-1)c/4l

Page 47: Speech signal processing lizy
Page 48: Speech signal processing lizy

VOWELS

Filter characteristics for vowels

the vocal tract is a dynamic filter

it is frequency dependent

it has, theoretically, an infinite number of resonances

each resonance has a center frequency, an amplitude and a

bandwidthbandwidth

for speech, these resonances are called formants

formants are numbered in succession from the lowest

F1, F2, F3, etc.

Page 49: Speech signal processing lizy

Fricatives

Modeled as a tube with a very severe constriction

The air exiting the constriction is turbulent

Because of the turbulence, there is no periodicity

unless accompanied by voicing

Page 50: Speech signal processing lizy

When a fricative constriction is tapered

the back cavity is involved

this resembles a tube closed at both ends

Fn=nc/2lFn=nc/2l

such a situation occurs primarily for articulation

disorders

Page 51: Speech signal processing lizy

Introduction to Digital Speech Processing

(Rabiner & Schafer )– 20-23

51

Page 52: Speech signal processing lizy

52

Page 53: Speech signal processing lizy

Rabiner &

Schafer : 98-

105

53

Page 54: Speech signal processing lizy

54

Page 55: Speech signal processing lizy

28 December 2012

SOUND SOURCE:

VOCAL FOLD VIBRATION

Modeled as a volume velocity source at glottis, UG(jΩ)

55

Page 56: Speech signal processing lizy

56

Page 57: Speech signal processing lizy

SHORT-TIME SPEECH ANALYSIS

Segments (or frames, or vectors) are typically of

length 20 ms.

Speech characteristics are constant.

Allows for relatively simple modeling. Allows for relatively simple modeling.

Often overlapping segments are extracted.

57

Page 58: Speech signal processing lizy

SHORTSHORTSHORTSHORT----TIME ANALYSIS OF SPEECHTIME ANALYSIS OF SPEECHTIME ANALYSIS OF SPEECHTIME ANALYSIS OF SPEECH

58

Page 59: Speech signal processing lizy

the system is an all-pole system with system function of the form:

For all-pole linear systems, the input and output are related by

a difference equation of the form:

59

Page 60: Speech signal processing lizy

60

Page 61: Speech signal processing lizy

The operator T defines the nature of the

short-time analysis function, and w[ˆn − m]

represents a time shifted window sequence

61

Page 62: Speech signal processing lizy

62

Page 63: Speech signal processing lizy

SHORT-TIME ENERGY

simple to compute, and useful for estimating

properties of the excitation function in the

model.

In this case the operator T is simply

squaring the windowed samples.

63

Page 64: Speech signal processing lizy

SHORT-TIME ZERO-CROSSING RATE

Weighted average of the number of times the

speech signal changes sign within the time

window. Representing this operator in terms of

linear filtering leads to:linear filtering leads to:

64

Page 65: Speech signal processing lizy

Since |sgnx[m] − sgnx[m − 1]| is equal to 1

if x[m] and x[m − 1] have different algebraic

signs and 0 if they have the same sign, it

follows that it is a weighted sum of all the follows that it is a weighted sum of all the

instances of alternating sign (zero-crossing)

that fall within the support region of the shifted

window w[ˆn − m].

65

Page 66: Speech signal processing lizy

shows an example of the short-time energy and

zero crossing rate for a segment of speech with

a transition from unvoiced to voiced speech.

In both cases, the window is a Hamming In both cases, the window is a Hamming

window of duration 25ms (equivalent to 401

samples at a 16 kHz sampling rate).

Thus, both the short-time energy and the

short-time zero-crossing rate are output of a

low pass filter whose frequency response is as

shown.66

Page 67: Speech signal processing lizy

Short time energy and zero-crossing rate functions are slowly varying Short time energy and zero-crossing rate functions are slowly varying

compared to the time variations of the speech signal, and therefore, they

can be sampled at a much lower rate than that of the original speech

signal.

For finite-length windows like the Hamming window, this reduction of

the sampling rate is accomplished by moving the window position ˆn in

jumps of more than one sample

67

Page 68: Speech signal processing lizy

during the unvoiced interval, the zero-crossing

rate is relatively high compared to the zero-

crossing rate in the voiced interval.

Conversely, the energy is relatively low in the Conversely, the energy is relatively low in the

unvoiced region compared to the energy in the

voiced region.

68

Page 69: Speech signal processing lizy

SHORT-TIME AUTOCORRELATION FUNCTION

(STACF)

The autocorrelation function is often used as a means

of detecting periodicity in signals, and it is also the

basis for many spectrum analysis methods.

STACF is defined as the deterministic autocorrelation

function of the sequence xˆn[m] = x[m]w[ˆn − m] that function of the sequence xˆn[m] = x[m]w[ˆn − m] that

is selected by the window shifted to time ˆn, i.e.,

69

Page 70: Speech signal processing lizy

70

Page 71: Speech signal processing lizy

e[n] is the excitation to the

linear system with impulse response h[n]. A

well known, and easily

proved, property of the autocorrelation

function is thatfunction is that

i.e., the autocorrelation function of s[n] =

e[n] h[n] is the convolution

of the autocorrelation functions of e[n] and

h[n].

71

Page 72: Speech signal processing lizy

72

Page 73: Speech signal processing lizy

SHORT-TIME FOURIER TRANSFORM (STFT)

The expression for the discrete-time STFT at

time n

where w[n] is assumed to be non-zero only

in the interval [0, N w - 1] and is referred to

as analysis window or sometimes as the

analysis filter

73

Page 74: Speech signal processing lizy

74

Page 75: Speech signal processing lizy

FILTERING VIEW

75

Page 76: Speech signal processing lizy

76

Page 77: Speech signal processing lizy

77

Page 78: Speech signal processing lizy

SHORT TIME SYNTHESIS

problem of obtaining a sequence back from its

discrete-time STFT.

This equation represents a synthesis

equation for the discrete-time STFT.

78

Page 79: Speech signal processing lizy

FILTER BANK SUMMATION (FBS) METHOD

the discrete STFT is considered to be the set of

outputs of a bank of filters.

the output of each filter is modulated with a

complex exponential, and these modulated complex exponential, and these modulated

filter outputs are summed at each instant of

time to obtain the corresponding time sample

of the original sequence

That is, given a discrete STFT, X (n, k), the FBS

method synthesize a sequence y(n) satisfying

the following equation: 79

Page 80: Speech signal processing lizy

80

Page 81: Speech signal processing lizy

81

Page 82: Speech signal processing lizy

82

Page 83: Speech signal processing lizy

83

Page 84: Speech signal processing lizy

OVERLAP-ADD METHOD

Just as the FBS method was motivated from the

filteling view of the STFT, the OLA method is motivated

from the Fourier transform view of the STFT.

In this method, for each fixed time, we take the

inverse DFT of the corresponding frequency function inverse DFT of the corresponding frequency function

and divide the result by the analysis window.

However, instead of dividing out the analysis window

from each of the resulting short-time sections, we

perform an overlap and add operation between the

short-time sections.

84

Page 85: Speech signal processing lizy

given a discrete STFT X (n, k), the OLA method

synthesizes a sequence Y[n] given by

85

Page 86: Speech signal processing lizy

86

Page 87: Speech signal processing lizy

Furthermore, if the discrete STFT had been

decimated in time by a factor L, it can be

similarly shown that if the analysis window

satisfiessatisfies

87

Page 88: Speech signal processing lizy

88

Page 89: Speech signal processing lizy

DESIGN OF DIGITAL FILTER BANKS

282 – 297: Rabiner & Schafer

89

Page 90: Speech signal processing lizy

90

Page 91: Speech signal processing lizy

91

Page 92: Speech signal processing lizy

92

Page 93: Speech signal processing lizy

USING IIR FILTER

93

Page 94: Speech signal processing lizy

94

Page 95: Speech signal processing lizy

USING FIR FILTER

95

Page 96: Speech signal processing lizy

96

Page 97: Speech signal processing lizy

97

Page 98: Speech signal processing lizy

98

Page 99: Speech signal processing lizy

99

Page 100: Speech signal processing lizy

100

Page 101: Speech signal processing lizy

FILTER BANK ANALYSIS AND SYNTHESIS

101

Page 102: Speech signal processing lizy

102

Page 103: Speech signal processing lizy

103

Page 104: Speech signal processing lizy

FBS synthesis results in multiple copies of the

input:

104

Page 105: Speech signal processing lizy

PHASE VOCODER

The fourier series is computed over a sliding

window of a single pitch period duration and

provide a measure of amplitude and frequency

trajectories of the musical tones.trajectories of the musical tones.

105

Page 106: Speech signal processing lizy

106

Page 107: Speech signal processing lizy

107

Page 108: Speech signal processing lizy

which can be interpreted as a real sinewave

that is amplitude- and phase-modulated by the

STFT, the "carrier" of the latter being the kth

filter's center frequency. filter's center frequency.

the STFT of a continuos time signal as,

108

Page 109: Speech signal processing lizy

109

Page 110: Speech signal processing lizy

where is an initial condition.

The signal is likewise referred to as the

instantaneous amplitude for each channel. The

resulting filter-bank output is a sinewave with resulting filter-bank output is a sinewave with

generally a time-varying amplitude and

frequency modulation.

An alternative expression is,

110

Page 111: Speech signal processing lizy

which is the time-domain counterpart to the

frequency-domain phase derivative.

111

Page 112: Speech signal processing lizy

we can sample the continuous-time STFT, with

sampling interval T, to obtain the discrete-time

STFT.

112

Page 113: Speech signal processing lizy

113

Page 114: Speech signal processing lizy

114

Page 115: Speech signal processing lizy

115

Page 116: Speech signal processing lizy

116

Page 117: Speech signal processing lizy

117

Page 118: Speech signal processing lizy

SPEECH MODIFICATION

118

Page 119: Speech signal processing lizy

119

Page 120: Speech signal processing lizy

120

Page 121: Speech signal processing lizy

121

Page 122: Speech signal processing lizy

122

Page 123: Speech signal processing lizy

HOMOMORPHICHOMOMORPHICHOMOMORPHICHOMOMORPHIC ((((CEPSTRALCEPSTRALCEPSTRALCEPSTRAL) SPEECH ANALYSIS) SPEECH ANALYSIS) SPEECH ANALYSIS) SPEECH ANALYSIS

use of the short-time cepstrum as a representation of

speech and as a basis for estimating the parameters

of the speech generation model.

cepstrum of a discrete-time signal,

123

Page 124: Speech signal processing lizy

124

Page 125: Speech signal processing lizy

That is, the complex cepstrum operator

transforms convolution into addition.

This property, is what makes the cepstrum

useful for speech analysis, since the model foruseful for speech analysis, since the model for

speech production involves convolution of the

excitation with the vocal tract impulse

response, and our goal is often to separate the

excitation signal from the vocal tract signal.

125

Page 126: Speech signal processing lizy

The key issue in the definition and computation

of the complex cepstrum is the computation of

the complex logarithm.

ie, the computation of the phase angle ie, the computation of the phase angle

arg[X(ejω)], which must be done so as to

preserve an additive combination of phases for

two signals combined by convolution

126

Page 127: Speech signal processing lizy

THE SHORTTHE SHORTTHE SHORTTHE SHORT----TIME TIME TIME TIME CEPSTRUMCEPSTRUMCEPSTRUMCEPSTRUM

The short-time cepstrum is a sequence of

cepstra of windowed finite-duration segments

of the speech waveform.

127

Page 128: Speech signal processing lizy

128

Page 129: Speech signal processing lizy

RECURSIVE COMPUTATION OF THE COMPLEX RECURSIVE COMPUTATION OF THE COMPLEX RECURSIVE COMPUTATION OF THE COMPLEX RECURSIVE COMPUTATION OF THE COMPLEX

CEPSTRUMCEPSTRUMCEPSTRUMCEPSTRUM

Another approach to compute the complex

cepstrum applies only to minimum-phase

signals.

i.e., signals having an z-transform whose poles i.e., signals having an z-transform whose poles

and zeros are inside the unit circle.

An example would be the impulse response of

an all-pole vocal tract model with system

function

129

Page 130: Speech signal processing lizy

In this case, all the poles ck must be inside

the unit circle

for stability of the system.

130

Page 131: Speech signal processing lizy

SHORTSHORTSHORTSHORT----TIME TIME TIME TIME HOMOMORPHICHOMOMORPHICHOMOMORPHICHOMOMORPHIC FILTERING OF FILTERING OF FILTERING OF FILTERING OF

SPEECH SPEECH SPEECH SPEECH ––––PAGE N0: 63, PAGE N0: 63, PAGE N0: 63, PAGE N0: 63, RABINERRABINERRABINERRABINER & & & & SCHAFERSCHAFERSCHAFERSCHAFER

131

Page 132: Speech signal processing lizy

The low quefrency part of the cepstrum is

expected to be representative of the slow

variations (with frequency) in the log spectrum,

while the high quefrency components would while the high quefrency components would

correspond to the more rapid fluctuations of

the log spectrum.

132

Page 133: Speech signal processing lizy

the spectrum for the voiced segment has a structure of periodic ripples

due to the harmonic structure of the quasi-periodic segment of voiced

speech.

This periodic structure in the log spectrum manifests itself in the

cepstrum peak at a quefrency of about 9ms.

The existence of this peak in the quefrency range of expected pitch

periods strongly signals voiced speech.periods strongly signals voiced speech.

Furthermore, the quefrency of the peak is an accurate estimate of the

pitch period during the corresponding speech interval.

the autocorrelation function also displays an indication of periodicity, but

not nearly as unambiguously as does the cepstrum.

But the rapid variations of the unvoiced spectra appear random with no

periodic structure.

As a result, there is no strong peak indicating periodicity as in the voiced

case.

133

Page 134: Speech signal processing lizy

These slowly varying log spectra clearly retain

the general spectral shape with peaks

corresponding to the formant resonance

structure for the segment of speech under structure for the segment of speech under

analysis.

134

Page 135: Speech signal processing lizy

APPLICATION TO PITCH DETECTIONAPPLICATION TO PITCH DETECTIONAPPLICATION TO PITCH DETECTIONAPPLICATION TO PITCH DETECTION

The cepstrum was first applied in speech

processing to determine the excitation

parameters for the discrete-time speech model.

The successive spectra and cepstra are for 50 The successive spectra and cepstra are for 50

ms segments obtained by moving the window

in steps of 12.5 ms (100 samples at a

sampling rate of 8000 samples/sec).

135

Page 136: Speech signal processing lizy

for the positions 1 through 5, the window includes only

unvoiced speech

for positions 6 and 7 the signal within the window is partly

voiced and partly unvoiced.

For positions 8 through 15 the window only includes voiced For positions 8 through 15 the window only includes voiced

speech.

the rapid variations of the unvoiced spectra appear random

with no periodic structure.

the spectra for voiced segments have a structure of periodic

ripples due to the harmonic structure of the quasi-periodic

segment of voiced speech.

136

Page 137: Speech signal processing lizy

137

Page 138: Speech signal processing lizy

the cepstrum peak at a quefrency of about 11–

12 ms strongly signals voiced speech, and the

quefrency of the peak is an accurate estimate

of the pitch period during the corresponding of the pitch period during the corresponding

speech interval.

Presence of a strong peak implies voiced

speech, and the quefrency location of the peak

gives the estimate of the pitch period.

138

Page 139: Speech signal processing lizy

MELMELMELMEL----FREQUENCY FREQUENCY FREQUENCY FREQUENCY CEPSTRUMCEPSTRUMCEPSTRUMCEPSTRUM COEFFICIENTS COEFFICIENTS COEFFICIENTS COEFFICIENTS

((((MFCCMFCCMFCCMFCC))))

The idea is to compute a frequency analysis based

upon a filter bank with approximately critical band

spacing of the filters and bandwidths.

For 4 KHz bandwidth, approximately 20 filters are

used.used.

a short-time Fourier analysis is done first, resulting in

a DFT Xˆn[k] for analysis time ˆn.

Then the DFT values are grouped together in critical

bands and weighted by a triangular weighting

function.

139

Page 140: Speech signal processing lizy

the bandwidths are constant for center

frequencies below 1 kHz and then increase

exponentially up to half the sampling rate of 4

kHz resulting in a total of 22 filters.kHz resulting in a total of 22 filters.

The mel-frequency spectrum at analysis timeˆn

is defined for r = 1,2,...,R as

140

Page 141: Speech signal processing lizy

141

Page 142: Speech signal processing lizy

is a normalizing factor for the rth mel-filter.

For each frame, a discrete cosine transform of

the log of the magnitude of the filter outputs is

computed to form the function mfccˆn[m], i.e.,computed to form the function mfccˆn[m], i.e.,

142

Page 143: Speech signal processing lizy

143

Page 144: Speech signal processing lizy

shows the result of mfcc analysis of a frame of

voiced speech in comparison with the short-

time Fourier spectrum, LPC spectrum, and a

homomorphically smoothed spectrum.homomorphically smoothed spectrum.

all these spectra are different, but they have in

common that they have peaks at the formant

resonances.

At higher frequencies, the reconstructed mel-

spectrum has more smoothing due to the

structure of the filter bank.144

Page 145: Speech signal processing lizy

THE SPEECH SPECTROGRAMTHE SPEECH SPECTROGRAMTHE SPEECH SPECTROGRAMTHE SPEECH SPECTROGRAM

simply a display of the magnitude of the STFT.

Specifically, the images in Figure are plots of

where the plot axes are labeled in terms of where the plot axes are labeled in terms of

analog time and frequency through the

relations tr = rRT and fk = k/(NT), where T is

the sampling period of the discrete-time signal

x[n] = xa(nT).

145

Page 146: Speech signal processing lizy

In order to make smooth, R is usually quite

small compared to both the window length L

and the number of samples in the frequency

dimension, N, which may be much larger than dimension, N, which may be much larger than

the window length L.

Such a function of two variables can be plotted

on a two dimensional surface as either a gray-

scale or a color-mapped image.

The bars on the right calibrate the color map (in

dB).146

Page 147: Speech signal processing lizy

147

Page 148: Speech signal processing lizy

if the analysis window is short, the spectrogram

is called a wide-band spectrogram which is

characterized by good time resolution and poor

frequency resolution.frequency resolution.

when the window length is long, the

spectrogram is a narrow-band spectrogram,

which is characterized by good frequency

resolution and poor time resolution.

148

Page 149: Speech signal processing lizy

THE SPECTROGRAM

•A classic analysis tool.

– Consists of DFTs of overlapping, and

windowed frames.windowed frames.

•Displays the distribution of energy in time

and frequency.

– is typically displayed.2

10 )(log10 fX m

149

Page 150: Speech signal processing lizy

THE SPECTROGRAM CONT.

150

Page 151: Speech signal processing lizy

151

Page 152: Speech signal processing lizy

Note the three broad peaks in the spectrum

slice at time tr = 430 ms, and observe that

similar slices would be obtained at other times

around tr = 430 ms.around tr = 430 ms.

These large peaks are representative of the

underlying resonances of the vocal tract at the

corresponding time in the production of the

speech signal.

152

Page 153: Speech signal processing lizy

The lower spectrogram is not as sensitive to

rapid time variations, but the resolution in the

frequency dimension is much better.

This window length is on the order of several This window length is on the order of several

pitch periods of the waveform during voiced

intervals.

As a result, the spectrogram no longer displays

vertically oriented striations since several

periods are included in the window.

153

Page 154: Speech signal processing lizy

SHORT TIME ACF/m/ /ow/ /s/

ACF

154

Page 155: Speech signal processing lizy

CEPSTRUMSPEECH WAVE (X)= EXCITATION (E) . FILTER (H)

(H)(H)(H)(H)(Vocal tract

filter)

Glottal excitation

From

(E)(E)(E)(E)

(S)(S)(S)(S)

155

http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif

From

Vocal cords

(Glottis)

Page 156: Speech signal processing lizy

CEPSTRAL ANALYSIS

Signal(s)=convolution(*) of

glottal excitation (e) and vocal_tract_filter (h)

s(n)=e(n)*h(n), n is time index

After Fourier transform FT: FTs(n)=FTe(n)*h(n)

Convolution(*) becomes multiplication (.)

156

n(time) w(frequency),

S(w) = E(w).H(w)

Find Magnitude of the spectrum

|S(w)| = |E(w)|.|H(w)|

log10 |S(w)|= log10|E(w)|+ log10|H(w)|

Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1

Page 157: Speech signal processing lizy

CEPSTRUM

C(n)=IDFT[log10 |S(w)|]=

IDFT[ log10|E(w)| + log10|H(w)| ]

windowing DFT Log|x(w)| IDFT

X(n) X(w) Log|x(w)|

N=time index

S(n) C(n)

157

In c(n), you can see E(n) and H(n) at two different positions

Application: useful for (i) glottal excitation (ii) vocal tract filter analysis

N=time index

w=frequency

I-DFT=Inverse-discrete Fourier transform

Page 158: Speech signal processing lizy

EXAMPLE OF CEPSTRUM

sampling frequency 22.05KHz

158

Page 159: Speech signal processing lizy

SUB BAND CODING

159

Page 160: Speech signal processing lizy

the time-decimated subband outputs are quantized

and encoded, then are decoded at the receiver.

In subband coding, a small number of filters with wide

and overlapping bandwidths are chosen and each

output is quantizedoutput is quantized

each bandpass filter output is quantized individually.

although the bandpass filters are wide and

overlapping, careful design of the filter, resuIts in a

cancellation of quantization noise that leaks across

bands.

160

Page 161: Speech signal processing lizy

Quadrature mirror filters are one such filter

class;

shows an example of a two-band subband

coder using two overlapping quadrature mirror coder using two overlapping quadrature mirror

filters

Quadrature mirror filters can be further

subdivided from high to low filters by splitting

the fullband into two, then the resulting lower

band into two, and so on.

161

Page 162: Speech signal processing lizy

This octave-band splitting, together with the

iterative decimation, can be shown to yield a

perfect reconstruction filter bank

such octave-band filter banks, and their such octave-band filter banks, and their

conditions for perfect reconstruction, are

closely related to wavelet analysis/synthesis

structures.

162

Page 163: Speech signal processing lizy

163

Page 164: Speech signal processing lizy

164

LINEAR PREDICTION (INTRODUCTION):

The object of linear prediction is to estimate

the output sequence from a linear combination

of input samples, past output samples or both :

∑∑pq

The factors a(i) and b(j) are called predictor

coefficients.

∑∑==

−−−=

p

i

q

j

inyiajnxjbny10

)()()()()(ˆ

Page 165: Speech signal processing lizy

165

LINEAR PREDICTION (INTRODUCTION):

Many systems of interest to us are describable by a linear, constant-coefficient difference equation :

If Y(z)/X(z)=H(z), where H(z) is a ratio of polynomials

∑∑==

−=−

q

j

p

i

jnxjbinyia00

)()()()(

If Y(z)/X(z)=H(z), where H(z) is a ratio of polynomials N(z)/D(z), then

Thus the predictor coefficients give us immediate access to the poles and zeros of H(z).

== ji 00

∑∑=

=

−==

p

i

iq

j

jziazDzjbzN

00

)()( and )()(

Page 166: Speech signal processing lizy

166

LINEAR PREDICTION (TYPES OF SYSTEM MODEL):

There are two important variants :

All-pole model (in statistics, autoregressive (AR)(AR)

model ) :

The numerator N(z) is a constant.The numerator N(z) is a constant.

All-zero model (in statistics, moving-average (MA)(MA)

model ) :

The denominator D(z) is equal to unity.

The mixed pole-zero model is called the

autoregressive moving-average (ARMA)(ARMA) model.

Page 167: Speech signal processing lizy

167

LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):

Given a zero-mean signal y(n), in the AR model :

The error is :

∑=

−−=

p

i

inyiany1

)()()(ˆ

−= nynyne )(ˆ)()(

To derive the predictor we use the orthogonality

principle, the principle states that the desired

coefficients are those which make the error orthogonal

to the samples y(n-1), y(n-2),…, y(n-p).

∑=

−=

p

i

inyia0

)()(

Page 168: Speech signal processing lizy

168

LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):

Thus we require that

Or,

p..., 2, 1,jfor 0)()( =>=−< nejny

0)()()(0

=−− ∑=

p

i

inyiajny

Interchanging the operation of averaging and summing, and representing < > by summing over n, we have

The required predictors are found by solving these equations.

p1,...,j ,0)()()(0

==−−∑∑= n

p

i

jnyinyia

Page 169: Speech signal processing lizy

169

LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):

The orthogonality principle also states that resulting

minimum error is given by

Or,

)()()(2nenyneE ==

Enyinyiap

=−∑∑ )()()(

We can minimize the error over all time :

where

Eriap

i

i =∑= 0

)(

Enyinyiani

=−∑∑=

)()()(0

∑∞

−∞=

−=

n

i inynyr )()(

, ...,p,jria ji

p

i

21 ,0)(0

==−

=

Page 170: Speech signal processing lizy

170

LINEAR PREDICTION (APPLICATIONS):

Autocorrelation matching :

We have a signal y(n) with known autocorrelation

. We model this with the AR system shown below :

)(nryy)(ne )(ny)(nryy

∑=

−−

==p

i

i

i zazA

zH

1

1)(

)(σσ

)(ne

σ

1-A(z)

)(ny

Page 171: Speech signal processing lizy

171

LINEAR PREDICTION (ORDER OF LINEAR PREDICTION):

The choice of predictor order depends on the analysis bandwidth. The rule of thumb is :

For a normal vocal tract, there is an average of

cBW

p +=1000

2

For a normal vocal tract, there is an average of about one formant per kilo Hertz of BW.

One formant requires two complex conjugate poles.

Hence for every formant we require two predictor coefficients, or two coefficients per kilo Hertz of bandwidth.

Page 172: Speech signal processing lizy

172

LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):

True Model:

DT

Impulse

G(z)

GlottalVoiced

Pitch Gain

U(n)

s(n)

Speech

SignalImpulse

generator

Glottal

Filter

Uncorrelated

Noise

generator

H(z)

Vocal tract

Filter

R(z)

LP

Filter

Voiced

Unvoiced

Gain

V

U

U(n)

Voiced

Volume

velocity

Page 173: Speech signal processing lizy

173

LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):

Using LP analysis :

DT

ImpulseVoiced

Pitch

Gain

estimate s(n)Impulse

generator

White

Noise

generator

All-Pole

Filter

(AR)

Voiced

Unvoiced

estimate

V

U

H(z)

s(n)

Speech

Signal