speech signal processing

49
Speech Signal Processing Murtadha Al-Sabbagh

Upload: mortadha-alsabbagh

Post on 12-Aug-2015

125 views

Category:

Technology


9 download

TRANSCRIPT

Page 1: Speech Signal Processing

Speech Signal Processing

Murtadha Al-Sabbagh

Page 2: Speech Signal Processing

• Speech processing is the study of speech signals and the processing methods of these signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signal. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals.

• Speech processing is generally can be divided as:• 1-recognition (will be discussed here).• 2-synthesis (will not be discussed here).

Page 3: Speech Signal Processing

Disciplines related to speech Processing

•1 .Signal Processing•The process of extracting information from speech in efficient manner

•2 .Physics•The science of understanding the relationship between speech signal

and physiological mechanisms•3 .Pattern recognition

•the set of algorithms to create patterns and match data to them according to the degree of likeliness

4 .Computer ScienceTo make efficient algorithms for implementing in HW or SW the methods of speech recognitions system

5 .LinguisticsThe relationship between sounds , words in a language , the meaning of those words and the overall meaning of sentences

Page 4: Speech Signal Processing

Speech (phonemes)

•Sentences consists of words , which consists of phonemes

•A phoneme is a basic unit of a language's spoken sounds

Page 5: Speech Signal Processing

Speech Waveform Characteristics

•LoudnessVoiced/Unvoiced.

Voiced: speech cords vibrating (periodic)Unvoiced: speech cords not vibrating (aperiodic)

Pitch.

•Spectral envelope:•Formants :the spectral peaks of the sound spectrum

Page 6: Speech Signal Processing

Aspects of Speech processing we will take

• Pre-processing• Feature extraction (Analysis)• Recognition

Page 7: Speech Signal Processing

Pre-proceesing

Page 8: Speech Signal Processing

Pre-processing•We can treat (pre-process) speech signal after it has been

received by as an analog signal with three general ways:•1-In time domain (Speech Wave)

2-In frequency domain(Spectral Envelope)3-Combination (Spectrogram)

freq..

Energy

1KHz 2KHz

Page 9: Speech Signal Processing

Time domainSpeech is captured by a microphone , e.g .

sampled periodically ( 16KHz) by an analogue-to-digital converter (ADC)Each sample converted is a 16-bit data.If sampling is too slow, sampling may fail (Nyquist Theorem)

Page 10: Speech Signal Processing

• A sound is sampled at 22-KHz and resolution is 16 bit. How many bytes are needed to store the sound wave for 10 seconds?

• Answer:• One second has 22K samples , so for 10 seconds: 22K

x 2bytes x 10 seconds =440K bytes• *note: 2 bytes are used because 16-bit = 2 bytes

Page 11: Speech Signal Processing

Time framing• Since our ear cannot response to very fast change of speech

data content, we normally cut the speech data into frames before analysis. (similar to watch fast changing still pictures to perceive motion )

• Frame size is 10-30ms• Frames can be overlapped, normally the overlapping region

ranges from 0 to 75% of the frame size .

Page 12: Speech Signal Processing

Time framing : Continued… For a 22-KHz/16 bit sampling speech wave, frame size is 15 ms and frame

overlapping period is 40 % of the frame size. Draw the frame block diagram. Answer: Number of samples in one frame (N)= 15 ms / (1/22k)=330 Overlapping samples = 132, m=N-132=198. x=Overlapping time = 132 * (1/22k)=6ms; Time in one frame= 330* (1/22k)=15ms.

i=1 (first window), length =N

m

N

i=2 (second window)

n

sn

timex

Page 13: Speech Signal Processing

The frequency domain•Use DFT or FFT to transform the wave from time domain to

frequency domain (i.e. to spectral envelope).

complex is so,

numberscomplex 12 are

which...after domian) (FrequecnyOutput

samples) N total(... domain) (timeInput

1),sin()cos( and,2

,...,3,2,1,0,

numbers) (real numbers)(complex

,2/,2,1,0

,1,2,1,01,..2,1,0

1

0

2

1..,2,1,02/.,1,0

mj

mm

N

NNk

jN

k

N

kmj

km

NkNm

XeXX

)(N/

XXXXFT

SSSSS

jjeN

meSX

}SFT {X

m

|Xm|= (real2+imginary2)^0.5

Page 14: Speech Signal Processing

The frequency domain :Continued

freq..

Energy

1KHz 2KHz

Page 15: Speech Signal Processing

The spectrogram: to see the spectral envelope as time moves forward

Specgram: The white bands are the formants

which represent high energy frequency

contents of the speech signal

Page 16: Speech Signal Processing

Feature Extraction (Analysis)

Page 17: Speech Signal Processing

feature extraction techniques

Page 18: Speech Signal Processing

(A )Filtering• Ways to find the spectral envelope

• Filter banks: uniform

• Filter banks can also be non-uniform• LPC and Cepstral LPC parameters

filter1 output

filter2 output

filter3 output

Spectralenvelop

energy

Page 19: Speech Signal Processing

Spectral envelope SEar=“ar”

Speech recognition idea using 4 linear filters, each bandwidth is 2.5KHz

• Two sounds with two Spectral Envelopes SEar,SEei ,E.g. Spectral Envelop (SE) “ar”, Spectral envelop “ei”

energyenergy

Freq.

Freq.

Spectrum A Spectrum B

filter 1 2 3 4 filter 1 2 3 4

v1 v2 v3 v4 w1 w2 w3 w4

Spectral envelope SEei=“ei”

Filterout

Filterout

10KHz10KHz0 0

Page 20: Speech Signal Processing

Difference between two sounds (or spectral envelopes SE SE’)• Difference between two sounds, E.g.• SEar={v1,v2,v3,v4}=“ar”,

• SEei={w1,w2,w3,w4}=“ei”• A simple measure of the difference is• Dist =sqrt(|v1-w1|2+|v2-w2|2+|v3-w3|2+|v4-w4|2)• Where |x|=magnitude of x

Page 21: Speech Signal Processing

(B )Linear Predictive coding LPC

•The concept is to find a set of parameters ie. 1, 2, 3, 4,.. p=8 to represent the same waveform (typical values of p=8->13)

1, 2, 3, 4,.. 8

Each time frame y=512 samples (S0,S1,S2,. Sn,SN-1=511)

512 integer numbers (16-bit each)

Each set has 8 floating point numbers (data compressed)

’1, ’2, ’3, ’4,.. ’8

’’1, ’’2, ’’3, ’’4,.. ’’8:

Can reconstruct the waveform fromthese LPC codes

Time frame y

Time frame y+1Time frame y+2

Input waveform

30ms

30ms

30ms

For example

Page 22: Speech Signal Processing

ppppp

p

p

r

r

r

a

a

a

rrrr

rrr

rrrr

rrrr

:

:

:

:

...,

:...,:::

:...,

...,

...,

2

1

2

1

0321

012

2101

1210

Page 23: Speech Signal Processing

example

• A speech waveform S has the values s0,s1,s2,s3,s4,s5,s6,s7,s8= [1,3,2,1,4,1,2,4,3]. The frame size is 4.

• Find auto-correlation parameter r0, r1, r2 for the first frame.• If we use LPC order 2 for our feature extraction system, find

LPC coefficients a1, a2.

Page 24: Speech Signal Processing

Answer:• Frame size=4, first frame is [1,3,2,1]

• r0=1x1+ 3x3 +2x2 +1x1=15• r1= 3x1 +2x3 +1x2=11• r2= 2x1 +1x3=5

0.4423-

1.0577

2

1

5

11

1511

1115

2

1

5

11

2

1

1511

1115

2

1

2

1

01

10

a

a

inva

a

a

a

r

r

a

a

rr

rr

Page 25: Speech Signal Processing

(C )CepstrumA new word by reversing the first 4 letters of spectrum cepstrum.It is the spectrum of a spectrum of a signal.

Page 26: Speech Signal Processing

Glottis and cepstrumSpeech wave (S)= Excitation (E) . Filter (H)

• (H)(Vocal

tract filter)

OutputSo voice has astrong glottis ExcitationFrequency content

In Ceptsrum We can easily identify and remove the glottal excitation

Glottal excitationFromVocal cords(Glottis)

(E)

(S)

Page 27: Speech Signal Processing

Cepstral analysis

• Signal(s)=convolution(*) of • glottal excitation (e) and vocal_tract_filter (h)• s(n)=e(n)*h(n), n is time index

• After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)}• Convolution(*) becomes multiplication (.)• n(time) w(frequency),

• S(w) = E(w).H(w)• Find Magnitude of the spectrum• |S(w)| = |E(w)|.|H(w)|• log10 |S(w)|= log10{|E(w)|}+ log10{|H(w)|}

Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1

Page 28: Speech Signal Processing

Cepstrum

• C(n)=IDFT[log10 |S(w)|]=• IDFT[ log10{|E(w)|} + log10{|H(w)|} ]

• In c(n), you can see E(n) and H(n) at two different positions• Application: useful for (i) glottal excitation (ii) vocal tract filter

analysis

windowing DFT Log|x(w)| IDFT

X(n) X(w) Log|x(w)|

N=time indexw=frequencyI-DFT=Inverse-discrete Fourier transform

S(n) C(n)

Page 29: Speech Signal Processing

• Glottal excitation cepstrum

Vocal trackcepstrum

s(n) time domain signal

x(n)=windowed(s(n))Suppress two sides

|x(w)|=

Log (|x(w)|)

C(n)=iDft(Log (|x(w)|))gives Cepstrum

Page 30: Speech Signal Processing

Liftering (to remove glottal excitation)• Low time liftering:

• Magnify (or Inspect) the low time to find the vocal tract filter cepstrum

• High time liftering:• Magnify (or Inspect) the

high time to find the glottal excitation cepstrum (remove this part for speech recognition.

Glottal excitationCepstrum, useless for

speech recognition ,

Frequency =FS/ quefrencyFS=sample frequency

=22050

Vocal tractCepstrum

Used for Speech

recognition

Cut-off Found by experiment

Page 31: Speech Signal Processing

Reasons for lifteringCepstrum of speech• Why we need this?

• Answer: remove the ripples • of the spectrum caused by • glottal excitation.

Input speech signal xSpectrum of x

Too many ripples in the spectrum caused by vocalcord vibrations (glottal excitation).But we are more interested in the speech envelope for recognition and reproduction

FourierTransform

http://isdl.ee.washington.edu/people/stevenschimmel/sphsc503/files/notes10.pdf

Page 32: Speech Signal Processing

Speech Recognition

Page 33: Speech Signal Processing

Speech Recognition

• speech recognition (SR) is the translation of spoken words into text. It is also known as "automatic speech recognition" (ASR), "computer speech recognition", or just "speech to text" (STT).

Page 34: Speech Signal Processing

speech recognition procedure

We will inplement all the methods we have taken to connect all the dots and clarify the recognition system , note that only step (4) is regarded to recognition process and the other points are connected to the other parts we have taken before.

Steps1. End-point detection2. (2a) Frame blocking and (2b) Windowing3. Feature extraction

Find cepstral cofficients by LPC1. Auto-correlation analysis2. LPC analysis,3. Find Cepstral coefficients,

4. Distortion measure calculations

Page 35: Speech Signal Processing

Step1: Get one frame and execute end point detection• To determine the start and end points of the speech

sound• It is not always easy since the energy of the starting

energy is always low.• Determined by energy & zero crossing rate

recorded

end-point detected

n

s(n)

In our example it is about 1 second

Page 36: Speech Signal Processing

Step2(a): Frame blocking and Windowing

• To choose the frame size (N samples )and adjacent frames separated by m samples.

• I.e.. a 16KHz sampling signal, a 10ms window has N=160 samples, m=40 samples.

m

N

N

nsn

l=2 window, length = N

l=1 window, length = N

Page 37: Speech Signal Processing

Step2(b): Windowing

•To smooth out the discontinuities at the beginning and end.•Hamming or Hanning windows can be used.

•Hamming window

•Tutorial: write a program segment to find the result of passing a speech frame, stored in an array int s[1000], into the Hamming window.

10

1

2cos46.054.0)()()(

~

Nn

N

nnWnSnS

Page 38: Speech Signal Processing

Effect of Hamming window

)(~

nS

)(*)(

)(~

nWnS

nS

10

1

2cos46.054.0

)()(

)(~

Nn

N

n

nWnS

nS

)(nS

)(~

nS

)(nW

Page 39: Speech Signal Processing

Step3.1: Auto-correlation analysis

• Auto-correlation of every frame (l =1,2,..)of a windowed signal is calculated.

• If the required output is p-th ordered LPC• Auto-correlation for the l-th frame is

pm

mnSSmr l

mN

nll

,..,1,0

)(~~

)(1

0

Page 40: Speech Signal Processing

Step 3.2 : LPC calculationTo calculate LPC coefficints vector

ppppp

p

p

r

r

r

a

a

a

rrrr

rrr

rrrr

rrrr

:

:

:

:

...,

:...,:::

:...,

...,

...,

2

1

2

1

0321

012

2101

1210

Page 41: Speech Signal Processing

Step3.3: LPC to Cepstral coefficients conversion

• Cepstral coefficient is more accurate in describing the characteristics of speech signal

• Normally cepstral coefficients of order 1<=m<=p are enough to describe the speech signal.

• Calculate c1, c2, c3,.. cp from LPC a1, a2, a3,.. ap

)needed if( ,

1 ,

1

1

1

00

pmacm

kc

pmacm

kac

rc

kmk

m

pmkm

kmk

m

kmm

Page 42: Speech Signal Processing

Step(4) Matching method: Dynamic programming DP• Correlation is a simply method for pattern

matching BUT:• The most difficult problem in speech

recognition is time alignment. No two speech sounds are exactly the same even produced by the same person.

• Align the speech features by an elastic matching method -- DP.

Page 43: Speech Signal Processing

(B )Dynamic programming algorithm

• Step 1: calculate the distortion matrix dist( )• Step 2: calculate the accumulated matrix

• by using

D( i, j)D( i-1, j)

D( i, j-1)D( i-1, j-1)

1,(

),,1(

),1,1(

min),(),(

jiD

jiD

jiD

jidistjiD

Page 44: Speech Signal Processing

To find the optimal path in the accumulated matrix (and the minimum accumulated distortion/ distance)

• Starting from the top row and right most column, find the lowest cost D (i,j)t : it is found to be the cell at (i,j)=(3,5), D(3,5)=7 in the top row. *(this cost is called the “minimum accumulated distance” , or “minimum accumulated distortion”)

• From the lowest cost position p(i,j)t, find the next position (i,j)t-1 =argument_min_i,j{D(i-1,j), D(i-1,j-1), D(i,j-1)}.

• E.g. p(i,j)t-1 =argument_mini,j{11,5,12)} = 5 is selected.• Repeat above until the path reaches the left most column

or the lowest row.• Note: argument_min_i,j{cell1, cell2, cell3} means the

argument i,j of the cell with the lowest value is selected.

Page 45: Speech Signal Processing

Optimal path

• It should be from any element in the top row or right most column to any element in the bottom row or left most column.

• The reason is noise may be corrupting elements at the beginning or the end of the input sequence.

• However, in fact, in actual processing the path should be restrained near the 45 degree diagonal (from bottom left to top right), see the attached diagram, the path cannot passes the restricted regions. The user can set this regions manually. That is a way to prohibit unrecognizable matches. See next page.

Page 46: Speech Signal Processing

Optimal path and restricted regions

Page 47: Speech Signal Processing

Example: for DP

• The Cepstrum codes of the speech sounds of ‘YES’and ‘NO’ and an unknown ‘input’ are shown. Is the ‘input’ = ‘Yes’ or ‘NO’?

YES' 2 4 6 9 3 4 5 8 1NO' 7 6 2 4 7 6 10 4 5Input 3 5 5 8 4 2 3 7 2

2')( xxdistdistortion

Page 48: Speech Signal Processing

• Answer • Starting from the top row and

right most column, find the lowest cost D (i,j)t : it is found to be the cell at (i,j)=(9,9), D(9,9)=13.

• From the lowest cost position (i,j)t, find the next position (i,j)t-1

• =argument_mini,j{D(i-1,j), D(i-1,j-1), D(i,j-1)}. E.g. position (i,j)t-1 =argument_mini,j{48,12,47)} =(9-1,9-1)=(8,8) that contains “12” is selected.

• Repeat above until the path reaches the right most column or the lowest row.

• Note: argument_min_i,j{cell1, cell2, cell3} means the argument i,j of the cell with the lowest value is selected.

Page 49: Speech Signal Processing

Thank you ^_~