a 12-week project in speech coding and recognition by fu-tien hsiao and vedrana andersen

53
A 12-WEEK PROJECT IN Speech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen

Upload: allen-lloyd

Post on 17-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

A 12-WEEK PROJECT IN

Speech Coding and Recognition

by Fu-Tien Hsiaoand Vedrana Andersen

Overview An Introduction to Speech Signals

(Vedrana) Linear Prediction Analysis (Fu) Speech Coding and Synthesis (Fu) Speech Recognition (Vedrana)

Speech Coding and Recognition

AN INTRODUCTION TO SPEECH SIGNALS

AN INTRODUCTION TO SPEECH SIGNALS

Speech Production Flow of air from lungs Vibrating vocal cords Speech production cavities Lips Sound wave Vowels (a, e, i), fricatives (f, s, z) and

plosives (p, t, k)

AN INTRODUCTION TO SPEECH SIGNALS

Speech Signals Sampling frequency 8 — 16 kHz Short-time stationary assumption

(frames 20 – 40 ms)

AN INTRODUCTION TO SPEECH SIGNALS

Model for Speech Production Excitation (periodic, noisy) Vocal tract filter (nasal cavity, oral

cavity, pharynx)

AN INTRODUCTION TO SPEECH SIGNALS

Voiced and Unvoiced Sounds Voiced sounds,

periodic excitation, pitch period

Unvoiced sounds, noise-like excitation

Short-time measures: power and zero-crossing

AN INTRODUCTION TO SPEECH SIGNALS

Frequency Domain

Pitch, harmonics (excitation) Formants, envelope (vocal tract filter) Harmonic product spectrum

AN INTRODUCTION TO SPEECH SIGNALS

Speech Spectrograms

Time varying formant structure Narrowband / wideband

Speech Coding and Recognition

LINEAR PREDICTION ANALYSIS

LINEAR PREDICTION ANALYSIS

Categories Vocal Tract Filter Linear Prediction Analysis

Error Minimization Levison-Durbin Recursion

Residual sequence u(n)

LINEAR PREDICTION ANALYSIS

Vocal Tract Filter(1) Vocal tract filter

If we assume an all poles filter?

)(zS

Input: periodic impulse train

Output: speech

)(

)()(

zU

zSzH

g

)(zU g

)(zH

LINEAR PREDICTION ANALYSIS

Vocal Tract Filter(2) Auto regressive model:

(all poles filter)

where p is called the model order Speech is a linear combination of past

samples and an extra part, Aug(z)

)()(...)2()1()(

)()(...)()()(

)(

)(

1)(

21

22

11

1

nAupnsansansans

zAUzSzazSzazSzazS

zU

zS

za

AzH

gp

gp

p

gp

kk

k

LINEAR PREDICTION ANALYSIS

Linear Prediction Analysis(1) Goal: how to find the coefficients ak

in this all poles model?

all poles model

speech, s(n)

impulse, Aug(n)

Physical model v.s. Analysis system

ak here is fixed, but unknown!

we try to find αk to estimate ak

?

error, e(n)

LINEAR PREDICTION ANALYSIS

Linear Prediction Analysis(2) What is really inside the ? box? A predictor (P(z), FIR filter) inside,

where ŝ(n)= α1s(n-1)+α2s(n-2)+… + αps(n-p)

If αk ≈ ak , then e(n) ≈ Aug(n)

predictive error, e(n)=s(n)- ŝ(n)

P(z)original s(n)

-

A(z)=1-P(z)

predicitve ŝ(n)

LINEAR PREDICTION ANALYSIS

Linear Prediction Analysis (3) If we can find a predictor generating

a smallest error e(n) which is close to Aug(n), then we can use A(z) to estimate filter coefficients.

e(n)

≈Aug(n)

1 / A(z) ŝ(n)

very similar to vocal tract model

LINEAR PREDICTION ANALYSIS

Error Minization(1) Problem: How to find the minimum error? Energy of error:

, where e(n)=s(n)- ŝ(n) = function(αi)

For quadratic function of αi we can find the smallest value by for each

n

neE )(2

0/ iE i

By differentiation,

We define that, where

This is actually an autocorrelation of s(n)

LINEAR PREDICTION ANALYSIS

Error Minization(2)

nn

p

kk insknsinsns )()()()(

1

n

insknski )()(),( pipk 1,1

a set of linear equations

LINEAR PREDICTION ANALYSIS

Error Minization(3) Hence, let’s discuss linear equations

in matrix:

Linear prediction coefficient is our goal.

How to solve it efficiently?

)(

)2(

)1(

)0()1(

)0()1(

)1()1()0(

2

1

pr

r

r

rpr

rr

prrr

p

LINEAR PREDICTION ANALYSIS

Levinson-Durbin Recursion(1) In the matrix, LD recursion method is

based on following characteristics: Symmetric Toeplitz

Hence we can solve matrix in O(p2) instead of O(p3)

Don’t forget our objective, which is to find αk to simulate the vocal tract filter.

LINEAR PREDICTION ANALYSIS

Levinson-Durbin Recursion(2)

In exercise, we solve matrix by ‘brute force’ and L-D recursion. There is no difference of corresponding parameters

Error energy v.s. Predictor order

LINEAR PREDICTION ANALYSIS

Residual sequence u(n) After knowing filter

coefficients, we can find residual sequence u(n) by inversely filtering computation.

Try to compare original s(n)

residual u(n)

u(n)A(z)

s(n)

Speech Coding and Recognition

SPEECH CODING AND SYNTHESIS

SPEECH CODING AND SYNTHESIS

Categories Analysis-by-Synthesis Perceptual Weighting Filter Linear Predictive Coding

Multi-Pulse Linear Prediction Code-Excited Linear Prediction (CELP) CELP Experiment

Quantization

SPEECH CODING AND SYNTHESIS

Analysis-by-Synthesis(1) Analyze the speech by estimating a

LP synthesis filter Computing a residual sequence as a

excitation signal to reconstruct signal

Encoder/Decoder : the parameters like LP synthesis filter, gain, and pitch are coded, transmitted, and decoded

SPEECH CODING AND SYNTHESIS

Analysis-by-Synthesis(2) Frame by frame

Without error minimization: With error minimization:

LP parameters

Excitation parameters

ENCODER

To channele(n)

s(n)

Excitation Generator

LP Synthesis Filter

ŝ(n)

-

Error Minimization

LP analysis

SPEECH CODING AND SYNTHESIS

Perceptual Weighting Filter(1) Perceptual masking effect:

Within the formant regions, one is less sensitive to the noise

Idea: designing a filter that de-emphasizes the error in the formant region

Result: synthetic speech with more error near formant peaks but less error in others

SPEECH CODING AND SYNTHESIS

Perceptual Weighting Filter(2)

In frequency domain: LP syn. filter v.s. PW filter Perceptual weighting

coefficient: α = 1, no filtering. α decreases, filtering more optimalα depends on perception

)(

)(

)(1

)(1)(

zA

zAzP

zPzQ

SPEECH CODING AND SYNTHESIS

Perceptual Weighting Filter(3) In z domain, LP filter v.s. PW filter

Numerator: generating the zeros which are the original poles of LP synthesis filter

Denominator: placing the poles closer to the origin. α determines the distance

SPEECH CODING AND SYNTHESIS

Linear Predictive Coding(1) Based on above methods, PW filter

and analysis-by-synthesis If excitation signal ≈ impulse train,

during voicing, we can get a reconstructed signal very close to the original

More often, however, the residue is far from the impulse train

SPEECH CODING AND SYNTHESIS

Linear Predictive Coding(2) Hence, there are many kinds of coding

trying to improve this Primarily differ in the type of

excitation signal Two kinds:

Multi-Pulse Linear Prediction Code-Excited Linear Prediction

(CELP)

SPEECH CODING AND SYNTHESIS

Multi-Pulse Linear Predcition(1) Concept: represent the residual

sequence by putting impulses in order to make ŝ(n) closer to s(n).

s(n)

Error Minimizatio

n

Excitation

Generator

LP Synthesis

Filter

PW Filter

ŝ(n)-

Multi-pulse, u(n)

LP Analysis

SPEECH CODING AND SYNTHESIS

Multi-Pulse Linear Predcition(2) s1 Estimate the LPC filter without

excitation s2 Place one impulse (placement and amplitude) s3 A new error is determined s4 Repeat s2-s3 until reaching a desired min error

multi-pulse

s1

original

synthetic

errors2,3 s

4

SPEECH CODING AND SYNTHESIS

Code-Excited Linear Prediction(1) The difference:

Represent the residue v(n) by codewords (exhaustive searching) from a codebook of zero-mean Gaussian sequence

Consider primary pitch pulses which are predictable over consecutive periods

Multi-pulse generator

SPEECH CODING AND SYNTHESIS

Code-Excited Linear Prediction(2)

LP synthesis

filter

LP analysis

-

PW filter

Error minimizatio

n

Gaussian excitation codebook

s(n)ŝ(n)

s(n)

LP parameters

u(n)Pitch synthesis

filter

Pitch estimate

v(n)

SPEECH CODING AND SYNTHESIS

CELP Experiment(1) An experiment of

CELP Original (blue) : Excitation signal

(below): Reconstructed (green) :

SPEECH CODING AND SYNTHESIS

CELP Experiment(2) Test the quality for different settings:1. LPC model order

Initial M=10 Test M=2

2. PW coefficient

SPEECH CODING AND SYNTHESIS

CELP Experiment(3)

3. Codebook (L,K) K: codebook size K influences the computation time

strongly. if K: 1024 to 256, then time: 13 to 6 sec Initial (40,1024) Test (40,16)

L: length of the random signal L determines the number of subblock in

the frame

SPEECH CODING AND SYNTHESIS

Quantization With quantization,

16000 bps CELP 9600 bps CELP

Trade-offBandwidth efficiency v.s. speech quality

Speech Coding and Recognition

SPEECH RECOGNITION

SPEECH RECOGNITION

Dimensions of Difficulty Speaker dependent / independent Vocabulary size (small, medium,

large) Discrete words / continuous

utterance Quiet / noisy environment

SPEECH RECOGNITION

Feature Extraction Overlapping frames Feature vector for each frame

Mel-cepstrum, difference cepstrum, energy, diff. energy

SPEECH RECOGNITION

Vector Quantization Vector quantization

K-means algorithm Observation sequence for the whole

word

SPEECH RECOGNITION

Hidden Markov Model (1) Changing states, emitting symbols (1), A, B

1 542 3

SPEECH RECOGNITION

Hidden Markov Model (2) Probability of transition

State transition matrix

State probability vector

State equation

SPEECH RECOGNITION

Hidden Markov Model (3) Probability of observing

Observation probability matrix

Observation probability vector

Observation equation

SPEECH RECOGNITION

Hidden Markov Model (4) Discrete observation hidden Markov

model

Two HMM problems Training problem Recognition problem

SPEECH RECOGNITION

Recognition using HMM (1) Determining the probability that a given HMM

produced the observation sequence

Using straightforward computation – all possible paths, ST

3 3

2 2 2 2

3 3 3

11111

2states

time

SPEECH RECOGNITION

Recognition using HMM (2) Forward-backward algorithm, only the forward part Forward partial observation

Forward probability

i

SPEECH RECOGNITION

Recognition using HMM (3) Initialization

Recursion

Termination

i

j

SPEECH RECOGNITION

Training HMM No known analytical way Forward-backward (Baum-Welch) reestimation, a hill-

climbing algorithm Reestimates HMM parameters in such a way that

Method: Uses and to compute forward and backward

probabilities, calculates state transition probabilities and observation probabilities

Reestimates the model to improve probability

Need for scaling

SPEECH RECOGNITION

Experiments Matrices A and B

Observation sequences for words ‘one’ and ‘two’

Thank you!