codebook-based feature compensation for robust speech recognition 2007/02/08 shih-hsiang lin (...

Codebook-based Feature Compensation for Robust Speech Recognition

2007/02/08

Shih-Hsiang Lin (林士翔 )

Graduate Student

National Taiwan Normal University, Taipei, Taiwan

2

Outline

• Introduction• Codebook-based Cepstral Compensation

– With Both Clean and Noisy (Stereo) Data• Stereo-based Piecewise Linear Compensation (SPLICE)

• SNR Dependent Cepstral Normalization (SDCN)

• Codeword Dependent Cepstral Normalization (CDCN)

• Probabilistic Optimum Filtering (POF)

– With Only Noisy Data • Maximum Likelihood based Stochastic Vector Mapping (ML-SVM)

• Maximum Mutual Information-SPLICE (MMI-SPLICE)

• Minimum Classification Error based SVM (MCE-SVM)

• Stochastic Matching

• Conclusions

3

Introduction

• Speech recognition performance degrades seriously when– There is mismatch between training and test acoustic conditions – However, training systems for all noise conditions is impractical

• A Simplified Distortion Framework

– Channel effects are usually assumed to be constant while uttering– Additive noises can be either stationary or non-stationary

(Convolutional Noise)

Channel Effect

Background Noise

(Additive Noise)

Noisy Speech

Clean Speech

tx

th tn

tnthtxty

4

• Non-linear Environmental Distortions

– Clean speech was corrupted by 10dB subway noise • Not only linear but also non-linear distortions were involved

Introduction (cont.)

Mel filter bank output cepstral coefficient log energy

5


• Two main approaches to improving noise robustness– Model Compensation

• Adapt acoustic models to match corrupted speech features– Feature Compensation

• Restore corrupted speech features to corresponding clean ones

Clean Speech

Noisy Speech

Clean Acoustic Models

NoisyAcoustic Models

TrainingConditions

TestConditions

Feature Compensation

ModelCompensation

Feature Space Model Space

6


• Model Compensation– Adapt acoustic models to match corrupted speech features

– Representative approaches: MAP, MLLR, MAPLR, etc.

• Feature Compensation– Restore corrupted speech features to corresponding clean ones

– Representative approaches: SS, CMN, CMVN, etc.

ModelCompensation

m

m

m

m

Clean Models Corrupted Models

Corrupted Speech

Corrupted Speech Features

FeatureCompensation

Compensation Model

CleanSpeech Features

7

Theme of Presented Compensation Approaches

• Codebook-based Cepstral Compensation– All involve the utilization of vector quantization (VQ)

UniversalCodebook

Feature Vector

Correction Vector

Finding of Correction Vector

Compensation

8

Theme of Presented Compensation Approaches (cont.)

• SDCN (Data-Driven)• CDCN (Model-based)(Acero and Stern,1990)

• FCDCN (Data-Driven)(Acero and Stern,1991)

• POF (Data-Driven)(Neumeyer and Weintraub, 199

4)

• SPLICE (Data-Driven)(Deng and Acero et al. 2000)

• Stochastic Matching (Model-based)(Sankar and Lee 1996)

• MMI-SPLICE (Model-based)(Droppo and Acero, 2005)

• MCE-SVM (Model-based(Wu and Huo, 2006)

• ML-SVM (Model-based)(Wu, Huo and Zhu, 2005)

With Only Noisy Data

With Both Clean and Noisy (Stereo) Data

• Unsupervised ML-SVM (Model-based)(Zhu and Huo, 2007)

9

Stereo-based Piecewise Linear Compensation (SPLICE)

Estimation of the Correction Vector (Bias)

A Mixture of Gaussians are Trained

For Each Gaussian Component (codeword)

A Linear Relationship between the Clustered Noisy Data and Its

Corresponding Clean Counterpart is Assumed

Deng et al. 2000

10

SPLICE (cont.)

• The success of SPLICE has its roots on two assumptions1. The noisy cepstral feature vectors follow a distribution of a

mixture of Gaussians• Can be thought of as a “codebook” with a total of codewords

2. The conditional probability density function for a clean vector given its corresponding noisy vector and the cluster index is Gaussian whose mean vector is a linear transformation of

• The correction vector (or bias) can be estimated by MMSE estimate given the clean training speech vectors and their noisy counterparts

k

kpkypyp | kkyNkyp ,;| where

K

xy k

kkryxNkyxp ,;,|

Only mean shift (without rotation) is assumed here

y

kr

11

SPLICE (cont.)

• Therefore, the clean feature vector can be restored by its corresponding noisy feature vector and a linear weighted sum of all codeword-dependent bias vectors

• Or, we can alternatively (simply) use the following equation to obtain the restored clean feature

kk

kk

k

rykpy

ryykpkyxykpkyxyxx

|

| ,||,|ˆ EEEE

ykpkk

|argmax where

Since is an (observed) known constant vector herey

yx

kr

kryx ˆ

x

12

SPLICE (cont.)

• The MMSE estimate of the correction vector (bias) can be expressed using the following formula

– The use of the stereo (both clean and noisy) training data provides an accurate estimate of the correction vectors

• SPLICE has good potential to effectively handle a wide range of distortions– Nonstationary distortion, jointly additive or/and convolutional

distortion, and even nonlinear distortion of the original speech signal

Yy

Yyk ykp

yxykpr

|

|

k kpkyp

kpkypykp

|

||

kr

where

However, in many ASR applications, stereo data are too expensive to collect

13

SNR Dependent Cepstral Normalization (SDCN)

• Notice that in the simplified distortion framework, the noisy cepstral feature vector can be expressed by

– is a non-linear function and it is very difficult to estimate

• Therefore, SDCN attempts to restore the clean cepstral vector using a compensation vector that approximates

and for different SNR levels

hxnghxy

hxnhxng -1Cexp1logC

where

Acero et al. 1990

y

g

is the discrete cosine transform matrixC

hxng

TrMfYfYfYy log log logC 10

TrMfXfXfXx log log logC 10

TrMfHfHfHh log log logC 10

TrMfNfNfNn log log logC 10

mf is the filter bank bin

h

14

SDCN (cont.)

• A schematic depiction of SDCN

– The (instantaneous) SNR can be calculated in proportion to the difference between the C[0] of the input frame and the C[0] of the noise at a reference time

Universal Codebook(Each codeword represents a specific SNR level)

While for each SNR level (codeword), a compensation vector is estimated kSNRw

kSNRw

15

SDCN (cont.)

• Therefore, a compensation vector that completely depends on the instantaneous SNR of the observed feature vector can be used to restore the clean one

• Two extreme cases - High SNR: - Low SNR:

kSNRwyx ˆ

x

hxnghSNRw k where

hSNRw

hxy

hxn

k

0Cexp1logC 1

nhx

Removal of the channel effect

nhx

nxSNRw

ny

nhxn

k

1Cexp1logC

Removal of the additive noise effect

y

kSNRw

16

SDCN (cont.)

• The compensation vectors were estimated by MMSE using the following equation

• Disadvantage– For a new test environment, the compensation vectors have to

be re-estimated using a new set of sufficient amounts of stereo data

XxxSNR

XxxSNRchannelchannel

kchannelk

channelk

SNR

SNRxxSNRw

1

121

1channelx

2channelx

: cepstral vectors for test (noisy) condition

: cepstral vectors for standard acoustical (clean) condition

kSNRw

kSNR : Kronecker delta function

1channelxSNR : the instantaneous SNR level of 1channelx

17

Codeword Dependent Cepstral Normalization (CDCN)Acero et al. 1990

Distribution of clean speech

• A schematic depiction of CDCN

a noisy speech utterance Y

Form a mixture of Gaussians

Can be regarded as a kind of phone-dependent distributions

For , we can estimate the corresponding noise and channel vectors of each Gaussian

hxnghyx

Y

r||

18

CDCN (cont.)

• CDCN first models the distributions of cepstral feature vectors of clean speech by a mixture of Gaussian distributions

• Secondly, CDCN assumes the conditional probability density function for a clean vector given its corresponding noisy vector and the Gaussian index is a Gaussian distribution

1

0

1

0,;|

K

kkkkk

K

kk xNckxpcxp

xkrhyxNkyxp ,;,|

xy k

19

CDCN (cont.)

• Then, the CDCN algorithm is conducted by two steps1. The noise and channel vectors are estimated by MLE

2. The expected restored cepstral vector is calculated by a linear weighted sum of all Gaussian-dependent expected clean cepstral vectors

rrr

,|logmaxargˆ,ˆ,

hYphh

is phone-dependentis phone-independentrhwhere we assume

kK

kMMSE xykpkyxyxx ˆ|,ˆ

1

0

EEE

MMSEx

kx

10 ,, krr r

20

CDCN (cont.)

• The ML estimate of the noise and channel vectors

– Where

• To obtain the optimum values and , we can take derivatives w.r.t. and respectively and set them to zero

Yyh

h

hyp

hYph

r

rr

r

r

,|logmaxarg

,|logmaxargˆ,ˆ

,

,

0

,|ln

h

hYp r

1

0

12/1

ˆˆ2

1exp

2

1,|

K

kk

kkk

k

k

k xxchyp

r

krhy h kr

hkr

k

r

hYp

k

, 0

,|ln rand

21

• POF is conducted based on a probabilistic piecewise-linear transform of the acoustic space– Using VQ algorithm to partition the clean feature space in clusters

• Each VQ region is assigned a multidimensional transversal filter

Probabilistic Optimum Filtering (POF)

K

112,1,0,1,,T

dpdkpkkkkpkk bAAAAAW

1)12(1TT

1TT

1TT 1 dppttttptt yyyyyY

Neumeyer et al. 1994

tkkt YWx Tˆ where

The multidimensional transversal filter of a cluster

1Z 1Z 1Z 1Zpty 1 pty ty pty 2 pty

pkA , 1, pkA 2, pkA 0,kA pkA ,

1

kb

ktx

k

22

POF (cont.)

• The error between the clean vector and the estimated vectors produced by the -th filter is given by

• The conditional error in each region is defined as

– where is the probability that the clean vector belongs to cluster given an arbitrary characteristic vector of

• The characteristic vector can be any acoustic information cues generated from each frame of the speech utterance

– E.g. instantaneous SNR, energy, cepstral coefficients

k

tktktttk YWxxxe Tˆ

pT

ptttkk zkpeE

1 2|

tzkp | txk tz tx

23

POF (cont.)

• To compute the optimum filters in the MMSE sense, we have to minimize the error in each cluster– can be obtained by taking the gradient of with respect to it

and then equating the gradient to zero

• The run-time estimate of the clean feature vector can be computed by integrating the outputs of all the filters as follows

kE

kEkWk ,0

kkk rRW 1

pT

pttttk zkpYYR

1T1 |

pT

pttttk zkpxYr

1T |

Accordingly, has the form

t

K

ktk

K

kttkt YzkpWzkpYWx

1

0

T1

0

T ||ˆ

kW

kW

where and

24

Maximum Likelihood based Stochastic Vector Mapping (ML-SVM)

Distribution of noisy speech

E environmental clusters

Correction vectors

• A schematic depiction of ML-SVM

For each environment cluster

0 1 2 3

Acoustic Model(Trained by Noisy Speech)

GMM

Noisy Speech

+

Wu et al. 2005

25

ML-SVM (cont.)

• Suppose the (noisy) training data can be partitioned into environmental clusters, while each cluster is modeled as a mixture of Gaussians

• Given a set of acoustic models, the aim of SVM is to estimate the restored clean feature vector from the noisy one by applying the environment-dependent transformation

E

1

0,,

1

0,;|,|||

K

kkeke

K

kyNekpekypekpeyp

xy

eyFx ,ˆ

26

• The SVM function can be one of the following forms

• During recognition, given an unknown utterance– The most proximal environmental cluster is first identified– Then, the corresponding GMM and the mapping function are

used to derive a compensated version of from noisy vector

K

k

ek

e beykpyyFx1

1 ,|,ˆ

ek

e byyFx ,ˆ 3 eykpkkk

,|'maxarg,,1'

K

jejypejp

ekypekpeykp

1,||

,||,|where

where

Y

x y

eee byAyFx ,ˆ 5

ekK

k

ee beykpyAyFx 1

2 ,|,ˆ

ek

ee byAyFx ,ˆ 4

ML-SVM (cont.)

e

27

ML-SVM (cont.)

• A Flowchart of The Joint Maximum Likelihood Training of SVM

28

• The detailed procedures depicted in the above flowchart are explained as follows

Step 1 : Initialization

A set of HMM acoustic models with diagonal covariance matrices trained with multi-condition training data are used as the initial models, and the initial bias vectors are set to be zero vectors

Step 2 : Estimating SVM Function Parameters

Given the HMM acoustic model parameters , for each environmental class , Nb times of EM training are performed to estimate the environment-dependent mapping function parameters to increase the likelihood function

TekD

ek

ek bbb ,,1

,L e

I

iiYFpL

1

|;,

ML-SVM (cont.)

e

e

29

Let’s consider a particular environmental class . The auxiliary (Q) function

for becomes e

eIi t s mesm

k

ekititsm

Tr

smk

ekititit

eIi t s msmsm

k

ekititite

Constbeykpybeykpyms

beykpyNmsQ

,|,|,

,;,|log,

1

The occupation probability of Gaussian component m of state s at time t

By setting the derivates of w.r.t ‘s as zero ekb

e

e

Ii t s mitsmitsmit

Ii t s m k

ekititsmit

yeykpms

beykpeykpms

,|,

,|',|,

1

''

1

K

k

ek

e beykpyyF1

1 ,|,

ML-SVM (cont.)

e

eQ

30

Since above equation holds for all k, it equivalent to solve the root of vector

in the following equation

where is a K x K matrix with the (k,k’)-th element being

and is a K –dimensional vector with each being

TeKd

ed

ed bbB ,,1

ed

ed

ed CBA

edA

eIi t s mitit

smd

ited eykpeykp

mskka ,|',|

,',

2

edC Te

ded

ed Kckcc ,,1

eIi t s mit

smd

itdsmdited eykp

ymskc ,|

,2

The estimation need an inverse operation of the K x K matrix

ML-SVM (cont.)

kc ed

',kka ed

31

If the SVM function is used for feature compensation,

the EM training formula for can be derived similarly with a much simpler

updating form as follows

ekb

e

e

Ii t s m smd

ititk

Ii t s m smd

itdsmdititk

ekd mseykpk

ymseykpk

b

2'

2'

,,|'maxarg1

,,|'maxarg1

ek

e byyFy ,ˆ 3

Step 3 : Estimating HMM Acoustic Model Parameters

We transform each training utterance using its respective

mapping function with parameters . Having the environment-

compensated utterances, Nh EM iterations are then performed to re-estimate the HMM acoustic model parameters to increase the likelihood function

Step 4 : Repeat Step 2 and Step 3 Ne times ,L

e

ML-SVM (cont.)

32

Maximum Mutual Information-SPLICE (MMI-SPLICE)

• MMI-SPLICE is much like SPLICE, but without the need of target clean feature vectors (no need of stereo data)– MMI-SPLICE learns to increase recognition accuracy directly

with a maximum mutual information (MMI) objective function

– MMI objective function

• The global objective function is a linear sum of the objective function for each training utterance

r w r

rr

rrr

rr wXp

wXpXwpFF

,ˆ,ˆ

lnˆ|lnB

BBBB

rF

Droppo et al. 2005

,|ˆ 110

1

0

K

K

kk bbbbykpyyfx BB

Fr

where

33

MMI-SPLICE (cont.)

• The transformation parameters can be estimated by gradient-based linear method– Any gradient ascent method can be used (e.g. conjugate gradient or

BFGS)

• Since every is a function of many conditional probabilities of HMM states

rtstr k

rt

rt

rt

rt

rt

rt

r

k b

x

x

sxp

sxp

F

b

F

,,

~

~|~ln

|~lnB

B

BB

rF rtr

t sxp |B

110 Kbbb B

34

MMI-SPLICE (cont.)

rtr

tsrtstr

rts

denrts

numrts

rt

k

xykpb

F ~|,,

1

den

rts

numrts

rrtrr

rtr

trt

r XspwXspsxp

F ~

|,~

||~ln

BBB

B

1 ~

~|~ln

rts r

ts

rtr

t

rt

rt x

x

sxp B

rtK

rtik

rit

jmjm

rdt ykpjiykpby

bb

x|1|

~,,

,,

,

35

Minimum Classification Error based SVM (MCE-SVM)

• Classification Error Function

• A continuous loss function is defined as follows

• Objective function

Wu et al. 2006

1

1

;,exp1

1log;,;,

M

n

nc YFgM

YFgYFd

anti-discriminant functionlog-likelihood of current enhanced feature vectorsequence generated by the HMMs of the competitive word strings

discriminant function for recognition decision makinglog-likelihood of current enhanced feature vector sequencegenerate by the current HMMs of the word string Zc

cc YFLL ;, n

n YFLL ;,

R

rrYFl

Rl

1

;,1

,

;,exp1

1;,

YFdYFl

K

kkbykpyyF

1

|,

36

MCE-SVM (cont.)

• Let denote generically the parameters to be estimated and is updated as follows:

• In order to find the gradient , the following partial derivation is used

jjjjjj YFlV |;;1

j

N

nN

n n

nn

c

LL

LLLLLL

ll

YFd

YFd

ll

11' 'exp

exp1

;;

;;

,

37

MCE-SVM (cont.)

• Therefore, the remaining partial derivate or is formulated differently depending on the parameters to be optimized

• Updating of the HMM acoustic model parameters is the same as that of the MCE training

smjtsm

t s m

jtjt yF

yFms

LL

;

;, 1

For each , it follows ekb

otherwise ,0

e classt environmen tobelongs if ,,|; jtjte

k

jt yeykp

b

yF

Therefore

''

1 ,|',|,k

sme

kjtjtsmt s m

jtjtek

beykpyeykpmsb

LL

cLL

nLL

38

Stochastic Matching

• The mismatch between the corrupted speech and the HMM acoustic models can be reduced in two ways– Feature-space transformation

• Find a inverse distortion function that maps into which matches better with the models

– Model-space transformation• Find a model transformation function that maps to the

transformed model which matches better with

• The stochastic algorithm operates only on the given test utterance and the given set of HMM acoustic models– No additional training data is required for the estimation of the

mismatch prior to actual testing

Y

F Y X

GY

YFX ˆ

XY G

Sankar et al. 1996

39

Stochastic Matching (cont.)

• In the feature-space, we need to find such that

We are interested in the problem of finding the parameters ,so

Let be the set of all possible state sequences

be the set of all possible mixture sequences

Then the equation can be written as

• In general, it is not easy to estimate directly, but for some ,we can use the EM algorithm to estimate them

WpWYpWW

,,|maxarg,,

TssS ,,1

,|maxarg

Yp

TmmM ,,1

S M

MSYp ,|,,maxarg

F

40


• Expectation (E-Step)

• Maximization (M-Step)

• For simplification, we assume– – The covariance matrices are diagonal

1

0

1

0

1

0,

1,

T, log

2

1,|

T

t

N

s

M

mtvmstmsmstt yJyFyFmsQ

|maxarg

0log2

1,

1

0

1

0

1

0,

1,

T,

T

t

N

s

M

mtvmstmsmstt yJyFyFms

ygyF1, ms

t

tctstctsttt yJ

yfNcsyp

,, ,;

,,,|

41


• The auxiliary function can now be written as

• For the estimation of and , we can take the derivative w.r.t. and respectively and set to zero

0log

2

1,,|,

1

0

1

0

1

02,

2,

T

t

N

s

M

m ms

mstt

ygmsQ

0

1,

,|, 1

0

1

0

1

02,

,

T

t

N

s

M

m ms

tmstt

ygygms

Q

0,

,|, 1

0

1

0

1

02,

,

T

t

N

s

M

m ms

mstt

ygms

Q

42


• We now consider a special case of– when , ,

• We may model the bias as either fixed for an utterance or varying with time (depend on the state)– Fixed bias

– State-dependent bias

ygyF

1 tt yyg tb tbyyFx ˆ

tb

1

0

1

0

1

0

2,,

1

0

1

0

1

0

2,,,,,

/,

/,

T

t

N

s

M

mdmst

T

t

N

s

M

mdmsdmsdtt

d

ms

ymsb

1

0

1

0

2,,

1

0

1

0

2,,,,,

,

/,

/,

T

t

M

mdmst

T

t

M

mdmsdmsdtt

ds

ms

ymsb

43

Conclusions

• In this presentation, we have presented some codebook based cepstral normalization methods– Either using stereo data or noisy data only– Various optimization criteria

• Maximum Likelihood, Minimum Classification Error (MCE), Maximum Mutual Information (MMI)

• Further studies– Exploitation of other optimization criteria?

• Minimum Phone/Word Error (MPE/MWE) criteria– Having similar effectiveness in LVCSR applications?– Utilization of the distribution characteristics of speech feature vec

tors?• Combination with Histogram EQualization (HEQ) approaches

44

Conclusions (cont.)

Method Stereo Data Code book optimization

CriterionBias

EstimationTraining

Complexity Test

Complexity

SPLICE Yes Noisy Data MMSE Data Driven Low Low

SDCN Yes SNR MMSE Data Driven Low Low

CDCN Yes Clean Data ML & MMSE Model Driven N/A High

POF Yes Characteristic Vector MMSE Data Driven High Low

ML-SVM No Noisy Data ML Model Driven High Low

MMI-SPLICE No Noisy Data MMI Model Driven High Low

MCE-SVM No Noisy Data MCE Model Driven High Low

Stochastic Matching No N/A ML Model Driven High Low

• Comparison of various codebook-based feature compensation methods

45

References

• SDCN, CDCN, FCDCN– A. Acero, and R. M. Stern, "Environmental Robustness in Autom

atic Speech Recognition,“ in Proc. ICASSP 1990– A. Acero, and R. M. Stern, "Robust Speech Recognition by Norm

alization of the Acoustic Space," in Proc. ICASSP 1991

• POF– L. Neumeyer and M. Weintraub, "Probabilistic Optimum Filtering

for Robust Speech Recognition," in Proc. ICASSP 1994

• SPLICE– L. Deng, A. Acero, M. Plumpe and X. Huang, “Large-Vocabulary

Speech Recognition under Adverse Acoustic Environments,” in Proc. ICSLP 2000

– L. Deng, A. Acero, L. Jiang, J. Droppo, and X.-D. Huang, “High-performance robust speech recognition using stereo training data,” in Proc. ICASSP 2002

46

References (cont.)

– J. Droppo, A. Acero and L. Deng. Evaluation of the SPLICE Algorithm on the Aurora2 Database,” in Proc. EuroSpeech 2001

• Stochastic Mapping– A. Sankar and C. H. Lee, “A Maximum-Likelihood Approach to St

ochastic Matching for Robust Speech Recognition,” IEEE Trans. Speech and Audio Processing, 1994

• ML-SVM– J. Wu, Q. Huo, D. Zhu, “An Environment Compensated Maximu

m Likelihood Training Approach Based on Stochastic Vector Mapping,” in Proc. ICASSP 2005

– Q. Huo and D. Zhu, “A Maximum Likelihood Training Approach to Irrelevant Variability Compensation Based on Piecewise Linear Transformations” in Proc. ICSLP 2006

– D. Zhu and Q. Huo, “A Maximum Likelihood Approach to Unsupervised Online Adaptation of Stochastic Vector Mapping Function for Robust Speech Recognition,” in Proc. ICASSP 2007

47

References (cont.)

• MCE-SVM– J. Wu and Q. Huo, “An Environment-Compensated Minimum Classifi

cation Error Training Approach Based on Stochastic Vector Mapping,” IEEE Trans. Audio, Speech and Language Processing 14(6), 2006

• MMI-SPLICE– J. Droppo and A. Acero., “Maximum Mutual Information SPLICE Tra

nsform for Seen and Unseen Conditions,” in Proc. EuroSpeech 2005.

– J. Droppo, M. Mahajan, A. Gunawardana and A. Acero. "How to Train a Discriminative Front End with Stochastic Gradient Descent and Maximum Mutual Information," in Proc. ASRU 2005

• Others– H. Liao and M.J.F. Gales. "Joint Uncertainty Decoding for Noise Rob

ust Speech Recognition," in Proc. EuroSpeech 2005

codebook-based feature compensation for robust speech recognition 2007/02/08 shih-hsiang lin (...

Documents

clean vector

corresponding noisy

splice datadrivendeng

noisy stereo data unsupervised

splice cont

noisy cepstral feature

svm modelbasedzhu

noisy data maximum likelihood