codebook-based feature compensation for robust speech recognition 2007/02/08 shih-hsiang lin (...
TRANSCRIPT
Codebook-based Feature Compensation for Robust Speech Recognition
2007/02/08
Shih-Hsiang Lin (林士翔 )
Graduate Student
National Taiwan Normal University, Taipei, Taiwan
2
Outline
• Introduction• Codebook-based Cepstral Compensation
– With Both Clean and Noisy (Stereo) Data• Stereo-based Piecewise Linear Compensation (SPLICE)
• SNR Dependent Cepstral Normalization (SDCN)
• Codeword Dependent Cepstral Normalization (CDCN)
• Probabilistic Optimum Filtering (POF)
– With Only Noisy Data • Maximum Likelihood based Stochastic Vector Mapping (ML-SVM)
• Maximum Mutual Information-SPLICE (MMI-SPLICE)
• Minimum Classification Error based SVM (MCE-SVM)
• Stochastic Matching
• Conclusions
3
Introduction
• Speech recognition performance degrades seriously when– There is mismatch between training and test acoustic conditions – However, training systems for all noise conditions is impractical
• A Simplified Distortion Framework
– Channel effects are usually assumed to be constant while uttering– Additive noises can be either stationary or non-stationary
(Convolutional Noise)
Channel Effect
Background Noise
(Additive Noise)
Noisy Speech
Clean Speech
tx
th tn
tnthtxty
4
• Non-linear Environmental Distortions
– Clean speech was corrupted by 10dB subway noise • Not only linear but also non-linear distortions were involved
Introduction (cont.)
Mel filter bank output cepstral coefficient log energy
5
Introduction (cont.)
• Two main approaches to improving noise robustness– Model Compensation
• Adapt acoustic models to match corrupted speech features– Feature Compensation
• Restore corrupted speech features to corresponding clean ones
Clean Speech
Noisy Speech
Clean Acoustic Models
NoisyAcoustic Models
TrainingConditions
TestConditions
Feature Compensation
ModelCompensation
Feature Space Model Space
6
Introduction (cont.)
• Model Compensation– Adapt acoustic models to match corrupted speech features
– Representative approaches: MAP, MLLR, MAPLR, etc.
• Feature Compensation– Restore corrupted speech features to corresponding clean ones
– Representative approaches: SS, CMN, CMVN, etc.
ModelCompensation
m
m
m
m
Clean Models Corrupted Models
Corrupted Speech
Corrupted Speech Features
FeatureCompensation
Compensation Model
CleanSpeech Features
7
Theme of Presented Compensation Approaches
• Codebook-based Cepstral Compensation– All involve the utilization of vector quantization (VQ)
UniversalCodebook
Feature Vector
Correction Vector
Finding of Correction Vector
Compensation
8
Theme of Presented Compensation Approaches (cont.)
• SDCN (Data-Driven)• CDCN (Model-based)(Acero and Stern,1990)
• FCDCN (Data-Driven)(Acero and Stern,1991)
• POF (Data-Driven)(Neumeyer and Weintraub, 199
4)
• SPLICE (Data-Driven)(Deng and Acero et al. 2000)
• Stochastic Matching (Model-based)(Sankar and Lee 1996)
• MMI-SPLICE (Model-based)(Droppo and Acero, 2005)
• MCE-SVM (Model-based(Wu and Huo, 2006)
• ML-SVM (Model-based)(Wu, Huo and Zhu, 2005)
With Only Noisy Data
With Both Clean and Noisy (Stereo) Data
• Unsupervised ML-SVM (Model-based)(Zhu and Huo, 2007)
9
Stereo-based Piecewise Linear Compensation (SPLICE)
Estimation of the Correction Vector (Bias)
A Mixture of Gaussians are Trained
For Each Gaussian Component (codeword)
A Linear Relationship between the Clustered Noisy Data and Its
Corresponding Clean Counterpart is Assumed
Deng et al. 2000
10
SPLICE (cont.)
• The success of SPLICE has its roots on two assumptions1. The noisy cepstral feature vectors follow a distribution of a
mixture of Gaussians• Can be thought of as a “codebook” with a total of codewords
2. The conditional probability density function for a clean vector given its corresponding noisy vector and the cluster index is Gaussian whose mean vector is a linear transformation of
• The correction vector (or bias) can be estimated by MMSE estimate given the clean training speech vectors and their noisy counterparts
k
kpkypyp | kkyNkyp ,;| where
K
xy k
kkryxNkyxp ,;,|
Only mean shift (without rotation) is assumed here
y
kr
11
SPLICE (cont.)
• Therefore, the clean feature vector can be restored by its corresponding noisy feature vector and a linear weighted sum of all codeword-dependent bias vectors
• Or, we can alternatively (simply) use the following equation to obtain the restored clean feature
kk
kk
k
rykpy
ryykpkyxykpkyxyxx
|
| ,||,|ˆ EEEE
ykpkk
|argmax where
Since is an (observed) known constant vector herey
yx
kr
kryx ˆ
x
12
SPLICE (cont.)
• The MMSE estimate of the correction vector (bias) can be expressed using the following formula
– The use of the stereo (both clean and noisy) training data provides an accurate estimate of the correction vectors
• SPLICE has good potential to effectively handle a wide range of distortions– Nonstationary distortion, jointly additive or/and convolutional
distortion, and even nonlinear distortion of the original speech signal
Yy
Yyk ykp
yxykpr
|
|
k kpkyp
kpkypykp
|
||
kr
where
However, in many ASR applications, stereo data are too expensive to collect
13
SNR Dependent Cepstral Normalization (SDCN)
• Notice that in the simplified distortion framework, the noisy cepstral feature vector can be expressed by
– is a non-linear function and it is very difficult to estimate
• Therefore, SDCN attempts to restore the clean cepstral vector using a compensation vector that approximates
and for different SNR levels
hxnghxy
hxnhxng -1Cexp1logC
where
Acero et al. 1990
y
g
is the discrete cosine transform matrixC
hxng
TrMfYfYfYy log log logC 10
TrMfXfXfXx log log logC 10
TrMfHfHfHh log log logC 10
TrMfNfNfNn log log logC 10
mf is the filter bank bin
h
14
SDCN (cont.)
• A schematic depiction of SDCN
– The (instantaneous) SNR can be calculated in proportion to the difference between the C[0] of the input frame and the C[0] of the noise at a reference time
Universal Codebook(Each codeword represents a specific SNR level)
While for each SNR level (codeword), a compensation vector is estimated kSNRw
kSNRw
15
SDCN (cont.)
• Therefore, a compensation vector that completely depends on the instantaneous SNR of the observed feature vector can be used to restore the clean one
• Two extreme cases - High SNR: - Low SNR:
kSNRwyx ˆ
x
hxnghSNRw k where
hSNRw
hxy
hxn
k
0Cexp1logC 1
nhx
Removal of the channel effect
nhx
nxSNRw
ny
nhxn
k
1Cexp1logC
Removal of the additive noise effect
y
kSNRw
16
SDCN (cont.)
• The compensation vectors were estimated by MMSE using the following equation
• Disadvantage– For a new test environment, the compensation vectors have to
be re-estimated using a new set of sufficient amounts of stereo data
XxxSNR
XxxSNRchannelchannel
kchannelk
channelk
SNR
SNRxxSNRw
1
121
1channelx
2channelx
: cepstral vectors for test (noisy) condition
: cepstral vectors for standard acoustical (clean) condition
kSNRw
kSNR : Kronecker delta function
1channelxSNR : the instantaneous SNR level of 1channelx
17
Codeword Dependent Cepstral Normalization (CDCN)Acero et al. 1990
Distribution of clean speech
• A schematic depiction of CDCN
a noisy speech utterance Y
Form a mixture of Gaussians
Can be regarded as a kind of phone-dependent distributions
For , we can estimate the corresponding noise and channel vectors of each Gaussian
hxnghyx
Y
r||
18
CDCN (cont.)
• CDCN first models the distributions of cepstral feature vectors of clean speech by a mixture of Gaussian distributions
• Secondly, CDCN assumes the conditional probability density function for a clean vector given its corresponding noisy vector and the Gaussian index is a Gaussian distribution
1
0
1
0,;|
K
kkkkk
K
kk xNckxpcxp
xkrhyxNkyxp ,;,|
xy k
19
CDCN (cont.)
• Then, the CDCN algorithm is conducted by two steps1. The noise and channel vectors are estimated by MLE
2. The expected restored cepstral vector is calculated by a linear weighted sum of all Gaussian-dependent expected clean cepstral vectors
rrr
,|logmaxargˆ,ˆ,
hYphh
is phone-dependentis phone-independentrhwhere we assume
kK
kMMSE xykpkyxyxx ˆ|,ˆ
1
0
EEE
MMSEx
kx
10 ,, krr r
20
CDCN (cont.)
• The ML estimate of the noise and channel vectors
– Where
• To obtain the optimum values and , we can take derivatives w.r.t. and respectively and set them to zero
Yyh
h
hyp
hYph
r
rr
r
r
,|logmaxarg
,|logmaxargˆ,ˆ
,
,
0
,|ln
h
hYp r
1
0
12/1
ˆˆ2
1exp
2
1,|
K
kk
kkk
k
k
k xxchyp
r
krhy h kr
hkr
k
r
hYp
k
, 0
,|ln rand
21
• POF is conducted based on a probabilistic piecewise-linear transform of the acoustic space– Using VQ algorithm to partition the clean feature space in clusters
• Each VQ region is assigned a multidimensional transversal filter
Probabilistic Optimum Filtering (POF)
K
112,1,0,1,,T
dpdkpkkkkpkk bAAAAAW
1)12(1TT
1TT
1TT 1 dppttttptt yyyyyY
Neumeyer et al. 1994
tkkt YWx Tˆ where
The multidimensional transversal filter of a cluster
1Z 1Z 1Z 1Zpty 1 pty ty pty 2 pty
pkA , 1, pkA 2, pkA 0,kA pkA ,
1
kb
ktx
k
22
POF (cont.)
• The error between the clean vector and the estimated vectors produced by the -th filter is given by
• The conditional error in each region is defined as
– where is the probability that the clean vector belongs to cluster given an arbitrary characteristic vector of
• The characteristic vector can be any acoustic information cues generated from each frame of the speech utterance
– E.g. instantaneous SNR, energy, cepstral coefficients
k
tktktttk YWxxxe Tˆ
pT
ptttkk zkpeE
1 2|
tzkp | txk tz tx
23
POF (cont.)
• To compute the optimum filters in the MMSE sense, we have to minimize the error in each cluster– can be obtained by taking the gradient of with respect to it
and then equating the gradient to zero
• The run-time estimate of the clean feature vector can be computed by integrating the outputs of all the filters as follows
kE
kEkWk ,0
kkk rRW 1
pT
pttttk zkpYYR
1T1 |
pT
pttttk zkpxYr
1T |
Accordingly, has the form
t
K
ktk
K
kttkt YzkpWzkpYWx
1
0
T1
0
T ||ˆ
kW
kW
where and
24
Maximum Likelihood based Stochastic Vector Mapping (ML-SVM)
Distribution of noisy speech
E environmental clusters
Correction vectors
• A schematic depiction of ML-SVM
For each environment cluster
0 1 2 3
Acoustic Model(Trained by Noisy Speech)
GMM
Noisy Speech
+
Wu et al. 2005
25
ML-SVM (cont.)
• Suppose the (noisy) training data can be partitioned into environmental clusters, while each cluster is modeled as a mixture of Gaussians
• Given a set of acoustic models, the aim of SVM is to estimate the restored clean feature vector from the noisy one by applying the environment-dependent transformation
E
1
0,,
1
0,;|,|||
K
kkeke
K
kyNekpekypekpeyp
xy
eyFx ,ˆ
26
• The SVM function can be one of the following forms
• During recognition, given an unknown utterance– The most proximal environmental cluster is first identified– Then, the corresponding GMM and the mapping function are
used to derive a compensated version of from noisy vector
K
k
ek
e beykpyyFx1
1 ,|,ˆ
ek
e byyFx ,ˆ 3 eykpkkk
,|'maxarg,,1'
K
jejypejp
ekypekpeykp
1,||
,||,|where
where
Y
x y
eee byAyFx ,ˆ 5
ekK
k
ee beykpyAyFx 1
2 ,|,ˆ
ek
ee byAyFx ,ˆ 4
ML-SVM (cont.)
e
27
ML-SVM (cont.)
• A Flowchart of The Joint Maximum Likelihood Training of SVM
28
• The detailed procedures depicted in the above flowchart are explained as follows
Step 1 : Initialization
A set of HMM acoustic models with diagonal covariance matrices trained with multi-condition training data are used as the initial models, and the initial bias vectors are set to be zero vectors
Step 2 : Estimating SVM Function Parameters
Given the HMM acoustic model parameters , for each environmental class , Nb times of EM training are performed to estimate the environment-dependent mapping function parameters to increase the likelihood function
TekD
ek
ek bbb ,,1
,L e
I
iiYFpL
1
|;,
ML-SVM (cont.)
e
e
29
Let’s consider a particular environmental class . The auxiliary (Q) function
for becomes e
eIi t s mesm
k
ekititsm
Tr
smk
ekititit
eIi t s msmsm
k
ekititite
Constbeykpybeykpyms
beykpyNmsQ
,|,|,
,;,|log,
1
The occupation probability of Gaussian component m of state s at time t
By setting the derivates of w.r.t ‘s as zero ekb
e
e
Ii t s mitsmitsmit
Ii t s m k
ekititsmit
yeykpms
beykpeykpms
,|,
,|',|,
1
''
1
K
k
ek
e beykpyyF1
1 ,|,
ML-SVM (cont.)
e
eQ
30
Since above equation holds for all k, it equivalent to solve the root of vector
in the following equation
where is a K x K matrix with the (k,k’)-th element being
and is a K –dimensional vector with each being
TeKd
ed
ed bbB ,,1
ed
ed
ed CBA
edA
eIi t s mitit
smd
ited eykpeykp
mskka ,|',|
,',
2
edC Te
ded
ed Kckcc ,,1
eIi t s mit
smd
itdsmdited eykp
ymskc ,|
,2
The estimation need an inverse operation of the K x K matrix
ML-SVM (cont.)
kc ed
',kka ed
31
If the SVM function is used for feature compensation,
the EM training formula for can be derived similarly with a much simpler
updating form as follows
ekb
e
e
Ii t s m smd
ititk
Ii t s m smd
itdsmdititk
ekd mseykpk
ymseykpk
b
2'
2'
,,|'maxarg1
,,|'maxarg1
ek
e byyFy ,ˆ 3
Step 3 : Estimating HMM Acoustic Model Parameters
We transform each training utterance using its respective
mapping function with parameters . Having the environment-
compensated utterances, Nh EM iterations are then performed to re-estimate the HMM acoustic model parameters to increase the likelihood function
Step 4 : Repeat Step 2 and Step 3 Ne times ,L
e
ML-SVM (cont.)
32
Maximum Mutual Information-SPLICE (MMI-SPLICE)
• MMI-SPLICE is much like SPLICE, but without the need of target clean feature vectors (no need of stereo data)– MMI-SPLICE learns to increase recognition accuracy directly
with a maximum mutual information (MMI) objective function
– MMI objective function
• The global objective function is a linear sum of the objective function for each training utterance
r w r
rr
rrr
rr wXp
wXpXwpFF
,ˆ,ˆ
lnˆ|lnB
BBBB
rF
Droppo et al. 2005
,|ˆ 110
1
0
K
K
kk bbbbykpyyfx BB
Fr
where
33
MMI-SPLICE (cont.)
• The transformation parameters can be estimated by gradient-based linear method– Any gradient ascent method can be used (e.g. conjugate gradient or
BFGS)
• Since every is a function of many conditional probabilities of HMM states
rtstr k
rt
rt
rt
rt
rt
rt
r
k b
x
x
sxp
sxp
F
b
F
,,
~
~|~ln
|~lnB
B
BB
rF rtr
t sxp |B
110 Kbbb B
34
MMI-SPLICE (cont.)
rtr
tsrtstr
rts
denrts
numrts
rt
k
xykpb
F ~|,,
1
den
rts
numrts
rrtrr
rtr
trt
r XspwXspsxp
F ~
|,~
||~ln
BBB
B
1 ~
~|~ln
rts r
ts
rtr
t
rt
rt x
x
sxp B
rtK
rtik
rit
jmjm
rdt ykpjiykpby
bb
x|1|
~,,
,,
,
35
Minimum Classification Error based SVM (MCE-SVM)
• Classification Error Function
• A continuous loss function is defined as follows
• Objective function
Wu et al. 2006
1
1
;,exp1
1log;,;,
M
n
nc YFgM
YFgYFd
anti-discriminant functionlog-likelihood of current enhanced feature vectorsequence generated by the HMMs of the competitive word strings
discriminant function for recognition decision makinglog-likelihood of current enhanced feature vector sequencegenerate by the current HMMs of the word string Zc
cc YFLL ;, n
n YFLL ;,
R
rrYFl
Rl
1
;,1
,
;,exp1
1;,
YFdYFl
K
kkbykpyyF
1
|,
36
MCE-SVM (cont.)
• Let denote generically the parameters to be estimated and is updated as follows:
• In order to find the gradient , the following partial derivation is used
jjjjjj YFlV |;;1
j
N
nN
n n
nn
c
LL
LLLLLL
ll
YFd
YFd
ll
11' 'exp
exp1
;;
;;
,
37
MCE-SVM (cont.)
• Therefore, the remaining partial derivate or is formulated differently depending on the parameters to be optimized
• Updating of the HMM acoustic model parameters is the same as that of the MCE training
smjtsm
t s m
jtjt yF
yFms
LL
;
;, 1
For each , it follows ekb
otherwise ,0
e classt environmen tobelongs if ,,|; jtjte
k
jt yeykp
b
yF
Therefore
''
1 ,|',|,k
sme
kjtjtsmt s m
jtjtek
beykpyeykpmsb
LL
cLL
nLL
38
Stochastic Matching
• The mismatch between the corrupted speech and the HMM acoustic models can be reduced in two ways– Feature-space transformation
• Find a inverse distortion function that maps into which matches better with the models
– Model-space transformation• Find a model transformation function that maps to the
transformed model which matches better with
• The stochastic algorithm operates only on the given test utterance and the given set of HMM acoustic models– No additional training data is required for the estimation of the
mismatch prior to actual testing
Y
F Y X
GY
YFX ˆ
XY G
Sankar et al. 1996
39
Stochastic Matching (cont.)
• In the feature-space, we need to find such that
We are interested in the problem of finding the parameters ,so
Let be the set of all possible state sequences
be the set of all possible mixture sequences
Then the equation can be written as
• In general, it is not easy to estimate directly, but for some ,we can use the EM algorithm to estimate them
WpWYpWW
,,|maxarg,,
TssS ,,1
,|maxarg
Yp
TmmM ,,1
S M
MSYp ,|,,maxarg
F
40
Stochastic Matching (cont.)
• Expectation (E-Step)
• Maximization (M-Step)
• For simplification, we assume– – The covariance matrices are diagonal
1
0
1
0
1
0,
1,
T, log
2
1,|
T
t
N
s
M
mtvmstmsmstt yJyFyFmsQ
|maxarg
0log2
1,
1
0
1
0
1
0,
1,
T,
T
t
N
s
M
mtvmstmsmstt yJyFyFms
ygyF1, ms
t
tctstctsttt yJ
yfNcsyp
,, ,;
,,,|
41
Stochastic Matching (cont.)
• The auxiliary function can now be written as
• For the estimation of and , we can take the derivative w.r.t. and respectively and set to zero
0log
2
1,,|,
1
0
1
0
1
02,
2,
T
t
N
s
M
m ms
mstt
ygmsQ
0
1,
,|, 1
0
1
0
1
02,
,
T
t
N
s
M
m ms
tmstt
ygygms
Q
0,
,|, 1
0
1
0
1
02,
,
T
t
N
s
M
m ms
mstt
ygms
Q
42
Stochastic Matching (cont.)
• We now consider a special case of– when , ,
• We may model the bias as either fixed for an utterance or varying with time (depend on the state)– Fixed bias
– State-dependent bias
ygyF
1 tt yyg tb tbyyFx ˆ
tb
1
0
1
0
1
0
2,,
1
0
1
0
1
0
2,,,,,
/,
/,
T
t
N
s
M
mdmst
T
t
N
s
M
mdmsdmsdtt
d
ms
ymsb
1
0
1
0
2,,
1
0
1
0
2,,,,,
,
/,
/,
T
t
M
mdmst
T
t
M
mdmsdmsdtt
ds
ms
ymsb
43
Conclusions
• In this presentation, we have presented some codebook based cepstral normalization methods– Either using stereo data or noisy data only– Various optimization criteria
• Maximum Likelihood, Minimum Classification Error (MCE), Maximum Mutual Information (MMI)
• Further studies– Exploitation of other optimization criteria?
• Minimum Phone/Word Error (MPE/MWE) criteria– Having similar effectiveness in LVCSR applications?– Utilization of the distribution characteristics of speech feature vec
tors?• Combination with Histogram EQualization (HEQ) approaches
44
Conclusions (cont.)
Method Stereo Data Code book optimization
CriterionBias
EstimationTraining
Complexity Test
Complexity
SPLICE Yes Noisy Data MMSE Data Driven Low Low
SDCN Yes SNR MMSE Data Driven Low Low
CDCN Yes Clean Data ML & MMSE Model Driven N/A High
POF Yes Characteristic Vector MMSE Data Driven High Low
ML-SVM No Noisy Data ML Model Driven High Low
MMI-SPLICE No Noisy Data MMI Model Driven High Low
MCE-SVM No Noisy Data MCE Model Driven High Low
Stochastic Matching No N/A ML Model Driven High Low
• Comparison of various codebook-based feature compensation methods
45
References
• SDCN, CDCN, FCDCN– A. Acero, and R. M. Stern, "Environmental Robustness in Autom
atic Speech Recognition,“ in Proc. ICASSP 1990– A. Acero, and R. M. Stern, "Robust Speech Recognition by Norm
alization of the Acoustic Space," in Proc. ICASSP 1991
• POF– L. Neumeyer and M. Weintraub, "Probabilistic Optimum Filtering
for Robust Speech Recognition," in Proc. ICASSP 1994
• SPLICE– L. Deng, A. Acero, M. Plumpe and X. Huang, “Large-Vocabulary
Speech Recognition under Adverse Acoustic Environments,” in Proc. ICSLP 2000
– L. Deng, A. Acero, L. Jiang, J. Droppo, and X.-D. Huang, “High-performance robust speech recognition using stereo training data,” in Proc. ICASSP 2002
46
References (cont.)
– J. Droppo, A. Acero and L. Deng. Evaluation of the SPLICE Algorithm on the Aurora2 Database,” in Proc. EuroSpeech 2001
• Stochastic Mapping– A. Sankar and C. H. Lee, “A Maximum-Likelihood Approach to St
ochastic Matching for Robust Speech Recognition,” IEEE Trans. Speech and Audio Processing, 1994
• ML-SVM– J. Wu, Q. Huo, D. Zhu, “An Environment Compensated Maximu
m Likelihood Training Approach Based on Stochastic Vector Mapping,” in Proc. ICASSP 2005
– Q. Huo and D. Zhu, “A Maximum Likelihood Training Approach to Irrelevant Variability Compensation Based on Piecewise Linear Transformations” in Proc. ICSLP 2006
– D. Zhu and Q. Huo, “A Maximum Likelihood Approach to Unsupervised Online Adaptation of Stochastic Vector Mapping Function for Robust Speech Recognition,” in Proc. ICASSP 2007
47
References (cont.)
• MCE-SVM– J. Wu and Q. Huo, “An Environment-Compensated Minimum Classifi
cation Error Training Approach Based on Stochastic Vector Mapping,” IEEE Trans. Audio, Speech and Language Processing 14(6), 2006
• MMI-SPLICE– J. Droppo and A. Acero., “Maximum Mutual Information SPLICE Tra
nsform for Seen and Unseen Conditions,” in Proc. EuroSpeech 2005.
– J. Droppo, M. Mahajan, A. Gunawardana and A. Acero. "How to Train a Discriminative Front End with Stochastic Gradient Descent and Maximum Mutual Information," in Proc. ASRU 2005
• Others– H. Liao and M.J.F. Gales. "Joint Uncertainty Decoding for Noise Rob
ust Speech Recognition," in Proc. EuroSpeech 2005