a novel parallel model method for noise speech recognition_正式投稿_

1

A Novel Parallel Model Method for Noise

Speech Recognition ZHANG Mingxin1 2, CHEN Guoping1 2, NI Hong1, ZHANG Dongbin1

1(Institute of Acoustics, Chinese Academy of Sciences, Beijing 100080, China) 2(Graduate School of the Chinese Academy of Sciences, Beijing 100039, China)

Abstract ─ In noise robust speech recognition, parallel model combination (PMC) method is suitable for non-stationary environment noise, and theoretically the performance the combined model can approach that of the model matching the noisy environment, so it is an important and popular noise robust speech recognition research field. In this paper, a new feature MFCC_FWD_BWD is presented to make PMC much simple and direct, which is based on forward-backward difference dynamic parameters. On this condition, a novel parallel sub-state hidden Markov model (PSSHMM) is also presented for PMC, which topology is different from that of the standard hidden Markov model (HMM). In PSSHMM each state has parallel sub-states with transitions. In experiment, PSSHMM using the feature MFCC_FWD_BWD achieves good results under each kind of noise and SNR. Especially for non-stationary noise, its robust performance is also excellent.

Key words ─ Parallel Model, Speech recognition, Noise robust, PMC

1 Introduction

Recognition rate of LVCSR (large vocabulary continuous speech recognition) system has reached fairly high level in laboratory environment up to now. However, if the system works in noisy environment, the performance degrades seriously. Such performance degradation has greatly prevented the application of LVCSR. Therefore, robust speech recognition in noisy environment is becoming increasingly important.

Now international noise robust speech recognition researches mainly focus on three aspects. Firstly, robust feature representation is used, such as relative spectral (RASTA)[1], perceptual linear prediction (PLP) and cepstral mean normalization (CMN). Secondly, approaches trying to modify the testing speech features to make them better match the conditions of the pre-trained recognition model. The methods based on spectral subtraction [2] and speech enhancement belong to this aspect. Thirdly, the compensation is performed on the pre-trained model to match the noisy background. Such model-based compensation schemes include parallel model combination (PMC)[3][4], maximum likelihood linear regression (MLLR), etc.. Because PMC is suitable for non-stationary noise and the performance of the combined model without retraining can approximate that of the matched model trained using the noisy speech of corresponding environment, it has been paid great attention. PMC is the subject of this paper.

In this paper, the basic PMC method is introduced first. Then the new feature named MFCC_FWD_BWD is described, which dynamic parameters in feature vector is based on forward and backward difference. And then, PSSHMM is presented for PMC noise robust speech recognition model. The model parameters combination algorithm is also explained. Last, the evaluation and conclusion are given.

2

2 Basic PMC Method

When LVCSR system works in additive noise environment, the matched model should be the retrained model using the noisy speech sampled in the environment or obtained by adding the noise and pure speech in time domain waveform. This model will have the best performance under the noise environment. However, retraining the matched model online in any environment is impractical for its great computation costs. Fortunately, PMC method doesn’t need retraining. It believes that the pure speech model contains enough information about the speech feature and the noise model contains enough information about the noise feature, so we can combine the speech and noise model to match the noisy background [4].

In this paper, the speech model is the standard HMM model. The noise model is the single Gaussian component and state-full-transition model, which is obtained by clustering noise feature vectors. It has no starting and ending state, which is different from HMM speech model.

In order to describe the effects of the noise on the clean speech, a series of assumptions are required, as show in the following [3].

1) speech and noise are independent. 2) speech and noise are additive in the time domain. 3) a single Gaussian or multiple Gaussian component(s) model contain sufficient information to represent the distribution of the observation feature vector in cepstral or log-spectral domain. 4) the frame/state alignment used to generate the speech models from the clean speech data is not altered by the addition of noise.

Under above assumptions, the speech and noise is treated as additive in power spectral [5]. As the feature used in recognition is usually in cepstral domain, the model parameters of speech and noise should be transformed to spectral domain. After model combination, the combined model should be transformed back to cepstral domain. The procedure is shown in Fig.1.

1−C 1−C

exp{} exp{}

PMC

log{}

C

Fig.1 Parallel model combination procedure

3 Feature Vector Construction Method for PMC

PMC method requires that the feature vector used in recognition can regenerate the raw parameter vectors that will be used for combination in power spectral domain [3]. Especially for dynamic feature parameters in the feature vector, this requirement is much more important. For this reason, we present a novel feature vector MFCC_FWD_BWD. Its static part is the same with MFCC_D_A, while its dynamic part uses the forward and backward difference parameters to take

3

place of the difference and accelerate parameters. The MFCC_FWD_BWD feature is constructed as following:

TTcBw

TcFw

TccNFVec ])()()([)( ττττ OOOO ∆∆= (1)

where )()()( τττ cFB

ccFw w OOO −+=∆ is the forward difference part and

)()()( FBccc

Bw w−−=∆ τττ OOO is the backward difference part. In matrix form,

cNTVecN

FBc

c

FBc

cBw

cFw

c

cNFVec

w

w

OA

O

O

O

II00II

0I0

O

O

O

O =

−

+

−−=

∆

∆=

)(

)(

)(

)(

)(

)(

)(

τ

τ

τ

τ

τ

τ

τ . (2)

However, MFCC_D_A is constructed as

cTVec

c

c

c

c

c

c

c

c

cFVec

w

w

w

w

AO

O

O

O

O

O

I02I0I0I0I000I00

O

O

O

O =

−

−

+

+

−−=

∆

∆=

)2(

)(

)(

)(

)2(

)(

)(

)(

)(2

τ

τ

τ

τ

τ

τ

τ

τ

τ . (3)

Comparing formula (2) and (3), we can see that MFCC_FWD_BWD construction matrix

NA is invertible, so feature vector static time series cNTVecO can be obtained from feature

vector )(τcNFVecO . On the contrary, because the MFCC_D_A construction matrix A is not

invertible, we can not obtain cTVecO from )(τc

FVecO . Construction matrix is invertible is

necessary for PMC.

4 Parallel Sub-State HMM for PMC

4.1 Speech model and noise model used in PMC

In the system of this paper, clean speech model is the usually used standard HMM model, which is the finite state machine model and is quite suited for describing the speech generating procedure. The topology of HMM model is shown in Fig.2. HMM model may be characterized by following three important parameters:

SN , the number of states in the model;

ST , the state-transition probability matrix;

)( tjb o , 1,,2 −= SNj L , the output observation probability distribution.

In the model, both the starting state and ending state are non-emitting states which are used for

HMM models connection. Here )( tjb o are often described by Gaussian (single or multiple)

4

component(s), i.e. ∑=

=Ms

mjmjmjmtj Ncb

1

),()( Sµo , where 1≥sM (for the convenience of

explanation, we let 1=sM in the following, then ),()( jjtj Nb Sµo = ).

The noise model in the paper is defined as full-state-transition model, which topology is shown in Fig.3. It is composed by several states, which parameters are obtained by clustering background noise features. The noise model also can be characterized by following three important parameters:

NoiN , the number of states in noise model;

NoiT , the full-state-transition probability matrix of noise model;

)( tkb o , noiNk ,,1 L= , the output observation probability distribution;

where )( tkb o is described by single Gaussian component, i.e. )~

,~()( kktk Nb Sµo = .

Fig.2 HMM topology

Fig.3 Noise model topology

4.2 PSSHMM

PMC combines the clean speech model and noise model to achieve the matched model. In this subsection, the presented PSSHMM used for the combined matched model is described in detail. The topology of PSSHMM is shown in Fig.4. PSSHMM is a complex HMM model, while each state has several parallel sub-states. These sub-states are generated by combining the corresponding clean speech state and each noise state. In PSSHMM, there are two kinds of transition. One is the transition of the global HMM, as shown in Fig.5; the other is the transition among the sub-states, which obey the noise states transition matrix. Seen from the time synchronous expanded states series, we can find that the sub-states are arranged parallel and at each time point only one sub-state can emit an observation, as shown in Fig.6. It also can be seen that the transition among the sub-states exists between the previous and posterior time synchronous states.

Fig.4 PSSHMM topology

5

Fig.5 PSSHMMl is a complex HMM

2t 3t 4t1t

Fig.6 PSSHMM time synchronous expanded state series

The PSSHMM can be described using following five parameters:

pmN , the number of states in the model;

pmT , the state-transition probability matrix;

subD , the number of sub-states in each model state;

subT , the sub-state-transition probability matrix;

)( tjkb o , 1,,2 −= pmNj L , subDk ,,1 L= , the output observation probability distribution of

each sub-state.

Here )( tjkb o is often described by Gaussian component, i.e. )ˆ,ˆ()( jkjktjk Nb Sµo = , where

jkµ and jkS are obtained by parallel model combination algorithm that will be discussed in

following section. It should be emphasized that the output probability of parallel model state is related with that

of the sub-states. The relation can be described by )}|()({max)( 1−⋅= ttjkktj kkPbb oo , which

has directly ef fect on recognition decoding, where 1−tk is previous optimal sub-state label and

)|( 1−tkkP is the sub-state-transition probability, i.e. ],[)|( 11 −− = tsubt kkkkP T .

5 Parallel Model Parameter Combination Algorithm

Using the MFCC_FWD_BWD feature, we combine the clean speech model and noise model to achieve the parallel model to match the noisy environment. In this paper log-add algorithm [4] is used for the model parameters combination. Log-add algorithm only combines means and does not combine variances. It is assumed that clean speech model state is described by Gaussian

components ])[],([ cBw

cFw

ccBw

cFw

cN ∆∆∆∆ SSSµµµ and noise model state is described by Gaussian

6

components ])~~~

[],~~~([ cBw

cFw

ccBw

cFw

cN ∆∆∆∆ SSSµµµ . The parameter combining steps is:

1) Transform the clean speech model parameters from MFCC_FWD_BWD to static time series parameters in cepstral, i.e.

[ ] [ ]TTcBw

TcFw

TcN

TTcw

Tcw

Tc ∆∆−−+ = µµµAµµµ 1

τττ (4)

It is the same for noise model

[ ] [ ]TTcBw

TcFw

TcN

TTcw

Tcw

Tc ∆∆−−+ = µµµAµµµ ~~~~~~ 1

τττ (5)

2) Using IDCT, transform the static time series parameters from cepstral domain to log domain, i.e.

[ ] [ ]TTcw

Tcw

TcTTl

wTl

wTl

−−

+−−

−+ = ττττττ µCµCµCµµµ 111 (6)

[ ] [ ]TTcw

Tcw

TcTTl

wTl

wTl

−−

+−−

−+ = ττττττ µCµCµCµµµ ~~~~~~ 111 (7)

3) Combine the parameters of clean speech model and noise model using log-add algorithm, i.e.

}}~exp{}log{exp{ˆ τττlll µµµ += (8)

}}~exp{}log{exp{ˆ wl

wl

wl

+++ += τττ µµµ (9)

}}~exp{}log{exp{ˆ wl

wl

wl

−−− += τττ µµµ (10)

4) Using DTC, transform the combined model parameters from log domain to cepstral domain, i.e.

[ ] [ ]TTlw

Tlw

TlTTc

wTc

wTc

−+−+ = ττττττ µCµCµCµµµ ˆˆˆˆˆˆ (11)

5) Transform the static time series combined model parameters to MFCC_FWD_BWD, i.e.

[ ] [ ]TTcw

Tcw

TcN

TTcBw

TcFw

Tc−+

∆∆ = τττ µµµAµµµ ˆˆˆˆˆˆ (12)

Thus, the sub-state output observation probability components of combined paralle l model states

subpmtjk DkNjb ,,1,1,,2),( LL =−=o , can be calculated by combining the clean speech

model state Stj Njb ,,1),( L=o and each noise model state noitk Nkb ,,1),( L=o using the

above log-add algorithm.

6 Evaluation

Our experiment is based on HTK3.0 [6] speech recognition platform that has been improved and changed for PMC. The acoustic models are context dependent single Gaussian component model. The database used for clean speech was Mandarin 863 Speech Database. We select 9515 sentences of 16 female speakers as training set and 400 sentences of 4 speakers outside the training set as testing set. The second database used is NOISEX92 as noise database. We select four kinds of representative noise: babble, f16, machinegun and white. The four kinds of noise are added to clean speech according to certain ratio to form four different SNR: 30db, 20db, 10db, 0db noisy speech for test. All of feature vector sizes are 39.

7

In the experiment, we first test recognition performance of the clean speech data using the MFCC_D_A and MFCC_FWD_BWD features. As shown in table 1, it can be seen that the word accurate recognition rates of two kinds of feature are almost same and the difference is less than 0.5%. So the new feature MFCC_FWD_BWD simplifies parameter combination procedure with slight decrease in recognition rate.

Table 1. Accuracy comparison of clean speech (MFCC_D_A vs MFCC_FWD_BWD)

Feature kind Acc (%)

MFCC_D_A 75.08

MFCC_FWD_BWD 74.65

In noise robust experiments, the baseline system uses the MFCC_D_A feature and standard HMM, and the PMC testing system uses the MFCC_FWD_BWD feature and PSSHMM. In PMC, the number of noise model state is 3. In order to compare the performance of the new method we also test the spectral subtraction method, which is usually used in noise robust speech recognition field. The evaluations are given out in following table 2-4.

Table 2. Baseline system performance

Baseline system (MFCC_D_A & HMM)

Acc(%) babble f16 machinegun white Avg.

30db 66.52 69.10 68.46 55.77 64.96

20db 47.80 51.34 61.47 18.92 44.88

10db 5.70 10.41 48.34 5.34 17.45

0db 0.36 0.00 33.46 0.00 8.46

Avg. 30.10 32.71 52.93 20.01 33.94

Table 3. Spectral subtraction method performance

Spectral subtraction (MFCC_D_A & HMM)

Acc (%) babble f16 machinegun white Avg.

30db 65.67 69.10 66.70 65.83 66.83

20db 51.62 57.42 58.47 34.97 50.62

10db 13.69 24.44 37.31 5.46 20.23

0db 0.00 1.67 11.62 2.11 3.85

Avg. 32.75 38.16 43.53 27.09 35.38

Table 4. PMC method performance

PMC (MFCC_FWD_BWD & PSSHMM)

Acc (%) babble f16 machinegun white Avg.

30db 76.08 74.63 75.08 67.87 73.42

20db 69.65 63.78 73.67 50.05 64.29

10db 46.65 31.63 71.47 21.66 42.85

0db 12.46 5.80 65.63 5.68 22.39

Avg. 51.21 43.96 71.46 36.32 50.74

From table 2, it can be seen that the recognition rate of baseline system without any robust processing decreases sharply with SNR descending. In table 3, as spectral subtraction feature

8

processing method is used, the recognition rate increases 4.2% relative to baseline system averagely. Table 4 shows the performance of the PMC method using MFCC_FWD_BWD feature and PSSHMM. It is clear that this method achieves excellent noise robust result, which recognition rate is far higher than baseline system and spectral subtraction method with the relative increase of 49.5% and 43.4%.

Comparing the recognition rate of machinegun noisy speech in table 2-4, we can find that spectral subtraction method cannot reach the goal of noisy robustness. On the contrary, the recognition rate decreases 17.8% relative to baseline system. However the PMC method has excellent robust performance that it achieves relative increase of 35.0% compared with baseline system.

7 Conclusion

In this paper, the PMC method using the presented MFCC_FWD_BWD feature and PSSHMM achieves excellent noise robust performance in each kind of noise and each SNR level. Its recognition rate improves 49.5% relative to baseline system and 43.4% relative to spectral subtraction method. Especially for machinegun noise, the spectral subtraction cannot makes any improvement, but the PMC stands out with the 35.0% increase relative to the baseline system. Our planed work will be to put on the research of the model parameter combination algorithm to improve the recognition performance. Reference [1] B.E.D. Kingsbury, N. Morgan, Recognizing Reverberant Speech with RASTA-PLP, ICASSP-97, pp.

1259-1262, Munich, Germany, 1997.

[2] Randy Gomez, Akinobu Lee, Hiroshi Saruwatari, etc., Robust Speech Recognition with Spectral Subtraction

in low SNR. ICSLP-04, pp. 2077-2080, Jeju Island, Korea, 2004.

[3] Mark J. F. Gales, Steve Young, Robust Continuous Speech Recognition Using Parallel Model Combination,

IEEE Trans. Speech and Audio Processing, vol. 4, pp. 352-359, 1996.

[4] Jeih-weih Huang, Jia-lin Shen, Lin-shan Lee, New Approach for Domain Transformation and Parameter

Combination for Improved Accuracy in Parallel Model Combination (PMC) Techniques, IEEE Trans. Speech

and Audio Processing, vol. 9, 842-855, 2001.

[5] Febe de Wet, Jhan de Veth, Loe Boves, etc., Additive Background Noise as a source of non-linear mismatch

in the cepstral and log-energy domain, Computer Speech and Language, Vol.19, pp. 31-54, 2005.

[6] Steve Young, Dan Kershaw, Julian Odell, etc., The HTK book (for HTK v3.0), Cambridge University, 2000.

a novel parallel model method for noise speech recognition_正式投稿_

Documents