quantile based histogram equalization for noise robust speech recognition von diplom-physiker...

Quantile Based Histogram Equalizationfor Noise Robust Speech Recognition

vonDiplom-Physiker Florian Erich Hilger

ausBonn - Bad Godesberg

Berichter: Univ.-Prof. Dr.-Ing. Hermann Ney

Presenter : Chen Hung_Bin

December 2004

2

outline

Histogram Normalization Quantile Based Histogram Equalization Experimental Conclusion

3

Histogram Normalization

Histogram normalization is a general non-parametric method to make the cumulative distribution function (CDF) of some given data match a reference distribution.

to reduce an eventual mismatch between the distribution of the incoming test data and the training data's distribution which is used as reference

4


between the test and the training data distributions is caused by the dierent acoustic conditions

the two CDFs can be used directly to dene a transformation

))((ˆ 1 YPPY train

data training theof CDF reference inverse the

and datast current te theof CDF theis If1trainP

P

5


Example for the cumulative distribution functions of a clean and noisy signal.

The arrows show how an incoming noisy value is transformed based on these twocumulative distribution functions.

6


two pass method Two separate histograms, one for silence the other for speech, can be

estimated on the training data. Then a first recognition pass can be used to determine the amount of

silence in the recognition utterances. Based on that percentage the appropriate target histogram can be

determined. which requires a sufficiently large amount of data from the same

recording environment or noise condition to get reliable estimates for the high resolution histograms

7


two pass method It can not be used when a real-time response of the recognizer is requir

ed, like in command and control applications or spoken dialog systems.

Quantile equalization is a straight forward solution to this problem would be to reduce the number of histogram bins, in order to get reliable estimates even with little data.

8

Quantile Based Histogram Equalization

Quantiles are very easy to determine by just sorting the sample data set.

Cumulative distributions can be approximated using quantiles. example, two cumulative distribution function with four 25% quant

iles, NQ = 4

9


NQ = 4, like shown in the example, about one second of data (100 time frames) is already sufficient to get a rough estimate of the cumulative distribution

an other advantage of the quantile Even if the data set that shall be considered only consists of very few o

r in an extreme case just one sample, the quantiles can be calculated without any special modication of the algorithm.

10


the corresponding reference quantiles of the training data define a set of points that can be used to determine the parameters of a transformation function that transforms the incoming data to and thus reduces the mismatch between the test and training data quantiles

),(~ YTY

Y~YT

Applying a transformation function to make the four training and recognition quantiles match.

11


Within the context of this work the transformation is applied to the output of the Mel-scaled filter-bank after applying a 10th root to reduce the dynamic range, so in the following will denote the output vector of the filter-bank and will correspondingly denote its component.

To scale the incoming filter output values down to the interval [0; 1] After the power function transformation is applied the values are scaled

back to the original range:

YkY thk

kY

1 , ),(~

k

kNQ

kkNQkkkk Q

YQYTY

12


Small values are scaled down even further towards zero, so little amplitude dierences will be enhanced considerably if a logarithm is applied afterwards, this is in contradiction to the desired compression of the signal to a smaller range.

so the transformation function that will always be used within the context

kNQ

kk

kNQ

kkkNQkkkk Q

Y

Q

YQYTY

k

1),(~

),(~

k

kNQ

kkNQkkkk Q

YQYTY

13


Both transformation parameters are jointly optimized to minimize the squared distance between the current quantiles and the training quantiles

The minimum is determined with a simple grid search: by the way it should be in the range

kkk ,ktQ

trainiQ

1

1

2',minarg'

Q

k

N

i

trainikkikk QQT

kk , max1, , 0,1 kk

The step size for the grid search can be set to a value in the order of 0.01

14


Example: output of the 6th Mel scaled lter over time for a sentence from the Aurora 4 test set

case in this 0.1 and 4.1search grid the Cumulative distributions of the signals

15


Combine neighboring filter channels: a linear combination of a filter with its left and right neighbor can be u

sed to further reduce the remaining difference are the filter output values and the recognition quantiles after the pre

ceding power function transformation factors are denoted for the left neighbors and for the right neigh

bors With the transformation step can be written as:

Y~

~~

1)~

,~

(~ˆ

11 kkkkkkkkkk YYYTY

kkk ,~

k k

16


Comparison of the RWTH baseline feature extraction front-end

17

Experiment

Car Navigation isolated German words recorded in cars vocabulary consists of 2100 equally probable words the training data was recorded in a quiet office environment

Aurora 3 – SpeechDat Car continuous digit strings recorded in cars four languages are available: Danish, Finnish, German, and Spanish

Aurora 4 – noisy WSJ 5k utterances read from the Wall Street Journal with various artificially a

dded noises vocabulary consists of 5000 words

18

Comparison of Logarithm and Root Functions

isolated word Car Navigation database with different root functions on the Car Navigation database

LOG: logarithm, CMN: cepstral mean normalization,2nd - 20th: root instead of logarithm, FMN: filter mean normalization.

19


Comparison of logarithm and 10th root on Aurora 3 database

WM: well matched, MM: medium mismatch, HM: high mismatch, FMN: filter mean normalization

20


on the Aurora 4 noisy WSJ 16kHz database.

LOG: logarithm, CMN: cepstral mean normalization,2nd - 20th: root instead of logarithm, FMN: filter mean normalization.

21

Experiment - Quantile Equalization

Recognition results on the Car Navigation database with quantile equalization

LOG: logarithm, CMN: cepstral mean normalization, 10th: root instead of logarithm, FMN: filter mean normalization, QE: quantile equalization, QEF(2): quantile equalization with filter combination (2 neighbors).

22


Comparison of quantile equalization with histogram normalization on the Car Navigation database.

QE train: applied during training and recognition. HN: speaker session wise histogram normalization, HN sil: histogram normalization dependent on the amount of silence, ROT: feature space rotation.

23

Comparison of QE and HN

Cumulative distribution function of the 6th lter output.

HN: after histogram normalization,QE: after quantile equalization.clean: data from test set 1, noisy: test set 12

24


Recognition results on the Car Navigation database for dierent numbers of quantiles.

10th: root instead of logarithm, FMN: filter mean (and variance) normalization, QE: quantile equalization with NQ quantiles, QEF quantile equalization with filter combination.

25


Comparison of the logarithm in the feature extraction with dierent root functions on the Car Navigation database.

2nd - 20th: root instead of logarithm, FMN:filter mean normalization, QE: quantile equalization, QEF: quantile equalization with filter combination.

26

Conclusion

Replacing the logarithm in the feature extraction by a root function signficantly increased the recognition performance on noisy data

Using four quantiles NQ = 4 can be recommended as standard setup, it can be used on short windows as well as complete utterances.

rx

Spectral Entropy Feature in Full-Combination Multi-Stream for Robust ASR

Hemant Misra , Herv´e Bourlard∗ ∗IDIAP Research Institute, Martigny, Switzerland

Presenter : Chen Hung_Bin

INTERSPEECH 2005

28

Introduction

computing spectral entropy features from the sub-bands of spectrum in order to locate the spectral peaks of the spectrum

spectral entropy features are used along with PLP features in multi-stream framework

training a separate multi-layered perceptron (MLP) for PLP features

9.2% relative error reduction as compared to the baseline

29

Spectral entropy feature

Entropy measures can be used to capture the “peakiness” sharp peak will have low entropy flat distribution will have high entropy

convert the spectrum into a probability mass function (PMF) like function by normalizing it.

spectrum ofenergy theis , /

log

1

12

thi

N

iiii

N

iii

iXXXx

xxH

iX

1X

NX

30

Spectral entropy feature

observe that entropy computed on full-band spectrum can be used as an estimate for speech/silence detection

Entropy computed from the full-band spectrum. (a) Clean speech wave form, (b) Entropy contour for clean speech,(c) Speech corrupted with factory noise at 6 dB SNR, and (d) Entropy contour for speech corrupted with factory noise at 6 dB SNR.

31

Multi-band/multi-resolution spectral entropy feature

The full-band spectral entropy feature can capture only the gross peakiness of the spectrum.

obtained the best results by dividing the normalized full-band spectrum into 24 overlapping sub-bands defined by Mel-scale and computed entropy from each sub-band

32

Entropy based full-combination multi-stream (FCMS)

Full-combination multi-stream :

All possible combinations of the two features are treated as separate streams.An MLP expert is trained for each stream. The posteriors at the output of experts are weighted and combined. The combined posteriors thus obtained are passed to an HMM decoder.

33

Entropy based full-combination multi-stream

The combined output posterior probability for class and framethk thn

I

i

in

ini

n

nin

in

nini

n

I

i

in

n

K

ki

inki

ink

in

I

ii

ink

innk

h

hw

hhh

hhh

I

hh

xqPxqPh

xqPwXqP

1

1

12

1

~/1

~/1

:

: 10000~

),|(log),|(

),|(),|(ˆ

I

Innn

th

th

xxX

I

n

i

,,

,,

3) of (case stream ofnumber :

setparameter :

number frame :

vectorfeature stream :

1

1

i

34

Spectral entropy feature in Tandem framework

exploiting the advantages of both HMM/ANN and HMM/GMM systems

Multi-stream Tandem: Out puts from different experts are weighted and combined. The combined output undergoes KL transform before being fed as features into HMM/GMM systems.

35

access to the ‘outputs before softmax’

Therefore we cannot use the entropy based weighting directly. To overcome this problem

we converted the ‘outputs before softmax’ into posteriors using the equation.

“softmax” nonlinearity in this position (exponentials normalized to sum to 1)

k nk

nknk xy

xyxqP

)|exp(

)|exp()|(

instant time:

vectorfeature:

class

n

x

k

n

th

36

Experimental

Numbers95 database of US English connected digits telephone speech is used

There are 30 words in the database represented by 27 phonemes

Noisex92 database added at different signal-to-noise-ratios (SNRs)

There were 3,330 utterances for training and 2,250 utterances were used for testing the system

37

Results

Hybrid system under different noise conditions:

WERs for PLP features, 24 Mel-band spectral entropy features and its time derivaties (24-Mel), the two features appended (PLP + 24-Mel), and PLP and spectral entropy features in FCMS with inverse entropy weighting.

38

Results

Tandem system under different noise conditions:

WERs for PLP features, 24 Mel-band spectral entropy features and its time derivaties (24-Mel), the two features appended (PLP + 24-Mel), and PLP and spectral entropy features in FCMS with inverse entropy weighting.

39

Conclusion

We demonstrated that better performance can be achieved by FCMS as compared to appending the multi-resolution entropy feature vector to the PLP feature vector.

40

References

[4] Hemant Misra, Shajith Ikbal, Herv´e Bourlard, and Hynek Hermansky, “Spectral entropy based feature for robust ASR,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Montreal, Canada, May 2004.

[5] Hemant Misra, Shajith Ikbal, Sunil Sivadas, and Herv´e Bourlard, “Multi-resolution spectral entropy feature for robust ASR,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Philadelphia, U.S.A., Mar. 2005.

[7] Hynek Hermansky, Daniel P. W. Ellis, and Sangita Sharma, “TANDEM connectionist feature extraction for conventional HMM systems,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Istanbul, Turkey, 2000.

[11] Astrid Hagen and Andrew Morris, “Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR,” Computer Speech and Language, , no. 19, pp. 3–30, 2005.

quantile based histogram equalization for noise robust speech recognition von diplom-physiker...

Documents