improving utterance verification using a smoothed na ï ve bayes model reporter : chen, tzan hwei...

Improving Utterance Verification Using a Smoothed Naïve Bayes Model

Reporter : CHEN, TZAN HWEI

Author :Alberto Sanchis, Alfons Juan and Enrique Vidal

2

Reference

A. Sanchis , A. Juan and E. Vidal, “Improving Utterance Verification Using a Smoothed Naïve Bayes Model”, ICASSP 2003

A. Sanchis, A. Juan, and E. Vidal, “Estimating confidence measures for speech recognition verification using a smoothed naïve bayes model”, IbPRIA 2003.

3

Outline

Introduction

Smoothed Naïve Bayes Model

Predictor Features

Experiment

Conclusion

4

Introduction

Current speech recognition systems are not error-free. To predict the reliability of each hypothesized

word. It can be seen as a conventional pattern

recognition problem in which hypothesized word is to be transformed into a feature vector and then classified as either correct or incorrect.

5

Smoothed Naïve Bayes Model

)2(),|(),|(

pair, word-classa givent independen

mutually are features that theassumption Bayesnaive themake we,simplicityFor

)1(),'|()|'(

),|()|(),|(

as rule Bayes' via thecalculated be can posteriors class the

features (discrete) of vector ldimensiona-a and wordedhypothesiza Given

incorrect.for 1 andcorrect for 0 ;by variableclass thedenote We

1

'

wcxPwcxP

wcxPwcP

wcxPwcPwxcP

xDw

c c c

d

D

d

c

6

Smoothed Naïve Bayes Model (cont)

count.event definedsuitably are the)( where

)4(,,1),(

),,(),|(

)3()(

),()|(

sfrequencie alconvention theusing

iesprobabilit unknown theestimate can we,},,{ samples training Given 1

N

DdwcN

wcxNwcxP

wN

wcNwcP

wcxN

dd

Nnnnn

7


Unfortunately, these frequencies often underestimate the true probabilities involving rare training data.

To circumvent this problem, we have considered an absolute discounting smoothing model. To discount a small constant to every positive

count and then distribute the gained probability mass among the null counts.

8


events). seen ofnumber thetimes)(

( massy probabilit gained thedenotes where

)6(0),,( if

)|'(

)|(

0),,( if),(

),,(

),|(

becomes (4) functiony probabilit the

, of valuespossible moreor onefor 0)( if ,)( eachfor Similarly,

)5(0),( if

)(

0),( if)(

),(

)|(

by replaced is (3), )0(or 1for 0)( if , wordeachFor

0),,'(:'

c,wN

bM

wcxNcxP

cxPM

wcxNwcN

bwcxN

wcxP

x,c,wxNc,w

wcNwN

b

wcNwN

bwcN

wcP

c c c,wNw

d

wcxNxd

d

dd

d

dd

dd

9


. thresholdcertaina greater is),1( ifincorrect as a word gclassifyin

by performed is UVphase, test thein above, explained as trainedmodels theUsing

).(by edapproximat is )( data, training theinoccur not does

a word if Similarly, (6). of instead used is (7) model dgeneralize thethreshold,

globala below counts withpairs )( Those case. extreme these withdeal To

estimates. inaccurate gives (6) model smoothed theeven , thereforeand null

are counts )( allnearly for which pairs )(many are therepratice, In

)7(0),,( if

)|'(

11

)(

0),( if)(

),(

)|(

gdiscountin absoluteby smoothed also isit estimates, nullprevent To

events. unseen theamong M divide toondistributi dgeneralizea as used is )|(

0),'(:'0),'(:'

x|wcP

cPc|wPw

c,w

,c,wxNc,w

wcxNcxPcN

b

cxNcN

bcxN

cxP

cxP

d

d

cxNxdcxNx

dd

d

d

dd

dd

10

Predictor Features

A set of Well-known features has been selected to perform.

Acoustic stability : Number of times that a hypothesized word appears at the same position (as computed by Levenshtein alignment) in K alternative outputs of the speech recognizer obtained using different weight between acoustic and language model scores.

LMprob : Language model probability.

Hypothesis density (HD) : the average number of the active hypotheses within the hypothesized word boundaries.

11

Predictor Features (cont)

A set of Well-known features has been selected

to perform (cont). PrecPh : the percentage of hypothesized word phones

that match the phones obtained in a “phone-only” decoding.

Duration : the word duration in frames divided by its number of phones

ACscore : the acoustic log-score of the word divided by its number of phones.

12


A set of Well-known features has been selected

to perform (cont). Word Trellis Stability (WTS) : the motivation of

the WTS comes from the following observations : a word is most probably correct if it is appears, within approximately the same time interval, in the majority of the most probable hypotheses.

13


WTS :

1). figure (see value

theon dependingnot or probablemost be can hypothese partiala is that value;GSF

theof functiona is (8) equation Clearly, (GSF). Factor ScaleGrammar theis where

)8()()()(

: as computed is h hypothesis

partiala for score path the},,,{ vectorsfeature of sequencea Given 1n1

hPhPhSc ACLM

n

Figure 1. Path scores of three competingPartial hypotheses

14


WTS (cont):

. frame at time active bemust word the)',(

of hypothesis each in addition, In .][ valuesGSF of range certaina for time

at probablemost are that hypotheses partialboundary - wordofset a is where

)10()()',(

)9()','(

)',(

1

1)(

as computed is of WTS The utterance.

given theof frames ofnumber theis where,es 0 , of frames ending

and starting thebe e and, slet and sentence recognized theof a word be Let

)',('

''

ww

ww

t'wtwH

t

H

twC

twC

twC

sewWTS

N

NNw

w

t

if

t

iftwHh

N

tt

w

e

stww

t

w

w

15

Experiments

Using two different corpora Traveler task, a Spanish speech corpus of person-to-

person communication utterance at the reception desk of a hotel.

FUB task, an Italian speech corpus of phone calls to the front desk of a hotel

Table 1. Traveler andFUB speech corpus

16

Experiments (cont)

Traveler task : 24 context-independent Spanish phonemes were modeled

by conventional left-to-right HMM. A bigram language model was estimated using the whole tr

aining text corpus of the Traveler task. The test set Word Error Rate was 5.5%

FUB task The HMMs were trained using Linear discriminant analysis. Decision-tree clustered generalized triphones were used as

phone units. A smoothed trigram LM was estimated using the transcriptio

n of the training utterances. The test-set Word Error Rate was 27.5%

17

Experiments (cont)

In evaluating verification systems two measure are of interest : the True Rejection Rate (TRR)

and the False Rejection Rate (FRR)

A Receiver Operating Characteristic (ROC) curve represents TRR against FRR for different values of

recognizedy incorrectl are that wordsofnumber

incorrect as classified are and recognizedy incorrectl are that wordsofnumber the

recognizedcorrectly are that wordsofnumber

incorrect as classified are and recognizedcorrectly are that wordsofnumber the

18

Experiments (cont)

In evaluating verification systems two measure are of interest (cont): The area under a ROC curve divided by the

area of a worst-case diagonal ROC curve, provides an adequate overall estimation of the classification accuracy.

We denote this area ratio as AROC.

19

Experiments (cont)

Table 2. AROC values for each individual feature

Table 3. AROC values for the best feature combinations

20

Experiments (cont)

Fig. 1. Comparative ROC curves for single features AS and HD versus the best feature combination (Traveler corpus)

21

Experiments (cont)

Fig. 2. Comparative ROC curves for single features AS and HD versus the best feature combination (FUB corpus)

22

Conclusion

The result show that the combination of different features is significantly better than the single-feature performance of a set of well-known features.

The WTS feature has demonstrated to be particularly effective to improve the classification performance.

New Features based on Multiple Word Graphs for Utterance Verification

Reporter : CHEN, TZAN HWEI

Author :Alberto Sanchis, Alfons Juan and Enrique Vidal

24

Reference

A. Sanchis , A. Juan and E. Vidal, “New Feature based on Multiple Word Graphs for Utterance Verification”, ICSLP 2004

A. Sanchis , A. Juan and E. Vidal, “Improving Utterance Verification Using a Smoothed Naïve Bayes Model”, ICASSP 2003

25

Outline

Features based on multiple word graphs

Experiments

Conclusions

26

Features based on multiple word graphs

A word graph G is a directed, acyclic, weighted graph.

台東

妙語

無端太重良心

不斷太多

台中

良心

兩任

SIL

豪雨

無端

無端

台東

不斷

不斷

兩人

陶藝

失蹤私人

良心自任

27

Features based on multiple word graphs (cont)

algorithm. backward-forward

known- well theon based computedy efficientl be can iesprobabilitposterior These

)2(),()(

:hypotheses graph wordall of iesprobabilitposterior theup summing

by computed be can )( nsobservatio acoustic of sequence theofy probabilit The

)1(),()(

1)|]([

computed be can , node to node from wordedhypothesiz theis where,][

) word(edgespecifica for probailityposterior the, obsrvation acoustic theGiven

T1

T1

T1

T1

',',':]'',',[

:T1

T1

T1

hPP

P

hPP

w,s,eP

esww,s,e

h

eesswwesw

Gh

28


The posterior probabilities can be based on different kinds of knowledge depending on the weights associated to the word graph edges. In this work, we study three features: WgAC : the weights are acoustic scores.

WgLM : the weights are language model scores.

WgTOT : the weights are the combination of acoustic and language model scores.

29


The algorithm proposed to compute each of these feature is described in right Figure

t T

w

…

30

Experiments

To compare the proposed features with alternative predictor features, a set of well-know features has been selected : Acoustic stability : Number of times that a hypothesized

word appears at the same position in K alternative outputs of the speech recognizer obtained using different weight between acoustic and language model scores.

LMprob : Language model probability.

Hypothesis density (HD) : the average number of the active hypotheses within the hypothesized word boundaries.

31

Experiments (cont)

To compare the proposed features with alternative predictor features, a set of well-know features has been selected (cont). PrecPh : the percentage of hypothesized word phone

s that match the phones obtained in a “phone-only” decoding.

Duration : the word duration in frames divided by its number of phones

ACscore : the acoustic log-score of the word divided by its number of phones.

32

Experiments (cont)

To compare the proposed features with alternative predictor features, a set of well-know features has been selected (cont). Word Trellis Stability (WTS) :

)','(

)',(max)(

)','(

)',(

1

1)(

: as computed

are WTS theof variants two],[ timeend and start time its and a word Given

'

'max

''

twC

twCwWTS

twC

twC

sewWTS

s,ew

w

ets

w

e

stwwavg

w

w

33

Experiments (cont)

Corpus :

The test-set Word Error Rate was 27.5%

Table 1: FUB speech corpus

34

Experiments (cont)

Metrics for the evaluation of the classification accuracy. A ROC curve represents TRR against FRR for different

values of

AROC

Confidence Error Rate (CER) : defined as the number of classification errors divided by the total number of recognized words. A baseline CER is obtained assuming that all recognized words are classified as correct.

35

Experiments (cont)

Table 2: AROC, CER and relative reduction in baseline CER for each individual feature.

36

Experiments (cont)

Table 3: AROC, CER and relative reduction in baseline CER for the best feature combinations.

37

Experiments (cont)

Figure 2. Comparative ROC curves for the best singlefeature versus the best feature combination.

38

Experiments (cont)

Table 4 : Comparative AROC and CER values for each featurecomputed using a single or multiple word graphs.

39

Conclusions

The result show that the proposed features outperform those computed on a single word graph and other well-known predictor features.

The single feature performance is improved through the combination of the proposed features along with other kind of features.

improving utterance verification using a smoothed na ï ve bayes model reporter : chen, tzan hwei...

Documents

word duration

language model scores

language model probability

hypothesized word boundaries

bigram language model

wellknown features

smoothed trig

predictor features conta