improving utterance verification using a smoothed na ï ve bayes model reporter : chen, tzan hwei...
TRANSCRIPT
Improving Utterance Verification Using a Smoothed Naïve Bayes Model
Reporter : CHEN, TZAN HWEI
Author :Alberto Sanchis, Alfons Juan and Enrique Vidal
2
Reference
A. Sanchis , A. Juan and E. Vidal, “Improving Utterance Verification Using a Smoothed Naïve Bayes Model”, ICASSP 2003
A. Sanchis, A. Juan, and E. Vidal, “Estimating confidence measures for speech recognition verification using a smoothed naïve bayes model”, IbPRIA 2003.
4
Introduction
Current speech recognition systems are not error-free. To predict the reliability of each hypothesized
word. It can be seen as a conventional pattern
recognition problem in which hypothesized word is to be transformed into a feature vector and then classified as either correct or incorrect.
5
Smoothed Naïve Bayes Model
)2(),|(),|(
pair, word-classa givent independen
mutually are features that theassumption Bayesnaive themake we,simplicityFor
)1(),'|()|'(
),|()|(),|(
as rule Bayes' via thecalculated be can posteriors class the
features (discrete) of vector ldimensiona-a and wordedhypothesiza Given
incorrect.for 1 andcorrect for 0 ;by variableclass thedenote We
1
'
wcxPwcxP
wcxPwcP
wcxPwcPwxcP
xDw
c c c
d
D
d
c
6
Smoothed Naïve Bayes Model (cont)
count.event definedsuitably are the)( where
)4(,,1),(
),,(),|(
)3()(
),()|(
sfrequencie alconvention theusing
iesprobabilit unknown theestimate can we,},,{ samples training Given 1
N
DdwcN
wcxNwcxP
wN
wcNwcP
wcxN
dd
Nnnnn
7
Smoothed Naïve Bayes Model (cont)
Unfortunately, these frequencies often underestimate the true probabilities involving rare training data.
To circumvent this problem, we have considered an absolute discounting smoothing model. To discount a small constant to every positive
count and then distribute the gained probability mass among the null counts.
8
Smoothed Naïve Bayes Model (cont)
events). seen ofnumber thetimes)(
( massy probabilit gained thedenotes where
)6(0),,( if
)|'(
)|(
0),,( if),(
),,(
),|(
becomes (4) functiony probabilit the
, of valuespossible moreor onefor 0)( if ,)( eachfor Similarly,
)5(0),( if
)(
0),( if)(
),(
)|(
by replaced is (3), )0(or 1for 0)( if , wordeachFor
0),,'(:'
c,wN
bM
wcxNcxP
cxPM
wcxNwcN
bwcxN
wcxP
x,c,wxNc,w
wcNwN
b
wcNwN
bwcN
wcP
c c c,wNw
d
wcxNxd
d
dd
d
dd
dd
9
Smoothed Naïve Bayes Model (cont)
. thresholdcertaina greater is),1( ifincorrect as a word gclassifyin
by performed is UVphase, test thein above, explained as trainedmodels theUsing
).(by edapproximat is )( data, training theinoccur not does
a word if Similarly, (6). of instead used is (7) model dgeneralize thethreshold,
globala below counts withpairs )( Those case. extreme these withdeal To
estimates. inaccurate gives (6) model smoothed theeven , thereforeand null
are counts )( allnearly for which pairs )(many are therepratice, In
)7(0),,( if
)|'(
11
)(
0),( if)(
),(
)|(
gdiscountin absoluteby smoothed also isit estimates, nullprevent To
events. unseen theamong M divide toondistributi dgeneralizea as used is )|(
0),'(:'0),'(:'
x|wcP
cPc|wPw
c,w
,c,wxNc,w
wcxNcxPcN
b
cxNcN
bcxN
cxP
cxP
d
d
cxNxdcxNx
dd
d
d
dd
dd
10
Predictor Features
A set of Well-known features has been selected to perform.
Acoustic stability : Number of times that a hypothesized word appears at the same position (as computed by Levenshtein alignment) in K alternative outputs of the speech recognizer obtained using different weight between acoustic and language model scores.
LMprob : Language model probability.
Hypothesis density (HD) : the average number of the active hypotheses within the hypothesized word boundaries.
11
Predictor Features (cont)
A set of Well-known features has been selected
to perform (cont). PrecPh : the percentage of hypothesized word phones
that match the phones obtained in a “phone-only” decoding.
Duration : the word duration in frames divided by its number of phones
ACscore : the acoustic log-score of the word divided by its number of phones.
12
Predictor Features (cont)
A set of Well-known features has been selected
to perform (cont). Word Trellis Stability (WTS) : the motivation of
the WTS comes from the following observations : a word is most probably correct if it is appears, within approximately the same time interval, in the majority of the most probable hypotheses.
13
Predictor Features (cont)
WTS :
1). figure (see value
theon dependingnot or probablemost be can hypothese partiala is that value;GSF
theof functiona is (8) equation Clearly, (GSF). Factor ScaleGrammar theis where
)8()()()(
: as computed is h hypothesis
partiala for score path the},,,{ vectorsfeature of sequencea Given 1n1
hPhPhSc ACLM
n
Figure 1. Path scores of three competingPartial hypotheses
14
Predictor Features (cont)
WTS (cont):
. frame at time active bemust word the)',(
of hypothesis each in addition, In .][ valuesGSF of range certaina for time
at probablemost are that hypotheses partialboundary - wordofset a is where
)10()()',(
)9()','(
)',(
1
1)(
as computed is of WTS The utterance.
given theof frames ofnumber theis where,es 0 , of frames ending
and starting thebe e and, slet and sentence recognized theof a word be Let
)',('
''
ww
ww
t'wtwH
t
H
twC
twC
twC
sewWTS
N
NNw
w
t
if
t
iftwHh
N
tt
w
e
stww
t
w
w
15
Experiments
Using two different corpora Traveler task, a Spanish speech corpus of person-to-
person communication utterance at the reception desk of a hotel.
FUB task, an Italian speech corpus of phone calls to the front desk of a hotel
Table 1. Traveler andFUB speech corpus
16
Experiments (cont)
Traveler task : 24 context-independent Spanish phonemes were modeled
by conventional left-to-right HMM. A bigram language model was estimated using the whole tr
aining text corpus of the Traveler task. The test set Word Error Rate was 5.5%
FUB task The HMMs were trained using Linear discriminant analysis. Decision-tree clustered generalized triphones were used as
phone units. A smoothed trigram LM was estimated using the transcriptio
n of the training utterances. The test-set Word Error Rate was 27.5%
17
Experiments (cont)
In evaluating verification systems two measure are of interest : the True Rejection Rate (TRR)
and the False Rejection Rate (FRR)
A Receiver Operating Characteristic (ROC) curve represents TRR against FRR for different values of
recognizedy incorrectl are that wordsofnumber
incorrect as classified are and recognizedy incorrectl are that wordsofnumber the
recognizedcorrectly are that wordsofnumber
incorrect as classified are and recognizedcorrectly are that wordsofnumber the
18
Experiments (cont)
In evaluating verification systems two measure are of interest (cont): The area under a ROC curve divided by the
area of a worst-case diagonal ROC curve, provides an adequate overall estimation of the classification accuracy.
We denote this area ratio as AROC.
19
Experiments (cont)
Table 2. AROC values for each individual feature
Table 3. AROC values for the best feature combinations
20
Experiments (cont)
Fig. 1. Comparative ROC curves for single features AS and HD versus the best feature combination (Traveler corpus)
21
Experiments (cont)
Fig. 2. Comparative ROC curves for single features AS and HD versus the best feature combination (FUB corpus)
22
Conclusion
The result show that the combination of different features is significantly better than the single-feature performance of a set of well-known features.
The WTS feature has demonstrated to be particularly effective to improve the classification performance.
New Features based on Multiple Word Graphs for Utterance Verification
Reporter : CHEN, TZAN HWEI
Author :Alberto Sanchis, Alfons Juan and Enrique Vidal
24
Reference
A. Sanchis , A. Juan and E. Vidal, “New Feature based on Multiple Word Graphs for Utterance Verification”, ICSLP 2004
A. Sanchis , A. Juan and E. Vidal, “Improving Utterance Verification Using a Smoothed Naïve Bayes Model”, ICASSP 2003
26
Features based on multiple word graphs
A word graph G is a directed, acyclic, weighted graph.
台東
妙語
無端 太重良心
不斷 太多
台中
良心
兩任
SIL
豪雨
無端
無端
台東
不斷
不斷
兩人
陶藝
失蹤私人
良心 自任
27
Features based on multiple word graphs (cont)
algorithm. backward-forward
known- well theon based computedy efficientl be can iesprobabilitposterior These
)2(),()(
:hypotheses graph wordall of iesprobabilitposterior theup summing
by computed be can )( nsobservatio acoustic of sequence theofy probabilit The
)1(),()(
1)|]([
computed be can , node to node from wordedhypothesiz theis where,][
) word(edgespecifica for probailityposterior the, obsrvation acoustic theGiven
T1
T1
T1
T1
',',':]'',',[
:T1
T1
T1
hPP
P
hPP
w,s,eP
esww,s,e
h
eesswwesw
Gh
28
Features based on multiple word graphs (cont)
The posterior probabilities can be based on different kinds of knowledge depending on the weights associated to the word graph edges. In this work, we study three features: WgAC : the weights are acoustic scores.
WgLM : the weights are language model scores.
WgTOT : the weights are the combination of acoustic and language model scores.
29
Features based on multiple word graphs (cont)
The algorithm proposed to compute each of these feature is described in right Figure
t T
w
…
30
Experiments
To compare the proposed features with alternative predictor features, a set of well-know features has been selected : Acoustic stability : Number of times that a hypothesized
word appears at the same position in K alternative outputs of the speech recognizer obtained using different weight between acoustic and language model scores.
LMprob : Language model probability.
Hypothesis density (HD) : the average number of the active hypotheses within the hypothesized word boundaries.
31
Experiments (cont)
To compare the proposed features with alternative predictor features, a set of well-know features has been selected (cont). PrecPh : the percentage of hypothesized word phone
s that match the phones obtained in a “phone-only” decoding.
Duration : the word duration in frames divided by its number of phones
ACscore : the acoustic log-score of the word divided by its number of phones.
32
Experiments (cont)
To compare the proposed features with alternative predictor features, a set of well-know features has been selected (cont). Word Trellis Stability (WTS) :
)','(
)',(max)(
)','(
)',(
1
1)(
: as computed
are WTS theof variants two],[ timeend and start time its and a word Given
'
'max
''
twC
twCwWTS
twC
twC
sewWTS
s,ew
w
ets
w
e
stwwavg
w
w
34
Experiments (cont)
Metrics for the evaluation of the classification accuracy. A ROC curve represents TRR against FRR for different
values of
AROC
Confidence Error Rate (CER) : defined as the number of classification errors divided by the total number of recognized words. A baseline CER is obtained assuming that all recognized words are classified as correct.
35
Experiments (cont)
Table 2: AROC, CER and relative reduction in baseline CER for each individual feature.
36
Experiments (cont)
Table 3: AROC, CER and relative reduction in baseline CER for the best feature combinations.
37
Experiments (cont)
Figure 2. Comparative ROC curves for the best singlefeature versus the best feature combination.
38
Experiments (cont)
Table 4 : Comparative AROC and CER values for each featurecomputed using a single or multiple word graphs.