discriminative mle training using a product of gaussian likelihoods
DESCRIPTION
Discriminative MLE training using a product of Gaussian Likelihoods. T. Nagarajan and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Canada ICSLP 06. Outline. Introduction Model selection methods Bayesian information criterion Product of Gaussian likelihoods - PowerPoint PPT PresentationTRANSCRIPT
Present by: Fang-Hui Chu
Discriminative MLE training using a product of Gaussian Likelihoods
T. Nagarajan and Douglas O’ShaughnessyINRS-EMT, University of Quebec
Montreal, CanadaICSLP 06
2
Outline
• Introduction• Model selection methods
– Bayesian information criterion– Product of Gaussian likelihoods
• Experimental setup– PoG-based model selection
• Performance analysis• Conclusions
3
References
• [SAP 2000][Padmanabhan and Bahl], Model Complexity Adaptation Using a Discriminant
• [ICASSP 03][Airey and Gales], Product of Gaussians and Multiple Stream Systems
4
Introduction
• Defining a structure of a HMM is one of the major issues in acoustic modeling– The number of states is usually chosen based on either the
number of acoustic variations that one may expect across the utterance or the length of the utterance
– The number of mixtures per state can be chosen based on the amount of training data available
• Proportional to the number of data samples (PD)– Alternative criteria of model selection
• Bayesian Information Criterion (also referred to as MDL, Minimum Description Length)
• Discriminative Model Complexity Adaptation (MAC)
5
Introduction
• The major difference between the proposed technique with others lies in the fact that the complexity of a model is adjusted by considering only the training examples of other classes and not their corresponding models
• In this technique, the focus is given to how well a given model can discriminative training data of different classes
6
Model selection Methods
• The focus is given to choosing a proper topology for the models to be generated, especially the number of mixtures per state– We compare the performance of the proposed technique with
that of the conventional BIC-based technique– For this work, acoustic modeling is carried out using the MLE
algorithm only– These systems are implemented using the HTK
7
Bayesian Information Criterion
• The number of components of a model is chosen by maximizing an objective function– That is essentially the likelihood of the training examples of a
model penalized by the number of components in that model and the number of training examples
– α is an additional penalty factor used to control the complexities of the resultant models
vector feature the of dimension the :
model the in states of number the :
state per mixtures of mumber the :
class the of modelsacoustic :
class the for available examples training of number
, class a of utterances:
d
S
m
C
CK
KkCs
KdmSsp
mii
imi
ii
iiik
iimi
ik
Mmi
:
,...,2,1
)log()(2
1)|(logmaxarg
,...2,1
*
8
Bayesian Information Criterion
• Disadvantage– It does not consider information about other classes– This may lead to an increased error rate especially when most
competitive and closely resembling other classes
9
Product of Gaussian likelihoods
• Figure 1(b) is possible only when the model is λj well trained– During training of the model, the acoustic likelihoods of all the utt
erances of the Cj should be maximized to a greater extent
– To maximize the likelihood on the training data, the estimation procedure often tries to make the variances of all the mixture components very small
– But it often provides poor matches to independent test data
)|(),(
)|(),(2
2
ijkijijij
iikiiiiii
spN
spN
10
Product of Gaussian likelihoods
• Thus, it is always better to reduce the overlap between the likelihood Gaussians (say, Nii and Nij) of utterances of different classes (say, Ci and Cj) for a given model (λi)
• We assume that two Gaussians overlap with each other if either of the following conditions is met:– If , irrespective of their corresponding variances– If is wide enough so that both the Gaussians overlap c
onsiderably
• In order to quantify the amount of overlap between two Gaussians, we can use error bounds, like the Chernoff or Bhattacharyya bounds
ijii ijii or
11
Product of Gaussian likelihoods
• However, these bounds are sensitive to the variances of the Gaussians
• Here, we use a similar logic of “product of Gaussian” for estimating the amount of overlap between two Gaussians
ijii
spsp
ijijijiiiiii
kkk
K
Ke
NN
N
ij
ijijk
ii
iiiik
2
1
),(),(
),(
2
2
2
2
2
)|(
2
)|(
22
2
12
Product of Gaussian likelihoods
• In order to quantify the amount of overlap between two different Gaussians, we define the following ration
ij
iiijii
ijii
ijiiiiijk
iiijii
iiiiiiiiiiii
ijijijiiiiii
O
DreNr
Dr
Nr
NN
NNO
ij
ijk
ii
iik
then if
,
22
22
2
22
22
22
2
1
2
1
),(),(max
),(),(max
2
2
2
2
13
Product of Gaussian likelihoods
• However, for this case we expect the overlap O to be equal to 1
• The resultant ON is used as a measure to estimate the amount of overlap between two Gaussians
2
2
2
2
22 ij
ijk
ii
iik
e
OOii
ijN
14
Experimental setup
• The TIMIT corpus is considered for both training and testing
• The word-lexicon is created only with the test words– For the words that in common in train and test data, pronunciatio
n variations are taken from the transcriptions provided with the corpus
– For the rest of the test words only one transcription is considered
• Syllables extracted from the phonetic segmentation boundaries– Only considered 200 syllables that have more than 50 examples– The rest are replaced by their corresponding phonemes
• 200 syllables and 46 monophone models are initalized
15
Experimental setup
• For initialized models – the number of states is fixed based on the number of phonemes
for a given sub-word unit– the number of mixtures per state in considered as one
• For the re-estimation of model parameters– A standard Viterbi alignment procedure is used– The number of mixtures per state for each model is then increas
ed to 30, in steps of 1, by a conventional mixture splitting procedure
– Each time, the model parameters are re-estimated twic
16
PoG-based model selection
17
Performance analysis
• Since the model size seems to grow uncontrollably, we fixed the maximum number of mixtures per state at 30
18
Conclusions
• In conventional techniques for model optimization, the topology of a model is optimized either without considering other classes, or considering a subset of competing models
• While in this work, it consider whether a given model can discriminative training utterances of other classes from its own
19
Discriminative model complexity adaptation
• The discriminant measure is a two-dimensional vector