discriminative mle training using a product of gaussian likelihoods

Present by: Fang-Hui Chu

Discriminative MLE training using a product of Gaussian Likelihoods

T. Nagarajan and Douglas O’ShaughnessyINRS-EMT, University of Quebec

Montreal, CanadaICSLP 06

2

Outline

• Introduction• Model selection methods

– Bayesian information criterion– Product of Gaussian likelihoods

• Experimental setup– PoG-based model selection

• Performance analysis• Conclusions

3

References

• [SAP 2000][Padmanabhan and Bahl], Model Complexity Adaptation Using a Discriminant

• [ICASSP 03][Airey and Gales], Product of Gaussians and Multiple Stream Systems

4

Introduction

• Defining a structure of a HMM is one of the major issues in acoustic modeling– The number of states is usually chosen based on either the

number of acoustic variations that one may expect across the utterance or the length of the utterance

– The number of mixtures per state can be chosen based on the amount of training data available

• Proportional to the number of data samples (PD)– Alternative criteria of model selection

• Bayesian Information Criterion (also referred to as MDL, Minimum Description Length)

• Discriminative Model Complexity Adaptation (MAC)

5

Introduction

• The major difference between the proposed technique with others lies in the fact that the complexity of a model is adjusted by considering only the training examples of other classes and not their corresponding models

• In this technique, the focus is given to how well a given model can discriminative training data of different classes

6

Model selection Methods

• The focus is given to choosing a proper topology for the models to be generated, especially the number of mixtures per state– We compare the performance of the proposed technique with

that of the conventional BIC-based technique– For this work, acoustic modeling is carried out using the MLE

algorithm only– These systems are implemented using the HTK

7

Bayesian Information Criterion

• The number of components of a model is chosen by maximizing an objective function– That is essentially the likelihood of the training examples of a

model penalized by the number of components in that model and the number of training examples

– α is an additional penalty factor used to control the complexities of the resultant models

vector feature the of dimension the :

model the in states of number the :

state per mixtures of mumber the :

class the of modelsacoustic :

class the for available examples training of number

, class a of utterances:

d

S

m

C

CK

KkCs

KdmSsp

mii

imi

ii

iiik

iimi

ik

Mmi

:

,...,2,1

)log()(2

1)|(logmaxarg

,...2,1

*

8

Bayesian Information Criterion

• Disadvantage– It does not consider information about other classes– This may lead to an increased error rate especially when most

competitive and closely resembling other classes

9

Product of Gaussian likelihoods

• Figure 1(b) is possible only when the model is λj well trained– During training of the model, the acoustic likelihoods of all the utt

erances of the Cj should be maximized to a greater extent

– To maximize the likelihood on the training data, the estimation procedure often tries to make the variances of all the mixture components very small

– But it often provides poor matches to independent test data

)|(),(

)|(),(2

2

ijkijijij

iikiiiiii

spN

spN

10


• Thus, it is always better to reduce the overlap between the likelihood Gaussians (say, Nii and Nij) of utterances of different classes (say, Ci and Cj) for a given model (λi)

• We assume that two Gaussians overlap with each other if either of the following conditions is met:– If , irrespective of their corresponding variances– If is wide enough so that both the Gaussians overlap c

onsiderably

• In order to quantify the amount of overlap between two Gaussians, we can use error bounds, like the Chernoff or Bhattacharyya bounds

ijii ijii or

11


• However, these bounds are sensitive to the variances of the Gaussians

• Here, we use a similar logic of “product of Gaussian” for estimating the amount of overlap between two Gaussians

ijii

spsp

ijijijiiiiii

kkk

K

Ke

NN

N

ij

ijijk

ii

iiiik

2

1

),(),(

),(

2

2

2

2

2

)|(

2

)|(

22

2

12


• In order to quantify the amount of overlap between two different Gaussians, we define the following ration

ij

iiijii

ijii

ijiiiiijk

iiijii

iiiiiiiiiiii

ijijijiiiiii

O

DreNr

Dr

Nr

NN

NNO

ij

ijk

ii

iik

then if

,

22

22

2

22

22

22

2

1

2

1

),(),(max

),(),(max

2

2

2

2

13


• However, for this case we expect the overlap O to be equal to 1

• The resultant ON is used as a measure to estimate the amount of overlap between two Gaussians

2

2

2

2

22 ij

ijk

ii

iik

e

OOii

ijN

14

Experimental setup

• The TIMIT corpus is considered for both training and testing

• The word-lexicon is created only with the test words– For the words that in common in train and test data, pronunciatio

n variations are taken from the transcriptions provided with the corpus

– For the rest of the test words only one transcription is considered

• Syllables extracted from the phonetic segmentation boundaries– Only considered 200 syllables that have more than 50 examples– The rest are replaced by their corresponding phonemes

• 200 syllables and 46 monophone models are initalized

15

Experimental setup

• For initialized models – the number of states is fixed based on the number of phonemes

for a given sub-word unit– the number of mixtures per state in considered as one

• For the re-estimation of model parameters– A standard Viterbi alignment procedure is used– The number of mixtures per state for each model is then increas

ed to 30, in steps of 1, by a conventional mixture splitting procedure

– Each time, the model parameters are re-estimated twic

16

PoG-based model selection

17

Performance analysis

• Since the model size seems to grow uncontrollably, we fixed the maximum number of mixtures per state at 30

18

Conclusions

• In conventional techniques for model optimization, the topology of a model is optimized either without considering other classes, or considering a subset of competing models

• While in this work, it consider whether a given model can discriminative training utterances of other classes from its own

19

Discriminative model complexity adaptation

• The discriminant measure is a two-dimensional vector

discriminative mle training using a product of gaussian likelihoods

Documents