a baseline system for speaker recognition c. mokbel, h. greige, r. zantout, h. abi akl a. ghaoui, j....

A Baseline System for Speaker Recognition

C. Mokbel, H. Greige, R. Zantout, H. Abi Akl

A. Ghaoui, J. Chalhoub, R. Bayeh

University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 2

Outline

• Introduction

• Baseline speaker recognition system

• NIST 2002 evaluation

• Conclusion and perspective


Introduction

• A baseline system has been built and was used in the NIST 2002 speaker recognition evaluation– GMM based system– Normalization using z-norm– Adaptation technique used to estimate speaker

model starting from world model


Baseline Speaker Recognition System

• Feature extraction:– Speech recognition based feature vectors

• 13 MFCC coefficients including the energy on logarithmic scale

• + first and second order derivative – Leading to 39 feature parameters

• Preprocessing using cepstral mean normalization



• GMM modeling for both hypotheses: speaker and non speaker (world)– EM algorithm to train the world model (Baum-

Welch)• Initialization using LBG VQ

– Speaker model: adapted mean vectors from the world model

• Approximation of the “unified adaptation approach” (“Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework”, IEEE Trans. on SAP Vol. 9, n 4, may 2001) IEEE Trans. on SAP Vol. 9, n 4, may 2001)



• Speaker Adaptation:– World model Gaussian distributions grouped in a

binary tree– Speaker data driven determination of the Gaussian

classes– MLLR applied based on these classes: only means

of Gaussian distributions are adapted– MAP applied to the leaves Gaussian distributions



• Building the Gaussian tree bottom up:– Grouping two by two the closest Gaussian

distributions– Distance between 2 Gaussian distributions is

equal to the loss in the likelihood of the associated data if the two Gaussian are merged in a unique Gaussian



• After the E-step of the EM algorithm the weights associated to the leaves of the tree are propagated through the tree up to the root

• Going from the root to the leaves, nodes are selected whenever one of their two children has a weight less than a threshold– This defines a partition that will be used in an

MLLR algorithm



• MAP algorithm:– Estimated Gaussian means parameters at the

leaves are smoothed using a fixed weight with the parameters of the world Gaussian



• Given a target speaker model s, the world model w and a test utterance X, the score for this utterance is computed as the log likelihood ratio:s = log [p(X/s) / p(X/w)]

• This score should be normalized due to the fact that the world model is not precise



• Normalization using the z-norm:– Few impostors utterances are used– A score is computed for every utterance– The different scores define a distribution per

target speaker– Target speakers distributions should be similar

for a decision using a unique threshold• Reduce and center the distribution

ns = a * s + b



• Based on the data from the 2001 evaluation a DET curve can be plotted– Find the optimal decision threshold that

minimize the cost defined by NIST’2002, i.e.:

Cdet = Cmis*Prmiss/target*Prtarget + CFalseAlarm*PrFalseAlarm/NonTarget*(1-Prtarget)


NIST 2002 evaluation

• Feature vector: 13 MFCCs + 13 + 13 2

• Cepstral Mean Normalization

• Gender dependent GMM with 256 Gaussian mixtures for world model– Trained on a subset of the cellular data of NIST

2001 evaluation



• Target speaker model adapted from world model– For every iteration and after the E step

• Threshold (cumulative probability = 3.0) to select tree nodes

• MLLR used to update the Gaussian means

• Approximated MAP to smooth the MLLR estimated parameters: linear combination between the MLLR estimated mean (0.8) and the world (a priori) mean (0.2)



• 16 male and 21 female speakers (NIST 2001) used as impostors (~8 test files from each)– The pseudo-impostors scores define a

distribution used to z-normalize the score for a given target speaker

• Global threshold estimated on NIST 2001 data in order to minimize the cost



• System characteristics:– CPU time on a pentium III 800 MHz:

2.1 ms per frame and per speaker for speaker model adaptation

0.92 ms per frame for the test– Memory usage:

~360 Kbytes per test



• Results:– Cdet = 0.100292

– Min Cdet = 0.097833

• DET Curve:


Conclusions and perspectives• A new baseline system has been developed and

evaluated

• A lot of work to be done, mainly:– Optimize the feature extraction module– Implement the complete Unified Adaptation approach– Investigate new normalization strategies– Integrate automatic labeling of speech segments

a baseline system for speaker recognition c. mokbel, h. greige, r. zantout, h. abi akl a. ghaoui, j....

Documents