a baseline system for speaker recognition c. mokbel, h. greige, r. zantout, h. abi akl a. ghaoui, j....

21
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

Upload: gillian-small

Post on 14-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

A Baseline System for Speaker Recognition

C. Mokbel, H. Greige, R. Zantout, H. Abi Akl

A. Ghaoui, J. Chalhoub, R. Bayeh

University Of Balamand - ELISA

Page 2: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 2

Outline

• Introduction

• Baseline speaker recognition system

• NIST 2002 evaluation

• Conclusion and perspective

Page 3: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 3

Introduction

• A baseline system has been built and was used in the NIST 2002 speaker recognition evaluation– GMM based system– Normalization using z-norm– Adaptation technique used to estimate speaker

model starting from world model

Page 4: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 4

Baseline Speaker Recognition System

• Feature extraction:– Speech recognition based feature vectors

• 13 MFCC coefficients including the energy on logarithmic scale

• + first and second order derivative – Leading to 39 feature parameters

• Preprocessing using cepstral mean normalization

Page 5: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 5

Baseline Speaker Recognition System

• GMM modeling for both hypotheses: speaker and non speaker (world)– EM algorithm to train the world model (Baum-

Welch)• Initialization using LBG VQ

– Speaker model: adapted mean vectors from the world model

• Approximation of the “unified adaptation approach” (“Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework”, IEEE Trans. on SAP Vol. 9, n 4, may 2001) IEEE Trans. on SAP Vol. 9, n 4, may 2001)

Page 6: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 6

Baseline Speaker Recognition System

• Speaker Adaptation:– World model Gaussian distributions grouped in a

binary tree– Speaker data driven determination of the Gaussian

classes– MLLR applied based on these classes: only means

of Gaussian distributions are adapted– MAP applied to the leaves Gaussian distributions

Page 7: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 7

Baseline Speaker Recognition System

• Building the Gaussian tree bottom up:– Grouping two by two the closest Gaussian

distributions– Distance between 2 Gaussian distributions is

equal to the loss in the likelihood of the associated data if the two Gaussian are merged in a unique Gaussian

Page 8: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 8

Baseline Speaker Recognition System

• After the E-step of the EM algorithm the weights associated to the leaves of the tree are propagated through the tree up to the root

• Going from the root to the leaves, nodes are selected whenever one of their two children has a weight less than a threshold– This defines a partition that will be used in an

MLLR algorithm

Page 9: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 9

Baseline Speaker Recognition System

• MAP algorithm:– Estimated Gaussian means parameters at the

leaves are smoothed using a fixed weight with the parameters of the world Gaussian

Page 10: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 10

Baseline Speaker Recognition System

• Given a target speaker model s, the world model w and a test utterance X, the score for this utterance is computed as the log likelihood ratio:s = log [p(X/s) / p(X/w)]

• This score should be normalized due to the fact that the world model is not precise

Page 11: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 11

Baseline Speaker Recognition System

• Normalization using the z-norm:– Few impostors utterances are used– A score is computed for every utterance– The different scores define a distribution per

target speaker– Target speakers distributions should be similar

for a decision using a unique threshold• Reduce and center the distribution

ns = a * s + b

Page 12: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 12

Baseline Speaker Recognition System

• Based on the data from the 2001 evaluation a DET curve can be plotted– Find the optimal decision threshold that

minimize the cost defined by NIST’2002, i.e.:

Cdet = Cmis*Prmiss/target*Prtarget + CFalseAlarm*PrFalseAlarm/NonTarget*(1-Prtarget)

Page 13: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 13

NIST 2002 evaluation

• Feature vector: 13 MFCCs + 13 + 13 2

• Cepstral Mean Normalization

• Gender dependent GMM with 256 Gaussian mixtures for world model– Trained on a subset of the cellular data of NIST

2001 evaluation

Page 14: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 14

NIST 2002 evaluation

• Target speaker model adapted from world model– For every iteration and after the E step

• Threshold (cumulative probability = 3.0) to select tree nodes

• MLLR used to update the Gaussian means

• Approximated MAP to smooth the MLLR estimated parameters: linear combination between the MLLR estimated mean (0.8) and the world (a priori) mean (0.2)

Page 15: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 15

NIST 2002 evaluation

• 16 male and 21 female speakers (NIST 2001) used as impostors (~8 test files from each)– The pseudo-impostors scores define a

distribution used to z-normalize the score for a given target speaker

• Global threshold estimated on NIST 2001 data in order to minimize the cost

Page 16: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 16

NIST 2002 evaluation

• System characteristics:– CPU time on a pentium III 800 MHz:

2.1 ms per frame and per speaker for speaker model adaptation

0.92 ms per frame for the test– Memory usage:

~360 Kbytes per test

Page 17: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 17

NIST 2002 evaluation

• Results:– Cdet = 0.100292

– Min Cdet = 0.097833

• DET Curve:

Page 18: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 18

NIST 2002 evaluation

Page 19: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 19

NIST 2002 evaluation

Page 20: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 20

NIST 2002 evaluation

Page 21: A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST2002 21

Conclusions and perspectives• A new baseline system has been developed and

evaluated

• A lot of work to be done, mainly:– Optimize the feature extraction module– Implement the complete Unified Adaptation approach– Investigate new normalization strategies– Integrate automatic labeling of speech segments