an overview of automatic speaker recognition gérard chollet [email protected]@ get-enst/cnrs-ltci...
TRANSCRIPT
![Page 1: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/1.jpg)
An overview of Automatic Speaker Recognition
Gérard [email protected]
GET-ENST/CNRS-LTCI46 rue Barrault
75634 PARIS cedex 13http://www.tsi.enst.fr/~chollet
![Page 2: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/2.jpg)
Outline
Motivations, Applications Speech production background Speaker characteristics in the speech signal Automatic Speaker Verification :
Decision theory Text dependent Text independent
Databases, Evaluation, Standardization Audio-visual speaker verification Conclusions Perspectives
![Page 3: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/3.jpg)
Why should a computer recognize
who is speaking ?
Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...)
Limited access (secured areas, data bases) Personalization (only respond to its master’s
voice) Locate a particular person in an audio-visual
document (information retrieval) Who is speaking in a meeting ? Is a suspect the criminal ? (forensic applications)
![Page 4: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/4.jpg)
Domains of Automatic Speaker Recognition
Your voice is a signature Speaker verification (Voice Biometric)
Are you really who you claim to be ? Identification within an open set :
Is this speech segment coming from a known speaker ?
Identification within a closed set Speaker detection, segmentation, indexing,
retrieval : Looking for recordings of a particular speaker
Combining Speech and Speaker Recognition Adaptation to a new speaker Personalization in dialogue systems
![Page 5: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/5.jpg)
Applications
Access Control Physical facilities, Computer networks,
Websites Transaction Authentication
Telephone banking, e-Commerce Speech data Management
Voice messaging, Search engines Law Enforcement
Forensics, Home incarceration
![Page 6: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/6.jpg)
Voice Biometric
Avantages Often the only modality over the telephone, Low cost (microphone, A/D), Ubiquity Possible integration on a smart (SIM) card Natural bimodal fusion : speaking face
Disadvantages Lack of discretion Possibility of imitation and electronic
imposture Lack of robustness to noise, distortion,… Temporal drift
![Page 7: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/7.jpg)
Speaker Identity in Speech
Differences in Vocal tract shapes and muscular control Fundamental frequency (typical values)
100 Hz (Male), 200 Hz (Female), 300 Hz (Child) Glottal waveform Phonotactics Lexical usage, idiolects
The differences between Voices of Twins is a limit case
Voices can also be imitated or disguised
![Page 8: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/8.jpg)
spectral envelope of / i: /
f
A
Speaker A
Speaker B
Speaker Identity
segmental factors (~30ms)
glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness)
vocal tract:characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef)
suprasegmental factors speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits
![Page 9: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/9.jpg)
Speech production
![Page 10: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/10.jpg)
Speech analysis
![Page 11: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/11.jpg)
Inter-speaker Variability
We wereaway
ayear ago.
![Page 12: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/12.jpg)
Intra-speaker Variability
We
were
away
a
year
ago.
![Page 13: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/13.jpg)
Mel Frequency Cepstral Coefficients
![Page 14: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/14.jpg)
Speaker Verification
Typology of approaches (EAGLES Handbook) Text dependent
Public password Private password Customized password Text prompted
Text independent Incremental enrolment Evaluation
![Page 15: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/15.jpg)
Automatic Speaker Verification
Claimed Identity Automatic Speaker Verification System
Acceptation
Rejection
Speech processing Biometric Technology
![Page 16: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/16.jpg)
What are the sources of difficulty ?
Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…)
Recording conditions (filtering, noise,…) Temporal drift Intentional imposture Voice disguise
![Page 17: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/17.jpg)
Two types of errors : False rejection (a client is rejected) False acceptation (an impostor is accepted)
Decision theory : given an observation O and a claimed identity H0 hypothesis : it comes from an impostor H1 hypothesis : it comes from our client
H1 is chosen if and only if P(H1|O) > P(H0|O) which could be rewritten (using Bayes law) as
Decision theory for identity verification
)1()(
)(
)1(
HPHoP
HoOP
HOP>
)1()(
)(
)1(
HPHoP
HoOP
HOP>
![Page 18: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/18.jpg)
Decision
![Page 19: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/19.jpg)
Distribution of scores
![Page 20: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/20.jpg)
Receiver Operating Characteristic (ROC) curve
![Page 21: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/21.jpg)
Detection Error Tradeoff (DET) Curve
![Page 22: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/22.jpg)
History of Speaker Recognition
![Page 23: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/23.jpg)
Current approaches
![Page 24: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/24.jpg)
Text-dependent Speaker Verification
Uses Automatic Speech Recognition techniques (DTW, HMM, …)
Client model adaptation from speaker independent HMM (‘World’ model)
Synchronous alignment of client and world models for the computation of a score.
![Page 25: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/25.jpg)
Dynamic Time Warping (DTW)
![Page 26: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/26.jpg)
Dynamic Time Warping (DTW)
meilleurchemin
),()Y,X( 2jid yx∑=μ
“Bonjour” locuteur test Y
“Bon
jour
” lo
cute
ur X
“Bonjour” locuteur 1
“Bonjour” locuteur 2
“Bonjour” locuteur n
DODDINGTON 1974, ROSENBERG 1976, FURUI 1981, etc.
![Page 27: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/27.jpg)
Vector Quantization (VQ)
meilleurequant.
),()Y,X( X2
jiCd y∑=μ
Dictionnaire locuteur 1
Dictionnaire locuteur 2
Dictionnaire locuteur n
“Bonjour” locuteur test Y
Dic
tionn
aire
locu
teur
X
SOONG, ROSENBERG 1987
![Page 28: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/28.jpg)
Hidden Markov Models (HMM)
Bestpath
)S(Plog)Y,X(iXjy∑−=μ
“Bonjour” locuteur 1
“Bonjour” locuteur 2
“Bonjour” locuteur n
“Bonjour” locuteur test Y
“Bon
jour
” lo
cute
ur X
ROSENBERG 1990, TSENG 1992
![Page 29: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/29.jpg)
Ergodic HMM
meilleurchemin
)S(Plog)Y,X(iXjy∑−=μ
HMM locuteur 1
HMM locuteur 2
HMM locuteur n
“Bonjour” locuteur test Y
HM
M lo
cute
ur X
PORITZ 1982, SAVIC 1990
![Page 30: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/30.jpg)
Gaussian Mixture Models (GMM)
REYNOLDS 1995
![Page 31: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/31.jpg)
An example of a Text-dependent Speaker Verification System :
The PICASSO project Sequences of digits
Speaker independent HMM of each digit Adaptation of these HMMs to the client voice
(during enrolment and incremental enrolment) EER of less than 1 % can be achieved
Customized password The client chooses his password using some
feedback from the system Deliberate imposture
![Page 32: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/32.jpg)
Deliberate imposture
The impostor has some recordings of the target client voice. He can record the same sentences and align these speech signals with the recordings of the client.
A transformation (Multiple Linear Regression) is computed from these aligned data.
The impostor has heard the target client password. He records that password and applies the
transformation to this recording. The PICASSO reference system with less than 1 %
EER is defeated by this procedure (more than 30 % EER)
![Page 33: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/33.jpg)
Incremental enrolment of customised password
The client chooses his password using some feedback from the system.
The system attempts a phonetic transcription of the password.
Incremental enrolment is achieved on further repetitions of that password
Speaker independent phone HMM are adapted with the client enrolment data.
Synchronous alignment likelihood ratio scoring is performed on access trials.
![Page 34: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/34.jpg)
HMM structure depends on the application
![Page 35: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/35.jpg)
Speaker Verification (text independent)
The ELISA consortium ENST, LIA, IRISA, DDL, Uni-Fribourg, Uni-
Balamand... http://elisa.ddl.ish-lyon.cnrs.fr/
NIST evaluations http://www.nist.gov/speech/tests/spk/index.htm
Gaussian Mixture Models, Graphical models Segmental approaches (ALISP)
![Page 36: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/36.jpg)
Gaussian Mixture Model
Parametric representation of the probability distribution of observations:
![Page 37: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/37.jpg)
Gaussian Mixture Models
8 Gaussians per mixture
![Page 38: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/38.jpg)
GMM speaker modeling
Front-endGMM
MODELING
WORLDGMM
MODEL
Front-end GMM model adaptation
TARGETGMM
MODEL
![Page 39: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/39.jpg)
Baseline GMM method
HYPOTH.TARGET
GMM MOD.
Front-end
WORLDGMM
MODEL
Test Speech
xPxPLog ]
)/()/([
λλ
LLR SCORE
λ
λ
)/( λxP
)/( λxP
Λ =
![Page 40: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/40.jpg)
Support Vector Machines and Speaker Verification
Hybrid GMM-SVM system is proposed
SVM scoring model trained on development data to classify true-target speakers access and impostors access,using new feature representation based on GMMs
Modeling
Scoring
GMM
SVM
![Page 41: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/41.jpg)
SVM principles
X (X)
Inpu
t sp
ace
Feat
ure
spac
e Separating hyperplans H , with the optimal hyperplan Ho
Ho
H
Class(X)
![Page 42: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/42.jpg)
Results
![Page 43: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/43.jpg)
State of the art – research directions (3)
world model, speaker independent, train with all available speaker, using the
algorithm EM . client model,
Obtained as an adaptation of , MAP with a prior distribution MLLR with a transform function Unified approach
(Y)pX
(Y)pX
(Y)pX
![Page 44: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/44.jpg)
Adaptation
Degré de liberté variable Partitionnement variable des distributions Après chaque étape E de l’EM partitionnement
donnant une quantité de données suffisante par classe
12
12
9
17
6
23
21
33
56
![Page 45: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/45.jpg)
Hierarchical - MLLR adapted System
![Page 46: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/46.jpg)
National Institute of Standards & Technology (NIST)
Speaker Verification Evaluations
• Annual evaluation since 1995• Common paradigm for comparing technologies
![Page 47: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/47.jpg)
Evaluations NIST: généralités
Standard reconnu pour l’évaluation des systèmes de vérification du locuteur
Plusieurs centaines de locuteurs différents, Plusieurs dizaines de milliers d’accès de test.
Participation des meilleurs laboratoires mondiaux MIT, IBM, Nuance….
Participation de l’ENST depuis 1997.
![Page 48: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/48.jpg)
Evaluations NIST: Protocole
Phase d’apprentissage 2 minutes de parole spontanée Condition téléphonique, réseau cellulaire
Phase de test Durée des fichiers de 5s à 50s de parole
spontanée
![Page 49: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/49.jpg)
Evaluations NIST: Résultats
Les résultats sont présentés et discutés lors d’un workshop annuel.
Amélioration constante des performances de l’ENST (18%9%) malgré une augmentation de la difficulté: Réduction de la durée d’apprentissage, Réseau commuté réseau cellulaire.
![Page 50: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/50.jpg)
Evaluations NIST: Résultats
ENST 2003
![Page 51: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/51.jpg)
Combining Speech Recognition and Speaker Verification.
Speaker independent phone HMMs Selection of segments or segment classes
which are speaker specific Preliminary evaluations are performed on the
NIST extended data set (one hour of training data per speaker)
![Page 52: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/52.jpg)
1. 1 Speech Segmentation
Large Vocabulary Continuous Speech Recognition (LVCSR) need huge amount of transcribed speech data language (and task) dependent good results for a small set of languages (with existing AND
available transcripts) we do not have such system
Data-driven speech segmentation not yet usable for speech recognition purposes no annotated databases needed language and task independent we could use it to segment the speech data for a
text-independent speaker verification task and for language identification
ALISP (Automatic Language Independent Speech Processing) method
![Page 53: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/53.jpg)
1.2 ALISP data-driven speech segmentation
![Page 54: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/54.jpg)
3. Data-driven Speech Segmentation
for Speaker Verification
Current best speaker verification systems are based on Gaussian Mixture Models (each speech frame is treated independently, and no temporal information is taken into account);
Improvements are still necessary Speech is composed of different sounds Phonemes have different discriminant characteristics for
speaker verification nasals and vowels convey more speaker characteristics
then other speech classes we would like to exploit this idea, but with data-driven
ALISP unit An automatic speech segmentation tool is needed
![Page 55: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/55.jpg)
3.1 Advantages and disadvantages of the speech segmentation step
Problems: Need of an automatic speech segmentation tool Speaker modeling per speech classes => more data
needed More classes => more complicated systems
Advantages Possibility to use it in combination with a dialogue
based systems Text-prompted speaker verification Better accuracy if enough speech data available
![Page 56: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/56.jpg)
3.2 Proposed system: ALISP based Segmental Speaker Verification using
DTW
Speaker specific information is extracted from the : ALISP based speech segments = > Client Dictionary
Non-speaker (world speakers) : ALISP based speech segments => World Dictionary
Dynamic Time Warping (DTW) was already used for speaker verification, but in a text-dependent mode
comparison of two speech data with a similar linguistic content
the DTW distance measure between two speech segments conveys some speaker specific characteristics
Originality: use DTW in text-independent mode The speech data are first segmented in ALISP classes, in
order to remove the linguistic variability Measure the distances among speaker and non-speaker
speech segments
![Page 57: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/57.jpg)
3.3 Searching in client and world speech dictionaries
for speaker verification purposes
![Page 58: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/58.jpg)
3.4 Database and experimental setup for the
speaker verification experiments
Development data: NIST 2001 cellular data (American English)
world speakers (60 female + 59 male): train the ALISP speech segmenter model the non-speakers
Evaluated on small subset (14 female + 14 male speakers) from
NIST 2001 cellular data full set of NIST 2002 cellular data (??? speakers)
Speech parameterization : LPCC for initial ALISP segmentation and MFCC afterward
64 ALISP speech classes
![Page 59: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/59.jpg)
3.5 Results: example of data-driven speech segmentation for speaker verification
Comparison of a manual transcription with the ALISP segmentation (I think my my daughter )
2 occurrences of the English phone-sequence : m - ay ; corresponding ALISP sequences: HM-Hf-Ha and
HM-Hz-Ha-HC
![Page 60: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/60.jpg)
3.6 Results: another example data-driven speech segmentation for speaker verification
2 another occurrences of the English phone : ay ; the corresponding ALISP sequences: HX-Hf and Hf-Ha previous slide : Hf-Ha and Ha-
Hz
![Page 61: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/61.jpg)
3.7 Speaker Verification DET curves
![Page 62: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/62.jpg)
3.8 Conclusions
State of the art NIST 2002 results for EER: best 8% to worst 28%
Problem with the small data set results: influence of the size of the test set and/or mismatched train/test conditions
What we have NOT done: exploit the speech classes (silence classes are also
included) normalization (with pseudo-impostors) exploit the DTW distance value, not only the
“preference” result
![Page 63: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/63.jpg)
SuperSID experiments
![Page 64: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/64.jpg)
GMM with cepstral features
![Page 65: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/65.jpg)
Selection of nasals in words in -ing
being everythi
ng getting
anything thing
something
things going
![Page 66: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/66.jpg)
Fusion
![Page 67: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/67.jpg)
Fusion results
![Page 68: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/68.jpg)
Visages parlants et vérification d’identité
Le visage et la parole offrent des informations complémentaires sur l’identité de la personne.
De nombreux PC, PDA et téléphones sont et seront équipés d’une caméra et d’un microphone
Les situations d’imposture sont plus difficiles à réaliser.
Thème de recherche développé à l’ENST dans le cadre du projet IST-SecurePhone
![Page 69: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/69.jpg)
Visages parlants et vérification d’identité
Série de chiffres (PIN code) Mot de passe personnalisé
![Page 70: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/70.jpg)
Fusion Parole et Visage
(thèse de Conrad Sanderson, août 2002)
![Page 71: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/71.jpg)
1. Acquisition des signaux biométriques pour chaque modalité2. Calcul du score de décision pour chaque système3. Calcul d’un score de décision final basé sur la fusion des scores
mono-modalité
InsecureNetwork
Serveur distant:1. Accès à des services sécurisés2. Validation de transactions3. Etc.
Exemple d’application
![Page 72: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/72.jpg)
Conclusions et Perspectives La parole permet une vérification d’identité
à travers le téléphone.
Combiner les approches dépendantes et indépendantes du texte améliore la fiabilité.
Si l’on utilise le visage pour vérifier l’identité, il ne coûte pas cher d’ajouter la parole (et cela rapporte gros !).
De plus en plus de PC, PDA et téléphones sont équipés d’un microphone et d’une caméra. La reconnaissance audio-visuelle devrait se généraliser.
![Page 73: An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr@ GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 chollet](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551d9d82497959293b8bb7d7/html5/thumbnails/73.jpg)
Perspectives
Speech is often the only usable biometric modality (over the telephone network).
Fusion of modalities.
A number of R&D projects within the EU.