an overview of automatic speaker recognition gérard chollet [email protected]@ get-enst/cnrs-ltci...

An overview of Automatic Speaker Recognition

Gérard [email protected]

GET-ENST/CNRS-LTCI46 rue Barrault

75634 PARIS cedex 13http://www.tsi.enst.fr/~chollet

Outline

Motivations, Applications Speech production background Speaker characteristics in the speech signal Automatic Speaker Verification :

Decision theory Text dependent Text independent

Databases, Evaluation, Standardization Audio-visual speaker verification Conclusions Perspectives

Why should a computer recognize

who is speaking ?

Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...)

Limited access (secured areas, data bases) Personalization (only respond to its master’s

voice) Locate a particular person in an audio-visual

document (information retrieval) Who is speaking in a meeting ? Is a suspect the criminal ? (forensic applications)

Domains of Automatic Speaker Recognition

Your voice is a signature Speaker verification (Voice Biometric)

Are you really who you claim to be ? Identification within an open set :

Is this speech segment coming from a known speaker ?

Identification within a closed set Speaker detection, segmentation, indexing,

retrieval : Looking for recordings of a particular speaker

Combining Speech and Speaker Recognition Adaptation to a new speaker Personalization in dialogue systems

Applications

Access Control Physical facilities, Computer networks,

Websites Transaction Authentication

Telephone banking, e-Commerce Speech data Management

Voice messaging, Search engines Law Enforcement

Forensics, Home incarceration

Voice Biometric

Avantages Often the only modality over the telephone, Low cost (microphone, A/D), Ubiquity Possible integration on a smart (SIM) card Natural bimodal fusion : speaking face

Disadvantages Lack of discretion Possibility of imitation and electronic

imposture Lack of robustness to noise, distortion,… Temporal drift

Speaker Identity in Speech

Differences in Vocal tract shapes and muscular control Fundamental frequency (typical values)

100 Hz (Male), 200 Hz (Female), 300 Hz (Child) Glottal waveform Phonotactics Lexical usage, idiolects

The differences between Voices of Twins is a limit case

Voices can also be imitated or disguised

spectral envelope of / i: /

f

A

Speaker A

Speaker B

Speaker Identity

segmental factors (~30ms)

glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness)

vocal tract:characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef)

suprasegmental factors speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits

Speech production

Speech analysis

Inter-speaker Variability

We wereaway

ayear ago.

Intra-speaker Variability

We

were

away

a

year

ago.

Mel Frequency Cepstral Coefficients

Speaker Verification

Typology of approaches (EAGLES Handbook) Text dependent

Public password Private password Customized password Text prompted

Text independent Incremental enrolment Evaluation

Automatic Speaker Verification

Claimed Identity Automatic Speaker Verification System

Acceptation

Rejection

Speech processing Biometric Technology

What are the sources of difficulty ?

Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…)

Recording conditions (filtering, noise,…) Temporal drift Intentional imposture Voice disguise

Two types of errors : False rejection (a client is rejected) False acceptation (an impostor is accepted)

Decision theory : given an observation O and a claimed identity H0 hypothesis : it comes from an impostor H1 hypothesis : it comes from our client

H1 is chosen if and only if P(H1|O) > P(H0|O) which could be rewritten (using Bayes law) as

Decision theory for identity verification

)1()(

)(

)1(

HPHoP

HoOP

HOP>

)1()(

)(

)1(

HPHoP

HoOP

HOP>

Decision

Distribution of scores

Receiver Operating Characteristic (ROC) curve

Detection Error Tradeoff (DET) Curve

History of Speaker Recognition

Current approaches

Text-dependent Speaker Verification

Uses Automatic Speech Recognition techniques (DTW, HMM, …)

Client model adaptation from speaker independent HMM (‘World’ model)

Synchronous alignment of client and world models for the computation of a score.

Dynamic Time Warping (DTW)

Dynamic Time Warping (DTW)

meilleurchemin

),()Y,X( 2jid yx∑=μ

“Bonjour” locuteur test Y

“Bon

jour

” lo

cute

ur X

“Bonjour” locuteur 1


“Bonjour” locuteur n

DODDINGTON 1974, ROSENBERG 1976, FURUI 1981, etc.

Vector Quantization (VQ)

meilleurequant.

),()Y,X( X2

jiCd y∑=μ

Dictionnaire locuteur 1

Dictionnaire locuteur 2

Dictionnaire locuteur n


Dic

tionn

aire

locu

teur

X

SOONG, ROSENBERG 1987

Hidden Markov Models (HMM)

Bestpath

)S(Plog)Y,X(iXjy∑−=μ



“Bonjour” locuteur n


“Bon

jour

” lo

cute

ur X

ROSENBERG 1990, TSENG 1992

Ergodic HMM

meilleurchemin

)S(Plog)Y,X(iXjy∑−=μ

HMM locuteur 1

HMM locuteur 2

HMM locuteur n


HM

M lo

cute

ur X

PORITZ 1982, SAVIC 1990

Gaussian Mixture Models (GMM)

REYNOLDS 1995

An example of a Text-dependent Speaker Verification System :

The PICASSO project Sequences of digits

Speaker independent HMM of each digit Adaptation of these HMMs to the client voice

(during enrolment and incremental enrolment) EER of less than 1 % can be achieved

Customized password The client chooses his password using some

feedback from the system Deliberate imposture

Deliberate imposture

The impostor has some recordings of the target client voice. He can record the same sentences and align these speech signals with the recordings of the client.

A transformation (Multiple Linear Regression) is computed from these aligned data.

The impostor has heard the target client password. He records that password and applies the

transformation to this recording. The PICASSO reference system with less than 1 %

EER is defeated by this procedure (more than 30 % EER)

Incremental enrolment of customised password

The client chooses his password using some feedback from the system.

The system attempts a phonetic transcription of the password.

Incremental enrolment is achieved on further repetitions of that password

Speaker independent phone HMM are adapted with the client enrolment data.

Synchronous alignment likelihood ratio scoring is performed on access trials.

HMM structure depends on the application

Speaker Verification (text independent)

The ELISA consortium ENST, LIA, IRISA, DDL, Uni-Fribourg, Uni-

Balamand... http://elisa.ddl.ish-lyon.cnrs.fr/

NIST evaluations http://www.nist.gov/speech/tests/spk/index.htm

Gaussian Mixture Models, Graphical models Segmental approaches (ALISP)

Gaussian Mixture Model

Parametric representation of the probability distribution of observations:

Gaussian Mixture Models

8 Gaussians per mixture

GMM speaker modeling

Front-endGMM

MODELING

WORLDGMM

MODEL

Front-end GMM model adaptation

TARGETGMM

MODEL

Baseline GMM method

HYPOTH.TARGET

GMM MOD.

Front-end

WORLDGMM

MODEL

Test Speech

xPxPLog ]

)/()/([

λλ

LLR SCORE

λ

λ

)/( λxP

)/( λxP

Λ =

Support Vector Machines and Speaker Verification

Hybrid GMM-SVM system is proposed

SVM scoring model trained on development data to classify true-target speakers access and impostors access,using new feature representation based on GMMs

Modeling

Scoring

GMM

SVM

SVM principles

X (X)

Inpu

t sp

ace

Feat

ure

spac

e Separating hyperplans H , with the optimal hyperplan Ho

Ho

H

Class(X)

Results

State of the art – research directions (3)

world model, speaker independent, train with all available speaker, using the

algorithm EM . client model,

Obtained as an adaptation of , MAP with a prior distribution MLLR with a transform function Unified approach

(Y)pX

(Y)pX

(Y)pX

Adaptation

Degré de liberté variable Partitionnement variable des distributions Après chaque étape E de l’EM partitionnement

donnant une quantité de données suffisante par classe

12

12

9

17

6

23

21

33

56

Hierarchical - MLLR adapted System

National Institute of Standards & Technology (NIST)

Speaker Verification Evaluations

• Annual evaluation since 1995• Common paradigm for comparing technologies

Evaluations NIST: généralités

Standard reconnu pour l’évaluation des systèmes de vérification du locuteur

Plusieurs centaines de locuteurs différents, Plusieurs dizaines de milliers d’accès de test.

Participation des meilleurs laboratoires mondiaux MIT, IBM, Nuance….

Participation de l’ENST depuis 1997.

Evaluations NIST: Protocole

Phase d’apprentissage 2 minutes de parole spontanée Condition téléphonique, réseau cellulaire

Phase de test Durée des fichiers de 5s à 50s de parole

spontanée

Evaluations NIST: Résultats

Les résultats sont présentés et discutés lors d’un workshop annuel.

Amélioration constante des performances de l’ENST (18%9%) malgré une augmentation de la difficulté: Réduction de la durée d’apprentissage, Réseau commuté réseau cellulaire.

Evaluations NIST: Résultats

ENST 2003

Combining Speech Recognition and Speaker Verification.

Speaker independent phone HMMs Selection of segments or segment classes

which are speaker specific Preliminary evaluations are performed on the

NIST extended data set (one hour of training data per speaker)

1. 1 Speech Segmentation

Large Vocabulary Continuous Speech Recognition (LVCSR) need huge amount of transcribed speech data language (and task) dependent good results for a small set of languages (with existing AND

available transcripts) we do not have such system

Data-driven speech segmentation not yet usable for speech recognition purposes no annotated databases needed language and task independent we could use it to segment the speech data for a

text-independent speaker verification task and for language identification

ALISP (Automatic Language Independent Speech Processing) method

1.2 ALISP data-driven speech segmentation

3. Data-driven Speech Segmentation

for Speaker Verification

Current best speaker verification systems are based on Gaussian Mixture Models (each speech frame is treated independently, and no temporal information is taken into account);

Improvements are still necessary Speech is composed of different sounds Phonemes have different discriminant characteristics for

speaker verification nasals and vowels convey more speaker characteristics

then other speech classes we would like to exploit this idea, but with data-driven

ALISP unit An automatic speech segmentation tool is needed

3.1 Advantages and disadvantages of the speech segmentation step

Problems: Need of an automatic speech segmentation tool Speaker modeling per speech classes => more data

needed More classes => more complicated systems

Advantages Possibility to use it in combination with a dialogue

based systems Text-prompted speaker verification Better accuracy if enough speech data available

3.2 Proposed system: ALISP based Segmental Speaker Verification using

DTW

Speaker specific information is extracted from the : ALISP based speech segments = > Client Dictionary

Non-speaker (world speakers) : ALISP based speech segments => World Dictionary

Dynamic Time Warping (DTW) was already used for speaker verification, but in a text-dependent mode

comparison of two speech data with a similar linguistic content

the DTW distance measure between two speech segments conveys some speaker specific characteristics

Originality: use DTW in text-independent mode The speech data are first segmented in ALISP classes, in

order to remove the linguistic variability Measure the distances among speaker and non-speaker

speech segments

3.3 Searching in client and world speech dictionaries

for speaker verification purposes

3.4 Database and experimental setup for the

speaker verification experiments

Development data: NIST 2001 cellular data (American English)

world speakers (60 female + 59 male): train the ALISP speech segmenter model the non-speakers

Evaluated on small subset (14 female + 14 male speakers) from

NIST 2001 cellular data full set of NIST 2002 cellular data (??? speakers)

Speech parameterization : LPCC for initial ALISP segmentation and MFCC afterward

64 ALISP speech classes

3.5 Results: example of data-driven speech segmentation for speaker verification

Comparison of a manual transcription with the ALISP segmentation (I think my my daughter )

2 occurrences of the English phone-sequence : m - ay ; corresponding ALISP sequences: HM-Hf-Ha and

HM-Hz-Ha-HC

3.6 Results: another example data-driven speech segmentation for speaker verification

2 another occurrences of the English phone : ay ; the corresponding ALISP sequences: HX-Hf and Hf-Ha previous slide : Hf-Ha and Ha-

Hz

3.7 Speaker Verification DET curves

3.8 Conclusions

State of the art NIST 2002 results for EER: best 8% to worst 28%

Problem with the small data set results: influence of the size of the test set and/or mismatched train/test conditions

What we have NOT done: exploit the speech classes (silence classes are also

included) normalization (with pseudo-impostors) exploit the DTW distance value, not only the

“preference” result

SuperSID experiments

GMM with cepstral features

Selection of nasals in words in -ing

being everythi

ng getting

anything thing

something

things going

Fusion

Fusion results

Visages parlants et vérification d’identité

Le visage et la parole offrent des informations complémentaires sur l’identité de la personne.

De nombreux PC, PDA et téléphones sont et seront équipés d’une caméra et d’un microphone

Les situations d’imposture sont plus difficiles à réaliser.

Thème de recherche développé à l’ENST dans le cadre du projet IST-SecurePhone

Visages parlants et vérification d’identité

Série de chiffres (PIN code) Mot de passe personnalisé

Fusion Parole et Visage

(thèse de Conrad Sanderson, août 2002)

1. Acquisition des signaux biométriques pour chaque modalité2. Calcul du score de décision pour chaque système3. Calcul d’un score de décision final basé sur la fusion des scores

mono-modalité

InsecureNetwork

Serveur distant:1. Accès à des services sécurisés2. Validation de transactions3. Etc.

Exemple d’application

Conclusions et Perspectives La parole permet une vérification d’identité

à travers le téléphone.

Combiner les approches dépendantes et indépendantes du texte améliore la fiabilité.

Si l’on utilise le visage pour vérifier l’identité, il ne coûte pas cher d’ajouter la parole (et cela rapporte gros !).

De plus en plus de PC, PDA et téléphones sont équipés d’un microphone et d’une caméra. La reconnaissance audio-visuelle devrait se généraliser.

Perspectives

Speech is often the only usable biometric modality (over the telephone network).

Fusion of modalities.

A number of R&D projects within the EU.

an overview of automatic speaker recognition gérard chollet [email protected]@ get-enst/cnrs-ltci...

Documents