an intro to speaker recognition nikki mirghafori acknowledgment: some slides borrowed from the heck...

An Intro to Speaker RecognitionNikki Mirghafori

Acknowledgment: some slides borrowed from the Heck & Reynolds tutorial, and A. Stolcke.

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Today’s class

•Interactive

•Measures of success for today:

•You talk at least as much as I do

•You learn and remember the basics

•You feel you can do this stuff

•We all have fun with the material!

2


A 10-minute “Project Design”• You are experts with different backgrounds. Your previous

startup companies were wildly successful. A large VC firm in the valley wants to fund YOUR next creation, as long as the project is in speaker recognition.

• The VC funding is yours, if you come up with some kind of a coherent plan/list of issues:

• What is your proposed application?

• What will be the sources of error and variability, i.e., technology challenges?

• What types of features will you use?

• What sorts of statistical modeling tools/techniques?

• What will be your data needs?

• Any other issues you can think of along your path?

3


Extracting Information from Speech

SpeechRecognition

LanguageRecognition

SpeakerRecognition

Words

Language Name

Speaker Name

“How are you?”

English

James Wilson

Speech Signal

Goal: Automatically extract information transmitted in speech

signal

• What’s noise? what’s signal?

• Orthogonal in many ways

• Use many of the same models and tools

Nikki Mirghafori 4/23/12EECS 225D -- Verification5

Speaker Recognition Applications• Access control

– Physical facilities– Data and data networks

• Transaction authentication– Telephone credit card purchases– Bank wire transfers– Fraud detection

• Monitoring– Remote time and attendance logging– Home parole verification

• Information retrieval– Customer information for call centers– Audio indexing (speech skimming device)– Personalization

• Forensics– Voice sample matching

• Access control– Physical facilities– Data and data networks

• Transaction authentication– Telephone credit card purchases– Bank wire transfers– Fraud detection

• Monitoring– Remote time and attendance logging– Home parole verification

• Information retrieval– Customer information for call centers– Audio indexing (speech skimming device)– Personalization

• Forensics– Voice sample matching


Tasks

•Identification vs. verification

•Closed set vs. open set identification

•Also, segmentation, clustering, tracking...


Speaker Model Speaker Model DatabaseDatabase

Test SpeechTest Speech

Identification

Whose voice is it?Whose voice is it?

Closed-set Speaker Identification




Identification

Whose voice is it?Whose voice is it?

Open-set Speaker Identification

None of the above




Verification/Authentication/Detection

Does the voice match?Does the voice match?

Yes/NoYes/NoVerification requires claimant ID

““It’s It’s me!”me!”


Speech Modalities

• Text-dependent recognition

– Recognition system knows text spoken by person

– Examples: fixed phrase, prompted phrase

– Used for applications with strong control over user input

– Knowledge of spoken text can improve system performance

• Text-independent recognition

– Recognition system does not know text spoken by person

– Examples: User selected phrase, conversational speech

– Used for applications with less control over user input

– More flexible system but also more difficult problem

– Speech recognition can provide knowledge of spoken text

– Text-Constrained recognition. Exercise for the reader.


Text-constrained Recognition

•Basic idea: build speaker models for words rich in speaker information

•Example:

•“What time did you say? um... okay, I_think that’s a good plan.”

•Text-dependent strategy in a text-independent context


Strongest security

• Biometric: a human generated signal or attribute for authenticating a person’s identity

• Voice is a popular biometric:

– natural signal to produce

– does not require a specialized input device

– ubiquitous: telephones and microphone equipped PC

• Voice biometric with other forms of security

– Something you have - e.g., badge

– Something you know - e.g., password

– Something you are - e.g., voice

HaveKnow

Are

Voice as a biometric


How to build a system? • Feature choices:

• low level (MFCC, PLP, LPC, F0, ...) and high level (words, phones, prosody, ...)

• Types of models:

• HMM, GMM, Support Vector Machines (SVM), DTW, Nearest Neighbor, Neural Nets

• Making decisions: Log Likelihood Thresholds, threshold setting for desired operating point

• Other issues: normalization (znorm, tnorm), optimal data selection to match expected conditions, channel variability, noise, etc.


Verification Performance

• There are many factors to consider in design of an evaluation of a speaker verification system

Speech quality – Channel and microphone characteristics– Noise level and type– Variability between enrollment and

verification speech

Speech modality – Fixed/prompted/user-selected phrases– Free text

Speech duration – Duration and number of sessions of enrollment and verification speech

Speaker population – Size and composition

• Most importantly: The evaluation data and design should match the target application domain of interest



Probability of False Accept (in %)

Pro

bab

ility

of

Fal

se R

ejec

t (i

n %

)

Text-dependent (Combinations)

Clean Data

Single microphone

Large amount of train/test speech

Text-dependent (Combinations)

Clean Data

Single microphone

Large amount of train/test speech

Text-independent (Conversational)

Telephone Data

Multiple microphones

Moderate amount of training data

Text-independent (Conversational)

Telephone Data



Text-dependent (Digit strings)

Telephone Data


Small amount of training data

Text-dependent (Digit strings)

Telephone Data


Small amount of training data

Text-independent (Read sentences)

Military radio Data

Multiple radios & microphones


Text-independent (Read sentences)

Military radio Data

Multiple radios & microphones


Incre

asing constra

ints



PROBABILITY OF FALSE ACCEPT (in %)

PR

OB

AB

ILIT

Y O

F F

AL

SE

RE

JEC

T (

in %

)

Equal Error Rate (EER) = 1 %

Wire Transfer:

False acceptance is very costly

Users may tolerate rejections for

security

Customization:

False rejections alienate customers

Any customization is beneficial

Application operating point depends on

relative costs of the two error types

High Convenience

High Security

Balance

Example Performance Curve


Human vs. Machine

17

• Motivation for comparing human to machine– Evaluating speech coders

and potential forensic applications

• Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000)

– Same amount of training data

– Matched Handset-type tests

– Mismatched Handset-type tests

– Used 3-sec conversational utterances from telephone speech

ErrorRates

Humans44%

betterHumans15%worse


Features

• Desirable attributes of features for an automatic system (Wolf ‘72)

• Occur naturally and frequently in speech

• Easily measurable

• Not change over time or be affected by speaker’s health

• Not be affected by reasonable background noise nor depend on specific transmission characteristics

• Not be subject to mimicry

• Occur naturally and frequently in speech

• Easily measurable

• Not change over time or be affected by speaker’s health

• Not be affected by reasonable background noise nor depend on specific transmission characteristics

• Not be subject to mimicry

Practical

Robust

Secure

• No feature has all these attributes


Training & Test Phases

FeatureFeatureExtractiExtracti

onon

ModelModelTraininTrainin

gg

Enrollment Phase

Training speech for each speaker

Model for each speake

r

FeatureFeatureExtractiExtracti

onon

VerificVerificationation

DecisioDecisionn

Recognition Phase(e.g. Verification)

??

““It’s It’s me!”me!”

AccepteAcceptedd

Reject Rejecteded


Decision makingVerification decision approaches have roots in signal detection theory

• 2-class Hypothesis test: H0: the speaker is an impostor

H1: the speaker is indeed the claimed speaker.

• Statistic computed on test utterance S as likelihood ratio:

Likelihood S came from speaker modelLikelihood S did not come from speaker model

log

reject

Feature extraction

Feature extraction

SpeakerModel

SpeakerModel

ImpostorModel

ImpostorModel

DecisionDecision+

-

accept


Decision making

• Identification: pick model (of N) with best score

• Verification: usual approach is via likelihood ratio tests, hypothesis testing, i.e.:

• By Bayes:

• P(target|x)/P(nontarget|x) =

P(x|target)P(target)/P(x|nontarget)P(nontarget)

• accept if > threshold, reject otherwise

• Can’t sum over all non-target talkers -- world for SV!

• Use “cohorts” (collection of impostors)

• Train “universal”/”world”/”background” model (speaker independent, it’s trained on many speakers)


•Traditional speaker recognition systems use

•Cepstral feaures

•Gaussian Mixture Models (GMMs)

Spectral Based Approach

D.A. Reynolds, T.F. Quatieri, R.B. Dunn. “Speaker Verification using Adapted Gaussian Mixture Models,” Digital Signal Processing, 10(1--3), January/April/July 2000

FeatureFeatureExtractExtractionion

FourierFourierTransforTransfor

mm

MagnituMagnitudede

Sliding Sliding windowwindow

CosineCosineTransforTransfor

mmLogLog

BackgrouBackgroundnd

ModelModel

log log likelihood likelihood

ratioratio

SpeakerSpeakerModelModelAdaptAdapt


Features: Levels of Information

Semantic

Dialogic

Idiolectal

Phonetic

Prosodic

SpectralLow-level cues(physical

characteristics)

High-level cues(learned behaviors)

semantics, idiolects,

pronunciations,

idiosyncrasies

socio-economic status,

education, place of birth

prosody, rhythm, speed intonation,

volume modulation

personality type,

parental influence

acoustic aspects of speech,

nasal, deep, breathy, rough

anatomical structure of

vocal apparatus

Hierarchy of Perceptual

Cues


Low level features

•Speech production model: source-filter interaction

•Anatomical structure (vocal tract/glottis) conveyed in speech spectrum

Glottal pulses Vocal tract Speech signal


Word N-gram Features

25

Idea (Doddington 2001):

•Word usage can be idiosyncratic to a speaker

•Model speakers by relative frequencies of word N-grams

•Reflects vocabulary AND grammar

•Cf. similar approaches for authorship and plagiarism detection on text documents.

•First (unpublished) use in speaker recognition: Heck et al. (1998)

Implementation:

•Get 1-best word recognition output

•Extract N-gram frequencies

•Model likelihood ratio OR

•Model frequency vectors by SVM

I_shall 0.002

I_think 0.025

I_would 0.012

… …


Phone N-gram features

Open-loop phone

recognition

zh eh

k

jh

phone lattice phonengram

jh

zh eh

k

relative

freq.

0.0254

0.0068

0.0198

SupportVector Machine(SVM)

[+ 0.0254 0.0068 0.0198][- 0.0001 0.8827 0.7264][- 0.0329 0.2847 0.2983]

score

Model the pattern of phone usage or “short term pronunciation” for a speaker


MLLR transform vectors as features

27

Speaker-independent

Speaker-independent

Phone class APhone class B

Speaker-dependent

Speaker-dependent

MLLR Transforms = Features


Models• HMMs:

• text dep (could use whole word/phone model)

• prompted (phone models)

• text ind’t (use LVCSR) -- or GMMs!

• templates DTW (if text-dependent system)

• nearest neighbor: frame level, training data as “model”, non-parametric

• neural nets: train explicitly discriminating models

• SVMs


Speaker Models -- HMM

• Speaker models (voiceprints) represent voice biometric in compact and generalizable form

h-a-d

• Modern speaker verification systems use Hidden Markov Models (HMMs)

– HMMs are statistical models of how a speaker produces sounds

– HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states.

– Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties.


Speaker Models – HMM/GMMForm of HMM depends on the application

“Open sesame”

Fixed Phrase Word/phrase models

/s/ /i/ /x/

Prompted phrases/passwords Phoneme models

General speech

Text-independent single state HMM


Word N-gram Modeling: Likelihood Ratios

31

j

j Background

Spea

j

j

Score1

)(

)(log ker

• Model N-gram token log likelihood ratio

• Numerator: speaker language model estimated from enrollment data

• Denominator: background language model estimated from large speaker population

• Normalize by token count

• Choose all reasonably frequent bigrams or trigrams, or a weighted combination of both


Speaker Recognition with SVMs

32

• Each speech sample (training or test) generates a point in a derived feature space

• The SVM is trained to separate the target sample from the impostor (= UBM) samples

• Scores are computed as the Euclidean distance from the decision hyperplane to the test sample point

• SVMs training is biased against misclassifying positive examples (typically very few, often just 1)

Background sampleTarget sampleTest sample


Feature Transforms for SVMs• SVMs have been a boon for higher-level (as well

as cepstral speaker recognition) research – they allow great flexibility in the choice of features

• However, we need a “sequence kernel”

• Dominant approach: transform variable-length feature stream into fixed, finite-dimensional feature space

• Then use linear kernel

• All the action is in the feature transform!

33


Combination of Systems• Systems work best in combination, especially ones using “higher level” features

•Need to estimate optimal combination weight. E.g., use neural network

•Combination weights trained on a held-out development dataset

GMMGMM MMLRMMLR WordHMMWordHMM PhoneNgrPhoneNgramam

Neural Network CombinerNeural Network Combiner


Variability: The Achilles Heel...•Variability (extrinsic & intrinsic) in the spectrum can cause error

•Data of focus has mainly been extrinsic

•“Channel” mismatch:• Microphone •carbon-button, hands-free,..

•Acoustic environment•Office, car, airport, ...

•Transmission channel•Landline, cellular, VoIP, ...

•Three compensation approaches:•Feature-based•Model-based•Score-based '96 '99

Matched

Handsets

Mismatched

Handsets

ErrorRates

Factor of 20worse

Factor of 2.5worse

Compensation techniqueshelp reduce error.


NIST Speaker Verification Evaluations

• Annual NIST evaluations of speaker verification technology (since 1996)

• Aim: Provide a common paradigm for comparing technologies

• Focus: Conversational telephone speech (text-independent)

Evaluation Coordinator

Linguistic Data Consortium

Data Provider

Technology Developers

Comparison of technologies on

common task

Evaluate

Improve


The NIST Evaluation Task•Conversational telephone speech, interview•Landline, cellular, hands-free, multiple-mics in room•5 min of conversations between two speakers•Various conditions, e.g.,•Training: 8, 1, or other number of conversation sides•Test: 1 conversation side, 30 secs, etc.•Evaluation:•Equal Error Rate (EER)•Decision Cost Function (DCF)• • • = (10, 1, 0.01)


The End

•What’s one interesting you learned today you may share with a friend over dinner conversation?

38


Backup slides

39


Word Conditional Models -- example•Boakye et al. (2004)

•19 words and bi-grams• Discourse markers:

{actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean}

• Filled pauses: {um, uh}

• Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know }

•Trained whole-word HMMs, instead of GMMs, to model evolution of speech in time

•Combines well with low-level (i.e., cepstral GMM) system, especially with more training data


Phone N-Grams -- example•Idea (Hatch et al., ‘05): model the pattern of phone usage or “short term pronunciation” for a speaker•Use open-loop phone recognition to obtain phone hypotheses

•Create models of relative frequencies of phone n-grams of the speaker vs. “others”

•Use SVM for modeling

•Combines well, esp. with increased data

•Works across languages

an intro to speaker recognition nikki mirghafori acknowledgment: some slides borrowed from the heck...

Documents

speaker models

speaker informationexample

toolsnikki mirghafori42312eecs

textdependent strategy

open set identificationalso

user inputmore flexible

voice match

user selected phrase