an intro to speaker recognition nikki mirghafori acknowledgment: some slides borrowed from the heck...

41
An Intro to Speaker Recognition Nikki Mirghafori Acknowledgment: some slides borrowed from the Heck & Reynolds tutorial, and A. Stolcke.

Upload: melina-alexander

Post on 04-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

An Intro to Speaker RecognitionNikki Mirghafori

Acknowledgment: some slides borrowed from the Heck & Reynolds tutorial, and A. Stolcke.

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Today’s class

•Interactive

•Measures of success for today:

•You talk at least as much as I do

•You learn and remember the basics

•You feel you can do this stuff

•We all have fun with the material!

2

Nikki Mirghafori 4/23/12EECS 225D -- Verification

A 10-minute “Project Design”• You are experts with different backgrounds. Your previous

startup companies were wildly successful. A large VC firm in the valley wants to fund YOUR next creation, as long as the project is in speaker recognition.

• The VC funding is yours, if you come up with some kind of a coherent plan/list of issues:

• What is your proposed application?

• What will be the sources of error and variability, i.e., technology challenges?

• What types of features will you use?

• What sorts of statistical modeling tools/techniques?

• What will be your data needs?

• Any other issues you can think of along your path?

3

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Extracting Information from Speech

SpeechRecognition

LanguageRecognition

SpeakerRecognition

Words

Language Name

Speaker Name

“How are you?”

English

James Wilson

Speech Signal

Goal: Automatically extract information transmitted in speech

signal

• What’s noise? what’s signal?

• Orthogonal in many ways

• Use many of the same models and tools

Nikki Mirghafori 4/23/12EECS 225D -- Verification5

Speaker Recognition Applications• Access control

– Physical facilities– Data and data networks

• Transaction authentication– Telephone credit card purchases– Bank wire transfers– Fraud detection

• Monitoring– Remote time and attendance logging– Home parole verification

• Information retrieval– Customer information for call centers– Audio indexing (speech skimming device)– Personalization

• Forensics– Voice sample matching

• Access control– Physical facilities– Data and data networks

• Transaction authentication– Telephone credit card purchases– Bank wire transfers– Fraud detection

• Monitoring– Remote time and attendance logging– Home parole verification

• Information retrieval– Customer information for call centers– Audio indexing (speech skimming device)– Personalization

• Forensics– Voice sample matching

Nikki Mirghafori 4/23/12EECS 225D -- Verification6

Tasks

•Identification vs. verification

•Closed set vs. open set identification

•Also, segmentation, clustering, tracking...

Nikki Mirghafori 4/23/12EECS 225D -- Verification7

Speaker Model Speaker Model DatabaseDatabase

Test SpeechTest Speech

Identification

Whose voice is it?Whose voice is it?

Closed-set Speaker Identification

Nikki Mirghafori 4/23/12EECS 225D -- Verification8

Speaker Model Speaker Model DatabaseDatabase

Test SpeechTest Speech

Identification

Whose voice is it?Whose voice is it?

Open-set Speaker Identification

None of the above

Nikki Mirghafori 4/23/12EECS 225D -- Verification9

Speaker Model Speaker Model DatabaseDatabase

Test SpeechTest Speech

Verification/Authentication/Detection

Does the voice match?Does the voice match?

Yes/NoYes/NoVerification requires claimant ID

““It’s It’s me!”me!”

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Speech Modalities

• Text-dependent recognition

– Recognition system knows text spoken by person

– Examples: fixed phrase, prompted phrase

– Used for applications with strong control over user input

– Knowledge of spoken text can improve system performance

• Text-independent recognition

– Recognition system does not know text spoken by person

– Examples: User selected phrase, conversational speech

– Used for applications with less control over user input

– More flexible system but also more difficult problem

– Speech recognition can provide knowledge of spoken text

– Text-Constrained recognition. Exercise for the reader.

Nikki Mirghafori 4/23/12EECS 225D -- Verification11

Text-constrained Recognition

•Basic idea: build speaker models for words rich in speaker information

•Example:

•“What time did you say? um... okay, I_think that’s a good plan.”

•Text-dependent strategy in a text-independent context

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Strongest security

• Biometric: a human generated signal or attribute for authenticating a person’s identity

• Voice is a popular biometric:

– natural signal to produce

– does not require a specialized input device

– ubiquitous: telephones and microphone equipped PC

• Voice biometric with other forms of security

– Something you have - e.g., badge

– Something you know - e.g., password

– Something you are - e.g., voice

HaveKnow

Are

Voice as a biometric

Nikki Mirghafori 4/23/12EECS 225D -- Verification13

How to build a system? • Feature choices:

• low level (MFCC, PLP, LPC, F0, ...) and high level (words, phones, prosody, ...)

• Types of models:

• HMM, GMM, Support Vector Machines (SVM), DTW, Nearest Neighbor, Neural Nets

• Making decisions: Log Likelihood Thresholds, threshold setting for desired operating point

• Other issues: normalization (znorm, tnorm), optimal data selection to match expected conditions, channel variability, noise, etc.

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Verification Performance

• There are many factors to consider in design of an evaluation of a speaker verification system

Speech quality – Channel and microphone characteristics– Noise level and type– Variability between enrollment and

verification speech

Speech modality – Fixed/prompted/user-selected phrases– Free text

Speech duration – Duration and number of sessions of enrollment and verification speech

Speaker population – Size and composition

• Most importantly: The evaluation data and design should match the target application domain of interest

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Verification Performance

Probability of False Accept (in %)

Pro

bab

ility

of

Fal

se R

ejec

t (i

n %

)

Text-dependent (Combinations)

Clean Data

Single microphone

Large amount of train/test speech

Text-dependent (Combinations)

Clean Data

Single microphone

Large amount of train/test speech

Text-independent (Conversational)

Telephone Data

Multiple microphones

Moderate amount of training data

Text-independent (Conversational)

Telephone Data

Multiple microphones

Moderate amount of training data

Text-dependent (Digit strings)

Telephone Data

Multiple microphones

Small amount of training data

Text-dependent (Digit strings)

Telephone Data

Multiple microphones

Small amount of training data

Text-independent (Read sentences)

Military radio Data

Multiple radios & microphones

Moderate amount of training data

Text-independent (Read sentences)

Military radio Data

Multiple radios & microphones

Moderate amount of training data

Incre

asing constra

ints

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Verification Performance

PROBABILITY OF FALSE ACCEPT (in %)

PR

OB

AB

ILIT

Y O

F F

AL

SE

RE

JEC

T (

in %

)

Equal Error Rate (EER) = 1 %

Wire Transfer:

False acceptance is very costly

Users may tolerate rejections for

security

Customization:

False rejections alienate customers

Any customization is beneficial

Application operating point depends on

relative costs of the two error types

High Convenience

High Security

Balance

Example Performance Curve

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Human vs. Machine

17

• Motivation for comparing human to machine– Evaluating speech coders

and potential forensic applications

• Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000)

– Same amount of training data

– Matched Handset-type tests

– Mismatched Handset-type tests

– Used 3-sec conversational utterances from telephone speech

ErrorRates

Humans44%

betterHumans15%worse

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Features

• Desirable attributes of features for an automatic system (Wolf ‘72)

• Occur naturally and frequently in speech

• Easily measurable

• Not change over time or be affected by speaker’s health

• Not be affected by reasonable background noise nor depend on specific transmission characteristics

• Not be subject to mimicry

• Occur naturally and frequently in speech

• Easily measurable

• Not change over time or be affected by speaker’s health

• Not be affected by reasonable background noise nor depend on specific transmission characteristics

• Not be subject to mimicry

Practical

Robust

Secure

• No feature has all these attributes

Nikki Mirghafori 4/23/12EECS 225D -- Verification19

Training & Test Phases

FeatureFeatureExtractiExtracti

onon

ModelModelTraininTrainin

gg

Enrollment Phase

Training speech for each speaker

Model for each speake

r

FeatureFeatureExtractiExtracti

onon

VerificVerificationation

DecisioDecisionn

Recognition Phase(e.g. Verification)

??

““It’s It’s me!”me!”

AccepteAcceptedd

Reject Rejecteded

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Decision makingVerification decision approaches have roots in signal detection theory

• 2-class Hypothesis test: H0: the speaker is an impostor

H1: the speaker is indeed the claimed speaker.

• Statistic computed on test utterance S as likelihood ratio:

Likelihood S came from speaker modelLikelihood S did not come from speaker model

log

reject

Feature extraction

Feature extraction

SpeakerModel

SpeakerModel

ImpostorModel

ImpostorModel

DecisionDecision+

-

accept

Nikki Mirghafori 4/23/12EECS 225D -- Verification21

Decision making

• Identification: pick model (of N) with best score

• Verification: usual approach is via likelihood ratio tests, hypothesis testing, i.e.:

• By Bayes:

• P(target|x)/P(nontarget|x) =

P(x|target)P(target)/P(x|nontarget)P(nontarget)

• accept if > threshold, reject otherwise

• Can’t sum over all non-target talkers -- world for SV!

• Use “cohorts” (collection of impostors)

• Train “universal”/”world”/”background” model (speaker independent, it’s trained on many speakers)

Nikki Mirghafori 4/23/12EECS 225D -- Verification22

•Traditional speaker recognition systems use

•Cepstral feaures

•Gaussian Mixture Models (GMMs)

Spectral Based Approach

D.A. Reynolds, T.F. Quatieri, R.B. Dunn. “Speaker Verification using Adapted Gaussian Mixture Models,” Digital Signal Processing, 10(1--3), January/April/July 2000

FeatureFeatureExtractExtractionion

FourierFourierTransforTransfor

mm

MagnituMagnitudede

Sliding Sliding windowwindow

CosineCosineTransforTransfor

mmLogLog

BackgrouBackgroundnd

ModelModel

log log likelihood likelihood

ratioratio

SpeakerSpeakerModelModelAdaptAdapt

Nikki Mirghafori 4/23/12EECS 225D -- Verification23

Features: Levels of Information

Semantic

Dialogic

Idiolectal

Phonetic

Prosodic

SpectralLow-level cues(physical

characteristics)

High-level cues(learned behaviors)

semantics, idiolects,

pronunciations,

idiosyncrasies

socio-economic status,

education, place of birth

prosody, rhythm, speed intonation,

volume modulation

personality type,

parental influence

acoustic aspects of speech,

nasal, deep, breathy, rough

anatomical structure of

vocal apparatus

Hierarchy of Perceptual

Cues

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Low level features

•Speech production model: source-filter interaction

•Anatomical structure (vocal tract/glottis) conveyed in speech spectrum

Glottal pulses Vocal tract Speech signal

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Word N-gram Features

25

Idea (Doddington 2001):

•Word usage can be idiosyncratic to a speaker

•Model speakers by relative frequencies of word N-grams

•Reflects vocabulary AND grammar

•Cf. similar approaches for authorship and plagiarism detection on text documents.

•First (unpublished) use in speaker recognition: Heck et al. (1998)

Implementation:

•Get 1-best word recognition output

•Extract N-gram frequencies

•Model likelihood ratio OR

•Model frequency vectors by SVM

I_shall 0.002

I_think 0.025

I_would 0.012

… …

Nikki Mirghafori 4/23/12EECS 225D -- Verification26

Phone N-gram features

Open-loop phone

recognition

zh eh

k

jh

phone lattice phonengram

jh

zh eh

k

relative

freq.

0.0254

0.0068

0.0198

SupportVector Machine(SVM)

[+ 0.0254 0.0068 0.0198][- 0.0001 0.8827 0.7264][- 0.0329 0.2847 0.2983]

score

Model the pattern of phone usage or “short term pronunciation” for a speaker

Nikki Mirghafori 4/23/12EECS 225D -- Verification

MLLR transform vectors as features

27

Speaker-independent

Speaker-independent

Phone class APhone class B

Speaker-dependent

Speaker-dependent

MLLR Transforms = Features

Nikki Mirghafori 4/23/12EECS 225D -- Verification28

Models• HMMs:

• text dep (could use whole word/phone model)

• prompted (phone models)

• text ind’t (use LVCSR) -- or GMMs!

• templates DTW (if text-dependent system)

• nearest neighbor: frame level, training data as “model”, non-parametric

• neural nets: train explicitly discriminating models

• SVMs

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Speaker Models -- HMM

• Speaker models (voiceprints) represent voice biometric in compact and generalizable form

h-a-d

• Modern speaker verification systems use Hidden Markov Models (HMMs)

– HMMs are statistical models of how a speaker produces sounds

– HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states.

– Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties.

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Speaker Models – HMM/GMMForm of HMM depends on the application

“Open sesame”

Fixed Phrase Word/phrase models

/s/ /i/ /x/

Prompted phrases/passwords Phoneme models

General speech

Text-independent single state HMM

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Word N-gram Modeling: Likelihood Ratios

31

j

j Background

Spea

j

j

Score1

)(

)(log ker

• Model N-gram token log likelihood ratio

• Numerator: speaker language model estimated from enrollment data

• Denominator: background language model estimated from large speaker population

• Normalize by token count

• Choose all reasonably frequent bigrams or trigrams, or a weighted combination of both

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Speaker Recognition with SVMs

32

• Each speech sample (training or test) generates a point in a derived feature space

• The SVM is trained to separate the target sample from the impostor (= UBM) samples

• Scores are computed as the Euclidean distance from the decision hyperplane to the test sample point

• SVMs training is biased against misclassifying positive examples (typically very few, often just 1)

Background sampleTarget sampleTest sample

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Feature Transforms for SVMs• SVMs have been a boon for higher-level (as well

as cepstral speaker recognition) research – they allow great flexibility in the choice of features

• However, we need a “sequence kernel”

• Dominant approach: transform variable-length feature stream into fixed, finite-dimensional feature space

• Then use linear kernel

• All the action is in the feature transform!

33

Nikki Mirghafori 4/23/12EECS 225D -- Verification34

Combination of Systems• Systems work best in combination, especially ones using “higher level” features

•Need to estimate optimal combination weight. E.g., use neural network

•Combination weights trained on a held-out development dataset

GMMGMM MMLRMMLR WordHMMWordHMM PhoneNgrPhoneNgramam

Neural Network CombinerNeural Network Combiner

Nikki Mirghafori 4/23/12EECS 225D -- Verification35

Variability: The Achilles Heel...•Variability (extrinsic & intrinsic) in the spectrum can cause error

•Data of focus has mainly been extrinsic

•“Channel” mismatch:• Microphone •carbon-button, hands-free,..

•Acoustic environment•Office, car, airport, ...

•Transmission channel•Landline, cellular, VoIP, ...

•Three compensation approaches:•Feature-based•Model-based•Score-based '96 '99

Matched

Handsets

Mismatched

Handsets

ErrorRates

Factor of 20worse

Factor of 2.5worse

Compensation techniqueshelp reduce error.

Nikki Mirghafori 4/23/12EECS 225D -- Verification

NIST Speaker Verification Evaluations

• Annual NIST evaluations of speaker verification technology (since 1996)

• Aim: Provide a common paradigm for comparing technologies

• Focus: Conversational telephone speech (text-independent)

Evaluation Coordinator

Linguistic Data Consortium

Data Provider

Technology Developers

Comparison of technologies on

common task

Evaluate

Improve

Nikki Mirghafori 4/23/12EECS 225D -- Verification37

The NIST Evaluation Task•Conversational telephone speech, interview•Landline, cellular, hands-free, multiple-mics in room•5 min of conversations between two speakers•Various conditions, e.g.,•Training: 8, 1, or other number of conversation sides•Test: 1 conversation side, 30 secs, etc.•Evaluation:•Equal Error Rate (EER)•Decision Cost Function (DCF)• • • = (10, 1, 0.01)

Nikki Mirghafori 4/23/12EECS 225D -- Verification

The End

•What’s one interesting you learned today you may share with a friend over dinner conversation?

38

Nikki Mirghafori 4/23/12EECS 225D -- Verification

Backup slides

39

Nikki Mirghafori 4/23/12EECS 225D -- Verification40

Word Conditional Models -- example•Boakye et al. (2004)

•19 words and bi-grams• Discourse markers:

{actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean}

• Filled pauses: {um, uh}

• Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know }

•Trained whole-word HMMs, instead of GMMs, to model evolution of speech in time

•Combines well with low-level (i.e., cepstral GMM) system, especially with more training data

Nikki Mirghafori 4/23/12EECS 225D -- Verification41

Phone N-Grams -- example•Idea (Hatch et al., ‘05): model the pattern of phone usage or “short term pronunciation” for a speaker•Use open-loop phone recognition to obtain phone hypotheses

•Create models of relative frequencies of phone n-grams of the speaker vs. “others”

•Use SVM for modeling

•Combines well, esp. with increased data

•Works across languages