speaker recognition systems: paradigms and challenges

Center for Speech and Language Technologies, Tsinghua UniversityAsia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Plan 2012-2013

Speaker Recognition Systems:Paradigms and Challenges

Thomas Fang Zheng

Co-work with: Linlin Wang and Xiaojun Wu

<Date>, <Venue>


2

About APSIPA Asia-Pacific Signal & Information Processing Association

An emerging association to promote broad spectrum of research and education activities in SIP

Mission: non-profit organization with the following objectives: Providing education, research and development exchange platforms for both

academia and industry; Organizing common-interest activities for researchers and practitioners; Facilitating collaboration with region-specific focuses and promoting leadership

for worldwide events; Disseminating research results and educational material via publications,

presentations, and electronic media; Offering personal and professional career opportunities with development

information and networking

Established on October 5, 2009, officially registered in Hong Kong APSIPA ASC (Annual Summit and Conference) starting from 2009 APSIPA Transactions on Signal & Information Processing APSIPA Distinguished Lecture Program starting from Jan. 2012

http://www.apsipa.org


3

Outline

Introduction

Creation of Time-varying Voiceprint Database

The Discrimination-emphasized Mel-frequency-warping Method

Experimental Results

Conclusions & Future Work


Biometric Recognition

Technologies for measuring and analyzing a person's physiological or behavioral characteristics. These can be used to verify or identify a person.

The term "biometrics" is derived from the Greek words bio (life) and metric (to measure).

4


Examples of Biometrics

Face Fingerprint Palmprint Hand Geometry Iris Retina Scan DNA Signatures Gait Keystroke Voiceprint

5


Rich Information Contained in Speech

Language Recognition

What language was spoken?

Accent Recognition

Where is he/she from?

Speech Recognition

What was spoken?

Gender Recognition

Male or Female?

Emotion Recognition

Positive? Negative?Happy? Sad?

Speaker Recognition

Who spoke?

6


Speaker recognition (or Voiceprint recognition) is the process of automatically identifying or verifying the identity of a person from his/her voice, using the characteristic vocal information included in speech. It enables access control of various services by voice. [Kunzel 94][Furui 97]

Various applications: Access control (e.g.: security control for confidential information,

remote access of computers, information and reservation services); Transaction authentication (e.g.: telephone banking, telephone

shopping); Security and forensic prospects (e.g.: public security, criminal

verification); Rich Transcription for Conference Meeting (e.g.: "Who Spoke

When" and "Who Spoke What" speaker diarization); etc.

Speaker Recognition / Voiceprint Recognition

7


Speaker IdentificationDetermining which identity in a specified speaker set

is speaking during a given speech segment.Closed-Set / Open-Set

Speaker VerificationDetermining whether a claimed identity is speaking

during a speech segment. It is a binary decision task. Speaker Detection

Determining whether a specified target speaker is speaking during a given speech segment.

Speaker Tracking (Speaker Diarization = Who Spoke When)Performing speaker detection as a function of time,

giving the timing index of the specified speaker.

Speaker Recognition Categories

8


Detection Error Trade-off (DET) CurveA plot of error rates for binary classification systems,

plotting false rejection rate (FRR) vs. false acceptance rate (FAR).

Equal Error Rate (EER)The error rate corresponding to the location on a DET

curve where FAR and FRR are equal.

Minimum Detection Cost Function (MinDCF)Cdet=Cmiss X Pmiss X PTarget + CFalseAlarm X PFalseAlarm X (1-Ptarget)

Performance Evaluation (for verification and open-set identification)

9


Open Issues for Speaker Recognition Research [Furui 1997]

1. How can human beings correctly recognize speakers? 2. Is it useful to study the mechanism of speaker recognition by human beings? 3. Is it useful to study the physiological mechanism of speech production to get

new ideas for speaker recognition? 4. What feature parameters are appropriate for speaker recognition? 5. How can we fully exploit the clearly evident encoding of identity in prosody

and other supra-segmental features of speech? 6. Is there any feature that can separate speakers whose voices sound

identical, such as twins or imitators? 7. How do we deal with long term variability in people's voices (ageing)? 8. How do we deal with short term alteration due to illness, emotion, fatigue,

…? 9. What are the conditions that speaker recognition must satisfy to be practical? 10. What about combing speech and speaker recognition?

Furui, S., "Recent Advances in Speaker Recognition," Pattern Recognition Letters 18 (1997) 859-872

10


Performance Factors for Speaker Recognition

Factors affecting the speaker recognition system performance:The quality of the speech signalThe length of the training speech signalThe length of the testing speech signalThe size of the population tested by the systemThe phonetic content of the speech signal

11


Key Issues for Robust Speaker Recognition

Cross Channel

Multiple Speakers

Background Noise

Emotions

Short Utterance

Time-Varying (or Ageing)

12


13

Time-Varying (or Ageing) Issue

In all these typical situations, training and testing are usually separated by some period of time, which poses a possible threat to speaker recognition systems.

TIME GAP


14

“Ever-newer waters flow on those who step into the same rivers.” —— Heraclitus


15

Open Questions

“Does the voice of an adult change significantly with time? If so, how?” [Kersta 1962]

“How to deal with the long-term variability in people’s voice? Whether there was any systematic long-term variation that helped update speaker models to cope with the gradual changes in people’s voices? ” [Furui 1997]

“Voice changes over time, either in the short-term (at different times of day), the medium-term (times of the year), or in the long-term (with age).” [Bonastre et al. 2003]


16

Observations

Performance degradation in presence of time intervalsThe longer the separation between the training and the

testing recordings, the worse the performance. [Soong et al. 1985]

A significant loss in accuracy (4~5% in EER) between two sessions separated by 3 months was reported [Kato

& Shimizu 2003] and ageing was considered to be the cause [Hebert 2008].

Few researchers have figured out reasons behind this time-varying phenomenon exactly.


17

More enrollment data -- a solution?

Using training data with a larger time span [Markel 1979]Performance can be improved.The enrollment is quite time-consuming!In some situation, it is impractical to obtain such data!

Accepted testing/recognition speech segments be augmented to previous enrollment data to retrain the speaker model [Beigi 2009, Beigi 2010]Performance can be improved.Initial training data should be kept for later use

(storage-consuming)!


18

Ageing-dependent decision boundary -- a solution?

Using ageing-dependent decision boundary in the score domain [Kelly 2011, Kelly 2012]Performance can be improved.How to determine the time lapse practically?


19

Model-updating (adaptation) -- a solution?

A simple and straightforward way [Lamel 2000, Beigi 2009, Beigi 2010]:to update speaker models from time to time

It is effective to maintain representativeness.

However, it is costly, user-unfriendly, and sometimes, perhaps unrealistic.

And feature matters.


20

Efforts in frequency domain …

The most essential way to stabilize performance is to extract exact acoustic features that are speaker-specific and further, stable across sessions.

This is more like a dream for a long period!

To take some findings into existing techniques… NUFCC [Lu & Dang 2007]: assign frequency bands with different

resolution according to their discrimination sensitivity for speaker-specific information.


21

The idea of mel-frequency-warping!

To emphasize frequency bands that are more sensitive to speaker-specific information, yet not so sensitive to time-related session-specific information.

Identify frequency bands that reveal high discrimination sensitivity for speaker-specific information but low discrimination sensitivity for session-specific information.

Once these frequency bands are identified, more features can be extracted within them by means of frequency warping.

The Discrimination-emphasized Mel-frequency-warping method.


22

Outline

Introduction






MARP Corpus

A proper longitudinal database is necessary. Time-related variability is the only focus. The MARP corpus has been the only one published so far [Lawson

2009], though there were more variabilities.

The MARP corpus 32 participants, 672 sessions from June 2005 to March 2008 10 minutes of free-flowing conversations for each session “While the impact on speaker recognition accuracy between any

two sessions is considerable, the long-term trend is statistically quite small.”

“The detrimental impact is clearly not a function of ageing or of the voice changing within this timeframe.”

23


In free-flowing conversations, speech contents are not fixed and a speaker’s emotion, speaking style, or engagement can be easily influenced by his/her partner.

Hence, creation of a voiceprint database which specially focuses on the time-varying effect in speaker recognition is imperative for both research and practical applications.

24


Database Design Principles

The time-varying effect is the only focus, therefore other factors should be kept as constant as possible throughout all recording sessions.recording equipments, software, conditions,

environment, and so on

In the database design, two major factors were well considered:prompt texts design, andtime intervals design.

25


Fixed Prompt Texts

Speakers were requested to utter in a reading way with fixed prompt texts instead of free-style conversations.

Prompt texts were designed to remain unchanged throughout all recording sessions. To avoid or at least reduce the impact of speech

contents on speaker recognition accuracy. In form of sentences and isolated words.

26


100 Chinese sentences and 10 isolated Chinese words

The length of each sentence ranges from 8 to 30 Chinese characters with an average of 15.

Each isolated Chinese word contained 2 to 5 Chinese characters and was read five times in each session. Of the 10 isolated words, 5 were unchanged throughout all

sessions just like the sentences, while the other 5 changed from session to session and reserved for

future research of other purpose.

27


Number covered in prompt texts

Total number

Percentage(%)

Initials 23 23 100

Finals 38 38 100

di-IFs 1,183 1,523 78

Table 1. Acoustic coverage of prompt texts

28


Gradient Time Intervals

Gradient time intervals were used.no precedent reference of time-interval design. costly and perhaps unnecessary to record in a fixed-

length time interval for more than 10 times to obtain a possible trend.

Initial sessions can be of shorter time intervals, while following sessions of longer and longer time intervals.impacts of different time intervals can be easily

analyzed.

29


16 sessions from January 2010 to 2012 Five different time intervals are used: one week, one

month, two months, four months and half a year, as illustrated in the figure below.

The design of time intervals exactly voids the recordings in summer or winter vacations.

In actual recording it is unrealistic to make all speakers record exactly on one specific day, so the session day is made flexible to a session interval.

time

sessions

Figure 1. Illustration of different time intervals and session days

30


Speakers

60 fresh students, w/ 30M + 30F.

Born in years between 1989 and 1993 with a majority in year 1990.

From various departmentssuch as computer science, biology, English,

humanities, and journalism

All of them speak standard Chinese well.

31


Recording conditions

An ordinary room in the laboratory for recording.no burst noise but environmental noise in a low level.

Prompt texts were requested to read in a normal speaking rate, while the volume can be controlled by the recording software. Most of the speakers could complete a session in

about 25 minutes smoothly.

Speech signals are digitalized at 8 kHz / 16 kHz sampling rates simultaneously in 16-bit precision.

10 recording sessions had been finished so far.

32


Database evaluation -- a first and quick look

Experimental setup1024-mixture GMM-UBM system with 32-dim MFCCs

Experimental resultsThe system performs best when training and testing

utterances are taken from the same session.However, performance gets worse and worse with the

recording date difference between training and testing gets bigger.

Figure 2. EER curves when using different sessions for model training

33


34

Outline

Introduction






35

How to find IMPORTANT frequency bands?

The proposed solution is to highlight in feature extraction the frequency bands that reveal high discrimination sensitivity for speaker-

specific information while low discrimination sensitivity for session-specific information.

How to determine the discrimination sensitivity of each frequency band? F-ratio serves as a criterion to produce the discrimination

scores

How to perform frequency warping to highlight target frequency bands? Frequency warping on the basis of mel-scale


36

F-ratio [Wolf 1972]

The ratio of the between-group variance to the within-group variance.

A higher F-ratio value means better feature selection for the target grouping.

That is to say, the feature selection with a higher F-ratio possesses higher discrimination sensitivity against the target grouping.


37

F-ratio in time-varying speaker recognition tasks

There exist two kinds of grouping: by speakers for each session and by sessions for each speaker. The whole frequency range in divided into K frequency bands

uniformly. Linear frequency scale triangle filters are used to process the

power spectrum of utterances.

Two F-ratio values are obtained for each frequency band


38

……

……

…………

……

……

……

……

……

……

……

Figure 3. An illustration of two kinds of grouping

,

2

,1

2,, ,

1 1,

,1 i s

M

i s sk is NM

k ji s i s

i ji s

F ratio spk

xN

- -

,

2

,1

2,, ,

1 1,

.1 i s

S

i s ik si NS

k ji s i s

s ji s

F ratio ssn

xN

- -

1

1.

Sk k

ss

F ratio spk F ratio spkS

- - - -

1

1.

Mk k

ii

F ratio ssn F ratio ssnM

- - - -


39

For each frequency band k, a discrimination score is defined as:

Target frequency bands with higher discrimination

scores should be assigned with a proper warping-factor, neither too small to emphasize them, nor too big, to increase the frequency resolution.

( )( )

( )

_ __ .

_ _

kk

k

F ratio spkdiscrim score

F ratio ssn (1)


40

How to EMPHASIZE? Mel frequency warping (MFW)!

Warping strategies:Uniformly warping of those target frequency bands

with discrimination scores above a threshold.

Non-uniformly warping of the whole frequency range according to their discrimination scores.

Figure 4. The relationship between Hz, Mel scale, and MFW scale


41

Figure 5. A comparison of MFCC and WMFCC extraction procedures


42

Outline

Introduction






43

The discrimination for different bands …

Figure 7. Discrimination scores of frequency bands

Warping factor 1 2 3 4 5

EER (%) 10.06 8.69 8.14 8.22 8.36

Table 3. Performance comparison of WMFCC with different warping factors in average EER


44

Comparison

0

2

4

6

8

10

12

14

1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th Avg.

Session

EER

MFCC WMFCC

2nd-session EER(%)

Average EER(%)

Degradation Degree(%)

Standard Deviation

MFCC 6.45 10.06 55.97 1.83

WMFCC 5.38 8.14 51.30 1.32

Reduction Rate (%) 16.6 19.1 8.9 27.9

Figure 7. Performance comparison between MFCC and WMFCC in EER

Table 3. Performance comparison between MFCC and WMFCC in degradation degree


45

Outline

Introduction






46

A Discrimination-emphasized Mel-frequency-warping method is proposed for time-varying speaker recognition.

Experimental results show that in the time-varying voiceprint database, this method can not only improve speaker recognition performance in average EER with a reduction of 19.1%, but also alleviate performance degradation brought by time varying with a reduction of 8.9%. [WANG 2011, APSIPA ASC 2011 Excellent Student Paper Award]

Future work Further experiments are needed to test the data-dependency by

using other databases. It requires more speculation and experimentation whether the

discrimination-emphasized idea could be applied to other speech features, and further, speaker modeling techniques.


47

Thanks!

http://cslt.riit.tsinghua.edu.cnhttp://www.apsipa.org

[email protected]


48

Update ... Telephone banking application

2009 年，得意音通与清华大学共同承担中国建设银行《 95533 电话银行声纹身份认证系统》项目， 2010 年该项目完成验收， 2011 年 11

月建行确认“已正常运行满一年”。建设银行成为中国金融领域首家应用声纹身份认证的银行。

招商银行－－下一个！

speaker recognition systems: paradigms and challenges

Documents