report 1

Khushboo Modi Senior Design Project Report

Text Independent Speaker Verification System

1

Text Independent

Speaker Verification

System

Khushboo Modi

[email protected]

Project Advisor:

Professor Lawrence Saul

[email protected]



2

Abstract:

User identification and verification are very important aspects of any security system

today, as cheaters find more and more ways to break into even the most complex of

security measures. Biometric recognition systems are in demand today due to their

reliance of human features that are unique to a person and cannot be forged easily such

as face, fingerprints and voice. Like fingerprints, a persons voice has particular unique

features and using this voiceprint, their identity can be verified.

The goal of my project is to design and implement a text-independent speaker

identification system. This means that regardless of what the user speaks, the system

should be able to verify whether he is the person he claims to be. Such a system would

be useful in banks, at ATMs, as well as telephone-based applications, where there is no

way to identify a user based on fingerprint or face.

Related Work:

Speech recognition is not a new subject, however it is a growing industry and

continuously new methods to tap this human quality are being developed. A lot of

research has been done on text-independent speaker verification systems using Gaussian

mixture models and my project is a simple implementation of that. I will be using

published papers on this topic to assist me in my goal.



3

Technical Approach:

The object of this project is to implement a single speaker verification system.

Statistically speaking, it is a hypothesis test between two hypotheses:

p(Y|H0) > , accept H0

p(Y|H1) < , accept H1

where

H0: Y is from the hypothesized speaker S

H1: Y is not from the hypothesized speaker S1

Figure taken from A Tutorial on Text-Independent Speaker Verification

The output of front end processing is a sequence of feature vectors X = {x1, x2,t},

where xt is a feature vector indexed at discrete time t [1,2,3..., T]. These features are

then used to compute the likelihood ratios of H0 and H1. The log of the likelihood ratio

above would then be:

(X) = log p(X|H0) log p(X|H1)

We need to generate two models for this test to work the speaker model as well as the

background model.

I have planned three stages for implementing this system Training Phase, Tuning

Phase and Testing Phase.

Training Phase: Generate the background model

Tuning Phase: Generate the individual speaker models

Testing Phase: Test the system using new wave files from test speakers

Im using the Gaussian Mixture Model for the likelihood function and so the mixture

density for the likelihood function, for a D-dimensional feature vector x, is:

1 A Tutorial on Text-Independent Speaker Verification



4

QuickTime and aTIFF (LZW) decompressor

are needed to see this picture.

The GMM parameters (mean, variance, etc) are calculated using the Expectation-

Maximization (EM) Algorithm. It is an iterative process that monotonically increases the

likelihood of the estimated model for the observed feature vectors such that for

iterations k and k+1,

p(X|(k+1)) p(X|(k))

The weight, mean, and variance parameters:



5

Data Collection:

To implement this system, test data is required. I have recorded clips from 25

speakers. Each speaker data set consists of 15 speech clips, of varying lengths. This data

set I split into three categories Training, Tuning and Testing. These are the three

phases of the project and the data will be required in each stage. So 9 out of 15 clips I

have used for training, 3 more for tuning and the rest for testing the application.

To record the clips, I used a microphone and recording software called GoldWave.

One factor that affected the results, was the distance between the microphone and the

speakers mouth. Too close or too far and the results were skewed. I realized this at a

later stage, and so had to ask a few speakers to record more test clips.

Training Phase:

In the Training phase, the background model is created. The background model is

basically a large pool of all sample data, just a large Gaussian mixture model. I have

converted the wave files into a different format so that they can be used for this analysis.

The wave file is a continuous signal, which must be broken down in discrete parameter

vectors. Each vector is about 10ms long, because we assume that in this duration the

vector is stationary. This is not strictly true, but it is a reasonable approximation to make.

The format Ive used is MFCC, which stands for Mel Frequency Ceptral Coefficients.

The conversion can be done as follows:

1. Divide signal into frames.

2. For each frame, obtain the amplitude spectrum.

3. Take the logarithm.

4. Convert to Mel (a perceptually-based) spectrum.

5. Take the discrete cosine transform (DCT).2

However, instead of doing this manually, I used the HTK Toolkit in order to

automate the process. Once the files are in the correct format, from each frame it is

important to discard all the silence and keep the speech samples. So, then I generate

mfcc.speech files.

2 Logan, Beth. Mel Frequency Cepstral Coefficients for Music Modeling

http://ciir.cs.umass.edu/music2000/papers/logan_abs.pdf



6

One of the features vectors extracted is energy, which corresponds to the loudness

or softness of the speakers voice. In order to avoid bad results due to this, I removed

the energy vector from the speech files.

Now, the speech files can be combined to generate the background model file. This

model must now be trained. We must decide on the number of Gaussians to work with.

In order to make that decision, you look at the log_likelihood values at the end of the

training process, and compare the values.

For example:

Number of samples Log_Likelihood in Loop 4 Number of Gaussians

25973 -607367.312500 250

25973 -603886.625000 300

25973 -600375.312500 350

25973 -597373.687500 400

The optimal number of Gaussians is one where the log likelihood value drops for the

first time, because this means that the likelihood is actually increasing.

During my earlier training phase, the optimal number of Gaussians was 300, with the

lowest log likelihood value. However, as the number of samples increased, I decided to

continue testing with higher number of Gaussians and finally achieved best results at 600

Gaussians. As the system is scaled for use by a large of speakers, this number will

increase substantially. I keep the number of Gaussians fixed for the background model

and the speaker models.

Tuning Phase:

In the tuning phase, the individual speaker models are generated. The process for

generating these models is very similar to that for generating the background model, with

a few minor changes.

I take the mfcd.speech files these are the mfcc files with the energy feature

removed and use these to generate a model for that speaker. I keep the number of

Gaussians same as the background model in this case, 600 Gaussians.

The purpose of this system is to test whether a given voice print belongs to the

person the speaker claims to be. In order to achieve this, I needed to device a method to

calculate a threshold value, which would make it easy to identify the speaker/imposter.

An imposter is a user who claims to be somebody else, to try and cheat the system. To



7

do this, I used three test files from each user. I compared the each file of the speaker to

the speaker model, and based on the matching of the features, calculated the likelihood

value of a test recording belonging to that speaker. For each speaker, not only did I

compare the speakers test files, but also the files from other speakers in the background

model. This provided a range of values that would be useful in calculating a threshold.

Below is a sample of the data I got from running the above test.

Dat file: dip divye jiten Khush madhu Speech files

Dip13 1.080225 -0.376633 -0.28686 -0.437426 -0.294811 Dip14 0.764772 -0.397726 -0.269392 -0.447361 -0.301577 Dip15 0.673584 -0.44739 -0.34235 -0.469791 -0.422424

Divye13 -0.576429 0.964666 -0.398272 -0.618855 -0.204668 Divye14 -0.478092 0.99568 -0.295307 -0.503201 -0.371454 Divye15 -0.508654 1.180914 -0.33753 -0.587896 -0.246685 jiten13 -0.507383 -0.323767 1.3738 -0.48052 -0.425753 jiten14 -0.276649 -0.396345 0.844433 -0.47724 -0.399639 jiten15 -0.407593 -0.397972 1.095037 -0.497236 -0.3974

khush13 -0.326227 -0.366068 -0.286724 0.927004 -0.230051 khush14 -0.265522 -0.360636 -0.359411 1.067671 -0.389227 khush15 -0.475126 -0.412575 -0.435125 1.201353 -0.447935 madhu13 -0.323267 -0.377714 -0.310669 -0.461508 1.254961 madhu14 -0.241459 -0.454681 -0.335117 -0.370653 1.299156 madhu15 -0.275469 -0.405884 -0.288155 -0.447096 0.842831

Each speech file belongs to some speaker, and the highlighted likelihood values are

the results of comparing a speaker test file to the same speakers model. The most

important point that I noticed in the tuning phase results was that the likelihood values

of a test file belonging to a speaker is positive when the file actually belongs to the

speaker and negative when the file belongs to an imposter.

I decided that the threshold had to be some function based on the average of the

likelihood values of the speaker files as well as include the imposter values.

The threshold function I used is: ++++ x is the mean and is the standard deviation of all the likelihood values. x is an integer whose value is can be varied. I varied x, starting with x = 2. Using this threshold

function, I computed the thresholds of all the speakers in my background model.



8

Testing Phase:

Once we have all the threshold values and speaker models, it is time to test the

remaining files. This will help us determine if the analysis done above is accurate enough.

Using the threshold values calculated in the tuning phase, I tested the remaining speaker

files. To ensure that the system is accurate while verifying users, we need to test the

threshold values in two ways for false alarms and false rejections. If the likelihood

value of an imposter file is higher than the threshold for the speaker being tested, then

the system will validate the imposter as the speaker. This is a false alarm. On the other

hand, sometimes a speakers own file may not have a likelihood value higher than the

threshold and so the speaker is falsely identified as an imposter. This is a false rejection.

An optimal threshold value would minimize both these values, keeping the error rate

low. I maintain a summary file for each user, which is generated when the testing scripts

are run, recording the likelihood values, and the mean, variance and standard deviation of

the results. The summary file also tracks the number of false alarms and rejections.

mean = -0.424741 var = 0.022052 stdev = 0.148498 threshold for khush is 0.169252 number of false alarms with threshold 0.169252 are 1 number of false rejections with threshold 0.169252 are 0

As mentioned above, I started by keeping the value of x=2. This threshold gave a

very high rate of error, allowing many imposters to be validated as another speaker.

However, there were very few false rejections. So I experimented by varying the value of

x to 3 and then finally x = 4. Currently, I have fixed the value of x as 4. However, with

an increase in number of speakers, this would vary.



9

The User Interface:

In order to make this system user friendly, I have developed a GUI application,

which is simple and hides the layer of complexity from the user. There are two parts, for

training a new speaker and to test a returning user. It is important to implement these

features in a very short time, while demonstrating the application. I have incorporated a

recorder in the GUI, so that no separate recording software is required.

In theory, a new speaker would be added to the system offline. The background

model need not contain all the users that are added to the system, but if there were a

huge discrepancy in the actual number of users and the user data in the background

sample pool, the results would get skewed. However, for the purpose of demonstration,

while adding a new user, the background model is not modified. The entire procedure is

automated using perl scripts. Once the user records a voice clip, and selects to be added

to the system or to be identified as a particular speaker, all the processes are implemented

and the result is shown on the screen.

A new speaker is added by the following procedure:

Speaker records a voice clip.

Voice clip is converted into speech file of the correct format.



10

Using the data, the speaker model is generated.

Using the same speech file, and the tuning files of the existing users, the

threshold value of the speaker is generated.

A speakers identity is verified by the following procedure:

User records voice clip

Voice clip is converted into speech file.

User selects his username from a drop down menu.

Based on the users selection, the likelihood value of the speech file is

compared with the threshold value of the selected identity.

If the likelihood value is higher than the threshold value, user is identified as

speaker.

If the likelihood value is below the threshold value, user is identified as

imposter.



11

Conclusion:

The aim of this project was to implement an application that would verify a speakers

identity by using the speakers voice print characteristics that distinguish the speaker

from other speakers. I wanted to implement a simple application using the algorithms

already in existence.

Data collection was a very important aspect of this project. It was a challenge to

figure out how many speakers I should use. I initially had about 10, but then I increased

that number to 25. It was also important to figure out what kind of data I should work

with. Should I have multiple files or just one with a lot of speech? How many files for

testing phase and tuning phase? I decided the details for data collection after a lot of trial

and error.

One of the challenges was understanding how the Hidden Markov Model Toolkit

worked. It was important to extract the features that I needed for my experiments, and

being able to manipulate the data the right way. One of the features extracted from the

voice recording is energy. This energy corresponds to the loudness of the speakers voice

and would skew results if taken into account. So I to figure out how to remove the

energy vector from the feature vectors that HTK generated.

While the application gives pretty accurate results, it works well only under certain

environmental circumstances. I recorded most of the data in a room with very little

disturbance in the background. This is meant to be a single speaker verification system,

so no other speakers should be heard in the background. Also, the microphone used for

all the test speakers is the same. The mike is placed at a fixed distance from the speakers

mouth while recording the clip. Using a different microphone or adjusting the distance

between the mike and speakers mouth causes results to be skewed. So, the application

works under this scenario, but not necessarily under any other circumstances. I wouldve

liked to accomplish this, but I was not successful.

Overall I enjoyed working on this project since it was a topic that interested me. A

blessing in disguise was my lack of information and awareness in this field, as it forced

me to read and learn a lot on my own. Also, I learnt how to work on a large project with

very little structure. It was important to set deadlines for myself, and keep working

towards the end goal. There were times when everything went wrong and it was

important not to give up. I am glad that I was able to achieve the goals I set for myself.



12

References:

Logan, Beth. Mel Frequency Cepstral Coefficients for Music Modeling

http://ciir.cs.umass.edu/music2000/papers/logan_abs.pdf

Schmidt, Regina. Identity Confirmed, Access Permitted: The Basics On Voice

Authentication, Security And Consumer Use Of An Emerging Biometric.

BiometriTech. 3 Sep. 2003

.

A Tutorial on Text-Independent Speaker Verification EURASIP Journal on Applied

Signal Processing 2004 .

Reynolds, Douglas. A., Quatieri, Thomas. F., Dunn, Robert B., Speaker Verification

Using Adapted Gaussian Mixture Models Digital Signal Processing, 2000

http://www.ll.mit.edu/IST/pubs/000101_Reynolds.pdf

The HTK Book http://anacardier.eecs.tulane.edu/documentation/htkbook/

report 1

Documents