report 1

12
Khushboo Modi Senior Design Project Report Text Independent Speaker Verification System 1 Text Independent Speaker Verification System Khushboo Modi [email protected] Project Advisor: Professor Lawrence Saul [email protected]

Upload: pac-ce

Post on 16-Sep-2015

214 views

Category:

Documents


0 download

DESCRIPTION

speaker verification

TRANSCRIPT

  • Khushboo Modi Senior Design Project Report

    Text Independent Speaker Verification System

    1

    Text Independent

    Speaker Verification

    System

    Khushboo Modi

    [email protected]

    Project Advisor:

    Professor Lawrence Saul

    [email protected]

  • Khushboo Modi Senior Design Project Report

    Text Independent Speaker Verification System

    2

    Abstract:

    User identification and verification are very important aspects of any security system

    today, as cheaters find more and more ways to break into even the most complex of

    security measures. Biometric recognition systems are in demand today due to their

    reliance of human features that are unique to a person and cannot be forged easily such

    as face, fingerprints and voice. Like fingerprints, a persons voice has particular unique

    features and using this voiceprint, their identity can be verified.

    The goal of my project is to design and implement a text-independent speaker

    identification system. This means that regardless of what the user speaks, the system

    should be able to verify whether he is the person he claims to be. Such a system would

    be useful in banks, at ATMs, as well as telephone-based applications, where there is no

    way to identify a user based on fingerprint or face.

    Related Work:

    Speech recognition is not a new subject, however it is a growing industry and

    continuously new methods to tap this human quality are being developed. A lot of

    research has been done on text-independent speaker verification systems using Gaussian

    mixture models and my project is a simple implementation of that. I will be using

    published papers on this topic to assist me in my goal.

  • Khushboo Modi Senior Design Project Report

    Text Independent Speaker Verification System

    3

    Technical Approach:

    The object of this project is to implement a single speaker verification system.

    Statistically speaking, it is a hypothesis test between two hypotheses:

    p(Y|H0) > , accept H0

    p(Y|H1) < , accept H1

    where

    H0: Y is from the hypothesized speaker S

    H1: Y is not from the hypothesized speaker S1

    Figure taken from A Tutorial on Text-Independent Speaker Verification

    The output of front end processing is a sequence of feature vectors X = {x1, x2,t},

    where xt is a feature vector indexed at discrete time t [1,2,3..., T]. These features are

    then used to compute the likelihood ratios of H0 and H1. The log of the likelihood ratio

    above would then be:

    (X) = log p(X|H0) log p(X|H1)

    We need to generate two models for this test to work the speaker model as well as the

    background model.

    I have planned three stages for implementing this system Training Phase, Tuning

    Phase and Testing Phase.

    Training Phase: Generate the background model

    Tuning Phase: Generate the individual speaker models

    Testing Phase: Test the system using new wave files from test speakers

    Im using the Gaussian Mixture Model for the likelihood function and so the mixture

    density for the likelihood function, for a D-dimensional feature vector x, is:

    1 A Tutorial on Text-Independent Speaker Verification

  • Khushboo Modi Senior Design Project Report

    Text Independent Speaker Verification System

    4

    QuickTime and aTIFF (LZW) decompressor

    are needed to see this picture.

    The GMM parameters (mean, variance, etc) are calculated using the Expectation-

    Maximization (EM) Algorithm. It is an iterative process that monotonically increases the

    likelihood of the estimated model for the observed feature vectors such that for

    iterations k and k+1,

    p(X|(k+1)) p(X|(k))

    The weight, mean, and variance parameters:

  • Khushboo Modi Senior Design Project Report

    Text Independent Speaker Verification System

    5

    Data Collection:

    To implement this system, test data is required. I have recorded clips from 25

    speakers. Each speaker data set consists of 15 speech clips, of varying lengths. This data

    set I split into three categories Training, Tuning and Testing. These are the three

    phases of the project and the data will be required in each stage. So 9 out of 15 clips I

    have used for training, 3 more for tuning and the rest for testing the application.

    To record the clips, I used a microphone and recording software called GoldWave.

    One factor that affected the results, was the distance between the microphone and the

    speakers mouth. Too close or too far and the results were skewed. I realized this at a

    later stage, and so had to ask a few speakers to record more test clips.

    Training Phase:

    In the Training phase, the background model is created. The background model is

    basically a large pool of all sample data, just a large Gaussian mixture model. I have

    converted the wave files into a different format so that they can be used for this analysis.

    The wave file is a continuous signal, which must be broken down in discrete parameter

    vectors. Each vector is about 10ms long, because we assume that in this duration the

    vector is stationary. This is not strictly true, but it is a reasonable approximation to make.

    The format Ive used is MFCC, which stands for Mel Frequency Ceptral Coefficients.

    The conversion can be done as follows:

    1. Divide signal into frames.

    2. For each frame, obtain the amplitude spectrum.

    3. Take the logarithm.

    4. Convert to Mel (a perceptually-based) spectrum.

    5. Take the discrete cosine transform (DCT).2

    However, instead of doing this manually, I used the HTK Toolkit in order to

    automate the process. Once the files are in the correct format, from each frame it is

    important to discard all the silence and keep the speech samples. So, then I generate

    mfcc.speech files.

    2 Logan, Beth. Mel Frequency Cepstral Coefficients for Music Modeling

    http://ciir.cs.umass.edu/music2000/papers/logan_abs.pdf

  • Khushboo Modi Senior Design Project Report

    Text Independent Speaker Verification System

    6

    One of the features vectors extracted is energy, which corresponds to the loudness

    or softness of the speakers voice. In order to avoid bad results due to this, I removed

    the energy vector from the speech files.

    Now, the speech files can be combined to generate the background model file. This

    model must now be trained. We must decide on the number of Gaussians to work with.

    In order to make that decision, you look at the log_likelihood values at the end of the

    training process, and compare the values.

    For example:

    Number of samples Log_Likelihood in Loop 4 Number of Gaussians

    25973 -607367.312500 250

    25973 -603886.625000 300

    25973 -600375.312500 350

    25973 -597373.687500 400

    The optimal number of Gaussians is one where the log likelihood value drops for the

    first time, because this means that the likelihood is actually increasing.

    During my earlier training phase, the optimal number of Gaussians was 300, with the

    lowest log likelihood value. However, as the number of samples increased, I decided to

    continue testing with higher number of Gaussians and finally achieved best results at 600

    Gaussians. As the system is scaled for use by a large of speakers, this number will

    increase substantially. I keep the number of Gaussians fixed for the background model

    and the speaker models.

    Tuning Phase:

    In the tuning phase, the individual speaker models are generated. The process for

    generating these models is very similar to that for generating the background model, with

    a few minor changes.

    I take the mfcd.speech files these are the mfcc files with the energy feature

    removed and use these to generate a model for that speaker. I keep the number of

    Gaussians same as the background model in this case, 600 Gaussians.

    The purpose of this system is to test whether a given voice print belongs to the

    person the speaker claims to be. In order to achieve this, I needed to device a method to

    calculate a threshold value, which would make it easy to identify the speaker/imposter.

    An imposter is a user who claims to be somebody else, to try and cheat the system. To

  • Khushboo Modi Senior Design Project Report

    Text Independent Speaker Verification System

    7

    do this, I used three test files from each user. I compared the each file of the speaker to

    the speaker model, and based on the matching of the features, calculated the likelihood

    value of a test recording belonging to that speaker. For each speaker, not only did I

    compare the speakers test files, but also the files from other speakers in the background

    model. This provided a range of values that would be useful in calculating a threshold.

    Below is a sample of the data I got from running the above test.

    Dat file: dip divye jiten Khush madhu Speech files

    Dip13 1.080225 -0.376633 -0.28686 -0.437426 -0.294811 Dip14 0.764772 -0.397726 -0.269392 -0.447361 -0.301577 Dip15 0.673584 -0.44739 -0.34235 -0.469791 -0.422424

    Divye13 -0.576429 0.964666 -0.398272 -0.618855 -0.204668 Divye14 -0.478092 0.99568 -0.295307 -0.503201 -0.371454 Divye15 -0.508654 1.180914 -0.33753 -0.587896 -0.246685 jiten13 -0.507383 -0.323767 1.3738 -0.48052 -0.425753 jiten14 -0.276649 -0.396345 0.844433 -0.47724 -0.399639 jiten15 -0.407593 -0.397972 1.095037 -0.497236 -0.3974

    khush13 -0.326227 -0.366068 -0.286724 0.927004 -0.230051 khush14 -0.265522 -0.360636 -0.359411 1.067671 -0.389227 khush15 -0.475126 -0.412575 -0.435125 1.201353 -0.447935 madhu13 -0.323267 -0.377714 -0.310669 -0.461508 1.254961 madhu14 -0.241459 -0.454681 -0.335117 -0.370653 1.299156 madhu15 -0.275469 -0.405884 -0.288155 -0.447096 0.842831

    Each speech file belongs to some speaker, and the highlighted likelihood values are

    the results of comparing a speaker test file to the same speakers model. The most

    important point that I noticed in the tuning phase results was that the likelihood values

    of a test file belonging to a speaker is positive when the file actually belongs to the

    speaker and negative when the file belongs to an imposter.

    I decided that the threshold had to be some function based on the average of the

    likelihood values of the speaker files as well as include the imposter values.

    The threshold function I used is: ++++ x is the mean and is the standard deviation of all the likelihood values. x is an integer whose value is can be varied. I varied x, starting with x = 2. Using this threshold

    function, I computed the thresholds of all the speakers in my background model.

  • Khushboo Modi Senior Design Project Report

    Text Independent Speaker Verification System

    8

    Testing Phase:

    Once we have all the threshold values and speaker models, it is time to test the

    remaining files. This will help us determine if the analysis done above is accurate enough.

    Using the threshold values calculated in the tuning phase, I tested the remaining speaker

    files. To ensure that the system is accurate while verifying users, we need to test the

    threshold values in two ways for false alarms and false rejections. If the likelihood

    value of an imposter file is higher than the threshold for the speaker being tested, then

    the system will validate the imposter as the speaker. This is a false alarm. On the other

    hand, sometimes a speakers own file may not have a likelihood value higher than the

    threshold and so the speaker is falsely identified as an imposter. This is a false rejection.

    An optimal threshold value would minimize both these values, keeping the error rate

    low. I maintain a summary file for each user, which is generated when the testing scripts

    are run, recording the likelihood values, and the mean, variance and standard deviation of

    the results. The summary file also tracks the number of false alarms and rejections.

    mean = -0.424741 var = 0.022052 stdev = 0.148498 threshold for khush is 0.169252 number of false alarms with threshold 0.169252 are 1 number of false rejections with threshold 0.169252 are 0

    As mentioned above, I started by keeping the value of x=2. This threshold gave a

    very high rate of error, allowing many imposters to be validated as another speaker.

    However, there were very few false rejections. So I experimented by varying the value of

    x to 3 and then finally x = 4. Currently, I have fixed the value of x as 4. However, with

    an increase in number of speakers, this would vary.

  • Khushboo Modi Senior Design Project Report

    Text Independent Speaker Verification System

    9

    The User Interface:

    In order to make this system user friendly, I have developed a GUI application,

    which is simple and hides the layer of complexity from the user. There are two parts, for

    training a new speaker and to test a returning user. It is important to implement these

    features in a very short time, while demonstrating the application. I have incorporated a

    recorder in the GUI, so that no separate recording software is required.

    In theory, a new speaker would be added to the system offline. The background

    model need not contain all the users that are added to the system, but if there were a

    huge discrepancy in the actual number of users and the user data in the background

    sample pool, the results would get skewed. However, for the purpose of demonstration,

    while adding a new user, the background model is not modified. The entire procedure is

    automated using perl scripts. Once the user records a voice clip, and selects to be added

    to the system or to be identified as a particular speaker, all the processes are implemented

    and the result is shown on the screen.

    A new speaker is added by the following procedure:

    Speaker records a voice clip.

    Voice clip is converted into speech file of the correct format.

  • Khushboo Modi Senior Design Project Report

    Text Independent Speaker Verification System

    10

    Using the data, the speaker model is generated.

    Using the same speech file, and the tuning files of the existing users, the

    threshold value of the speaker is generated.

    A speakers identity is verified by the following procedure:

    User records voice clip

    Voice clip is converted into speech file.

    User selects his username from a drop down menu.

    Based on the users selection, the likelihood value of the speech file is

    compared with the threshold value of the selected identity.

    If the likelihood value is higher than the threshold value, user is identified as

    speaker.

    If the likelihood value is below the threshold value, user is identified as

    imposter.

  • Khushboo Modi Senior Design Project Report

    Text Independent Speaker Verification System

    11

    Conclusion:

    The aim of this project was to implement an application that would verify a speakers

    identity by using the speakers voice print characteristics that distinguish the speaker

    from other speakers. I wanted to implement a simple application using the algorithms

    already in existence.

    Data collection was a very important aspect of this project. It was a challenge to

    figure out how many speakers I should use. I initially had about 10, but then I increased

    that number to 25. It was also important to figure out what kind of data I should work

    with. Should I have multiple files or just one with a lot of speech? How many files for

    testing phase and tuning phase? I decided the details for data collection after a lot of trial

    and error.

    One of the challenges was understanding how the Hidden Markov Model Toolkit

    worked. It was important to extract the features that I needed for my experiments, and

    being able to manipulate the data the right way. One of the features extracted from the

    voice recording is energy. This energy corresponds to the loudness of the speakers voice

    and would skew results if taken into account. So I to figure out how to remove the

    energy vector from the feature vectors that HTK generated.

    While the application gives pretty accurate results, it works well only under certain

    environmental circumstances. I recorded most of the data in a room with very little

    disturbance in the background. This is meant to be a single speaker verification system,

    so no other speakers should be heard in the background. Also, the microphone used for

    all the test speakers is the same. The mike is placed at a fixed distance from the speakers

    mouth while recording the clip. Using a different microphone or adjusting the distance

    between the mike and speakers mouth causes results to be skewed. So, the application

    works under this scenario, but not necessarily under any other circumstances. I wouldve

    liked to accomplish this, but I was not successful.

    Overall I enjoyed working on this project since it was a topic that interested me. A

    blessing in disguise was my lack of information and awareness in this field, as it forced

    me to read and learn a lot on my own. Also, I learnt how to work on a large project with

    very little structure. It was important to set deadlines for myself, and keep working

    towards the end goal. There were times when everything went wrong and it was

    important not to give up. I am glad that I was able to achieve the goals I set for myself.

  • Khushboo Modi Senior Design Project Report

    Text Independent Speaker Verification System

    12

    References:

    Logan, Beth. Mel Frequency Cepstral Coefficients for Music Modeling

    http://ciir.cs.umass.edu/music2000/papers/logan_abs.pdf

    Schmidt, Regina. Identity Confirmed, Access Permitted: The Basics On Voice

    Authentication, Security And Consumer Use Of An Emerging Biometric.

    BiometriTech. 3 Sep. 2003

    .

    A Tutorial on Text-Independent Speaker Verification EURASIP Journal on Applied

    Signal Processing 2004 .

    Reynolds, Douglas. A., Quatieri, Thomas. F., Dunn, Robert B., Speaker Verification

    Using Adapted Gaussian Mixture Models Digital Signal Processing, 2000

    http://www.ll.mit.edu/IST/pubs/000101_Reynolds.pdf

    The HTK Book http://anacardier.eecs.tulane.edu/documentation/htkbook/