© artville an overview of statistical pattern recognition ...metric industry report, forecasts and...

20
© ARTVILLE 1. Introduction S peaker verification is a popular biometric identification technique [1] used for authenticating and monitoring human subjects using their speech signal. The method is attractive for two main reasons: (a) it does not require direct contact with the individual, thus avoiding the hurdle of “perceived invasiveness” inherent in many biometric systems like iris and finger print recognition systems; (b) it does not require deployment of specialized signal transducers as microphones are now ubiquitous on most portable devices (cellular phones, PDAs and laptops). According to “The Bio- metric Industry Report, Forecasts and Analysis, 2006”, the market for speaker verification systems is expected to grow to approximately $100 million [2] and this is also evident from the large number of business ventures who are actively supporting product development in speaker verification/rec- ognition [2]. The key applications that are driving this demand in speaker verification/identification technology are tele-commerce and forensics [3] where the objective is to automatically authenticate speakers of interest us- ing his/her conversation over a voice channel (telephone or wireless phone). Also, with the ever increasing popularity in multimedia web-portals (like 62 IEEE CIRCUITS AND SYSTEMS MAGAZINE 1531-636X/11/$26.00©2011 IEEE SECOND QUARTER 2011 Feature Digital Object Identifier 10.1109/MCAS.2011.941080 An Overview of Statistical Pattern Recognition Techniques for Speaker Verification Amin Fazel and Shantanu Chakrabartty Abstract Even though the subject of speaker verification has been investigated for several decades, numerous chal- lenges and new opportunities in ro- bust recognition techniques are still being explored. In this overview pa- per we first provide a brief introduc- tion to statistical pattern recogni- tion techniques that are commonly used for speaker verification. The second part of the paper presents traditional and modern techniques which make real-world speaker veri- fication systems robust in degrada- tion due to the presence of ambi- ent noise; channel variations, aging effects, and availability of limited training samples. The paper con- cludes with discussions on future trends and research opportunities in this area. Date of publication: 27 May 2011

Upload: others

Post on 23-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

© ARTVILLE

1. Introduction

Speaker verification is a popular biometric identification technique [1] used for authenticating and monitoring human subjects using their speech signal. The method is attractive for two main reasons: (a) it

does not require direct contact with the individual, thus avoiding the hurdle of “perceived invasiveness” inherent in many biometric systems like iris and finger print recognition systems; (b) it does not require deployment of specialized signal transducers as microphones are now ubiquitous on most portable devices (cellular phones, PDAs and laptops). According to “The Bio-metric Industry Report, Forecasts and Analysis, 2006”, the market for speaker verification systems is expected to grow to approximately $100 million [2] and this is also evident from the large number of business ventures who are actively supporting product development in speaker verification/rec-ognition [2]. The key applications that are driving this demand in speaker verification/identification technology are tele-commerce and forensics [ 3] where the objective is to automatically authenticate speakers of interest us-ing his/her conversation over a voice channel (telephone or wireless phone). Also, with the ever increasing popularity in multimedia web-portals (like

62 IEEE CIRCUITS AND SYSTEMS MAGAZINE 1531-636X/11/$26.00©2011 IEEE SECOND QUARTER 2011

Feature

Digital Object Identifier 10.1109/MCAS.2011.941080

An Overview of Statistical Pattern Recognition Techniques for Speaker VerificationAmin Fazel and

Shantanu Chakrabartty

Abstract

Even though the subject of speaker verification has been investigated for several decades, numerous chal-lenges and new opportunities in ro-bust recognition techniques are still being explored. In this overview pa-per we first provide a brief introduc-tion to statistical pattern recogni-tion techniques that are commonly used for speaker verification. The second part of the paper presents traditional and modern techniques which make real-world speaker veri-fication systems robust in degrada-tion due to the presence of ambi-ent noise; channel variations, aging effects, and availability of limited training samples. The paper con-cludes with discussions on future trends and research opportunities in this area.

Date of publication: 27 May 2011

Page 2: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

SECOND QUARTER 2011 IEEE CIRCUITS AND SYSTEMS MAGAZINE 63

Amin Fazel, Student Member of the IEEE, and Shantanu Chakrabartty, Senior Member of the IEEE, are with Michigan State University.

Facebook and Youtube), large repositories of archived spoken documents such as TV broadcasts, teleconfer-ence meetings, and personal video clips can be accessed through the internet. Searching for metadata like topic of discussion or participant names and genders from these multimedia documents would require automated technology like speaker verification and recognition.

Traditionally, speaker verification systems have been classified into two different categories based on constraints imposed on the authentication process: (a) text-dependent speaker verification systems where the users are assumed to be “cooperative” and use identi-cal pass-phrase during the training and testing phase; and (b) text-independent speaker verification systems where no vocabulary constraints are imposed on the training and testing phase. Text-independent speaker verification systems can be further categorized into either vocabulary constrained or unconstrained text-independent speaker verification systems. An uncon-strained text-independent speaker verification system doesn’t assume any prior knowledge about the spoken text which is unlike a vocabulary constrained system. Also, while in the text-dependent case, the imposed con-straints greatly improve the accuracy in the presence of channel/background noise, the lack of constraints in text-independent case makes it more challenging. For instance, since there are no constraints on the words which the speakers are allowed to use, the reference (what is spoken in training) and the test (what is uttered in actual use) utterances may have completely different content, and hence the verification system has to take into account phonetic mismatch issues. Other sources of challenges applicable to both text-independent and text-dependent speaker verification include compen-sating for changes in the acoustic environment such transducer or channel variations, and compensating for “within-speaker” variations like change in state of health, mood or aging. An example scenario is shown in Figure 1, where the enrollment data used in developing recognition models is acquired over the internet where as during verification or testing a mobile interface is used for acquiring the speech data. Compensating for such mismatch conditions between training and test conditions has been and still remains the most challeng-ing problem [4, 5] in the design of speaker verification systems. Over the last four decades, several speaker verification techniques have been reported to address this challenge [6, 7, 8, 9, 10, 11] which makes a thorough review of the field challenging. Therefore, in this paper we will occasionally refer the reader to several excellent

survey papers [12, 13, 14], books and monographs [15, 16, 17] for details of some the discussed techniques. The focus of this paper will be to survey some of the state-of-the-art statistical pattern recognition techniques which have been employed for designing speaker verification systems. Our emphasis will be to cover topics in noise-robustness and speaker variability and at the end intro-duce the reader to some of the open challenges and re-search opportunities in the field of speaker verification.

We have organized this paper as follows: First, in sec-tion 2, we briefly explain the biometric properties of speech signal that makes it unique to the speaker. We then outline the architecture of a typical speaker veri-fication system in section 3 and provide an overview of some of the underlying statistical models and tech-niques. In section 4 we describe the key challenges in the area of speaker verification and discuss some of the statistical approaches that have been proposed to ad-dress these challenges. In section 5 we describe some of the evaluation metrics used to compare different speaker verification systems. In section 6 we conclude the paper with a brief discussion on future trends, ap-plications and research opportunities.

2. Fundamentals of Speech Based Biometrics

Speech is produced when air from the lungs passes through the throat, the vocal cords, the mouth and the nasal tract (see Figure 2(a)). Different position of the lips, tongue and the palate (also known as the articulators) then create different sound patterns and gives rise to the physiological and spectral properties of the speech sig-nal like pitch, tone and volume. The combination of these properties is typically considered unique to the speaker because they are modulated by the size and shape of the mouth, vocal and nasal tract along with the size, shape and tension of the vocal cords. It has been shown that even for twins, the chances for all of these properties to be similar are very low [18, 19] thus making the speech as a viable biometric signal. Other than the physiological properties, behavioral or dynamical properties of the speech signal like dialect, vocabulary, stress-patterns could also be used to further distinguish the person of interest.

One of the most commonly used methods for visual-izing the spectral and dynamical content of speech signal is called the spectrogram which displays the frequency of vibration of the vocal cords (pitch), and amplitude (volume) with respect to time. Examples of the spectro-grams for a male and a female speaker are shown in Fig-ure 2(b) where the horizontal axis represents time and the vertical axis represents frequency. The pitch of the

Page 3: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

64 IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2011

utterance manifests itself as horizontal striations in the spectrogram as shown in Figure 2(b). For instance, it can be seen from Figure 2(b) that the pitch of the female speaker is greater than the pitch of the male speaker. Oth-er important spectral parameters of speech signal are formants which are defined as the resonant frequencies (denoted by F1, F2, F3, . . .) of the vocal tract, in particu-lar, when vowels are pronounced. They are produced by restricting air flow through the mouth, tongue, and the jaw. The relative frequency location of the formats can vary widely from person to person (due to shape of the vocal tracts) and hence can be used as a biometric fea-ture. Even though multiple resonant frequencies exist in speech signal, only three of the formats (typically labeled as F1, F2, F3 as shown in Figure 2(b)) are used for speech and speaker recognition applications. However, reliable estimation of the spectral parameters requires segments of speech signal that are stationary and hence most veri-fication systems use 20–30 milliseconds segments. An-other biometric signature embedded in the speech signal is the stress patterns also known as prosody which mani-fests as spectral trajectory and distribution of energy in the spectrogram. This signature is typically considered as one of the “high-level” features which can be estimated from observing the dynamics across multiple segments of the speech signals. In the next section, we will discuss some of the popular approach to extract some of these bio-metric features and discuss some of the statistical models which are used to recognize the speaker specific features.

Figure 1. An example illustrating one source of mismatch in a speaker verification system. Reference data could be ac-quired from the internet to train speaker verification models which are then stored on an authentication server. However, during testing, the data is acquired through a mobile interface which could be prone to background noise and wireless chan-nel distortion. The central authentication system has to handle the mismatch between the training and testing conditions to be effective in real-life deployment.

Train• Determine Recognition Templates• Parameter Optimization

Verify• Wireless Channel• Background Noise

Training Data– Speaker-Specific Enrollment Data

OfflineCollection

MobileInterface

SpeechFeatures

Environment

Microphone

Microphone

Figure 2. Fundamental of speech biometrics: (a) Magnetic resonance image1 showing the anatomy of speech produc-tion apparatus. The biometric property of the speech sig-nal is determined by shape of the vocal tract, orientation of the mouth, teeth and nasal passages. (b) Spectrograms corresponding to a sample utterance “fifty-six thirty-five seventy-two” for a male and female speaker. The horizon-tal axis represents the time, the vertical axis corresponds to the frequency and the color map represents the mag-nitude of the spectrogram. Different parameters of the speech signal that makes it unique for each person are pitch, formants (F1-F3), and prosody—labeled in (b). This spectrogram clearly shows the differences between some of these parameters for a male and female speaker. For instance the pitch for female speaker is higher than the male speaker.

1P. Martinsa, I. Carboneb, A. Pintoc, A. Silvab and A.Teixeira, “European Portuguese MRI based speech production stud-ies,” Speech Commun., vol. 50 , no. 11–12, pp. 925–952, 2008.

Nasal Cavity

Tongue

Six Female

Pitch

Male F3F2

F1

ProsodyLip

Teeth

Voc

al T

ract

Len

gth

(a) (b)

Page 4: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

SECOND QUARTER 2011 IEEE CIRCUITS AND SYSTEMS MAGAZINE 65

3. Overview of a Basic Speaker Verification System

Operation of a speaker verification system typically consists of two distinct phases: (a) an enrollment phase where parameters of a speaker specific statistical model are determined using annotated (pre-labeled) speech data; and (b) a verification phase where an unknown speech sample is authenticated using the trained speak-er specific model. Both the phases are shown in Figure 3, where the speech signal is first sampled, digitized and fil-tered before a feature extraction algorithm computes sa-lient acoustic features from the speech signal. The next step in the enrollment phase uses the extracted features to train a speaker specific statistical model. Most state-of-the-art speaker verification systems also use fea-tures specific to a set of background speakers or cohort speakers to enhance the robustness and discriminatory ability of the system. For instance background speakers could be used as negative examples for training of a dis-criminative model [20], or for training a universal back-ground model for adapting target speaker models [21]. During the verification phase (as shown in Figure 3), an unknown utterance is authenticated against the trained speaker and background statistical model. The scores generated by both the models are normalized [21] and integrated for the entire utterance before an acceptance

or a rejection decision is made. In the following subsec-tions, we will review some of the standard speaker veri-fication techniques that are used during the enrollment and verification phase.

3.1 Speech Acquisition and Feature Extraction ModuleThe speech acquisition module typically consists of a transducer that is coupled to an amplifier and a filtering circuitry. Depending on the specifications (size, power and recognition performance) imposed on the speaker verification system, the transducer could be a standard microphone (omni-directional or directed) or a noise-canceling microphone array where the speech signal is enhanced by suppressing background noise using a spatial filter [22]. The amplifier and the filtering cir-cuitry are used to maintain a reasonable signal-to-noise ratio (SNR) at the input of an analog-to-digital converter (ADC) which is used to digitize the speech signal. De-pending on the topology of the ADC, the speech signal could be sampled at the Nyquist rate (8 KHz) or over sampled using a sigma-delta modulator. Typically, a high-order sigma-delta modulator is the ADC of choice because of its ability to achieve resolution greater than 16 bits for audio frequency signals. Once the speech

Figure 3. Functional architecture of a speaker verification system which consists of two main phases: (a) An enrollment phase where parameters of a speaker specific statistical model are determined and (b) a verification phase where an unknown speaker authenticated using the models trained during the enrollment phase.

Nontarget Speakers

Target Speakers

Target SpeakerModels

Verification

Accept/RejectDecisionMaking

FeatureExtraction

Enrollment

A/D

A/D

A/D

ClaimedIdentity

UnknownSpeaker

FeatureExtraction

FeatureExtraction

SpeakerModeling

(a)

(b)

Most state-of-the-art speaker verification systems also use features specific to a set of background speakers or cohort speakers to enhance the robustness and

discriminatory ability of the system.

Page 5: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

66 IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2011

signal is digitized, a feature extraction module (typical-ly implemented on a floating point processor) extracts speaker specific information from the raw waveform. The speaker specific information can be categorized into two main components namely “high-level” and “low-level” information [23, 24]. The “low-level” informa-tion conveys the physical structure of the vocal tract where as the “high-level” characteristics convey behav-ioral information such as prosody, phonetic, conversa-tional patterns, etc. The difference between these two features is the relative time-scale required for extract-ing and processing the features. While low-level fea-tures can be effectively computed using short frames of speech (<30 ms), the high-level features could require time-scales greater than few seconds [24]. However, most speaker verification systems use only “low-level” features due to their inherent low-complexity. In the fol-lowing description we present a short overview of two of the popular classes of “low-level” features: linear pre-dictive cepstral coefficients and Mel frequency cepstral coefficients for authentication.

Linear Predictive Cepstral Coefficients (LPCC): At the core of the LPCC feature extraction algorithm is the Linear Prediction Coding (LPC) technique [25, 26] which assumes that any speech signal can be modeled by a linear source-filter model. This model assumes two sources of human vocal sounds: the glottal pulse generator and the random noise generator as shown in Figure 4(a). The glottal pulse generator creates voiced sounds. This source generates one of the measurable attributes used in voice analysis: the pitch period. The random noise generator produces the unvoiced sounds and the vocal tract serves as the filter of the model that produces intensification at specific formants. In LPCC feature extraction, the filter is typically chosen to be an all-pole filter as shown in Figure 4(a). The parameters of the all-pole filter are estimated using an auto-regressive procedure where the signal at each time instant can be determined using a certain number of preceding sam-ples. Mathematically this can be expressed as

s 1 t 2 5 ap

i51ais 1 t 2 i 2 1 e 1 t 2 , (1)

where s 1 t 2 is the speech signal at time instant t is deter-mined by p past samples s 1 t 2 i 2 where i., represents the discrete time delay. e 1 t 2 is known as the excitation term (random noise or glottal pulse generator) which also

signifies the estimation error for the linear prediction process and ai denotes the LPC coefficients.

During an LPCC feature extraction, a quasi-stationary window of speech (about 20–30 ms) is used to determine the parameters ai and the process is repeated for the en-tire duration of the utterance. In most implementations, an overlapping window or a spectral shaping window [26] is chosen to compensate for spectral degradation due to finite window size. The estimation of the predic-tion coefficients ai is done by minimizing the prediction error e 1 t 2 and several efficient algorithms like the Yule-Walker or Levinson Durbin algorithms exist to compute the features in real-time. The prediction coefficients are then further transformed into Linear Predictive Cepstral Coefficients (LPCC) using a recursive method [4], which has been omitted here for the sake of brevity. A variant of the LPC analysis is the Perceptual Linear Prediction (PLP) [27] method. The main idea of this technique is to take advantage of some characteristics derived from the psycho-acoustic properties of the human ear and these characteristics are modeled by filter-bank.

Mel Frequency Cepstral Coefficients (MFCC): These features have been extensively used in speaker verifi-cation systems [28]. MFCCs were introduced in early 1980s for speech recognition applications and since then have also been adopted for speaker identification applications. The key steps involved in computing MFCC features are shown in Figure 4(b). A sample of speech signal is first extracted using a window. Typically two parameters are important for the windowing procedure: the duration of the window (ranges from 20–30 ms) and the shift between two consecutive windows (ranges from 10–15 ms). The values correspond to the average duration for which the speech signal can be assumed to be stationary or its statistical and spectral informa-tion does not change significantly. The speech samples are then weighed by a suitable windowing function, for example, Hamming or Hanning window are extensively used in speaker verification. The weighing reduces the artifacts (side lobe and signal leakage) of choosing a fi-nite duration window size for analysis. The magnitude spectrum of the speech sample is then computed (shown in Figure 4(b)) using a fast Fourier transform (FFT) and is then processed by a bank of band-pass filters. The filters that are generally used in MFCC computation are triangular filters, as shown in Figure 4(b), and their center frequencies are chosen according a logarithmic

The “low-level” information conveys the physical structure of the vocal tract where as the “high-level” characteristics convey behavioral information

such as prosody, phonetic, conversational patterns, etc.

Page 6: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

SECOND QUARTER 2011 IEEE CIRCUITS AND SYSTEMS MAGAZINE 67

frequency scale, also known as Mel-frequency scale. The filter bank is then used to transform the frequency bins to Mel-scale bins by the following equations:

my 3b 45 af

wb 3f 4 0Y 3f 4 0 2, (2)

where wb 3f 4 is the bth Mel-scale filter’s weight for the fre-quency f and Y 3f 4 is the FFT of the windowed speech sig-nal. The rationale for choosing a logarithmic frequency scale conforms to response observed in human auditory systems which has been validated through several bio-physical experiments [26]. The Mel-frequency weighted magnitude spectrum is processed by a compressive non-linearity (typically a logarithmic function) which also models the observed response in a human auditory system. The last step in MFCC computation is a discrete cosine transform (DCT) which is used to de-correlate the Mel-scale filter outputs. A subset of the DCT coef-ficients are chosen (typically the first and the last few

coefficients are ignored) and represent the MFCC fea-tures used in the enrollment and the verification phases.

Dynamic and Energy Features: Even though each fea-ture set (LPC or MFCC) is computed for a short frame of speech signal (about 20–30 ms), it is well known that information embedded in the temporal dynamics of the features are also useful for recognition [29]. Typically two kinds of dynamics have been found useful in speech pro-cessing: (a) velocity of the features (known as D features) which is determined by its average first-order temporal derivative; and (b) acceleration of the features (known as D D features) which is determined by its average second-order temporal derivative. Other transforms of the fea-tures which have also been found useful in recognition include: logarithm of the total energy of the feature (L2 norm) and its first-order temporal derivative [26].

Auxiliary Features: Even though cepstral features have been widely used speaker recognition systems, it has been suggested that the features might contain

Figure 4. “Low-level” speech features used in speaker verification systems. (a) Simplified LPCC-based speech production system which consists of an all-pole filter that models the response of the voice articulators and the cepstrum recursion. (b) Signal flow of MFCC feature extraction algorithm. (c) Shows the bank of mel-scale filters which are used in MFCC feature extraction.

Voiced

Unvoiced

SpeechH (z) = G

∑1 –M

i = 1aiz

–i ∑n–1

m = 1

(n – m )amc n – m

c1 = ln G

cn = –an –1n

A

(a)

SpeechWindowing FFT |∗|

|Y |

Freq

. Bin

s

Frames

Mel-ScaleFilterbank

LogDCTMFCCs

MF

CC

Log-

Mel

.Bin

s

Mel

.Bin

s

Frames Frames Frames

my

Mel Scale

Frequency

3,500

3,000

2,500

2,000

1,500

1,000

500

00 2,000 4,000 6,000 8,000 10,000

Frequency Scale (Hz)

Mel

Sca

lem = 2,595 log( f

700+1)

(b) (c)

Page 7: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

68 IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2011

phonemic information that may be unrelated to the speaker recognition task. Recently, new techniques have been reported that can extract speaker-related informa-tion from LPCCs and MFCCs and in the process improve system’s recognition performance. One such group fea-tures are sometimes referred to as voice source features. For example, in [30], an inverse filtering technique has been used to separate the spectra of glottal source and vocal tract. In another approach, the residual signal ob-tained from LP analysis has been used in estimating the glottal flow waveform [31, 32, 33, 34]. An alternative ap-proach to estimating the glottal flow (derivative) wave-form was presented in [35, 36, 37] where a closed-phase covariance analysis technique was used during the in-tervals when the vocal folds are closed. Another group of speaker specific features includes prosodic features. Prosody which involves variation in syllable length, in-tonation, formant frequencies, pitch, rate and rhythm speech, can vary from speaker to speaker and relies on long- term information of speech. One of the predomi-nant prosodic features is the fundamental frequency (or F0). Other features include, pitch, energy distribution on a longer frame, speaking rate and phone duration [38, 39, 40]. The auxiliary features are usually combined with other low-level features using fusion techniques which will be discussed in section 3.2.3.

Voice Activity Detector (VAD): Before the features can be used in the enrollment and verification phases it is important to determine whether the features cor-respond to the “speech” portion of the signal or corre-spond to the silence or background part of the signal. Most speaker verification systems use a voice activity detector (VAD) whose function is to locate the speech segments in an audio signal. For example, a simple VAD could compute instantaneous signal-to-noise ratio (SNR) and pick segments only when the SNR exceeds a prede-termined threshold. However, we would like to point out that design of robust VAD could prove challenging since it is expected that the module works consistently across different environments and noise conditions.

3.2 Speaker ModelingOnce the feature vectors corresponding to the “speech” frames have been extracted the associated speech data also known as training/enrollment data is used to build a speaker specific model. During the verification phase, the trained model is used to authenticate a sequence of

feature vectors extracted from utterances of unknown speakers. The focus of this section and paper is on the statistical approaches for constructing the relevant models. The methods can be divided into two distinct categories: generative and discriminative. Training of generative models typically involves data specific to the target speakers where the objective is that the models can faithfully capture the statistical properties of the speaker specific speech signal. Training of discrimina-tive models involves data corresponding to the target and imposter speakers and the objective is to faithfully estimate the parameters of the manifold which distin-guishes the features for the target speakers from the features for the imposter speakers. An example of a popular generative model used in speaker verification is Gaussian Mixture Models (GMMs) and an example of a popular discriminative model is Support Vector Machines (SVMs). In the following section we describe these clas-sical techniques briefly and the readers are referred to appropriate references [41] for details.

3.2.1 Generative ModelsGenerative models include mainly Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs) and capture the empirical probability density function corresponding to the acoustic feature vectors. GMMs represent a particular case of HMMs and can be viewed as a single-state HMM where the probability density is defined by a mixture of Gaussians.

GMM-Based Modeling. GMMs have unique advan-tages compared to other modeling approaches because their training is relatively fast and the models can be scaled and updated to add new speakers with relative ease. A GMM model l, is composed of a finite mixture of multivariate Gaussian components and estimates a general probability density function pl 1x 2 according to:

p 1x 2 5 aM

i51wipi 1x 2 , (3)

where M is the number of Gaussian components, wi is the prior probability (mixing weights) of the ith D-vari-ate Gaussian density unction Ni 1x 2 given by

pi 1x 2 51

12p 2D/2 0Si 0 1/2 e2 11/221x2mi2T Sk211x2mi2. (4)

The parameters mi and Si represent the mean vector and covariance matrix of the multi-dimensional Gaussian

GMMs have unique advantages compared to other modeling approaches because their training is relatively fast and the models can be scaled and updated to

add new speakers with relative ease.

Page 8: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

SECOND QUARTER 2011 IEEE CIRCUITS AND SYSTEMS MAGAZINE 69

distribution and the mixing weights wi are constrained according to a

M

i51wi 5 1. Figure 5(a) shows an example

of a GMM training procedure.In a GMM based speaker verification system, a speak-

er-independent world model also known as a universal background model (UBM) is first trained using speech data gathered from a large number of “imposter” speak-ers [21]. The training procedure typically uses an itera-tive expectation-maximization (EM) algorithm [41] which estimates the parameters mi and Si using a maximum likelihood criterion [42]. In Figure 5, we present a rough sketch of the EM training procedure whose details can be found in numerous references [41, 42, 43]. The back-ground model obtained after the training thus represents a speaker-independent distribution of the feature vectors. When enrolling a new speaker to the system, the param-eters of the background model are adapted to the fea-ture vector distribution of the new speaker using the maximum a posteriori (MAP) update rules. In this way, the model parameters are not required to be estimated

from scratch and instead the previously estimated priors are used for re-training. There are alternative adaptation methods to MAP, and usually selection of the method de-pends on the amount of available training data [44]. For very short enrollment utterances (a few seconds), some other methods like Maximum likelihood linear regression (MLLR) [45], have shown to be more effective.

Hidden Markov Models (HMMs). By construction, GMMs are static models that do not take into account the dynamics inherent in the speech vectors. In this regard, HMMs [26] are statistical models that capture the temporal dynamics of speech production as an equivalent first-order Markov process. Figure 5 shows an example of a simple HMM which comprises of a sequence of states with a GMM associated with each state. In this example, each state represents a station-ary unit of the speech signal also known as “tri-phone”. The training procedure for HMMs involves an EM al-gorithm, where the feature vectors are first temporally aligned to the states using a dynamic programming

Figure 5. Example of generative models that have been used for speaker verification: (a) GMMs which consists of a mixture of weighted Gaussians whose parameters (mean, variance and weight) are determined using an expectation-maximization (EM) algorithm. The E and the M step for the EM based GMM training algorithm is shown. Before training, the means of Gaussians are uniformly spaced and the variance and weights are chosen to be the same. After training, the mean and variance of the Gaussians align themselves to the data cluster centers and the weights capture the priori probability of the data. (b) HMMs where each state has a GMM which captures the statistics of a stationary segment of speech. (c) HMMs are trained by aligning the states to the utterance using a trellis diagram. Each path through the trellis (from start to end) specifies a possible sequence of HMM state that generated the utterance. An EM algorithm similar to GMM is used except that probabilities are associated with paths instead of mixtures.

FeatureExtraction

W1 W2 W3

σ3σ2σ1

μ1 μ2 μ3

W1

W2 W3

σ3σ2σ1

μ1 μ2 μ3

Training

Current Phoneme

Start State End State

/al/Left

/al/Middle

/al/Right

t1 t2 t3 t4 tn

Most ProbableSequence of States

E Step:

pi (xn)pi (xn)wi

pk (xn)wk∑M

k = 1

M Step:

wi pi (xn);∑N

n = 1

1N

μi σi

pi (xn)xn∑N

n = 1

pi (xn)∑N

n = 1

pi (xn)xn∑N

n = 1

pi (xn)∑N

n = 1

;

2

2– μi

(a)

(b) (c)

d

d d

d

Page 9: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

70 IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2011

procedure and the aligned feature vectors are used to update the parameters of the state GMM. During the verification procedure, the most probable sequence of states/phones are estimated (again using a dynamic programming procedure) for a given utterance. The scores generated by each state in the most probable sequence are accumulated to obtain the utterance and speaker specific likelihood. Because the HMMs rely on the phonetic content of the speech signal, they have been used pre-dominantly in text-dependent speaker verification systems [46].

3.2.2 Discriminative ModelsThe discriminative models are optimized to minimize the error on a set of genuine and impostor training samples. They include, among many other approaches, Support Vector Machines (SVMs) and Artificial Neural Net-works (ANNs).

Support Vector Machines: SVMs are an attractive choice for implementing discriminative models they

provide good verification performance even with rela-tively few data points in the training set and bound on the performance error can be directly estimated from the training data [47]. This is important because in many instances only limited amount of data is usually avail-able for the “target” speaker. The learning ability of the classifier is controlled by a regularizer in the SVM train-ing, which determines the trade-off between its complex-ity and its generalization performance. In addition, the SVM training algorithm finds, under general conditions, a unique classifier topology that provides the best out-of-sample performance [47]. The key concept behind an SVM based approach is the use of kernel functions which map the feature vectors to a higher dimensional feature space by using a non-linear transformation F 1 . 2 .

Figure 6(c) illustrates an example of the mapping op-eration from a two dimensional feature space to a three dimensional space. In the feature space the data points corresponding to the binary classes (denoted by “cir-cles” and “squares”) are non-linearly separable. In the

Figure 6. Discriminative Models: (a) General structure of an SVM with radial basis functions as kernel. (b) Structure of a multi-layer ANN consisting of two hidden layers. (c) An example of a kernel function K(x, y) 5 (x.y)2, which maps a non-linearly separable classification (left) problem into a linearly separable problem (right) using a non-linear mapping F(.).

ScoreScore

a1

K (.,.)

Frame1 Frame2 Frame1 Frame2Framen Framen

Speech Features Speech Features

(a) (b)

a2 an

a1 a2 an

(c)

x1

x2

x22

x12

2x1x2

(x1, x2)

Φ2x1x2)√(x1, x2,2 2

Page 10: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

SECOND QUARTER 2011 IEEE CIRCUITS AND SYSTEMS MAGAZINE 71

higher dimensional space the data points are linearly separable and can be classified correctly by a linear hy-per-plane. A binary (two-class) SVM comprises of a lin-ear hyper-plane constructed in the higher dimensional space and is given by

f 1z 2 5 , w, F 1z 2. 1 b, (5)

where , .,. . defines an inner-product in the higher dimensional space and w,b are the parameters of the hyper-plane. As with SVMs, the hyper-plane parameters ware obtained as linear expansion over training featuresF 1xn 2 n 5 1, c, N as w 5 a

N

n51anF 1xn 2 where an are the

expansion coefficients. Accordingly the inner-products in the expression for f 1z 2 convert into kernel expansions over the training data xn n 5 1, c, N by transforming the data to feature space according to

f 1z 2 5 ,w,F 1z 2. 1 b

5 aN

n51an , F 1xn 2 ,F 1z 2. 1 b

5 aN

n51anK 1x,z 2 . 1 b, (6)

where K 1 .,. 2denotes any symmetric positive-definite kernel that satisfies the Mercer condition and is given by K 1x, z 2 5 , F 1x 2 ,F 1z 2. , which is an inner-product in the higher dimensional feature space. For example in Figure 6(c) the kernel function corresponding to F 1 . 2 is given by K 1x, z 2 5 1,x, z.22. The use of kernel function avoids the curse of dimensionality by avoiding direct

inner-product computation in higher-dimensional fea-ture space. Some other examples of valid kernel functions are radial basis functions K 1xi, xj 2 5 exp 12s 1xi 2 xj 22 2 or polynomial functions K 1xi, xj 2 5 31 1 1xi

# xj 2 4p. Training of the SVM involves estimating the parameters ai, b that optimizes a quadratic objective function. The exact form of the objective function depends on the topology of the SVM (soft-margin SVM [48], logistic SVM [49] or GiniS-VM [50]) and there exist open-source software packages implementing these different algorithms. SVM based speaker verification systems typically consist of the fol-lowing key steps (see Fig. 7):

■ Feature reduction and normalization: Due to vari-ability in the duration of utterances, the objective of this step is to reduce/equalize the size of the feature vectors to a fi xed-length vector. One of the possible approaches could be to use clustering or random selection to determine a pre-determined number of representative vectors. Another ap-proach could use the scores obtained from a gen-erative model (GMM or HMM) as the fi xed-dimen-sional input vector. The features are then scaled and normalized before processed by an SVM.

■ Kernel modeling: The reduced and normalized feature vectors are used to model each speaker using different types of kernel functions like lin-ear, quadratic, or exponential.

For each frame of the feature vector corresponding to the “non-silence” segment of the speech signal, the SVM generates a score and the scores are integrated over the entire utterance to obtain the final decision score. It is

Figure 7. Functional architecture of an SVM-based speaker verification system: (a) The extracted features are first aligned, reduced and normalized. The speaker specific and speaker non-specific features are combined to create a data set used for SVM training. (b) The soft-margin SVM determines the parameter of a hyperplane that separates the target and non-target data set with the maximum margin.

SupportVectorsSupportVectors

Target Speakers

Nontarget Speakers

FeatureExtraction

FeatureExtraction

FeatureExtraction

FeatureReduction/Norm.

FeatureReduction/Norm.

FeatureReduction/Norm.

Speech Features’Frames

Nontarget

NontargetSpeaker SVM

Target

TargetSeparatin

g Hyperplane

W*x+b = 0

c1∼

c2∼

c3∼

cm∼

c1∼

c2∼

c3∼

cm∼

c1∼

c2∼

c3∼

cm∼

c1∼

c2∼

c3∼

cm∼

c1

c1c1

c1

c1c1

c1c1c1

cn

cncn

cnc1

c1c1

cncn

cn

c1

cn

cncn

cncncn

Margin

(a) (b)

... ... ...

...

Page 11: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

72 IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2011

important to note that since the scores are required to be integrated it is important that the SVM outputs are properly calibrated. In this regard, logistic SVMs and GiniSVM are useful and have been shown to deliver more robust verification performance compared to traditional soft-margin SVMs. In more recent methods, SVM based discrimination have been applied to supervectors which are single vectors that represent an entire utterance. A common approach to create supervectors is to use the mean vectors of the trained GMM and these supervec-tors are then used as inputs to the SVM for training and verification [83].

Artificial neural networks (ANNs). Artificial neu-ral networks [51] have also been used for speaker verification systems and are based on discriminant learning. One such example of ANN is the Multilayer Per-ceptron (MLP) which is a feed-forward neural network comprising of multiple layers and each layer compris-ing of multiple nodes (as shown in Figure 6(b)). Each node computes a linear weighted sum over its input con-nections, where the weights of the summation are the adjustable parameters. A non-linear transfer function is applied to the result to compute the output of that node. The weights of the network are estimated by gra-dient descent based on the back-propagation algorithm. An MLP for speaker verification would classify speaker and impostor’s access by scoring each frame of the test utterance. The final utterance score is the mean of the MLP’s output over all the frames in the utterance. De-spite their discriminative power, the MLP present some

disadvantages. The main disadvantage is that their op-timal configuration is not easy to select and a lot of data is needed for the training and the cross-validation steps.

3.2.3 System FusionFusion refers to the process of combining information from multiple sources of evidence to improve the perfor-mance of the system. The technique has been also ap-plied in speaker verification where a number of different sets of feature are extracted from the speech signal and a different classifier is trained on each of the feature set. The scores produced by each of the classifier are then combined to arrive at a decision. Ideally, the informa-tion contained in the different features should be inde-pendent of each other so that each classifier focuses on different regions of the discrimination boundary. Fig. 8 shows an example of a fusion technique that combines “low-level” features like cepstrum or pitch with “high-level” features like prosody or other conversational patterns. However, performance gains could also be ob-tained by fusion of different low-level spectral features (e.g. MFCCs and LPCCs) as they contain some indepen-dent spectral information.

3.3 AuthenticationThe authentication module uses the integrated likeli-hood scores to determine if the utterance belongs to the target speaker or belongs to an imposter. Mathemati-cally, the task is equivalent to hypothesis testing where given a speech segment X and a claimed identity S the speaker verification system chooses one of the follow-ing hypotheses:

Hs: X is pronounced by SHs: X is not pronounced by S.The decision between the two hypotheses is usually

based on a likelihood ratio given by

L 1X 2 5p 1X|Hs 2p 1X|Hs 2 e

. Q accept Hs

, Q accept Hs, (7)

where p 1X|Hs 2 and p 1X|Hs 2 are the integrated likelihood scores (probability density functions) generated by the classifier and Q is the threshold to accept or reject Hs. Setting the threshold Q appropriately for a specific speaker verification application is a challenging task since it depends on environmental conditions like SNR. The threshold is usually chosen during the development phase, and is speaker-independent. However, to be more accurate, the threshold parameter should be chosen to reflect the speaker peculiarities and the inter-speaker variability. Furthermore, if there is a mismatch between development and test data, the optimal operating point could be different from the pre-determined threshold.

Figure 8. Fusion of low-level and high-level features to im-prove the performance of the speaker verification system.

Accept/Reject

Fusion

Authentication Authentication

Classifier 2Classifier 1

DataAcqisitionH

igh-

Leve

lF

eatu

re E

xtra

ctio

n

Low

-Lev

elF

eatu

re E

xtra

ctio

n

Spe

ech

Dec

isio

n 1

Dec

isio

n 2

Preprocessing

Page 12: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

SECOND QUARTER 2011 IEEE CIRCUITS AND SYSTEMS MAGAZINE 73

4. Robust Speaker Verification

As we mentioned in the introduction that even though the speaker verification has been the subject of research for the last four decades, there still exists many challenges and research opportunities. One of the challenges is to deal with limited amount of enrollment or training data for the speaker of interest. This situation is specifically true in forensic applications where access to speech samples of the target speaker is limited. Also, the available training samples could be noisy as they could have been acquired unreliable channels. Unfortunately, the methods used for long duration samples like MAP techniques can not read-ily generalized to short duration tasks as has been shown in [52].For very short enrollment data, alternative adapta-tion methods like likelihood linear regression (MLLR) in training the GMM have been shown to be more effective [53, 54]. Also, joint factor analysis (JFA) technique has also been reported to be useful for short training samples [52]. As a biometric method, speaker verifi-cation is also affected by artifacts that arise due to intra-speaker vari-ability. For example, the speaker’s voice could change due to aging, illness, emotions, tiredness and potentially other cosmetic fac-tors and model trained during the enrollment phased might not rep-resent all possible states of the speaker. As a result the accuracy of a speaker recognition system has been shown to degrade slowly over time [55, 56]. One of the pro-posed solutions to this problem is an incremental enrollment tech-nique which captures both the short and long-term evolution of a speaker’s voice [57]. Generally intra-speaker variability affects the speaker model and scoring and therefore the common way to ad-dress this issue is to use speaker model adaptation (presented in 4.2). Other challenges in speaker verification include mismatch be-tween recording conditions during the enrollment and the verifica-tion phase. This mismatch can be introduced by differences in the telephone handset, transmission channel and recording devices which decreases the accuracy of the verification system.

In this section we present an overview of different statistical techniques that are currently being used for designing speaker verification systems that are ro-bust to environmental and aging artifacts. The meth-ods span both the “low-level” and “high-level” features which are used in conjunction with robust speaker modeling techniques. A generic framework that models artifacts in a speaker verification system is shown in Figure 9, where the sources of interference could either arise due to the additive channel noise or due to the convulsive channel effects. Figure 9 also shows an ex-ample where the MFCC spectrum degrades significant-ly with the addition of “white” noise. To make verifica-tion systems to be more robust to channel variations, the state-of-the-art systems either use a noise-robust feature extraction algorithm or suitably adapt the speaker specific and background models. Figure 9(b)

Figure 9. (a) Equivalent model of additive and channel noise in a speaker verifica-tion (b) Different techniques used for designing robust speaker verification sys-tems; (c) Degradation of MFCC features with decreasing SNR (1) clean (2) 30dB (3) 10 dB in presence of white noise. Vertical axes denote the dimension of the vector, and horizontal axes denote time-frames.

Clean Speech

ChannelEffect

Additive Noise

Noisy Speechy (t ) = x (t ) ∗h (t )+n (t )

n (t )h (t )

x (t )

(a)

5

10

15

20 20 20

20 40 60 80Clean

5

10

15

20 40 60 8030 dB

5

10

15

20 40 60 8010 dB

(c)

Feature Space Model Space

CleanAcoustic Models

NoisyAcoustic Models

Clean Speech

Noisy Speech

ModelCompensation

FeatureCompensation

EnrollmentConditions

VerificationConditions

(b)

Page 13: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

74 IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2011

summarizes the approaches that have been used and in the following section we present an overview of some of these techniques.

4.1 Robust Feature ExtractionDifferent feature based approaches have been proposed to compensate the cross-channel effects which include well-known and widely used techniques such as cepstral mean subtraction (CMS) [58], RASTA filtering [59], and variance normalization [26] as well as recently devel-oped techniques for speaker recognition systems like feature warping, stochastic matching [60], and feature mapping [61]. Here will present a brief overview of these techniques:

Cepstral mean subtraction: In a Cepstral Mean Subtraction (CMS) method, the mean of cepstral coef-ficients (MFCC) computed over a frame of speech is removed from each of the coefficients. The rationale behind CMS is based on the “homeomorphic” filtering

principles where it can be shown that slow variations in channel conditions are reflected as offsets in the MFCC coefficients. However, CMS is not suitable for additive white noise channel. Also, in addition to mean subtrac-tion sometime the variance of the coefficients is also normalized to improve the noise robustness of the ceps-tral features.

RASTA filtering: RASTA (RelAtive SpecTrA) is a gen-eralization of CMS method to compensate the cross-channel mismatch. The method was first introduced to enhance the robustness of speech recognition system and since then it has also been used for speaker recog-nition systems. In RASTA filtering, the low and high fre-quency components in cepstral coefficients are removed using cepstral band-pass filters.

Kernel filtering: While the spectral features (MFCC and LPC) accurately extract linear information of speech signals, by construction they do not capture information about nonlinear or higher-order statistical character-istics of the signals, which have been shown to be not insignificant [62, 63]. One of the hypotheses that we have been investigating in our group is that many of the non-linear features in speech remain in tact even when the speech signal is corrupted by channel noise. Pre-vious studies in this area have approximated auditory time-series by a low-dimensional non-linear dynami-cal model. In [63], it was demonstrated that sustained vowels from different speakers exhibit a nonlinear, non-chaotic behavior that can be embedded in a low dimen-sion manifold of order less than four. Other non-linear speech feature extraction approaches include non-linear transformation/mapping [64, 65], non-linear Maximum Likelihood Feature Transformation [66], kernel based time-series features [67, 68, 69], non-linear discriminant techniques [70], neural predictive coding [71] and other auxiliary methods [72, 73]. In [74], we had presented a novel feature extraction technique that extracted ro-bust non-linear manifolds embedded in speech signal. The method called kernel predictive coefficients (KPCs) and it uses non-linear filtering properties of a functional regression procedure in a reproducing kernel Hilbert space (RKHS). The procedure is semi-parametric and does not make any assumptions on the channel statis-tics. Figure 10 shows the key steps involved in KPC fea-ture extraction. First, a non-linear function is regressed to a windowed segment of the speech signal and the function is then projected on a low-dimensional mani-fold. A DCT is then applied to the parameter of the mani-fold to obtain the KPC features. Figure also shows exam-ples of the KPC spectra where the features remain intact even when the noise power increases. Other extensions of KPC features have been reported where different pro-jection techniques have been employed.

Figure 10. (a) Kernel Filtering approach for robust speak-er verification: The approach uses non-linear smoothing techniques to filter out noise and yet retain higher-order speaker specific discriminatory information. The ap-proach consists of two key steps: 1) non-linear regression which uses an SVM based approach to estimate a func-tion which fits the input speech data; and 2) a project step which maps the high-dimensional regression function into a low-dimensional manifold. (b) Sample results showing that the kernel filtering features are robust to additive noise. Vertical axes denote the dimension of the vector, and horizontal axes denote time-frames.

Speaker 1

Speaker 2

SpeechWindowing

KernelRegression

Projection

50

100

150

20 40 60

5101520

20 40 60

50

100

150

20 40 60

5101520

20 40 60

Clean 10 dB

5

10

15

2020 40 60

5

10

15

2020 40 60

5

10

15

20

(a)

(b)(1) (2)

Page 14: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

SECOND QUARTER 2011 IEEE CIRCUITS AND SYSTEMS MAGAZINE 75

Feature warping: Feature warping [75] approach aims to construct more robust cepstral feature distribu-tion by whitening and hence generating an equivalent normal distribution over each frame of speech. This method delivers a more robust performance than the mean and variance normalization technique, however, the approach is more computationally intensive.

Feature mapping: Feature mapping [76] is a super-vised normalization technique which transforms the channel specific features to a channel independent feature space such that the channel variability is re-duced. This is achieved with a set of channel dependent GMMs which are adapted from a channel-independent root model. During the verification phase, the most likely channel (highest GMM likelihood) is detected, and the relationship between the root model and the channel-dependent model is used for mapping the vec-tors into channel-independent space.

4.2. Robust Speaker ModelingSeveral session compensation techniques have been re-cently developed for both GMM and SVM-based speaker models. Factor analysis (FA) techniques [77] were de-signed for the GMM-based recognizer and take explicit use of the stochastic properties of the GMM, whereas the methods developed for SVM-based models are often based on linear transformations. One such linear trans-form based approach uses Maximum Likelihood Linear Regression (MLLR) approach to transform the input pa-rameter of the SVM. MLLR transforms the mean vectors of a speaker-independent model as m rk 5 Amk 1 b, where m rk is the adapted mean vector, mk is the world model mean vector and the parameters A and b are parameters of the linear transform. A and b are estimated by maxi-mizing the likelihood of the training data with a modi-fied EM algorithm. Other normalization techniques for SVMs have also been reported which include nuisance attribute project (NAP) [78, 79] which uses the concept of eigenchannels and within-class covariance normal-ization (WCCN) [80, 81] that reweighs each dimension based on different techniques like principal component analysis (PCA). The Nuisance attribute project (NAP) uses an appropriate projection matrix, P in the feature space to remove subspaces that contain unwanted chan-nel or session variability from the GMM supervectors. The projection matrix filters out the nuisance attributes (e.g. session/channel variability) in the feature space by P 5 I 2 UUT, where U is the eigenchannel matrix. NAP requires a corpus labeled with speaker and/or session information.

The underlying principle behind factor analysis (FA) when applied to GMMs is the following: When speech samples are recorded from different handsets, the

super-vectors or the means of the GMMs could vary and hence require some sort of channel compensation and calibration before they can be compared. For chan-nel compensation to be possible, the channel variabil-ity has to be modeled explicitly and the technique that has been used is called joint factor analysis (JFA) [77, 82]. The JFA model considers the variability of a Gauss-ian supervector as a linear combination of the speaker and channel components. Given a training sample, the speaker-dependent and channel (session) dependent supervector M is decomposed into two statistically in-dependent components as M 5 s 1 c, where s and c are referred to as the speaker and channel (session) super-vectors, respectively. The channel variability is explic-itly modeled by the channel model of the form c 5 Ux where U and x are the channel factors estimated from a given speech utterance and the columns of the matrix U are the eigen channels estimated for a given data set. During enrollment, the channel factors x are to be esti-mated jointly with the speaker factors y of the speaker model of the form s 5 m 1 Vy 1 Dz, where m is the UBM supervector, V is a rectangular matrix with each of its columns referred to as the eigenvoices and D is a param-eter matrix of JFA and z is a latent variable vector for JFA. In this formulation, JFA can be viewed as a two-step generative model which models different speakers un-der different sessions. The core JFA algorithm comprises the first level and the second or the output level is the GMM generated using the first level. If we consider all the parameters that affect the mean of each component in output GMM, the mean of the session dependent GMM can be expressed as

Mki 5 mk 1 Ukxi 1 Vkys1i2 1 Dkzks1i2

with the indices k correspond to different GMM compo-nents, i corresponds to session, and s 1 i 2 for the speaker in session i. The system parameters are mk, Uk, Vk, and Dk where xi, ys(i), and zks(i) are hidden speaker and session variables. For specifics about estimation of the model parameters, the readers are referred to [64] for details.

In another approach [83] the GMM and the SVM principles can be combined to achieve robustness. In [65], the generative GMM-UBM model was used for creating “feature vectors” for the discriminative SVM speaker modeling. For example the mean and the vari-ance of the GMM-UBM states could be used as feature vector for SVM training. When the means of the GMMs are normalized by their variance, the resulting feature vectors are known as supervectors, which have been used in SVM training. The SVM kernel function could be also appropriately chosen that reflects the dis-tance between the pdfs generated by the GMMs. One

Page 15: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

76 IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2011

such measure is the Kullback-Leibler (KL) divergence measure between GMMs. Another extension is the GMM-UBM mean interval (GUMI) kernel which uses a bounded Bhattacharyya distance [84]. The GUMI kernel exploits the speaker’s information conveyed by the mean of GMM as well as those by the covariance matrices in an effective manner. Another alternative kernel known as probabilistic sequence kernel (PSK) [85] uses output values of the Gaussian functions rather than the Gaussian means to create supervec-tors. Other SVM approach based on Fisher kernels [86] and probabilistic distance kernels [87] have also been introduced where they use generative sequence mod-els for SVM speaker modeling. Similar hybrid methods have been used for HMMs and SVMs but for applica-tions in speech recognition.

4.3. Score NormalizationAs the name suggests, the score normalization techniques aim to reduce the score variabilities across different chan-nel conditions. The process is equivalent to adapting the speaker-dependent threshold which was briefly discussed in Section 2.3. Most of the normalization techniques used in speaker verification are based on the assumption that the impostors’ scores follow a Gaussian distribution where the mean and the standard deviation depend on the speak-er model and/or test utterance. Different score based nor-malization techniques have been proposed which includes Znorm [88], Hnorm [89], Tnorm [90], and Dnorm [91]. We describe some of these scores in this section.

Znorm: In zero normalization (Znorm) technique, a speaker model is first tested against a set of speech sig-nals produced by an imposter, resulting in an imposter similarity score distribution. Speaker-dependent mean and variance normalization parameters are estimated from this distribution. One of the advantages of Znorm is that the estimation of the normalization parameters can be performed offline during the enrollment phase.

TNorm The test normalization (TNorm) is another score normalization technique in which the mean and the standard deviation parameters are estimated us-ing a test utterance. The TNorm is known to improve the performances particularly in the region of low false alarm. However, TNorm has to be performed online while the system is being evaluated.

There are several variants of the ZNorm and TNorm that aim to reduce the microphone and transmission

channel’s effects. Among the variants of ZNorm, are the Handset Normalization (HNorm and the Channel Normalization (CNorm). In the last approach, hand-set- or channel-dependent normalization parameters are estimated by testing each speaker model against a handset or channel-dependent set of imposters. Dur-ing testing, the type of handset or channel related to the test utterance is first detected and then the cor-responding sets of parameters are used for score nor-malization. The HTNorm, a variant of TNorm, uses basically the same idea as the HNorm. Here, handset-dependent normalization parameters are estimated by testing each test utterance against handset-dependent imposter models.

DNorm: Both TNorm and ZNorm procedure rely on availability of imposter data. However, when the im-poster data is not available an alternate normalization called DNorm can be applied [91] where the pseudo-im-poster data are generated from the trained background model using Monte-Carlo techniques.

5. Performance of Speaker Verification System

In this section, we present some of the metrics which are used to evaluate the performance of a speaker veri-fication system.

5.1. Evaluation MetricsTypically the performance of a speaker verification sys-tem is determined by the errors generated by the rec-ognition. There are two types of errors that can occur during a verification task: (a) false acceptance when the system accepts an imposter speaker; and (b) false rejection when the system rejects a valid speaker. Both types of errors are a function of the decision threshold which was described in section 2.3. Choosing a high threshold of acceptance will result in a secure system that will accept only a few trusted speakers, however, at the expense of high false rejection rate (FRR). Similarly choosing a low threshold would make the system more user friendly by reducing false rejection rate but at the expense of high false acceptance rate (FAR). This trade-off is typically depicted using a decision-error trade-off (DET) curve [74] whose example is shown in Figure 11. The FAR and FRR of a verification system defines differ-ent operating points on the DET curve. These operating points (shown in Figure 11) vary according to their defi-nition and are considered different performance metrics

Most of the normalization techniques used in speaker verification are based on the assumption that the impostors’ scores follow a Gaussian distribution where the mean and the

standard deviation depend on the speaker model and/or test utterance.

Page 16: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

SECOND QUARTER 2011 IEEE CIRCUITS AND SYSTEMS MAGAZINE 77

of the speaker verification system. We describe the com-monly used ones below:

Detection Cost Function (DCF): The DCF is a weight-ed sum of the two error rates and computed as follows:

DCF 5 1CFRR 3 FRR 3 PTarg 21 1CFAR 3 FAR 3 112PTarg 22 , (8)

where CFAR, CFRR denotes the cost of false acceptance and cost of false rejection; and PTarg denotes the prior prob-ability that the utterance belongs to the target speaker. For instance, in evaluations conducted by National Insti-tute of Standards and Technology (NIST), CFAR, CFRR, PTarg

are assumed to be 10, 1, 0.01 and DCF is normalized using DCFNorm 5 DCF/CDefualt where CDefualt 5 0.1. Minimum DCF (min. DCF) which is the performance metric of the veri-fication system is defined as the smallest value of (8) computed over the cross-validation set when the deci-sion threshold (see Section 2.3) is varied. Another related metric is the actual DCF which is the minimum value of (8) computed over the test set for the entire range of the decision threshold. An example of the min DCF and ac-tual DCF metric is shown on the DET curve in Figure 11.

Equal Error Rate: An alternative performance mea-sure for speaker verification is the equal error rate (EER) which is defined as the FAR which is equal to FRR (see Figure 11). Thus, smaller the EER of the system, the superior is the verification system.

5.2. Speaker Verifi cation Databases and Research GroupsEven though several speech recognition databases like TIMIT, TIDIGIT and AURORA have been used also for

speaker verification, there are several databases like YOHO which have traditionally been dedicated for speaker verification. In Table 1 we summarize some of corpus which was used for speaker recognition/verifi-cation task.

Since 1996, the speech group of the National Institute of Standards and Technology (NIST) has been organiz-ing evaluations of text-independent speaker recogni-tion/verification technologies. During the evaluation, a unique data-set and an evaluation protocol are provided to each of the participating research group. The objec-tive is to provide a fair comparison between different speaker verification systems even though the iden-tity of the systems is not publicly revealed. The most

Figure 11. An example of DET curve that plots the FRR with respect to FAR.

30

20

10

5

2

False Acceptance Rate (%)

DET Curve

1

0.50.2

0.2

0.5 1 2 5 10 20 30

Fals

e R

ejec

tion

Rat

e (%

)

: EER:Min.DCF:Act.DCF

Table 1. Corpus discretion for speaker recognition/verification task.

Corpus nameNumber of speakers

Number of session/speaker Speech type Channels

Acoustic environment

TIMIT 630 1 Read sentences Clean Sound boothSIVA 671 1–26 Prompted words and digits, short

questions and read textPSTN Home/office

King 51 10 Descriptions of photograph Clean / PSTN

Sound booth

YOHO 138 4 enrollments 10 verification

Prompted digit phrases Clean Office

Switchboard I-II 543–657 1–25 Conversational PSTN Home/officeSPIDRE 45 target 160

imposter1–25 Conversational PSTN Home/office

POLYCOST 133 more than 5 Fixed and prompted digits, read sentences, free monologue

ISDN Home/office

PolyVar 143 1-229 Prompted digits, word and sentences and spontaneous speech

PSTN Home/office

AHUMADA 104 6 Prompted digits, phrases and text and spontaneous speech

PSTN Home/office

Page 17: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

78 IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2011

current evaluation from NIST for speaker verification/ recognition is the NIST SRE 2008 [75] which evaluates different systems under mismatched conditions. For instance, different microphones were used for train-ing and evaluation of the systems. For each of the mis-matched conditions, the DCFs of different verification systems were compared. Depending on the training-test conditions, the minDCFs for best verification system evaluated on the NIST SRE 2008 database varied from 0.1151–0.1257 (for matched conditions) to 0.278–0.4887 (for mismatched conditions). The evaluation reiter-ates the fact that existing state-of-the-art speaker veri-fication systems still can not cope with the challenges posed by mismatch between training and test condi-tions. Currently, this topic is being actively researched by several groups in academia as well as in industry. We summarize some of these groups in Table 2 with the un-derstanding that the list might not fully exhaustive.

6. Conclusions, Applications and Future Trends

In this paper, we have presented an overview of the clas-sical and emerging statistical techniques that are popu-lar for automatic text-independent speaker verification system. The main application for the technology is in the area of access control, where the speakers are re-quired to be authenticated before they can be allowed

access to certain facilities for example in call centers for applications such as account access, password reset, restricted services, etc. The future trend in access con-trol is to integrate speaker verification technology into a multi-level and a hybrid authentication approach, where results from different biometric technology like finger print, face, iris and speaker recognition could be fused together to achieve better reliability in authentication. However, the biggest advantage of speech based biomet-rics is the ability perform authentication where a direct physical or visual contact with the subject is not feasible. Thus the technology has a clear advantage for authen-ticating transactions that occur over the voice channel like telebanking. Other applications of speaker recogni-tion and verification systems include their use in surveil-lance systems where an embedded sensor could be used to automatically detect the presence of human targets.

Voice-based indexing is another emerging application where speaker recognition and verification techniques could be used search audio and media files. The need for such applications comes from the movie industry and from the media related industry. Other applications include voice mail browsing or intelligent answering ma-chines which use speaker recognition to label incoming voice mail with speaker name for browsing and/or action (personal reply). For speech skimming or audio mining

Table 2. Research groups who are actively working in the area of speaker verification.

Site System Description

Brno Univ. of Technology (BUT) [92] Eigen-channel GMM, MLLR-SVM, GMM-SVMQueensland Univ. of Technology (QUT) [93] GMM supervectors, SVM-NAP classifierUniversity of Stellenbosch (SUN) [92] GMM-SVM, MLLR-SVMCentre de Recherche Informatique de Montreal (CRIM) [94]

GMM-UBM, FA

MIT Lincoln Laboratory (MITLL) [95,96] GMM-UBM, SVM GMM supervectors, High-level featuresSRI International (SRI) [97] Fusion: ASR dependentNederlandse Organisatie voor Toegepast Natuurwetenschappelijk Onderzoek (TNO) [92]

GMM-SVM

Int. Computer Science Inst. (ICSI) [98] Fusion: SVM-based lattice N-gram, cepstral GMM, non parametric, conditional HMM

Michigan State University (AIM) [74] SVM-basedSwansea University [99] GMM-UBM, FA, NAPUniversity of Avignon (LIA) [100] GMM-UBMUniv. of Texas at Dallas (CRSS) [101] GMM-UBMIBM T. J. Watson Research Center (IBM) [102] GMM supervectors, NAP, discriminative modelingLIMSI, CNRS (LIM) [103] MLLR-SVM

The biggest advantage of speech based biometrics is the ability perform authentication where a direct physical or visual contact with the

subject is not feasible.

Page 18: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

SECOND QUARTER 2011 IEEE CIRCUITS AND SYSTEMS MAGAZINE 79

applications, annotate recorded meetings or video with speaker labels for quick indexing and filing. A more con-troversial application of speaker verification technology is in the area of forensics where the results of the tech-nique could be offered as evidence in judicial trials. Com-pared to finger-printing and DNA based authentication technology, the existing speaker verification techniques have their drawbacks and limitations due their sensitiv-ity to corruption by noise and the ability to masquerade the signal using voice recording devices.

We believe that there is an enormous potential for speaker verification and recognition technology in mul-timedia and biometric applications. However, key chal-lenges still remain to be solved and are currently limit-ing the wide-scale deployment of the technology. These challenges motivate further research and investment in some of the following important directions:

Exploitation of high-levels of information: In addi-tion to the low-level spectrum features used by current systems, there are many other sources of speaker in-formation in the speech signal that can be used. These include idiolect (word usage), prosodic measures and other long-term signal measures. This work will be aided by the increasing use of reliable speech recog-nition systems for speaker verification R&D. High-level features not only offer the potential to improve accu-racy, they may also help improve robustness since they should be less susceptible to channel effects and recent research in this regards show very promising results.

Focus on real world robustness: Speaker recognition continues to be data-driven field, setting the lead among other biometrics in conducting benchmark evaluations and research on realistic data. The continued ease of collecting and making available speech from real appli-cations means that researchers can focus on more real-world robustness issues that appear. Obtaining speech from a wide variety of handsets, channels and acoustic environments will allow examination of problem cases and development and application of new or improved compensation techniques.

Emphasis on remote application: With on-site and text-dependent systems making commercial headway, R&D effort will shift to the more difficult issues in uncon-strained and remote situations. This includes variable channels and noise conditions, text-independent speech and the tasks of speaker verification in the background.

Acknowledgments

This work was supported in part by a grant from the National Science Foundation: IIS:0836278.

Amin Fazel (S’07) received the B.Sc. degree in computer science and engineering from Shiraz University, Shiraz,

Iran, in 2002 and the M.Sc. degree in com-puter engineering from Sharif University of Technology, Tehran, Iran, in 2005. Cur-rently, he is pursuing the Ph.D. degree in electrical and computer engineering at Michigan State University. His research interests include speech processing, ro-

bust speech/speaker recognition, acoustic source sepa-ration, and analog-to-information converters.

Shantanu Chakrabartty received his B.Tech. degree from Indian Institute of Technology, Delhi in 1996, M.S. and Ph.D. in Electrical Engineering from Johns Hopkins University, Baltimore, MD in 2002 and 2005 respectively. He is currently an associate professor in the

department of electrical and computer engineering at Michigan State University. From 1996–1999 he was with Qualcomm Incorporated, San Diego and during 2002 he was a visiting researcher at The University of Tokyo. His current research interests include energy scaveng-ing sensors and integrated circuits, hybrid circuits and systems and ultra-low power analog signal processing circuits. Dr. Chakrabartty is a recipient of the National Science Foundation CAREER award, Michigan State Uni-versity’s teacher-scholar award and a catalyst founda-tion fellowship from 1999–2003. He has published more than 80 refereed articles and is a senior member of the Institute of Electrical and Electronics Engineers (IEEE). He has served or is currently serving as the associate editor of IEEE Transactions of Biomedical Circuits and Systems, associate editor for Advances in Artificial Neural Systems journal and a review editor for Frontiers of Neu-romorphic Engineering journal.

References[1] A. K. Jain, A. Ross, and S. Prabhakar, “An introduction to biometric recognition,” IEEE Trans. Circuits Syst. Video Technol. (Special Issue on Image- and Video-Based Biometrics), vol. 14, no. 1, Jan. 2004. [2] “Speaker verifi cation,” Biomet. Technol. Today, vol. 9, no. 7, pp. 9–11, July 2001.[3] J. P. Campbell, W. Shen, W. M. Campbell, R. Schwartz, J.-F. Bonastre, and D. Matrouf, “Forensic speaker recognition,” IEEE Signal Processing Mag., vol. 26, no. 2, pp. 95–103, 2009.[4] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker and session variability in GMM-based speaker verifi cation,” IEEE Trans. Au-dio, Speech Lang. Processing, vol. 15, no. 4, pp. 1448–1460, 2007.[5] R. Vogt, and S. Sridharan, “Explicit modeling of session variability for speaker verifi cation,” Comput. Speech Lang., vol. 22, no. 1, pp. 17–38, 2008.[6] B. S. Atal, “Automatic recognition of speakers from their voices,” Proc. IEEE, vol. 64, pp. 460–475, 1976. [7] A. Rosenberg, “Automatic speaker verifi cation: A review,” Proc. IEEE, vol. 64, no. 4, pp. 475–487, 1976.[8] G. R. Doddington, “Speaker recognition—Identifying people by their voices,” Proc. IEEE, vol. 73, no. 11, pp. 1651–1664, 1985.[9] D. O’Shaughnessy, Speech Communication, Human and Machine Digi-tal Signal Processing. Reading, MA: Addison-Wesley, 1987.

Page 19: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

80 IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2011

[10] S. Furui, “Speaker-dependent-feature extraction, recognition and processing techniques,” Speech Commun., vol. 10, pp. 505–520, 1991. [11] E. Rosenberg and F. K. Soong, “Recent research in automatic speak-er recognition,” in Advances in Speech Signal Processing, S. Furui, and M. M. Sondhi, Ed. New York: Marcel Dekker, 1992, pp. 701–738. [12] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chag-nolleau, S. Meignier, T. Merlin, J. Ortega-Garcia, D. Petrovska, and D. A. Reynolds, “A tutorial on text-independent speaker verifi cation,” EURA-SIP J. Appl. Signal Processing (Special Issue on Biometric Signal Process-ing), vol. 4, pp. 430–451, 2004. [13] J. P. Campbell, “Speaker recognition: A tutorial,” Proc. IEEE, vol. 85, no. 5, pp. 1437–1462, 1997.[14] T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech Commun., vol. 52, pp. 12–40, 2010. [15] C. Müller, Ed., Speaker Classification. I: Fundamentals, Features, and Methods, (Lecture Notes in Computer Science, vol. 4343). Berlin: Springer-Verlag, 2007.[16] C. Müller, Ed., Speaker Classification II: Selected Projects (Lec-ture Notes in Computer Science, vol. 4441). Berlin: Springer-Verlag, 2007.[17] A. El Hannani, D. Petrovska-Delacrétaz, B. Fauve, A. Mayoue, J. Ma-son, J.-F. Bonastre, and G. Chollet, “Text-independent speaker verifi ca-tion,” in D. Petrovska-Delacrtaz, G. Chollet, and B. Dorizzi, Eds. Guide to Biometric Reference Systems and Performance Evaluation, 1st ed., Berlin: Springer-Verlag, 2009.[18] M. M. Homayounpour and G. Chollet, “Discrimination of voices of twins and siblings for speaker verifi cation,” in Proc. 4th European Conf. Speech Communication and Technology (EUROSPEECH ’95), Madrid, Spain, 1995, pp. 345–348.[19] A. Ariyaeeinia, C. Morrison, A. Malegaonkar, and S. Black, “A test of the effectiveness of speaker verifi cation for differentiating between identical twins,” Sci. Justice, vol. 48, no. 4, pp. 182–186, 2008.[20] W. Campbell, J. Campbell, D. Reynolds, D. E. Singer, and P. Torres-Carrasquillo, “Support vector machines for speaker and lan-guage recognition,” Comput. Speech Lang., vol. 20, no. 2–3, pp. 210–229, 2006.[21] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verifi cation using adapted Gaussian mixture models,” Digital Signal Process., vol. 10, no. 1, pp. 19–41, 2000.[22] M. S. Brandstein and D. B. Ward, Microphone Arrays: Signal Process-ing Techniques and Applications. New York: Springer-Verlag, 2001.[23] D. Reynolds, et al., “The SuperSID project: Exploiting high-level information for high-accuracy speaker recognition,” in Proc. ICASSP, 2003, vol. 4, pp. 784–787. [24] J. P. Campbell, D. Reynolds, and R. Dunn, “Fusi ng high- and low-level features for speaker recognition,” in Proc. European Conf. Speech Communication and Technology (Eurospeech), Sept. 2003. [25] J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE, vol. 63, no. 4, pp. 561–580, 1975.[26] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993.[27] H. Hermansky, “Perceptual linear prediction (plp) analysis of speech,” J. Acoust. Soc. Amer., vol. 87, no. 4, 1990. [28] S. Davis and P. Mermelstein, “Comparison of parametric repre-sentations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.[29] S. Furui, “Comparison of speaker recognition methods using static features and dynamic features,” IEEE Trans. Acoust., Speech, Signal Pro-cessing, vol. 29, no. 3, pp. 342–350, 1981.[30] T. Kinnunen and P. Alku, “On separating glottal source and vocal tract information in telephony speaker verifi cation,” in Proc. ICASSP, 2009, pp. 4545–4548.[31] M. Chetouani, M. Faundez-Zanuy, B. Gas, and J. L. Zarader, “Investi-gation on LP-residual presentations for speaker identifi cation,” Pattern Recognit., vol. 42, pp. 487–494, 2009. [32] N. Zheng, T. Lee, and P. C. Ching, “Integration of complementary acoustic features for speaker recognition,” IEEE Signal Processing Lett., vol. 14, no. 3, Mar. 2007, pp. 181–184.[33] K. S. R. Murty and B. Yegnanarayana, “Combining evidence from residual phase and MFCC features for speaker recognition,” IEEE Signal Processing Lett, vol. 13, no. 1, pp. 52–55, Jan. 2006.

[34] S. R. M. Prasanna, C. S. Gupta, and B. Yegnanarayana, “Extraction of speaker-specifi c excitation information from linear prediction resid-ual of speech,” Speech Commun., vol. 48, pp. 1243–1261, 2006. [35] J. Gudnason and M. Brookes, “Voice source cepstrum coeffi cients for speaker identifi cation,” in Proc. ICASSP, Las Vegas, 2008, pp. 4821–4824. [36] M. D. Plumpe, T. F. Quatieri, and D. A. Reynolds, “Modeling of the glottal fl ow derivative waveform with application to speaker identifi ca-tion,” IEEE Trans. Speech Audio Processing, vol. 7, no. 5, pp. 569–586, Sept. 1999.[37] R. E. Slyh, E. G. Hansen, and T. R. Anderson, “Glottal modeling and closed-phase analysis for speaker recognition,” in Proc. Speaker Odys-sey 2004, Toledo, May 2004, pp. 315–322.[38] A. Adami, R. Mihaescu, D. Reynolds, and J. Godfrey, “Modeling Pro-sodic Dynamics for Speaker Recognition,” in Proc. ICASSP 2003. [39] A. Peskin, J. Navratil, J. Abramson, D. Jones, D. Klusacek, D. Reyn-olds, and B. Xiang, “Using prosodic and conversational features for high-performance speaker recognition: Report from JHU WS’ 02,” in Proc. ICASSP 2003. [40] A. Reynolds, W. Andrews, J. Campbell, J. Navrátil, B. Peskin, A. Ad-ami, Q. Jin, D. Klusácek, J. Abramson, R. Mihaescu, J. Godfrey, D. Jones, and B. Xiang, “The SuperSID project: Exploiting high-level information for high-accuracy speaker recognition,” in Proc. ICASSP 2003. [41] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Recognition, 2nd ed. New York: Wiley-Interscience, 2000.[42] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Stat. Soc., vol. 39, no. 1, pp. 1–38, 1977.[43] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identifi cation using Gaussian mixture speaker models,” IEEE Trans. Speech, Audio Processing, vol. 3, no. 1, pp. 72–83, 1995.[44] J. Mariéthoz and S. Bengio, “A comparative study of adaptation methods for speaker verifi cation,” in Proc. Int. Conf. Spoken Language Processing (ICSLP), Denver, CO, 2002, pp. 581–584. [45] C. Leggetter and P. Woodland, “Maximum, likelihood linear regres-sion for speaker adaptation of continuous density HMMs,” Comput. Speech Lang., vol. 9, pp. 171–185, 1995. [46] M. Hébert, “Text-dependent speaker recognition,” in Springer Handbook of Speech Processing, J. Benesty, M. Sondhi, and Y. Huang, Eds. Heidelberg: Springer Verlag, 2008, pp. 743–762.[47] V. Vapnik, The Nature of Statistical Learning Theory, New York: Springer-Verlag, 1995.[48] B. Schölkopf and A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA: MIT Press, 2001.[49] T. Jaakkola and D. Haussler, “Probabilistic kernel regression mod-els,” in Proc. 7th Int. Workshop Artificial Intelligence and Statistics, 1999. [50] S. Chakrabartty and G. Cauwenberghs, “Gini-support vector ma-chine: Quadratic entropy based multi-class probability regression,” J. Mach. Learn. Res., vol. 8, pp. 813–839, Apr. 2007.[51] S. Haykin, Neural Networks: A Comprehensive Foundation. New York: Macmillan, 1994. [52] L. Burget, N. Brümmer, D. Reynolds, P. Kenny, J. Pelecanos, R. Vogt, F. Castaldo, N. Dehak, R. Dehak, O. Glembek, Z. Karam, J. Noecker, E. Na, C. Costin, V. Hubeika, S. Kajarekar, N. Scheffer, and J. Cernocký, “Robust speaker recognition over varying channels,” Tech. Rep. JHU workshop 2008, Mar. 2009. [53] B. Fauve, N. Evans, and J. Mason, “Improving the performance of text-independent short duration SVM- and GMM-based speaker verifi cation,” in Proc. The Speaker and Language Recognition Workshop (Odyssey), 2008. [54] M.-W. Mak, R. Hsiao, and B. Mak, “A comparison of various adapta-tion methods for speaker verifi cation with limited enrollment data,” in Proc. ICASSP, 2006, pp. 929–932.[55] K. Wadhwa, “Voice verifi cation: Technology overview and accu-racy testing results,” in Proc. Biometrics Conf., 2004, vol. 2004. [56] T. Kato and T. Shimizu “Improved speaker verifi cation over the cel-lular phone network using phonemebalanced and digit-sequence-pre-serving connected digit patterns,” in Proc. IEEE ICASSP, 2003, pp. 57–60.[57] C. Fredouille, J. Mariethoz, C. Jaboulet, J. Hennebert, J.-F. Mok-bet, and F. Bimbot, “Behavior of a Bayesian adaptation method for incremental enrollment in speaker verifi cation,” in Proc. ICASSP, 2000, vol. 2, pp. 1197–1200.[58] S. Furui, “Cepstral analysis technique for automatic speaker verifi -cation,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 29, no. 2, pp. 254–272, Apr. 1981.

Page 20: © ARTVILLE An Overview of Statistical Pattern Recognition ...metric Industry Report, Forecasts and Analysis, ... ence meetings, and personal video clips can be accessed through the

SECOND QUARTER 2011 IEEE CIRCUITS AND SYSTEMS MAGAZINE 81

[59] H. Hermansky, “RASTA processing of speech,” IEEE Trans. Speech Audio Processing, vol. 2, no. 4, 1994. [60] Qi Li, S. Parthasarathy, and A. E. Rosenberg, “A fast algorithm for stochastic matching with applications to robust speaker verifi cation,” in Proc. ICASSP, 1997, pp. 1543–1546.[61] D. A. Reynolds, “Channel robust speaker verifi cation via feature map-ping,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal (ICASSP’03), pp. 53–56.[62] M. Banbrook, S. McLaughlin, and I. Mann, “Speech characteriza-tion and synthesis by nonlinear methods,” IEEE Trans. Speech Audio Processing, 1999, vol. 7, pp. 1–17.[63] H. M. Teager and S. M. Teager, “Evidence for nonlinear sound pro-duction mechanisms in the vocal tract,” presented at NATO ASI on Speech Production and Speech Modelling, 1990. [64] H. Hermansky, S. Sharma, and P. Jain, “Data-derived nonlinear mapping for feature extraction,” in Proc. Workshop Automatic Speech Recognition and Understanding, Dec. 1999. [65] S. Sharma, D. Ellis, S. Kajarekar, P. Jain, and H. Hermansky, “Feature extraction using non-linear transformation for robust speech recogni-tion on the Aurora database,” in Proc. ICASSP 2000. [66] M. K. Omar and M. Hasegawa-Johnson, “Non-linear maximum like-lihood feature transformation for speech recognition,” in Proc. Inter-speech, Sept. 2003. [67] A. Kocsor and L. Tóth, “Kernel-based feature extraction with a speech technology application,” IEEE Trans. Signal Processing, vol. 52, no. 8, pp. 2250–2263, 2004. [68] H. Huang and J. Zhu “Kernel based non-linear feature extraction methods for speech recognition,” Proc. 6th Int. Conf. Intelligent Systems Design and Applications (ISDA’06), 2006, vol. 2, pp. 749–754.[69] A. Lima, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda, and T. Kita-mura, “On the use of kernel PCA for feature extraction in speech recog-nition,” in Proc. Eurospeech, Geneva, Switzerland, 2003, pp. 2625–2628.[70] Y. Konig, L. Heck, M. Weintraub, K. Sonmez, and R. Esum E, “Nonlin-ear discriminant feature extraction for robust text-independent speak-er recognition,” in Proc. RLA2C, ESCA Workshop Speaker Recognition and Its Commercial and Forensic Applications, 1998, pp. 72–75. [71] M. Chetouani, B. Gas, J. L. Zarader, and C. Chavy “Neural predictive coding for speech discriminant feature extraction: The DFE-NPC,” in Proc. European Symp. Artificial Neural Networks, Bruges, Belgium, Apr. 24–26, 2002. [72] Q. Zhu and A. Alwan “Non-linear feature extraction for robust speech recognition in stationary and non-stationary noise,” Comput. Speech Lang., vol. 17, pp. 381–402, 2003. [73] M. Chetouani, M. Faundez, B. Gas and J. L. Zarader, “Non-linear speech feature extraction for phoneme classifi cation and speaker rec-ognition,” in Nonlinear Speech Processing: Algorithms and Analysis, G. Chollet, A. Esposito, M. Faundez, and M. Marinaro, Eds. Berlin: Spring-er-Verlag, 2005. [74] A. Fazel and S. Chakrabartty, “Non-linear fi ltering in reproducing kernel Hilbert spaces for noise-robust speaker verifi cation,” in Proc. Int. Symp. Circuits and Systems (ISCAS), Taipei, Taiwan, 2009. [75] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker verifi cation,” in Proc. IEEE Workshop Speaker and Language Recognition (Odyssey), June 2001. [76] D. Reynolds, W. Andrews, J. P. Campbell, J. Navratil, B. Peskin, A. Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, J. Jones, and B. Xiang, “The supersid project: Exploiting high-level infor-mation for high-accuracy speaker recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Apr. 2003. [77] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Factor analy-sis simplifi ed,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing (ICASSP’05), Philadelphia, PA, Mar. 2005, vol. 1, pp. 637–640.[78] A. Solomonoff, C. Quillen, and I. Boardman, “Channel compensa-tion for SVM speaker recognition,” in Proc. Odyssey-04 Speaker Lan-guage Recognition Workshop, Toledo, Spain, May 2004, pp. 57–62.[79] A. Solomonoff, W. M. Campbell, and I. Boardman, “Advances in channel compensation for SVM speaker recognition,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, Philadelphia, PA, Mar. 2005, vol. 1, pp. 629–632.[80] A. O. Hatch and A. Stolcke, “Generalized linear kernels for one ver-sus-all classifi cation: Application to speaker recognition,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), France, 2006, pp. 585–588.[81] A. O. Hatch, S. Kajarekar, and A. Stolcke, “Within-class covariance normalization for SVM-based speaker recognition,” in Proc. Int. Conf. Spoken Language Processing, Pittsburgh, PA, Sept. 2006, pp. 1471–1474.

[82] P. Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms,” Tech. Rep. CRIM-06/08-14, 2006.[83] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector machines using GMM supervectors for speaker verifi cation,” IEEE Sig-nal Process. Lett., vol. 13, no. 5, pp. 308–311, May 2006.[84] A. You, K. Lee, and H. Li, “An SVM kernel with GMM supervector based on the Bhattacharyya distance for speaker recognition,” IEEE Signal Process. Lett., vol. 16, no. 1, pp. 49–52, 2009.[85] K.-A Lee, C. You, H. Li, and T. Kinnunen, “A GMM-based probabilis-tic sequence kernel for speaker verifi cation,” in Proc. Interspeech, Bel-gium, 2007, pp. 294–297.[86] T. S. Jaakkola and D. Haussler, “Exploiting generative models in discriminative classifi ers,” in Proc. 1998 Conf. Adv. Neural Inf. Process. Syst. II, 1999, pp. 487–493.[87] P. Moreno and P. Ho, “A new SVM approach to speaker identifi ca-tion and verifi cation using probabilistic distance kernels,” in Proc. 8th Eur. Conf. Speech Commun. Technol., Geneva, Switzerland, Sept. 2003, pp. 2965–2968.[88] K. P. Li and J. E. Porter, “Normalizations and selection of speech segments for speaker recognition scoring,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP’88), New York, Apr. 1988, vol. 1, pp. 595–598.[89] A. A. Reynolds, “The effect of handset variability on speaker recog-nition performance: Experiments on the switchboard corpus,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP ’96), Atlanta, GA, May 1996, vol. 1, pp. 113–116.[90] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score normaliza-tion for text-independent speaker verifi cation systems,” Digital Signal Process., vol. 10, pp. 42–54, Jan. 2000.[91] M. Ben, R. Blouet, and F. Bimbot, “A Monte-Carlomethod for score normalization in automatic speaker verification using Kull-back-Leibler distances,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP ’02), Orlando, FL, May 2002, vol. 1, pp. 689–692.[92] R. Matejka, L. Burget, P. Schwarz, O. Glembek, M. Karafi at, F. Grezl, J. Cernocky, D. A. van Leeuwen, N. Brummer, and A. Strasheim, “STBU system for the NIST 2006 speaker recognition evaluation,” in Proc. ICASSP, 2007, pp. 221–224.[93] M. McLaren, B. Baker, R. Vogt, and S. Sridharan, “Improved SVM speaker verifi cation through data-driven background dataset collec-tion,” in Proc. ICASSP, 2009, pp. 4041–4044.[94] P. Kenny, N. Dehak, P. Ouellet, V. Gupta, and P. Dumouchel, “Devel-opment of the primary CRIM system for the NIST 2008 speaker recog-nition evaluation,” in Proc. Interspeech 2008, Brisbane, Australia, Sept. 2008. [95] W. Campbell, J. Campbell, R. Gleason, D. Reynolds, and W. Shen, “Speaker verifi cation using support vector machines and high-level features,” IEEE Trans. Audio, Speech, Lang. Processing, vol. 15, no. 7, pp. 2085–2094, Sept. 2007.[96]A. E. Sturim, W. M. Campbell, D. A. Reynolds, R. B. Dunn, and T. F. Quatieri, “Robust speaker recognition with cross-channel data: MIT/LL results on the 2006 NIST SRE auxiliary microphone task,” in Proc. ICASSP, 2007, vol. 4, pp. 49–52.[97] S. S. Kajarekar, N. Scheffer, M. Graciarena, E. Shriberg, A. Stolcke, L. Ferrer, and T. Bocklet, “THE SRI NIST 2008 speaker recognition evalua-tion system,” in Proc. ICASSP, 2009, pp. 4205–4208.[98] N. Mirghafori, A. O. Hatch, S. Stafford, K. Boakye, D. Gillick, and B. Peskin, “ICSI’S 2005 speaker recognition system,” in Proc. IEEE Workshop Automatic Speech Recognition and Understanding, 2005, pp. 23–28.[99] A. Fauve, D. Matrouf, N. Scheffer, J.-F. Bonastre, and J. Mason, “State-of-the-art performance in text-independent speaker verifi cation through open-source software,” IEEE Trans. Audio, Speech Lang. Pro-cessing, vol. 15, no. 7, pp. 1960–1968, 2007.[100] J.-F Bonastre, F. Wils, and S. Meignier, “ALIZE, a free toolkit for speaker recognition,” in Proc. ICASSP, 2005, pp. 737–740.[101] M. R. Leonard and J. H. L. Hansen, “In-set/out-of-set speaker rec-ognition: Leverging the speaker and noise balance,” in Proc. ICASSP, 2008, pp. 1585–1588. [102] M. K. Omar, J. Pelecanos, and G. N. Ramaswamy, “Maximum mar-gin linear kernel optimization for speaker verifi cation,” in Proc. ICASSP, 2009, pp. 4037–4040.[103] M. Ferràs, C. Barras, and J.-L. Gauvain, “Lattice-based MLLR for speaker recognition,” in Proc. ICASSP, pp. 4537–4550, 2009.