2. review of literature - inflibnetshodhganga.inflibnet.ac.in/bitstream/10603/7503/5/05_chapter...

Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012

14

2. Review of Literature

Speech is a natural means of communication for humans. It is not

surprising that humans can recognize the identity of a person by

hearing his voice. About 2-3 seconds of speech is sufficient for a

human to identify a voice. One review on human speech recognition

[33] states that many studies of 8-10 speakers yield accuracy of

more than 97% if a sentence or more of the speech is heard.

Performance falls if the length of the speech is short and if the

number of speakers is more. Speaker Recognition is one area of

artificial intelligence where machine performance can exceed human

performance using short test utterances and a large number of

speakers in which case machine accuracy often exceed that of

humans. Research on Speaker Identification systems can be dated

back to more than fifty years [3]. The survey of this work is given in

brief in the subsequent sections.

2.1 Early Systems (1960-1980)

The first reported work on Speaker Recognition can be dedicated

to Pruzansky at Bell Labs [34], as early as 1963, who initiated

research by using filter banks and correlating two digital

spectrograms for a similarity measure. The system used several

utterances of commonly spoken words by ten talkers and converted

it to time-frequency-energy patterns. Some of each talker's

utterances were used to form reference patterns and the remaining

utterances served as test patterns. The recognition procedure

consisted of cross-correlating the test patterns with the reference

patterns and selecting the talker corresponding to the reference

pattern with the highest correlation as the talker of the test

utterance. The recognition score for three-dimensional patterns was


15

89%. Reducing the original patterns to time-energy patterns

resulted in a lower recognition score; however, when only spectral

information was retained, recognition results were the same as

those for three-dimensional patterns. The work was further

improved in [35], by using a small subset of features. Features

were formed as the average of the speech energy over certain

rectangular areas on the spectrograms. Results were computed as a

function of the number of features used and as a function of the

size of the areas used to form the features. The filter bank approach

used in the earlier two cases was replaced by formant analysis by

Doddington [36]. Doddington proposed a speaker-verification using

eight known speakers and 32 impostors. Formant frequencies,

voicing pitch period, and speech energy—all as functions of time—

were used in verification. Proper time normalization was shown to

be an important factor in improving verification error performance.

Intra- Speaker variation in speech was investigated by Endres et al.

[37] and Furui [38]. In [37], Spectrograms of utterances produced

by seven speakers and recorded over periods of up to 29 years

showed that the frequency position of formants and pitch of voiced

sounds shift to lower frequencies with increasing age of test

persons. Speech spectrograms of texts spoken in a normal and a

disguised voice revealed strong variations in formant structure.

Speech spectrograms of utterances of well-known people were

compared with those of imitators. The imitators succeeded in

varying the formant structure and fundamental frequency of their

voices, but they were not able to adapt these parameters to match

or even be similar to those of imitated persons.

In [39], B S Atal, evaluated several different parametric

representations of speech derived from the linear prediction model,

for its effectiveness for automatic recognition of speakers from their

voices. Twelve predictor coefficients were determined approximately

once every 50 msec from speech sampled at 10 kHz. The predictor


16

coefficients and other speech parameters derived from them, such

as the impulse response function, the autocorrelation function, the

area function, and the cepstrum function were used as input to an

automatic speaker recognition system. S. Furui [40] and A. E.

Rosenberg and M. R. Sambur [41] used cepstrum coefficients

extracted by means of LPC analysis successively throughout an

utterance to form time functions. In time-domain methods, with

adequate time alignment, one can make precise and reliable

comparisons between two utterances of the same text, in similar

phonetic environments. Hence, text-dependent methods have a

much higher level of performance than text-independent methods.

Texas Instruments system based on filter banks and Bell Lab

Systems based on cepstal analysis were the first commercially

experimented Speaker Recognition systems.

2.2 Medieval systems (1980-2000)

In this period there was lot of development in Speaker

Identification technology. These advances were both in the field of

feature extraction and feature matching.

2.2.1 Feature Extraction

Voice pitch (F0) and formant frequencies (F1, F2, F3) extracted

from time aligned, un-coded and coded speech samples were

compared to establish the statistical distribution of error attributed

to the coding system [42]. The mel-warped cepstrum is a very

popular feature domain. The mel warping transforms the frequency

scale to place less emphasis on high frequencies. It is based on the

nonlinear human perception of the frequency of sounds [43]. The

cepstrum can be considered as the spectrum of the log spectrum.

Removing its mean reduces the effects of linear time-invariant


17

filtering (e.g., channel distortion). Often, the time derivatives of the

mel cepstra (also known as delta cepstra) are used as additional

features to model trajectory information.

Studies on automatically extracting the speech periods of each

person separately from a dialogue/conversation/meeting involving

more than two people have appeared as an extension of speaker

recognition technology [46 – 48]. Increasingly, speaker

segmentation and clustering techniques have been used to aid in

the adaptation of speech recognizers and for supplying metadata for

audio indexing and searching.

2.2.2 Feature Matching

Hidden Markov Model

As an alternative to the template-matching approach for text-

dependent speaker recognition, the Hidden Markov Model (HMM)

technique was introduced. HMMs have the same advantages for

speaker recognition as they do for speech recognition. Remarkably

robust models of speech events can be obtained with only small

amounts of specification or information accompanying training

utterances. Speaker recognition systems based on an HMM

architecture [44] used speaker models derived from a multi-word

sentence, a single word, or a phoneme. Typically, multi-word

phrases (a string of seven to ten digits, for example) were used,

and models for each individual word and for “silence” were

combined at a sentence level according to a predefined sentence-

level grammar.

Robustness

Research on increasing robustness became a central theme in the

1990s. Matsui et al. [24] compared the VQ-based method with the

discrete/continuous ergodic HMM-based method, particularly from


18

the viewpoint of robustness against utterance variations. They

found that the continuous ergodic HMM method is far superior to

the discrete ergodic HMM method and that the continuous ergodic

HMM method is as robust as the VQ-based method when enough

training data is available. They investigated speaker identification

rates using the continuous HMM as a function of the number of

states and mixtures. It was shown that speaker recognition rates

were strongly correlated with the total number of mixtures,

irrespective of the number of states. This means that using

information about transitions between different states is ineffective

for text-independent speaker recognition and, therefore, GMM

achieves almost the same performance as the multiple-state ergodic

HMM.

Text Prompted Speaker Recognition

Matsui et al. proposed a text prompted speaker recognition

method, in which key sentences are completely changed every time

the system is used [45]. The system accepts the input utterance

only when it determines that the registered speaker uttered the

prompted sentence. Because the vocabulary is unlimited,

prospective impostors cannot know in advance the sentence they

will be prompted to say. This method not only accurately recognizes

speakers, but can also reject an utterance whose text differs from

the prompted text, even if it is uttered by a registered speaker.

Thus, a recorded and played back voice can be correctly rejected.

Normalization

How to normalize intra-speaker variation of likelihood (similarity)

values is one of the most difficult problems in speaker verification.

Variations arise from the speaker him/herself, from differences in

recording and transmission conditions, and from noise. Speakers

cannot repeat an utterance precisely the same way from trial to


19

trial. Likelihood ratio- and posteriori probability-based techniques

were investigated [49 - 51]. In order to reduce the computational

cost for calculating the normalization term, methods using “cohort

speakers” or a “world model” were proposed.

2.3 Recent Trends in Speaker Identification (2000

onwards)

We can divide the recent advances in Speaker Identification in

two categories: Feature Extraction and Feature Matching.

2.3.1 Feature Extraction

Recently feature extraction techniques like MFCC, wavelet

decomposition and Transform domain techniques are being

explored.

Mel-Frequency Cepstral Coefficients (MFCC):

There has been a shift from LPC parameters to Mel-Frequency

Cepstral Coefficients (MFCC) for feature extraction. MFCC’s are

based on the known variation of the human ear’s critical bandwidths

with frequency. The MFCC technique makes use of two types of

filter, namely, linearly spaced filters and logarithmically spaced

filters. To capture the phonetically important characteristics of

speech, signal is expressed in the Mel frequency scale. This scale

has linear frequency spacing below 1000 Hz and a logarithmic

spacing above 1000 Hz. As a reference point, the pitch of a 1 KHz

tone, 40 dB above the perceptual hearing threshold is defined as

1000 Mels.

Fig. 2.1 shows the block diagram representation of the process

to convert the speech signal into MFCC. Here the speech signal is

first converted into frames and then windowed (e.g. hamming

window), to minimize the signal discontinuities at the beginning and


20

end of each frame. The next step is to convert the signal into

frequency domain by applying DFT on the windowed frames. Next

step is the Mel-frequency wrapping, where the Mel scale is used. Eq.

2.1 shows the conversion of frequency (f) to Mel Frequency. To

implement this, filter bank approach is used. In the final step, the

log Mel spectrum is converted back to time, which is called the

MFCC. This is done by using DCT.

)700/1(10log2595)( ffmel +×= (2.1)

Fig.2.1: Block diagram of MFCC processor

MFCC techniques have become a common approach of many

researchers [13-16, 52-55].

Wavelets:

Also another technique for feature extraction which is being

explored is using wavelet decomposition [17-19, 55-58]. Speech

signals have a wide variety of characteristics, in both time and

frequency domains. To analyze the non-stationary signals like

speech, both time and frequency resolutions are important.

Therefore while extracting features; it would be useful to analyze

the signal from multi-resolution perspective. Wavelets provide both

time as well as frequency resolution. The wavelet analysis

procedure is to adopt a wavelet prototype function, called an


21

analyzing wavelet or mother wavelet. Temporal analysis is

performed with a contracted, high frequency version of the

prototype wavelet, while frequency analysis is performed with a

dilated, low frequency version of the same wavelet. In [57],

Speaker Identification using different levels of decomposition of the

speech signal using discrete wavelet transform (DWT) and

Daubechies wavelets (mother wavelet) has been shown. Fig. 2.2

shows the how the speech signal is decomposed into approximate

(a1,…, a7) and detail coefficients (d1,…, d7) by using low pass and

high pass filters at each stage. The speech signal has been

decomposed up to seven levels using Discrete Wavelet Transform

(DWT) by using different Daubechies mother wavelets. The mean of

the approximate and exact coefficients of every level have been

taken as the feature vector.

Fig.2.2: 7th Level Wavelet Decomposition of the speech signal into

approximate and detail coefficients

K. Daqrouq et. al. [58] have used DWT for Text Dependent

Speaker Identification, which works in two steps: first gender

Discrimination and then feature extraction for Identification.

Wavelet packet transform is the extension of wavelet transform. It

provides a more precise way to analyze signals. Different from the


22

wavelet transform, the wavelet packet transform not only

decompose the signal in low frequency area, but also decompose

the signal in high frequency domain according to the signal’s

entropy. By using the wavelet packet transform the dynamic

features can be kept very well. In [59], the features extracted

through wavelet packet transform are the input to the artificial

neural network, and the classifier then determine the recognition

result.

High-level features:

High-level features such as word idiolect, pronunciation, phone

usage, prosody, etc. have been successfully used in text-

independent speaker verification. Typically, high-level-feature

recognition systems produce a sequence of symbols from the

acoustic signal and then perform recognition using the frequency

and co-occurrence of symbols. In Doddington’s idiolect work [60],

word unigrams and bigrams from manually transcribed

conversations were used to characterize a particular speaker in a

traditional target/background likelihood ratio framework.

Feature Extraction in Transform Domain:

The work reported on Speaker Identification shows that Discrete

Fourier Transform [DFT] is the most explored Transformation

technique. DFT has been used to compute LPC parameters [12];

MFCC features [13 - 16, 52 – 55]. In [61, 62], a novel technique of

utilizing the magnitude and phase of the speech signal has been

proposed. The complex DFT plane plotted by taking real part of DFT

as the X-axis and the imaginary part as the Y-axis has been

sectorized as shown in Fig. 2.3. The mean and density of the

sample points in each sector are used as features for Speaker

Identification. The concept of Sectorization is further extended to


23

Walsh Hadamard Transform (WHT) in [63], by plotting cal

coefficients verses sal coefficients.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

-1

-0.5

0

0.5

1

No. of samples

Amplitude

(a) (b)

-250 -225 -200 -175 -150 -125 -100 -75 -50 -25 0 25 50 75 100 125 150 175 200 225 250-250

-225

-200

-175

-150

-125

-100

-75

-50

-25

0

25

50

75

100

125

150

175

200

225

250

(c)

Fig.2.3: Speech signal and its circular sectors

The concept of using the Amplitude distribution in the transform

domain for feature extraction has been explored. The DFT, Discrete

cosine Transform (DCT) have been utilized for obtaining MFCC

coefficients [14-15, 53, 54]. Fig 2.4 shows the sum of the

magnitudes of the DFT samples by dividing the transform domain

into 32 groups. By making use of the symmetry of DFT, only the

first 16 sums are considered as feature vectors here. It has been

shown that the amplitude distribution of DFT, DCT, Discrete Sine

Transform (DST), WHT, Discrete Hartley Transform (DHT), Kekre

Transform (KT) and Haar Transform can be used for feature

extraction [66 - 68] for Speaker Identification.


24

Fig 2.4 Feature Vectors for FFT dividing the samples into 32 divisions

Another concept of utilizing the row mean of the column

Transform of the Transforms has been used for feature extraction

and has given very good results for Speaker Identification [69 –

71].

Vector Quantization:

Vector Quantization has been extensively used for Text

Independent Speaker Identification [20 – 22, 24, 52]. VQ has been

used for text dependent Recognition in [121]. Here each speaker is

represented by a sequence of vector quantization codebooks;

known input utterances are classified using these codebook

sequences and the resulting classification distortion is compared to

a rejection threshold. In [122], MFCC features are quantized to a

number of centroids using Vector Quantization. In [23], a VQ text

dependent speaker verification system based on VQ source coding

has been explored. Vector Quantization has been utilized for feature

extraction in the spatial domain by using clustering algorithms like

Linde Buzo Gray (LBG), Kekre’s Fast codebook Generation (KFCG)

and Kekre’s Median Codebook Generation (KMCG) [72 – 74]. These

techniques are further extended to the Transform domain by using

DFT, DCT and DST [75].


25

2.3.2 Feature matching

Artificial Neural network:

As seen from section 2.1 and 2.2, the techniques for feature

matching have shifted from template matching to statistical

modeling (e.g. HMM), distance based to likelihood based method.

The non-parametric approach of VQ is still being used. The recent

trend is the use of Artificial Neural Network (ANN). Being widely

used in pattern recognition tasks, neural networks have also been

applied in speaker recognition [77, 78].

Dynamic Time Warping (DTW):

The most popular method to compensate for speaking-rate

variability in template-based systems is known as DTW [5]. This

method accounts for the variation over time (trajectories) of

parameters corresponding to the dynamic configuration of the

articulators and vocal tract. In [79], Pandit M., proposes a

technique for optimisation of the feature sets, in a dynamic time

warping (DTW) based text-dependent speaker verification system.

The investigation on Gaussian mixture model (GMM) by comparing

it with some preliminary experiments on multilayered perceptron

network (MLP) with back propagation learning algorithm (BKP) and

dynamic time warping (DTW) techniques on Thai text-dependent

speaker identification system is given in [80].

2.3.3 Similarity measures

There are various distances measures which can be used as

similarity measures in the decision logic stage of Speaker

Identification. The various distance measures used in literature are:

• Manhattan Distance

• Euclidean Distance

• Mahalanobis distance


26

• Bhattacharya distance

• Earth mover’s distance

Out of these distance measures, the Euclidean distance, which is

the straight line distance between two points in an n-dimensional

space, is the most popularly used similarity measure [62 – 75].

Evgeny Karpov, et. al [130] have compared the performance of

Euclidean distance and Manhattan distance for Speaker

Identification. In [127] Shingo Kuroiwa, et. al. have used the Earth

Mover’s distance for CCC Speaker Recognition Evaluation 2006.

Bhattacharya distance has been used as similarity measure in [128,

129].

2.4 Summary of the progress in Speaker Identification

The technological progress in Speaker Identification can be

summarized as follows:

• Features of speech

Filter bank/spectral resonance – LPC – MFCC – magnitude

and row mean in transform domain – VQ codebook

• Feature Matching

Template matching - corpus-base statistical modeling

(e.g. HMM and n-grams) – DTW - Artificial Neural

Networks

• Type of speech signal

Clean speech – noisy speech – telephone speech

• System

Hardware recognizer – Software recognizer


27

Although these advances have taken place, there are still many

practical limitations which hinder the widespread commercial

deployment of applications and services. A more sound

understanding of the complex speech signal and its parameters is

through which this can be achieved.

2. review of literature - inflibnetshodhganga.inflibnet.ac.in/bitstream/10603/7503/5/05_chapter...

Documents