springer_1410170

8/3/2019 Springer_1410170

1/10

S. Dua, S. Sahni, and D.P. Goyal (Eds.): ICISTM 2011, CCIS 141, pp. 170179, 2011.

Springer-Verlag Berlin Heidelberg 2011

Multi-feature Fusion for Closed Set Text

Independent Speaker Identification

Gyanendra K. Verma

Indian Institute of Information Technology, Allahabad

Jhalwa, Allahabad, India

[email protected]

Abstract. An intra-modal fusion, a fusion of different features of the samemodal is proposed for speaker identification system. Two fusion methods at

feature level and at decision level for multiple features are proposed in this

study. We used multiple features from MFCC and wavelet transform of speech

signal. Wavelet transform based features capture frequency variation across

time while MFCC features mainly approximate the base frequency information,

and both are important. A final score is calculated using weighted sum rule by

taking matching results of different features. We evaluate the proposed fusion

strategies on VoxForge speech dataset using K-Nearest Neighbor classifier. We

got the promising result with multiple features in compare to separate one.

Further, multi-features also performed well at different SNRs on NOIZEUS, a

noisy speech corpus.

Keywords: Multi-feature fusion, intra-modal fusion, speaker identification,

MFCC, wavelet transform, K-Nearest Neighbor (KNN).

1 Introduction

Multiple information fusion is a new challenge nowadays. Information fusion is

defined as combine information from multiple sources in order to achieve higher

performance than the performance achieved by means of a single source [1]. There

are basically two fusion categories. Intra-modal fusion [2, 3]: this is the fusion of

different features of the same modal and Multimodal Fusion [4, 5]: this is the fusion

of different modalities e.g. combined face, speech; fingerprint etc. this paper is based

on the intra-modal fusion. Further the information can be fused at signal level, featurelevel and decision level. In signal level, data acquired from different sources to be

fused directly after preprocessing. Feature level: fusion of multiple features is

performed in this case. Decision level: the output of multiple classifiers based on a set

of match score is fused. Complementary information is useful in fusion process as it

enhance the confidence in decision [6].

Various methods have been developed for speaker identification such as Mel

Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC), Linear

Prediction Cepstral Coefficients (LPCC) and Gabor etc. however there are still open

8/3/2019 Springer_1410170

2/10

Multi-feature Fusion for Closed Set Text Independent Speaker Identification 171

problems which arise into real application. Longbiao Wang et al. [7] proposed a

combined approach using mfcc and phase information for feature extraction from

speech signal. Most of the algorithms considered only single features or directly

combined features. A multi-features based fusion method is proposed in this study for

closed set text independent speaker identification in order to improve the performance

of the system. Closed-set identification only considers the best match from the

enrolled speakers. We have used two feature extraction approaches namely MFCC

and Wavelet Transform. The cepstral representation is a better way to represent the

local spectral properties of a speech signal [8] whereas wavelet transform capture

frequency variation across time. Features obtained from the above two approaches

was fused at feature and decision level. A combined feature is generated by fusion of

different features of the same speech at feature level fusion. At decision level, a final

score is calculated using weighted sum rule by taking matching results of differentfeatures. VoxForge and Noizeus speech corpus has been used to evaluate the fusion

schemes. This study contributes to development of new data fusion methods in signal

processing and information extraction. The multi-feature fusion approach can be

beneficial to many application of pattern recognition in order to enhance the

performance of the system under consideration of pros and cons of the system. A

general architecture of feature and decision level information fusion is illustrated in

Figs.1a and b.

(a)

(b)

Fig. 1. A general architecture of Information Fusion (a) Feature level (b) Decision level

Feature

Extractor

Feature

Extractor

Feature

Extractor

Decision

Maker

Data

Decision

Feature

Extractor

FeatureExtractor

Feature

Extractor

Decision

Maker

DecisionMaker

Decision

Maker

Data

Decision

8/3/2019 Springer_1410170

3/10

172 G.K. Verma

)1

2cos(46.054.0)(

=

N

nnW

)700

1(log2595)( 10f

fMel +=

The rest of the paper is organized as follows: review of feature extraction

techniques are given in Section 2. Proposed fusion approach is described in Section 3.

Experiment results and discussion about proposed work is described in Section 4.

Concluding remark is given in Section 5.

2 Feature Extraction

Feature extraction is an important phase in any pattern recognition problem. In our

study the features are obtained by applying two approaches named MFCC and

wavelet transform. Wavelet transform is able to perform local analysis to capture the

local information of a signal at multi-resolution. Feature vectors extraction process

using MFCC and wavelet transform are described below.

2.1 Feature Extraction Using MFCC

The process of calculating the MFCC consists of the following steps.

Framing: In this step the speech signal segmented into N samples with overlapping

frames.

Windowing: To spectral analysis of the speech signal in order to minimize the

spectral distortion. Generally Hamming window is used as given below

(1)

where 0

8/3/2019 Springer_1410170

4/10


= ]12[][][ khnnXkYlow

Fig. 2. Feature extraction process

weighted sum of scaled and shifted versions of scaling function itself. Information

captured by wavelet transform depends on properties of wavelet function family like

Daubechis, Symlet, Biorthogonal, Coiflet etc and properties (waveform) of target

signal. Information in signal extracted by wavelet transforms using different family of

wavelet function need not to be same. It is required to choose or evaluate wavelet

function that provides more useful information for particular application. Signal at

various scales and translations providing multi resolution time-frequency

representation, as show in Fig. 3.

In Discrete wavelet decomposition of signal, the output of high band pass filter and

low band pass filter can be represented mathematically by the Equations 3 and 4.

=

]12[][][ kgnnXkYhigh (3)

(4)

where highY and lowY are the outputs of the high band pass and low band pass filters,

respectively.

8/3/2019 Springer_1410170

5/10

174 G.K. Verma

Fig. 3. Schematic of Discrete Wavelet decomposition of a speech signal

In order to extract wavelet coefficients, the speech signal is passed into successive

high pass and low pass filter. Selection of suitable wavelet and number of levels of

decomposition is important. For one dimensional speech signal Daubechis wavelet

family provides good results for non-stationary signal analysis [11] so we have used itin our study. The feature vectors obtained from six level wavelet coefficients provides

compact representation of the signal. The coefficients occur in whole bandwidth from

low frequency to high. The original signal can be represented by the sum of

coefficients in every sub band, which is cD6, cD5, cD4, cD3, cD2, cD1. Feature

vectors are obtained from the detailed coefficients applying common statistics and

entropy. The discriminatory property of entropy features makes it suitable to extract

frequency distribution information [12].

3 Proposed Fusion Approach

We proposed two fusion approaches for multiple features. The first approach fuse

information at feature level (Fig. 4) and other one fuse information at decision level

(Fig. 5). Low level features of the speech signal are extracted independently using

mfcc and wavelet transform analysis described in Sections 2.1 and 2.2, respectively.

The fusion strategies are discussed below.

3.1 Feature-Level Fusion

In feature level information fusion the features obtained from both approaches are

organized in such a way that the mfcc features remain in the first half of feature

vector and the wavelet features are into the second half of the feature vector. Let the

features obtained from mfcc coefficients are ),....,( 21 mnmmmfcc fffF = and from

wavelet coefficients are ),....,( 21 wnwwwav fffF = then the fused feature vector

can be given as ],[ wavmfccfusion FFF =

},....,,,....,{ 2121 wnwwmnmmfusion ffffffF =

3.2 Decision-Level Fusion

In decision-level, we start the procedure with normalization of the scores obtained

from different feature extraction approaches. Normalization is performed to map the

8/3/2019 Springer_1410170

6/10


Fig. 4. Feature level fusion architecture

values of different classifiers in common range. Min-max normalization is used here.The threshold value for different classifier is different so we further rescale the

matching score in order to obtain the same threshold value for different classifier. A

speaker is accepted only within threshold range otherwise rejected. Finally the scores

are combined using sum rule. The sum rule method of integration takes the weighted

average of the individual score values. Let1x , 2x , nx are the weighted scores

corresponding to classifier 1, 2 and n . Then the fusion equation can be given by

Equation 5.

n

X

S

n

ii

comb

==1 (5)

4 Experimental Results and Discussion

VoxForge corpus [13]: It containing more than 200 speaker profiles of males and

females and each profile contains 10 speech samples. The sampling frequency was

kept 8 KHz and bit depth 16. The duration of speech samples range between 2 - 10

second. All the speech files are in wav format.

8/3/2019 Springer_1410170

7/10

176 G.K. Verma

Fig. 5. Decision-Level Fusion Architecture

NOIZEUS [14]: The noisy database contains 30 IEEE sentences (produced bythree male and three female speakers) the speech was recorded from different places

of crowd of people, car, exhibition hall, restaurant, street, air port, train station and

train. The noises were added to the speech signals at SNRs of 15dB, 10dB and 5dB.

All files are in wav format (16 bit PCM, mono).

The experiments comprised of two modules: training and testing were performed

on standard VoxForge speech corpus and NOIZEUS. Five speech samples were used

for training and another five for testing purpose. Total 33 and 30 dimensional feature

vectors was obtained from MFCC and Wavelet decomposition respectively as

described in section II a and b. Min-max algorithm has been used for feature set

normalization in order to improve the identification accuracy before the classification

for large dataset. All the experiments are performed on Mat Lab 7.6 (R2008b).

For classification purpose speech samples of same speaker is assigned same class.

In this way five speech samples assigned the same class and so on such that speaker A

= A1, A2, A3, A4, A5 assign class 1 and for speaker B = B1, B2, B3, B4, B5 assign

class 2. In this way the whole training data is grouped in class. Euclidean distance is

used to calculate the distances among vectors using KNN algorithm. The performance

result of discrete wavelet and mfcc features is shown in Table 1 for standard

VoxForge speech corpus. The proposed design of the speaker identification system

uses 33 dimensional wavelet features and 30 dimensional mfcc features with 10

samples of 200 speakers. The parameters of the proposed design are the result of

8/3/2019 Springer_1410170

8/10


Table 1. Classification results with multi-feature

No. of

Speakers

Classification Rate (%)

Wavelet MFCC Fusion

10 100 96 100

20 99.0 91 100

30 94.0 84 95

40 89.5 81 92

60 87.0 74.6 90.6

80 88.0 73 92.5100 89.4 73.6 93.2

120 88.1 72.8 91.8

140 85.8 68.4 90.6

160 85.5 68 90.4

180 83.7 67.33 90.4

200 83.9 66.8 90.2

10 20 30 40 60 80 100 120 140 160 180 20065

70

75

80

85

90

95

100

Number of Speakers

ClassificationAccuracy(%)

Wavelet

MFCC

Fused

Fig. 6. Performance Graph

8/3/2019 Springer_1410170

9/10

178 G.K. Verma

evaluation of different speaker identifications designs, evaluated using the VoxForge

and Alternative corpora. The classification accuracy of fused features is 90.2% with

200 speakers.

A threshold is used to assign the class of speakers, here = 0.60. If the query

samples matched and all matched samples belongs to same class and if x (score) >

then assign the query sample to that class. The classification results are shown in

Table 1 and the corresponding performance graph is illustrated in Fig. 6. The

performance of the system with noisy dataset at different SNRs is illustrated in Fig. 7.

15 dB 10 dB 5 dB

60

70

80

90

100

Noise Level

ClassificationAccuracy(%)

Wavelet

MFCC

Fused

Fig. 7. Performance with Noisy Speech at Different SNRs

5 Conclusions

New fusion methods at feature level and at decision level for multiple features were

discussed and experimented in this study. Features extracted from speech signals

using MFCC and Wavelet Transform were either combined to form a hybrid feature

space before classification or classified separately before being combined using a

rule-based approach. Examination of the above feature fusion strategies for improving

speaker identification provides good results. MFCC and Discrete wavelet transform isused to extract multiple features from speech signal. Multiple features were fused at

feature as well as decision level. KNN classifier has been used for similarity measure

between extracted features and a set of reference features. All the experiments were

performed on standard speech corpus i.e. VoxForge and NOIZEUS. The results

obtained from fusion scheme shows significant increment in performance of the

system.

8/3/2019 Springer_1410170

10/10


References

1. Multimodel Data Fusion, http://www.multitel.be/?page=data2. Marcel, S., Bengio, S.: Improving face verification using skin color information. In: 16th

International Conference on Pattern Recognition, pp. 378381 (2002)

3. Czyz, J., Kittler, J., Vandendorpe, L.: Multiple classifier combination for face-basedidentity verification. Pattern Recognition 37(7), 14591469 (2004)

4. Wang, Y., Tan, T., Jain, A.K.: Combining face and iris biometrics for identity verification.In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688. Springer, Heidelberg

(2003)

5. Hong, L., Jain, A.K., Pankanti, S.: Can multi-biometrics improve performance? In:Technical Report MSU-CSE-99-39, Department of Computer Science, Michigan State

University, East Lansing, Michigan (1999)

6. An Introduction to Data Fusion, Royal Military Academy,http://www.sic.rma.ac.be/Research/Fusion/Intro/content.html

7. Wang, L., Minami, K., Yamamoto, K., Nakagawa, S.: Speaker identification by combiningMFCC and phase information in noisy environments. In: 35th International Conference on

Acoustics, Speech, and Signal Processing, Dallas, Texas, U.S.A. (2010)

8. Patel, I., Srinivas Rao, Y.: A Frequency Spectral Feature Modeling for Hidden MarkovModel Based Automated Speech Recognition. In: Meghanathan, N., Boumerdassi, S.,

Chaki, N., Nagamalai, D. (eds.) NeCoM 2010. CCIS, vol. 90, pp. 134143. Springer,

Heidelberg (2010)

9. Dutta, T.: Dynamic time warping based approach to text dependent speaker identificationusing spectrograms. Congress on Image and Signal Processing 2, 354360 (2008)

10. Tzanetakis, G., Essl, G., Cook, P.: Audio analysis using the discrete wavelet transform. In:The Proceedings of Conference in Acoustics and Music Theory Applications, Skiathos,

Greece (2001)11. Toh, A.M., Togneri, R., Northolt, S.: Spectral entropy as speech features for speech

recognition. In: The Proceedings of PEECS, Perth, pp. 2225 (2005)

12. VoxForge Speech Corpus, http://www.voxforge.org13. NOIZEUS: A Noisy Speech Corpus for Evaluation of Speech Enhancement Algorithms,

http://www.utdallas.edu/~loizou/speech/noizeus/

springer_1410170

Documents