Download - Springer_1410170
-
8/3/2019 Springer_1410170
1/10
S. Dua, S. Sahni, and D.P. Goyal (Eds.): ICISTM 2011, CCIS 141, pp. 170179, 2011.
Springer-Verlag Berlin Heidelberg 2011
Multi-feature Fusion for Closed Set Text
Independent Speaker Identification
Gyanendra K. Verma
Indian Institute of Information Technology, Allahabad
Jhalwa, Allahabad, India
Abstract. An intra-modal fusion, a fusion of different features of the samemodal is proposed for speaker identification system. Two fusion methods at
feature level and at decision level for multiple features are proposed in this
study. We used multiple features from MFCC and wavelet transform of speech
signal. Wavelet transform based features capture frequency variation across
time while MFCC features mainly approximate the base frequency information,
and both are important. A final score is calculated using weighted sum rule by
taking matching results of different features. We evaluate the proposed fusion
strategies on VoxForge speech dataset using K-Nearest Neighbor classifier. We
got the promising result with multiple features in compare to separate one.
Further, multi-features also performed well at different SNRs on NOIZEUS, a
noisy speech corpus.
Keywords: Multi-feature fusion, intra-modal fusion, speaker identification,
MFCC, wavelet transform, K-Nearest Neighbor (KNN).
1 Introduction
Multiple information fusion is a new challenge nowadays. Information fusion is
defined as combine information from multiple sources in order to achieve higher
performance than the performance achieved by means of a single source [1]. There
are basically two fusion categories. Intra-modal fusion [2, 3]: this is the fusion of
different features of the same modal and Multimodal Fusion [4, 5]: this is the fusion
of different modalities e.g. combined face, speech; fingerprint etc. this paper is based
on the intra-modal fusion. Further the information can be fused at signal level, featurelevel and decision level. In signal level, data acquired from different sources to be
fused directly after preprocessing. Feature level: fusion of multiple features is
performed in this case. Decision level: the output of multiple classifiers based on a set
of match score is fused. Complementary information is useful in fusion process as it
enhance the confidence in decision [6].
Various methods have been developed for speaker identification such as Mel
Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC), Linear
Prediction Cepstral Coefficients (LPCC) and Gabor etc. however there are still open
-
8/3/2019 Springer_1410170
2/10
Multi-feature Fusion for Closed Set Text Independent Speaker Identification 171
problems which arise into real application. Longbiao Wang et al. [7] proposed a
combined approach using mfcc and phase information for feature extraction from
speech signal. Most of the algorithms considered only single features or directly
combined features. A multi-features based fusion method is proposed in this study for
closed set text independent speaker identification in order to improve the performance
of the system. Closed-set identification only considers the best match from the
enrolled speakers. We have used two feature extraction approaches namely MFCC
and Wavelet Transform. The cepstral representation is a better way to represent the
local spectral properties of a speech signal [8] whereas wavelet transform capture
frequency variation across time. Features obtained from the above two approaches
was fused at feature and decision level. A combined feature is generated by fusion of
different features of the same speech at feature level fusion. At decision level, a final
score is calculated using weighted sum rule by taking matching results of differentfeatures. VoxForge and Noizeus speech corpus has been used to evaluate the fusion
schemes. This study contributes to development of new data fusion methods in signal
processing and information extraction. The multi-feature fusion approach can be
beneficial to many application of pattern recognition in order to enhance the
performance of the system under consideration of pros and cons of the system. A
general architecture of feature and decision level information fusion is illustrated in
Figs.1a and b.
(a)
(b)
Fig. 1. A general architecture of Information Fusion (a) Feature level (b) Decision level
Feature
Extractor
Feature
Extractor
Feature
Extractor
Decision
Maker
Data
Decision
Feature
Extractor
FeatureExtractor
Feature
Extractor
Decision
Maker
DecisionMaker
Decision
Maker
Data
Decision
-
8/3/2019 Springer_1410170
3/10
172 G.K. Verma
)1
2cos(46.054.0)(
=
N
nnW
)700
1(log2595)( 10f
fMel +=
The rest of the paper is organized as follows: review of feature extraction
techniques are given in Section 2. Proposed fusion approach is described in Section 3.
Experiment results and discussion about proposed work is described in Section 4.
Concluding remark is given in Section 5.
2 Feature Extraction
Feature extraction is an important phase in any pattern recognition problem. In our
study the features are obtained by applying two approaches named MFCC and
wavelet transform. Wavelet transform is able to perform local analysis to capture the
local information of a signal at multi-resolution. Feature vectors extraction process
using MFCC and wavelet transform are described below.
2.1 Feature Extraction Using MFCC
The process of calculating the MFCC consists of the following steps.
Framing: In this step the speech signal segmented into N samples with overlapping
frames.
Windowing: To spectral analysis of the speech signal in order to minimize the
spectral distortion. Generally Hamming window is used as given below
(1)
where 0
-
8/3/2019 Springer_1410170
4/10
Multi-feature Fusion for Closed Set Text Independent Speaker Identification 173
= ]12[][][ khnnXkYlow
Fig. 2. Feature extraction process
weighted sum of scaled and shifted versions of scaling function itself. Information
captured by wavelet transform depends on properties of wavelet function family like
Daubechis, Symlet, Biorthogonal, Coiflet etc and properties (waveform) of target
signal. Information in signal extracted by wavelet transforms using different family of
wavelet function need not to be same. It is required to choose or evaluate wavelet
function that provides more useful information for particular application. Signal at
various scales and translations providing multi resolution time-frequency
representation, as show in Fig. 3.
In Discrete wavelet decomposition of signal, the output of high band pass filter and
low band pass filter can be represented mathematically by the Equations 3 and 4.
=
]12[][][ kgnnXkYhigh (3)
(4)
where highY and lowY are the outputs of the high band pass and low band pass filters,
respectively.
-
8/3/2019 Springer_1410170
5/10
174 G.K. Verma
Fig. 3. Schematic of Discrete Wavelet decomposition of a speech signal
In order to extract wavelet coefficients, the speech signal is passed into successive
high pass and low pass filter. Selection of suitable wavelet and number of levels of
decomposition is important. For one dimensional speech signal Daubechis wavelet
family provides good results for non-stationary signal analysis [11] so we have used itin our study. The feature vectors obtained from six level wavelet coefficients provides
compact representation of the signal. The coefficients occur in whole bandwidth from
low frequency to high. The original signal can be represented by the sum of
coefficients in every sub band, which is cD6, cD5, cD4, cD3, cD2, cD1. Feature
vectors are obtained from the detailed coefficients applying common statistics and
entropy. The discriminatory property of entropy features makes it suitable to extract
frequency distribution information [12].
3 Proposed Fusion Approach
We proposed two fusion approaches for multiple features. The first approach fuse
information at feature level (Fig. 4) and other one fuse information at decision level
(Fig. 5). Low level features of the speech signal are extracted independently using
mfcc and wavelet transform analysis described in Sections 2.1 and 2.2, respectively.
The fusion strategies are discussed below.
3.1 Feature-Level Fusion
In feature level information fusion the features obtained from both approaches are
organized in such a way that the mfcc features remain in the first half of feature
vector and the wavelet features are into the second half of the feature vector. Let the
features obtained from mfcc coefficients are ),....,( 21 mnmmmfcc fffF = and from
wavelet coefficients are ),....,( 21 wnwwwav fffF = then the fused feature vector
can be given as ],[ wavmfccfusion FFF =
},....,,,....,{ 2121 wnwwmnmmfusion ffffffF =
3.2 Decision-Level Fusion
In decision-level, we start the procedure with normalization of the scores obtained
from different feature extraction approaches. Normalization is performed to map the
-
8/3/2019 Springer_1410170
6/10
Multi-feature Fusion for Closed Set Text Independent Speaker Identification 175
Fig. 4. Feature level fusion architecture
values of different classifiers in common range. Min-max normalization is used here.The threshold value for different classifier is different so we further rescale the
matching score in order to obtain the same threshold value for different classifier. A
speaker is accepted only within threshold range otherwise rejected. Finally the scores
are combined using sum rule. The sum rule method of integration takes the weighted
average of the individual score values. Let1x , 2x , nx are the weighted scores
corresponding to classifier 1, 2 and n . Then the fusion equation can be given by
Equation 5.
n
X
S
n
ii
comb
==1 (5)
4 Experimental Results and Discussion
VoxForge corpus [13]: It containing more than 200 speaker profiles of males and
females and each profile contains 10 speech samples. The sampling frequency was
kept 8 KHz and bit depth 16. The duration of speech samples range between 2 - 10
second. All the speech files are in wav format.
-
8/3/2019 Springer_1410170
7/10
176 G.K. Verma
Fig. 5. Decision-Level Fusion Architecture
NOIZEUS [14]: The noisy database contains 30 IEEE sentences (produced bythree male and three female speakers) the speech was recorded from different places
of crowd of people, car, exhibition hall, restaurant, street, air port, train station and
train. The noises were added to the speech signals at SNRs of 15dB, 10dB and 5dB.
All files are in wav format (16 bit PCM, mono).
The experiments comprised of two modules: training and testing were performed
on standard VoxForge speech corpus and NOIZEUS. Five speech samples were used
for training and another five for testing purpose. Total 33 and 30 dimensional feature
vectors was obtained from MFCC and Wavelet decomposition respectively as
described in section II a and b. Min-max algorithm has been used for feature set
normalization in order to improve the identification accuracy before the classification
for large dataset. All the experiments are performed on Mat Lab 7.6 (R2008b).
For classification purpose speech samples of same speaker is assigned same class.
In this way five speech samples assigned the same class and so on such that speaker A
= A1, A2, A3, A4, A5 assign class 1 and for speaker B = B1, B2, B3, B4, B5 assign
class 2. In this way the whole training data is grouped in class. Euclidean distance is
used to calculate the distances among vectors using KNN algorithm. The performance
result of discrete wavelet and mfcc features is shown in Table 1 for standard
VoxForge speech corpus. The proposed design of the speaker identification system
uses 33 dimensional wavelet features and 30 dimensional mfcc features with 10
samples of 200 speakers. The parameters of the proposed design are the result of
-
8/3/2019 Springer_1410170
8/10
Multi-feature Fusion for Closed Set Text Independent Speaker Identification 177
Table 1. Classification results with multi-feature
No. of
Speakers
Classification Rate (%)
Wavelet MFCC Fusion
10 100 96 100
20 99.0 91 100
30 94.0 84 95
40 89.5 81 92
60 87.0 74.6 90.6
80 88.0 73 92.5100 89.4 73.6 93.2
120 88.1 72.8 91.8
140 85.8 68.4 90.6
160 85.5 68 90.4
180 83.7 67.33 90.4
200 83.9 66.8 90.2
10 20 30 40 60 80 100 120 140 160 180 20065
70
75
80
85
90
95
100
Number of Speakers
ClassificationAccuracy(%)
Wavelet
MFCC
Fused
Fig. 6. Performance Graph
-
8/3/2019 Springer_1410170
9/10
178 G.K. Verma
evaluation of different speaker identifications designs, evaluated using the VoxForge
and Alternative corpora. The classification accuracy of fused features is 90.2% with
200 speakers.
A threshold is used to assign the class of speakers, here = 0.60. If the query
samples matched and all matched samples belongs to same class and if x (score) >
then assign the query sample to that class. The classification results are shown in
Table 1 and the corresponding performance graph is illustrated in Fig. 6. The
performance of the system with noisy dataset at different SNRs is illustrated in Fig. 7.
15 dB 10 dB 5 dB
60
70
80
90
100
Noise Level
ClassificationAccuracy(%)
Wavelet
MFCC
Fused
Fig. 7. Performance with Noisy Speech at Different SNRs
5 Conclusions
New fusion methods at feature level and at decision level for multiple features were
discussed and experimented in this study. Features extracted from speech signals
using MFCC and Wavelet Transform were either combined to form a hybrid feature
space before classification or classified separately before being combined using a
rule-based approach. Examination of the above feature fusion strategies for improving
speaker identification provides good results. MFCC and Discrete wavelet transform isused to extract multiple features from speech signal. Multiple features were fused at
feature as well as decision level. KNN classifier has been used for similarity measure
between extracted features and a set of reference features. All the experiments were
performed on standard speech corpus i.e. VoxForge and NOIZEUS. The results
obtained from fusion scheme shows significant increment in performance of the
system.
-
8/3/2019 Springer_1410170
10/10
Multi-feature Fusion for Closed Set Text Independent Speaker Identification 179
References
1. Multimodel Data Fusion, http://www.multitel.be/?page=data2. Marcel, S., Bengio, S.: Improving face verification using skin color information. In: 16th
International Conference on Pattern Recognition, pp. 378381 (2002)
3. Czyz, J., Kittler, J., Vandendorpe, L.: Multiple classifier combination for face-basedidentity verification. Pattern Recognition 37(7), 14591469 (2004)
4. Wang, Y., Tan, T., Jain, A.K.: Combining face and iris biometrics for identity verification.In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688. Springer, Heidelberg
(2003)
5. Hong, L., Jain, A.K., Pankanti, S.: Can multi-biometrics improve performance? In:Technical Report MSU-CSE-99-39, Department of Computer Science, Michigan State
University, East Lansing, Michigan (1999)
6. An Introduction to Data Fusion, Royal Military Academy,http://www.sic.rma.ac.be/Research/Fusion/Intro/content.html
7. Wang, L., Minami, K., Yamamoto, K., Nakagawa, S.: Speaker identification by combiningMFCC and phase information in noisy environments. In: 35th International Conference on
Acoustics, Speech, and Signal Processing, Dallas, Texas, U.S.A. (2010)
8. Patel, I., Srinivas Rao, Y.: A Frequency Spectral Feature Modeling for Hidden MarkovModel Based Automated Speech Recognition. In: Meghanathan, N., Boumerdassi, S.,
Chaki, N., Nagamalai, D. (eds.) NeCoM 2010. CCIS, vol. 90, pp. 134143. Springer,
Heidelberg (2010)
9. Dutta, T.: Dynamic time warping based approach to text dependent speaker identificationusing spectrograms. Congress on Image and Signal Processing 2, 354360 (2008)
10. Tzanetakis, G., Essl, G., Cook, P.: Audio analysis using the discrete wavelet transform. In:The Proceedings of Conference in Acoustics and Music Theory Applications, Skiathos,
Greece (2001)11. Toh, A.M., Togneri, R., Northolt, S.: Spectral entropy as speech features for speech
recognition. In: The Proceedings of PEECS, Perth, pp. 2225 (2005)
12. VoxForge Speech Corpus, http://www.voxforge.org13. NOIZEUS: A Noisy Speech Corpus for Evaluation of Speech Enhancement Algorithms,
http://www.utdallas.edu/~loizou/speech/noizeus/