springer_1410170

Upload: vermagkv

Post on 06-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Springer_1410170

    1/10

    S. Dua, S. Sahni, and D.P. Goyal (Eds.): ICISTM 2011, CCIS 141, pp. 170179, 2011.

    Springer-Verlag Berlin Heidelberg 2011

    Multi-feature Fusion for Closed Set Text

    Independent Speaker Identification

    Gyanendra K. Verma

    Indian Institute of Information Technology, Allahabad

    Jhalwa, Allahabad, India

    [email protected]

    Abstract. An intra-modal fusion, a fusion of different features of the samemodal is proposed for speaker identification system. Two fusion methods at

    feature level and at decision level for multiple features are proposed in this

    study. We used multiple features from MFCC and wavelet transform of speech

    signal. Wavelet transform based features capture frequency variation across

    time while MFCC features mainly approximate the base frequency information,

    and both are important. A final score is calculated using weighted sum rule by

    taking matching results of different features. We evaluate the proposed fusion

    strategies on VoxForge speech dataset using K-Nearest Neighbor classifier. We

    got the promising result with multiple features in compare to separate one.

    Further, multi-features also performed well at different SNRs on NOIZEUS, a

    noisy speech corpus.

    Keywords: Multi-feature fusion, intra-modal fusion, speaker identification,

    MFCC, wavelet transform, K-Nearest Neighbor (KNN).

    1 Introduction

    Multiple information fusion is a new challenge nowadays. Information fusion is

    defined as combine information from multiple sources in order to achieve higher

    performance than the performance achieved by means of a single source [1]. There

    are basically two fusion categories. Intra-modal fusion [2, 3]: this is the fusion of

    different features of the same modal and Multimodal Fusion [4, 5]: this is the fusion

    of different modalities e.g. combined face, speech; fingerprint etc. this paper is based

    on the intra-modal fusion. Further the information can be fused at signal level, featurelevel and decision level. In signal level, data acquired from different sources to be

    fused directly after preprocessing. Feature level: fusion of multiple features is

    performed in this case. Decision level: the output of multiple classifiers based on a set

    of match score is fused. Complementary information is useful in fusion process as it

    enhance the confidence in decision [6].

    Various methods have been developed for speaker identification such as Mel

    Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC), Linear

    Prediction Cepstral Coefficients (LPCC) and Gabor etc. however there are still open

  • 8/3/2019 Springer_1410170

    2/10

    Multi-feature Fusion for Closed Set Text Independent Speaker Identification 171

    problems which arise into real application. Longbiao Wang et al. [7] proposed a

    combined approach using mfcc and phase information for feature extraction from

    speech signal. Most of the algorithms considered only single features or directly

    combined features. A multi-features based fusion method is proposed in this study for

    closed set text independent speaker identification in order to improve the performance

    of the system. Closed-set identification only considers the best match from the

    enrolled speakers. We have used two feature extraction approaches namely MFCC

    and Wavelet Transform. The cepstral representation is a better way to represent the

    local spectral properties of a speech signal [8] whereas wavelet transform capture

    frequency variation across time. Features obtained from the above two approaches

    was fused at feature and decision level. A combined feature is generated by fusion of

    different features of the same speech at feature level fusion. At decision level, a final

    score is calculated using weighted sum rule by taking matching results of differentfeatures. VoxForge and Noizeus speech corpus has been used to evaluate the fusion

    schemes. This study contributes to development of new data fusion methods in signal

    processing and information extraction. The multi-feature fusion approach can be

    beneficial to many application of pattern recognition in order to enhance the

    performance of the system under consideration of pros and cons of the system. A

    general architecture of feature and decision level information fusion is illustrated in

    Figs.1a and b.

    (a)

    (b)

    Fig. 1. A general architecture of Information Fusion (a) Feature level (b) Decision level

    Feature

    Extractor

    Feature

    Extractor

    Feature

    Extractor

    Decision

    Maker

    Data

    Decision

    Feature

    Extractor

    FeatureExtractor

    Feature

    Extractor

    Decision

    Maker

    DecisionMaker

    Decision

    Maker

    Data

    Decision

  • 8/3/2019 Springer_1410170

    3/10

    172 G.K. Verma

    )1

    2cos(46.054.0)(

    =

    N

    nnW

    )700

    1(log2595)( 10f

    fMel +=

    The rest of the paper is organized as follows: review of feature extraction

    techniques are given in Section 2. Proposed fusion approach is described in Section 3.

    Experiment results and discussion about proposed work is described in Section 4.

    Concluding remark is given in Section 5.

    2 Feature Extraction

    Feature extraction is an important phase in any pattern recognition problem. In our

    study the features are obtained by applying two approaches named MFCC and

    wavelet transform. Wavelet transform is able to perform local analysis to capture the

    local information of a signal at multi-resolution. Feature vectors extraction process

    using MFCC and wavelet transform are described below.

    2.1 Feature Extraction Using MFCC

    The process of calculating the MFCC consists of the following steps.

    Framing: In this step the speech signal segmented into N samples with overlapping

    frames.

    Windowing: To spectral analysis of the speech signal in order to minimize the

    spectral distortion. Generally Hamming window is used as given below

    (1)

    where 0

  • 8/3/2019 Springer_1410170

    4/10

    Multi-feature Fusion for Closed Set Text Independent Speaker Identification 173

    = ]12[][][ khnnXkYlow

    Fig. 2. Feature extraction process

    weighted sum of scaled and shifted versions of scaling function itself. Information

    captured by wavelet transform depends on properties of wavelet function family like

    Daubechis, Symlet, Biorthogonal, Coiflet etc and properties (waveform) of target

    signal. Information in signal extracted by wavelet transforms using different family of

    wavelet function need not to be same. It is required to choose or evaluate wavelet

    function that provides more useful information for particular application. Signal at

    various scales and translations providing multi resolution time-frequency

    representation, as show in Fig. 3.

    In Discrete wavelet decomposition of signal, the output of high band pass filter and

    low band pass filter can be represented mathematically by the Equations 3 and 4.

    =

    ]12[][][ kgnnXkYhigh (3)

    (4)

    where highY and lowY are the outputs of the high band pass and low band pass filters,

    respectively.

  • 8/3/2019 Springer_1410170

    5/10

    174 G.K. Verma

    Fig. 3. Schematic of Discrete Wavelet decomposition of a speech signal

    In order to extract wavelet coefficients, the speech signal is passed into successive

    high pass and low pass filter. Selection of suitable wavelet and number of levels of

    decomposition is important. For one dimensional speech signal Daubechis wavelet

    family provides good results for non-stationary signal analysis [11] so we have used itin our study. The feature vectors obtained from six level wavelet coefficients provides

    compact representation of the signal. The coefficients occur in whole bandwidth from

    low frequency to high. The original signal can be represented by the sum of

    coefficients in every sub band, which is cD6, cD5, cD4, cD3, cD2, cD1. Feature

    vectors are obtained from the detailed coefficients applying common statistics and

    entropy. The discriminatory property of entropy features makes it suitable to extract

    frequency distribution information [12].

    3 Proposed Fusion Approach

    We proposed two fusion approaches for multiple features. The first approach fuse

    information at feature level (Fig. 4) and other one fuse information at decision level

    (Fig. 5). Low level features of the speech signal are extracted independently using

    mfcc and wavelet transform analysis described in Sections 2.1 and 2.2, respectively.

    The fusion strategies are discussed below.

    3.1 Feature-Level Fusion

    In feature level information fusion the features obtained from both approaches are

    organized in such a way that the mfcc features remain in the first half of feature

    vector and the wavelet features are into the second half of the feature vector. Let the

    features obtained from mfcc coefficients are ),....,( 21 mnmmmfcc fffF = and from

    wavelet coefficients are ),....,( 21 wnwwwav fffF = then the fused feature vector

    can be given as ],[ wavmfccfusion FFF =

    },....,,,....,{ 2121 wnwwmnmmfusion ffffffF =

    3.2 Decision-Level Fusion

    In decision-level, we start the procedure with normalization of the scores obtained

    from different feature extraction approaches. Normalization is performed to map the

  • 8/3/2019 Springer_1410170

    6/10

    Multi-feature Fusion for Closed Set Text Independent Speaker Identification 175

    Fig. 4. Feature level fusion architecture

    values of different classifiers in common range. Min-max normalization is used here.The threshold value for different classifier is different so we further rescale the

    matching score in order to obtain the same threshold value for different classifier. A

    speaker is accepted only within threshold range otherwise rejected. Finally the scores

    are combined using sum rule. The sum rule method of integration takes the weighted

    average of the individual score values. Let1x , 2x , nx are the weighted scores

    corresponding to classifier 1, 2 and n . Then the fusion equation can be given by

    Equation 5.

    n

    X

    S

    n

    ii

    comb

    ==1 (5)

    4 Experimental Results and Discussion

    VoxForge corpus [13]: It containing more than 200 speaker profiles of males and

    females and each profile contains 10 speech samples. The sampling frequency was

    kept 8 KHz and bit depth 16. The duration of speech samples range between 2 - 10

    second. All the speech files are in wav format.

  • 8/3/2019 Springer_1410170

    7/10

    176 G.K. Verma

    Fig. 5. Decision-Level Fusion Architecture

    NOIZEUS [14]: The noisy database contains 30 IEEE sentences (produced bythree male and three female speakers) the speech was recorded from different places

    of crowd of people, car, exhibition hall, restaurant, street, air port, train station and

    train. The noises were added to the speech signals at SNRs of 15dB, 10dB and 5dB.

    All files are in wav format (16 bit PCM, mono).

    The experiments comprised of two modules: training and testing were performed

    on standard VoxForge speech corpus and NOIZEUS. Five speech samples were used

    for training and another five for testing purpose. Total 33 and 30 dimensional feature

    vectors was obtained from MFCC and Wavelet decomposition respectively as

    described in section II a and b. Min-max algorithm has been used for feature set

    normalization in order to improve the identification accuracy before the classification

    for large dataset. All the experiments are performed on Mat Lab 7.6 (R2008b).

    For classification purpose speech samples of same speaker is assigned same class.

    In this way five speech samples assigned the same class and so on such that speaker A

    = A1, A2, A3, A4, A5 assign class 1 and for speaker B = B1, B2, B3, B4, B5 assign

    class 2. In this way the whole training data is grouped in class. Euclidean distance is

    used to calculate the distances among vectors using KNN algorithm. The performance

    result of discrete wavelet and mfcc features is shown in Table 1 for standard

    VoxForge speech corpus. The proposed design of the speaker identification system

    uses 33 dimensional wavelet features and 30 dimensional mfcc features with 10

    samples of 200 speakers. The parameters of the proposed design are the result of

  • 8/3/2019 Springer_1410170

    8/10

    Multi-feature Fusion for Closed Set Text Independent Speaker Identification 177

    Table 1. Classification results with multi-feature

    No. of

    Speakers

    Classification Rate (%)

    Wavelet MFCC Fusion

    10 100 96 100

    20 99.0 91 100

    30 94.0 84 95

    40 89.5 81 92

    60 87.0 74.6 90.6

    80 88.0 73 92.5100 89.4 73.6 93.2

    120 88.1 72.8 91.8

    140 85.8 68.4 90.6

    160 85.5 68 90.4

    180 83.7 67.33 90.4

    200 83.9 66.8 90.2

    10 20 30 40 60 80 100 120 140 160 180 20065

    70

    75

    80

    85

    90

    95

    100

    Number of Speakers

    ClassificationAccuracy(%)

    Wavelet

    MFCC

    Fused

    Fig. 6. Performance Graph

  • 8/3/2019 Springer_1410170

    9/10

    178 G.K. Verma

    evaluation of different speaker identifications designs, evaluated using the VoxForge

    and Alternative corpora. The classification accuracy of fused features is 90.2% with

    200 speakers.

    A threshold is used to assign the class of speakers, here = 0.60. If the query

    samples matched and all matched samples belongs to same class and if x (score) >

    then assign the query sample to that class. The classification results are shown in

    Table 1 and the corresponding performance graph is illustrated in Fig. 6. The

    performance of the system with noisy dataset at different SNRs is illustrated in Fig. 7.

    15 dB 10 dB 5 dB

    60

    70

    80

    90

    100

    Noise Level

    ClassificationAccuracy(%)

    Wavelet

    MFCC

    Fused

    Fig. 7. Performance with Noisy Speech at Different SNRs

    5 Conclusions

    New fusion methods at feature level and at decision level for multiple features were

    discussed and experimented in this study. Features extracted from speech signals

    using MFCC and Wavelet Transform were either combined to form a hybrid feature

    space before classification or classified separately before being combined using a

    rule-based approach. Examination of the above feature fusion strategies for improving

    speaker identification provides good results. MFCC and Discrete wavelet transform isused to extract multiple features from speech signal. Multiple features were fused at

    feature as well as decision level. KNN classifier has been used for similarity measure

    between extracted features and a set of reference features. All the experiments were

    performed on standard speech corpus i.e. VoxForge and NOIZEUS. The results

    obtained from fusion scheme shows significant increment in performance of the

    system.

  • 8/3/2019 Springer_1410170

    10/10

    Multi-feature Fusion for Closed Set Text Independent Speaker Identification 179

    References

    1. Multimodel Data Fusion, http://www.multitel.be/?page=data2. Marcel, S., Bengio, S.: Improving face verification using skin color information. In: 16th

    International Conference on Pattern Recognition, pp. 378381 (2002)

    3. Czyz, J., Kittler, J., Vandendorpe, L.: Multiple classifier combination for face-basedidentity verification. Pattern Recognition 37(7), 14591469 (2004)

    4. Wang, Y., Tan, T., Jain, A.K.: Combining face and iris biometrics for identity verification.In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688. Springer, Heidelberg

    (2003)

    5. Hong, L., Jain, A.K., Pankanti, S.: Can multi-biometrics improve performance? In:Technical Report MSU-CSE-99-39, Department of Computer Science, Michigan State

    University, East Lansing, Michigan (1999)

    6. An Introduction to Data Fusion, Royal Military Academy,http://www.sic.rma.ac.be/Research/Fusion/Intro/content.html

    7. Wang, L., Minami, K., Yamamoto, K., Nakagawa, S.: Speaker identification by combiningMFCC and phase information in noisy environments. In: 35th International Conference on

    Acoustics, Speech, and Signal Processing, Dallas, Texas, U.S.A. (2010)

    8. Patel, I., Srinivas Rao, Y.: A Frequency Spectral Feature Modeling for Hidden MarkovModel Based Automated Speech Recognition. In: Meghanathan, N., Boumerdassi, S.,

    Chaki, N., Nagamalai, D. (eds.) NeCoM 2010. CCIS, vol. 90, pp. 134143. Springer,

    Heidelberg (2010)

    9. Dutta, T.: Dynamic time warping based approach to text dependent speaker identificationusing spectrograms. Congress on Image and Signal Processing 2, 354360 (2008)

    10. Tzanetakis, G., Essl, G., Cook, P.: Audio analysis using the discrete wavelet transform. In:The Proceedings of Conference in Acoustics and Music Theory Applications, Skiathos,

    Greece (2001)11. Toh, A.M., Togneri, R., Northolt, S.: Spectral entropy as speech features for speech

    recognition. In: The Proceedings of PEECS, Perth, pp. 2225 (2005)

    12. VoxForge Speech Corpus, http://www.voxforge.org13. NOIZEUS: A Noisy Speech Corpus for Evaluation of Speech Enhancement Algorithms,

    http://www.utdallas.edu/~loizou/speech/noizeus/