[ieee international conference on computational intelligence and multimedia applications (iccima...

Audio -Visual Biometric Based Speaker Identification

Biswajit Kar, Sandeep Bhatia & P. K. Dutta Department of Electrical Engineering, Indian Institute of Technology, Kharagpur, India.

[email protected], [email protected], [email protected]

Abstract In this paper, we present a multimodal audio-visual speaker identification system. The

proposed system decomposes the information existing in a video stream into two components: speech and lip motion. It has been studied that lip information not only presents speech information but also characteristic information about a person’s identity. Fusing this information with speech information will produce robust person identification under adverse condition. Gaussian mixture models (GMMs) and Hidden markov models (HMMs) are used throughout this work for the tasks of text dependent speaker recognition and mouth tracking. The performance is evaluated for dataset of 22 Indian of different ethnicity speakers each uttering a sentence. The results show that the performance of the biometric system is significantly better when both audio and video features are used Keywords: Biometrics, Speaker recognition, Speaker model, Audio visual speech recognition 1. Introduction

The term biometrics refers to the method of identifying a person based on the physiological or behavioral characteristics or traits those are unique to the person [1]. By using biometrics it is possible to confirm or establish an individual’s identity based on “Who he is” rather than “What he possess” (e.g. ID card) or “What he remembers” (e.g. password). The commonly used biometrics techniques are speaker recognition, face recognition, handwriting analysis, fingerprint matching, gait recognition, iris scan, etc. The recognition system operates in two phases (fig 1): (a) Training phase and (b) Testing phase which can be in verification or identification mode of operation [2]. In the identification mode, the system recognizes an individual by searching the templates of all the users in the database for a match. Therefore, the system conducts a one-to-many comparison to establish an individual’s identity (or fails if the subject is not enrolled in the system database) without the subject having to claim an identity.

The most commonly used biometric for person recognition is voice. Voice represents both physiological as well behavioral characteristics of an individual. The physiological characteristics such as the shape and size of the articulators (e.g., vocal tracts, mouth, nasal cavities, and lips) are invariant for an individual. The behavioral part of the voice of a person changes over time due to age, medical conditions (such as common cold), emotional state, etc. The performance of the systems using only speech as biometric is greatly affected by the factors such as back ground noise, spoof attacks etc.

Depending upon the application, the voice used for recognition task can be either text-dependent or text independent. In this work we have evaluated the database for text dependent–closed set speaker identification system. Also, face is a most common biometric used by the human to make a personal recognition. The most popular approaches to face recognition are either the location or shape of facial features such as eyes, nose, lips, ears etc and their spatial relationships. The performance of the systems using only facial features as biometric is affected by factors such as imposter wearing mask of person to be identified or bad quality of the video images. The above limitations in systems using single modality (voice, face) motivate us to use multi-biometrics for person authentication to make system more robust and reliable. The information present in the speech signal and the facial movements associated with an utterance

International Conference on Computational Intelligence and Multimedia Applications 2007

0-7695-3050-8/07 $25.00 © 2007 IEEEDOI 10.1109/ICCIMA.2007.21

94

International Conference on Computational Intelligence and Multimedia Applications 2007

0-7695-3050-8/07 $25.00 © 2007 IEEEDOI 10.1109/ICCIMA.2007.21

94

may provide robustness for the authentication of a person. The speech is bimodal in production and perception. The speech is produced due to excitation of vocal tract system. The predominant mode of excitation is by the vibrations of the vocal cords. The vocal tract system terminates with the lips, and thus visual speech is represented in the form of movement of lips, chin and facial regions. McGurk Effect shows that visual cues in the form of lip movements significantly contribute to human speech perception. In this paper, the visual information conveyed during speech production is used along with acoustic speech for speaker identification.

Audio-visual data is acquired in the form of a speech signal and a sequence of image frames corresponding to the text spoken by the user. Acoustic and visual information is parametrically represented as features extracted from the audio-visual data. The acoustic information in speech is represented by acoustic features. The extraction of visual features from the face requires determination of the location of the face region. The facial movements that occur during the utterance of a text are represented by visual features. A pattern matching technique is required to compare the reference models with the test pattern during verification. A decision-making strategy is required to identify the speaker using the result of the pattern matching stage. The audio and visual information has to be integrated at an appropriate stage of processing such that the uniqueness in the audio-visual interaction for a person is captured.

The Review was conducted studying the audio–visual biometric based speaker recognition. The research on speaker recognition based on only acoustic information began in 1960s. In 1970s, first work in field of audio visual speech processing has been reported by McGurk and MacDonald better known as McGurk effect. The McGurk effect highlights the requirement for both acoustic and visual cues in the perception of speech.

Audio–Visual biometric based speaker authentication requires proper segmentation and feature extraction of the both acoustic and visual modality. Reference [3] has given detailed study of speech based bimodal recognition. Often, acoustic features are represented by short term spectrums. Cepstral features are widely used. While, Visual speech segmentation involves task like automatic face detection and facial region extraction (lips, eyes, nose, chin, etc.). Reference [4] has covered the detailed survey of all the face detection techniques. Face segmentation mainly relies on image attributes related to facial texture, colour [5], shape, motion, and brightness. Face detection can be broadly classified into four methods namely Knowledge based methods, Feature invariant methods, Template matching methods and Appearance based methods. HMM, ANN like models is then used for detection. Among all the techniques feature invariant gives better performance even when there is variation in pose, viewpoint, or lighting condition of the facial image. The mouth is commonly modeled by deformable templates, dynamic contour models, or statistical model of shape and brightness.

Classifier models can be broadly grouped into parametric and non parametric models. In parametric methods a functional for density model having certain parameters is assumed, parameters are optimized by fitting the model to the data set. The drawback of this method is that

Fig 1 a) Enrollment b) Testing Phase of Audio-Visual Biometric based Speaker identification.

9595

the particular model chosen might be incapable of providing a good representation to the true density. While a non parametric method assumes general form of the density, but form of the density is entirely determined by the data. So the number of parameters increases by increasing the data set. Following are the most prevalent parametric and non parametric methods used: Parametric Models • Hidden Markov Models (HMM)

• Gaussian Mixture Models (GMM) Non Parametric Models • Template Matching (Dynamic time warping)

• Artificial Neural Networks (ANN) • Vector Quantization (VQ);

2. Acoustic Feature Extraction

In acoustic module, the speaker dependent features are extracted from the speech of the person uttering a sentence. The spectral characteristics of the speech wave are time-varying, since the physical system changes rapidly over the time. As a result, speech can be divided into sound segments that possess similar acoustic properties over short periods of time. A special category of segments that speech is usually divided into are called as phonemes. More specifically, phonemes are the basic theoretical units for describing how speech conveys linguistic mean. Each phoneme can be considered to be a code that consists of a unique set of articulators gestures. From an acoustical point of view, we can say that the phoneme represents a class of sounds that convey the same meaning. 2.1. Preprocessing of Acoustic Speech

Speech signal captured by the microphone is preprocessed before extracting features. This involves detection and removal of silence part, pre-emphasis and frame blocking frame windowing.

2.1.1. Silence Removal It removes the silence portion from the speech, which does not have information but carry

unnecessary bit rate for storage and transmission 2.1.2. Pre-emphasis The digitized speech signal, S(n) is put through a low-order digital system (typically a first

order FIR filter), to spectrally flatten the signal. 2.1.3 . Frame Blocking In this step the continuous speech signal is blocked into frames of N samples, with adjacent

frames being separated by M (M < N). The first frame consists of the first N samples. The second frame begins M samples after the first frame, and overlaps it by N - M samples. Similarly, the third frame begins 2M samples after the first frame (or M samples after the second frame) and overlaps it by N - 2M samples. This process continues until all the speech is accounted for within one or more frames.

2.1.4 Frame Windowing The concept here is to minimize the spectral distortion by using the window to taper the signal

to zero at the beginning and end of each frame. If we define the window as w(n), 0≤ n≤ N-1, where N is the number of samples in each frame, then the result of windowing is the signal y(n)=x(n)w(n , 0 ≤ n ≤ N-1. Typically the Hamming window is used.

2.2. Feature Extraction The popular speech features currently in use for speech and speaker identification is Mel-frequency Cepstral coefficient (MFCC) and Linear predictive coding (LPC).In this paper we have used the MFCC coefficients as a feature vector. The speech signal is typically recorded at a sampling rate above 10000 Hz. This frequency was chosen to minimize the effects of aliasing in

9696

the analog-to-digital conversion. These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by humans. 3. Results and Discussion Each speech frame of around 30 ms ith overlap, a set of MFCC coefficients is computed. For the feature extraction purpose we have used the sampling rate Fs= 22050 Hz, frame length =256 samples, frame overlapping size=50% and number of MFCC coefficients =18. 3.1. Strategies for Multi-modal Integration

The audio-visual information during a speech utterance is acquired by two different sensors namely, the microphone and the camera. Both the sensors capture different effects of the same phenomenon. For the audio-visual person authentication task, the features extracted from the audio and visual data can be processed and integrated in a number of different ways. The possibilities that exist for audio-visual integration are: (a) Feature level integration, (b) Decision level integration. In this chapter, feature and decision level integration techniques are explored using the proposed visual feature for the person identification task. 3.2. Database Collection

The Database of 22 Indian subjects of different ethnicity were collected. Each speaker uttering a sentence “Hello how are you”. The recording was done in 2 sessions (one for training (10 times utterance) and another for testing (10 times utterance). The visual stream has color video frames of size 240x320 pixels containing the frontal view of speakers at a rate of 30 frames per second. The audio stream has the speech signal having sampling frequency of 22050 samples per sec. And Bit resolution is 8 Bit. In this work closed-set, text dependent identification system is considered. 3.3. Result

Evaluation results of Database using Hidden Markov Model (HMM) and Gaussian mixture model (GMM), using MFCC (18 coefficients) as acoustic features and PCA (20 coefficients) as visual features are given in next section. White Gaussian noise is added to the speech signal to observe the identification performance at different Signal to noise ratios (SNR). Table 1 and 2 presents the identification results using Hidden markov model (HMM) and Gaussian mixture model (GMM), respectively. The results shows the percentage identification using Audio, Visual and Audio-Visual modules for different noise conditions Table 1. Results of identification for audio visual data using HMM.Q=3, M=3.

Noise Level (dB) Identification (%) (Audio only) (Visual only) (Audio-Visual) clean 99.6% 79.5% 100% 60dB 98.3% 79.5% 99 .1% 40dB 96.7% 79.5% 97.1% 20dB 94.1% 79.5% 94.7%

Table 2. Results of identification for audio visual data using GMM. M = 16.

Noise Level (dB) Identification (%) (Audio only) (Visual only) (Audio-Visual) clean 99.2% 77.1% 100% 60dB 97.3% 77.1% 97 .8% 40dB 94.7% 77.1% 95.2% 20dB 92.8% 77.1% 93.1%

9797

4. Discussion and Conclusion Hidden markov model (HMM) with Number of states Q=3, number of mixtures M=3, has

given better identification % as compared to higher values of Q and M. This may be due to less training database. Similarly, GMM with M=16 has shown a better identification rate. For the text dependent application, HMM are found to be superior to GMM. This may be due to the reason that, HMM can model both spatial as well as temporal information present in speech and image sequence, while GMM can model information for one plane only. Results shown in table 1 and 2, illustrates that the combined use of audio and visual information significantly improve the performance of identification. So it can be inferred from the results that, biometric system with multi modes will definitely give better performance as compared to systems with single modality.

Extraction of visual features requires localization of the facial features for mouth region and tracking in video. A technique based on color and motion information is developed and tested for the database of Indian subjects having whitish and brown skin color. The main contribution in this work is the use of visual information from the facial movements during speech to compensate effect of noise on audio data for speaker identification task. It can be concluded that the system will become more efficient and robust against the background noise, spoof attacks etc using the visual information in speaker identification system. 5. References [1]. Javier Ortega-Garcia, Josef Bigun, Douglas Reynolds and Joaquin Gonzalez-Rodriguez,

“Authentication Gets Personal with Biometrics,” IEEE Signal processing magazine, pp 50-62, March 2004.

[2]. Anil K. Jain, Arun Ross and Salil Prabhakar, “An Introduction to Biometric Recognition,” IEEE Transactions on Circuits and Systems for Video Technology, Special Issue on Image- and Video-Based Biometrics, vol. 14, No. 1, pp 1 -29, January 2004.

[3]. Claude C.Chilbelushi,Farzin Deravi,and John S.D Mason,” A Review of Speech Based Bimodal Recognition,” IEEE transaction on multimedia , vol 4,no.1,pp 23-37, March 2002.

[4]. Ming-Hsuan Yang, David J. Kriegman and Narendra Ahuja, “Detecting Faces in Images: A Survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 1,pp 34-58 ,January 2002.

[5]. Douglas Chai and King N. Ngan, “Face Segmentation Using Skin-Color Map in Videophone Applications,” IEEE transactions on circuits and systems for video technology, vol. 9, no. 4,pp 551-564, June 1999.

[6]. S. McKenna, S. Gong, and Y. Raja, "Modeling Facial Color and Identity with Gaussian Mixtures,” Pattern Recognition, vol. 31, no. 12, 1998, pp. 1883-1892.

[7]. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech Audio Processing, vol. 3, no. 1, pp. 72–83, 1995.

[8]. K. Yu, J. Mason, and J. Oglesby, “Speaker recognition using hidden Markov models, dynamic time warping, and vector quantization,” Proceedings. Image, Signal possessing, vol. 142, no. 5, pp.313–318, 1995.

[9]. Arun Ross, Anil Jain and Jian-Zhong Qian, “Information Fusion in Biometrics,” Appeared in Proc. of 3rd Int'l Conference on Audio- and Video-Based Person Authentication (AVBPA), pp. 354-359, Sweden, June 6-8, 2001.

[10]. Glotin, H., Vergyri, D., Neti, C., Potamianos, G., and Luettin, J. “Weighting Schemes for audiovisual fusion in speech recognition”. In Proceedings. International conference of. Acoustic Speech Signal Processing, 2001.

[11]. Simon lucey, Tsuhan Chen, Sridha Sridharan, Vinod Chandran, “Integration Strategies for Audio-Visual Speech Processing: Applied to Text –dependent speaker recognition,” IEEE Transaction on multimedia, vol.7, no.3, pp 495-506, 2005

9898

[ieee international conference on computational intelligence and multimedia applications (iccima...

Documents