[ieee 2009 third asia international conference on modelling & simulation - bundang, bali,...

Malaysian Vowel Recognition based on Spectral Envelope Using Bandwidth Approach

1Fadzilah Siraj, [email protected] 2Shahrul Azmi M.Y., [email protected]

3Paulraj M.P., [email protected] 4Sazali Yaacob, [email protected]

1, 2 College of Arts and Sciences, Universiti Utara Malaysia,

06010 UUM Sintok, Kedah, Malaysia 3, 4 School of Mechatronics Engineering, Universiti Malaysia Perlis

Kampus Kubang Gajah, 02600 Arau Perlis

Abstract Automatic speech recognition (ASR) has made great strides with the development of digital signal processing hardware and software especially using English as the language of choice. In this paper, a new feature extraction method is presented to identify vowels recorded from 80 Malaysian speakers. The features are obtained from Vocal Tract Model based on Bandwidth (BW) approach. The bandwidth is determined by finding the frequency where the spectral energy is 3dB below the peak. Average gain was calculated from these bandwidths. Classification results from Bandwidth Approach were then compared with results from 14 MFCC Coefficients using BPNN (Backpropagation Neural Network), MLR (Multinomial Logistic Regression) and LDA (Linear Discriminative Analysis). Classification accuracy obtained shows Bandwidth Approach performs better than MFCC using all these classifiers. Keywords Vowel Recognition, Spectral Envelope, Bandwidth Approach, Neural Network, Logistic Regression 1. Introduction

Automatic Speech recognition (ASR) can be

classified into digital speech processing technologies that also includes speech synthesis (text-to-speech, language generation) and voice biometrics (speaker identification, speaker verification). In general, its aim is to allow a machine to replicate the human ability to hear, identify, and utter natural human spoken language as shown in Fig. 1. The earliest attempts to devise ASR systems were made in 1950s and 1960s when various researchers tried to exploit fundamental ideas of acoustic phonetics. Since signal processing and computer technologies were yet very primitive, most of the speech recognition systems investigated used spectral resonances during the

vowel region of each utterance which were extracted from output signals of an analogue filter bank and logic circuits. For the past 35 years, robust ASR systems have made great progress over the years. Several factors have contributed to this rapid progress, such as the development of advanced signal processing techniques and also the continuously increasing computing power.

Figure 1. ASR System The goal of an Automatic Speech Recognition (ASR) system is to transcribe speech to text. As illustrated in Fig. 1, the speaker’s mind decides the source word sequence W that is delivered through his/her text generator. The source is passed through a noisy communication channel that consists of the speaker’s vocal apparatus to produce the speech waveform and the speech signal processing component of the speech recognizer. Finally, the speech decoder decodes the acoustic signal X into a word sequence W*, which should be the most similar to original word sequence W. The complexity of a speech processing application is largely determined by the requirements of the system. For an automatic speech recognition system, the important parameters are type of speech, speaking style, vocabulary size, operating environment and speaker dependency. For type of speech, an isolated-word speech recognition system requires that the speaker pauses

2009 Third Asia International Conference on Modelling & Simulation

978-0-7695-3648-4/09 $25.00 © 2009 IEEE

DOI 10.1109/AMS.2009.152

363

briefly between words, whereas a continuous speech recognition system does not. In terms of speaking style, speaker’s dialects and accent can make it difficult for an ASR system to recognize the spoken words. Some applications require the use of only a limited number of predictable words, while other ASR systems can be confronted with any possible word. In general, ASR is also more difficult when many similar-sounding words have to be distinguished. Some applications are only used in a quiet office, while others have to recognize speech that is distorted by unpredictable background noises like in a train station or a market place. In human language, a phoneme is the smallest structural unit that distinguishes meaning. Normally, Language like English commonly combines phonemes to form a form. In Bahasa Malaysia, children are taught to spell the words using a combination of consonants and vowels. For English words, audio signals are broken up into acoustic components and translated into phonemes. These arrangement and sequence of these phonemes are then compared with actual words from and English database that can make up of thousands of words. English word pronunciation depends on a sequential combination of phonemes. The proper Bahasa Malaysia need only concern 6 vowels phonemes which are /a/, /e/, /i/, /o/, /u/ and /e’/ [1] whereas typical American English have 20 vowel phonemes [2]. Hence, it is possible that a Malay word can be spelled out by a computer similar to a human being. Among the active Malaysian Universities in researching Speech Recognition are Universiti Teknologi Malaysia (UTM), Universiti Kebangsaan Malaysia (UKM), Universiti Putra Malaysia (UPM), Universiti Sains Malaysia (USM) and Multimedia University (MMU). For example, UTM did research into Malay plosives sounds [3,4] and Malay numbers [5,6]. UTM also did a study on Malay vowels based on cepstral coefficients [7] and fusion of Dynamic Time Warping (DTW) and Hidden Markov Model (HMM) [8]. USM experimented with 200 vowel signals using wavelet de-noising approach and Probabilistic Neural Network Model [9]. MMU studied about speech emotion recognition based on LPC analysis and classified using Neural Network and Fuzzy Models [10]. UPM is investigating on using Neural Networks to recognized Malay digits [11]. Bandwidth is the difference between the upper and lower cutoff frequencies of a signal spectrum and measured in hertz. In signal processing, the bandwidth is the frequency at which the closed-loop system gain drops 3 dB below peak [12] as shown in equation 1. Refer to Fig. 2 for a simple illustration of the theory.

peakBW EE2

1= (1)

Figure 2. Bandwidth example

Human speech has strict hierarchical structure. It consists of sentences, which can be divided into words, and they are built by phonemes that are the basic voice construction elements. Vowels can be defined as phonemes with persistent frequency characteristics most expressed. These frequency characteristics represent stable basis for construction of efficient vowel recognizer. It is known from literature ([13,14,15]) that the spectral properties of male, female and child speech differ in a number of ways especially in terms of average vocal tract lengths (VTL). The VTL of female is about 10% shorter compared to the VTL of male. The VTL of children is even shorter (up to 10%) than that of females. 2. Methodology The summary of the entire Vowel recognition Process is shown in Fig. 3 below.

Figure 3. Vowel Recognition Process

Data collection was carried out twice, taken from a total of 80 individuals consisting students and staffs from Universiti Malaysia Perlis (UniMAP) and Universiti Utara Malaysia. The recordings were done using a microphone and a laptop computer with a sampling frequency of 8000Hz. The words “KA, KE, KI, KO, KU” were used to represent the five vowels of /a/, /e/, /i/, /o/ and /u/ because vowels have significantly more energy than consonants. Based on [16,17,18,19], the first three formants for vowels are situated within 4 kHz and so are vowel’s main characteristics. For this study, a sampling frequency of 8 kHz was used to sample the vowels. The recordings were done 3 to 4 times per

364

speaker. The details of the data collection are listed in the Table 1 below.

Table 1. Data Collection details

Information 1st Data Collection

2nd Data Collection

Sources 40 UniMAP students

20 UUM staffs and 20 students

Recorded utterances 640 445

Sampling Frequency 8000 Hz 8000 Hz

Words uttered /ka/, /ke/, /ki/, /ko/, /ku/

/ka/, /ke/, /ki/, /ko/, /ku/

The segmentation process is summarized in Fig. 4. First, the voice signal was recorded using a laptop and a microphone using a sampling frequency of 8000Hz. The DC portion of the which were introduced by the recording equipment is removed and the resultant signal was then normalized. The start and endpoint locations were determined using only energy method.

Figure 4. Vowel Extraction Process

Normally, frame-by-frame analysis is used to analyze the speech signals but in this vowel recognition method, only a single signal frame analysis was used to extract the features. In order to determine the best frame size and location to analyze on the waveform, spectrums were analyzed using frame-shifted waveform and frame-expanding waveform methods as shown in Fig. 5 and 6. The spectrums of the Frame-Shifting analysis show inconsistent response as the frame moves from left to right. On the other hand, the spectrums of the Frame-Expanding analysis show the same consistent response using different frame size with the center of the frame being the center of the waveform. The frame size chosen was 70% waveform length with the center frame located at the center of the waveform. When any part of frame is chosen for analysis, the segmentation process may cause some difference between the signal at the

beginning and end of the voice segment. This can produce spectral leakages. To reduce the discontinuity, a Hamming window function was applied to bring the signal smoothly to zero at beginning and end points.

Figure 5. Frame-Shifted Method

Figure 6. Frame-Expanding Method

The Hamming window is given by the Equation (2).

⎩⎨⎧ +

=0

)/cos(46.054.0][

MmmwH

π otherwise

MmM ≤≤− (2)

Next, the signal was pre-emphasized in order to emphasize the higher frequencies component of the signal. Pre-emphasis compensates the effect of the glottal-source and energy radiation from the lips [20]. The pre-emphasize filter was implemented by the equation (3) using a preemphasized constant value of 0.95.

]1[][][ −′−=′ nsAnsns c (3) where s’[n] - preemphasized signal s[n] - original signal Ac - preemphasized constant =0.95

3. Analyzing the Vocal tract The magnitudes of the 512-point complex frequency response are plotted for each of the vowels. In Fig. 7, all the averaged speakers’ spectrum envelope plots are

365

shown for each of the vowels in linear scaled magnitude which were modeled using Autoregressive Method.

Figure 7. Linear Scaled Spectrum

The peaks in the linear scaled spectrum are more defined than the log scaled spectrum. It is easily visible how closely the responses for different speakers match up for any of the vowels. Based on the observation and analysis of the plotted outputs, significant differences were found between each of the vowel frequency responses on bands on the frequency ranges. In terms of differentiating vowels, these energy differences can be used as features to classify the vowels. Mean magnitude parameter values were calculated from these subbands for each speech sample representing each of the vowels. Altogether, there are seven subbands used to extract gain features from the vocal tract model. Five of them were determined using Bandwidth Approach (BW) applied to the first peak response of every averaged vowel plot. The first formant peak for each of these vowels is located below 1000Hz. Fig. 8 shows the frequency ranging from the first peak response of vowel /u/. Bandwidth determined for vowel /u/ is from 336Hz to 477Hz with a magnitude value of 10.46.

Figure 8. Determining BW frequency ranges for vowel

/u/

Two more ranges were determined by comparing each of the average vowel plots. The sixth range is between 1000Hz and 1500Hz and the seventh range 1500Hz and 2000Hz as shown in Fig. 9.

Figure 9. Location of FR6 and FR7

Table 2 shows the subband of frequencies that were used to extract gain magnitude features. Seven gain features were extracted based on the frequency ranges from Table 2.

Table 2. Frequency Ranges to Extract Features Subband f start (Hz) f stop (Hz)

FR1 633 938

FR2 430 578

FR3 266 398

FR4 422 563

FR5 336 477

FR6 1000 1500

FR7 1500 2000

4. Classification Results One linear classifier (LDA) and two non-linear classifiers (BPNN and MLR) were used to classify the vowels. Linear Discriminant Analysis (LDA) is used in statistics and machine learning to find the linear combination of features which best separate two or more classes of objects or events. The resulting combination may be used as a linear classifier or for dimensionality reduction. Logistic regression is a regression technique which is used when the dependent is a dichotomy and the independents are of any type. Multinomial logistic regression (MLR) exists to handle the case of dependents with more classes than two. Backpropagation Neural Network (BPNN) is a supervised learning method which

/u/

/i/ /a/

/e/

/o/ /o/

/u/ /i/ /a/

/e/

366

calculates the gradient of the error of the network with respect to the network's modifiable weights. The network training function used is GDM which updates weight and bias values according to gradient descent momentum and an adaptive learning rate. It is great at prediction and classification but lacks the explanation of what has been learned and a bit slow compared to other learning algorithms. GDM (Gradient Descent BP with Momentum) algorithms have a good generalization [21,22] but suffer from its ability to convergence. There are seven input neurons representing seven parameters of formant bandwidth mean gain magnitude features. Two layers of ten hidden neurons and three output neurons representing vowel /a/, /e/, /i/, /o/ and /u/ with 001, 010, 011, 100 and 101. The network was trained using 70% of the data using learning rate of 0.3 and momentum factor of 0.8. The weights and biases of the network were initialized randomly. For comparison purposes, the 14 coefficients from MFCC were computed. These values were used to compare with the performance of the bandwidth approach. Fig. 10 below shows the classification rate of different testing tolerance of BW and MFCC.

Classification Rate based on Different Methods and Classifiers

80.00

85.00

90.00

95.00

100.00

Method/Classifiers

CR

Overall 88.41 87.90 98.40 95.40 94.36 93.83

BW(NN) MFCC(NN) BW(MLR) MFCC(MLR) BW(LDA) MFCC(LDA)

Figure 10. Classification Rate by Different Classifiers

A testing tolerance of 0.2 was selected based on its accuracy and its permissible limit of variation. Table 3 below shows the summary of the averaged results of the classification based on testing tolerance of 0.2.

Table 3. Classification Results of Different Methods

(averaged from 20 runs)

5. CONCLUSION BW performs better than MFCC using all the three classifiers. The improvements of BW over MFCC using BPNN, MLR and LDA were 0.51%, 3.4% and 0.53% respectively. Table 4 shows the Vowel classifications by different methods and classifiers. Fig. 11 shows the vowel classification accuracy of the two methods. Result using BPNN classifier is not plotted because it shows the worst results. Table 4. Classification Results of Different Methods (averaged from 20 runs)

Vowel Classification Rate by Different Methods and Classifiers

75.00

80.00

85.00

90.00

95.00

100.00

a e i o u OverallVowel

CR

BW(NN) MFCC(NN) BW(MLR) MFCC(MLR)BW(LDA) MFCC(LDA)

Figure 11. Classification Rate by Vowel Based on Table 4 and Fig. 11 results, MLR performs best in classifying the vowels followed by LDA and BPNN. Overall best method for classifying vowels is BW approach which performs best in classifying vowels /e/, /i/, /o/, and /u/. MFCC using LDA performs best in classifying vowel /a/ closely followed by BW using (MLR). MFCC performs worst in vowel /o/ and /u/ classification. Training time for BPNN for both BW and MFCC was comparably the same. 6. REFERENCES [1] Maris M. Y., The Malay Sound System, Fajar

Bakti, Malaysia, (1979). [2] http://www.btinternet.com/~ted.power/

phon00.htm accessed on 26 March 2008.

[3] H.N. Ting, J. Yunus, S.H. Salleh, “Speaker-independent phonation recognition for Malay plosives using neural networks”, Proceedings of the International Joint Conference on Neural

367

Networks (IJCNN '02.), 2002, Honolulu, HI, USA.

[4] H.N. Ting, J. Yunus, C.W. Lee, “Speaker-

independent Malay isolated sounds recognition”, Proceedings of the 9th International Conference on Neural Information Processing (ICONIP2002)

[5] M.S.H. Salam, D. Mohamad, S.H.S. Salleh,

“Neural network speaker dependent isolated Malay speech recognition system : handcrafted vs genetic algorithm”, Sixth International, Symposium on Signal Processing and its Applications, 2001, Kuala Lumpur, Malaysia.

[6] R. Sudirman, S.H. Salleh, Shaharudin Salleh,

“The Effectiveness of DTW-FF Coefficients and Pitch Feature in NN Speech Recognition”, Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia.

[7] S.A.R. Al-Haddad, S.A. Samad, A. Hussain, K.A.

Ishak and A.O.A. Noor , “Robust Speech Recognition Using Fusion Techniques and Adaptive Filtering”, American Journal of Applied Sciences 6 (2): 290-295, 2009, ISSN 1546-9239.

[8] H.N. Ting, J. Yunus, “Speaker-independent

Malay vowel recognition of children using multi-layer perceptron”, 2004 IEEE Region 10 Conference (TENCON 2004).

[9] C.P. Lim, S.C. Woo, A.S. Loh, R.

Osman, “Speech recognition using artificial neural networks”, Proceedings of the First International Conference on Web Information Systems Engineering”, 2000, Hong Kong.

[10] A.R. Aishah, K. Ryoichi, Z.A. Mohamad Izani,

“Comparison Between Fuzzy and NN Method for Speech Emotion Recognition”, Proceedings of the Third International Conference on Information Technology and Applications (ICITA’05).

[11] C.L. Tan and A. Jantan, “Digit Recognition Using

Neural Networks”, Malaysian Journal of Computer Science, Vol. 17 No. 2, December 2004, pp. 40-54.

[12] Carlson G.E., Signal and Linear System Analysis,

Houghton Mifflin Company, 1992. [13] H. Wakita, “Normalization of vowels by vocal

tract length and its application to vowel identification,” IEEE Trans. Acoustics, Speech, Signal Processing, vol. ASSP-25, p. 183, 1977.

[14] Fant G., Acoustic Theory of Speech Production, The Hague, The Netherlands: Mouton, 1960.

[15] G. E. Peterson and H. L. Barney, “Control

methods used in a study of the vowels,” Journal of Acoustic Society of America, vol. 24, pp. 175–184, 1952.

[16] Rabiner, L. and Juang, B.H. (1993).

Fundamentals of Speech Recognition. Prentice Hall.

[17] J. Hillenbrand, L.A. Getty, M.J. Clark, K.

Wheeler, “Acoustic Characteristics of American English vowels.” Journal of Acoustic Society of America, 1995, pg 97, 3099-3111.

[18] V. Vuckovic, M. Stankovic, “Formant analysis

and vowel classification methods”, 5th International Conference on Telecommunications in Modern Satellite, Cable and Broadcasting Service (TELSIKS), 2001.

[19] Huang X., Acero A., Hon H.W., Spoken

Language Processing: A Guide to Theory, Algorithm and System Development, Prentice Hall, 2001.

[20] T. Kinnunen, ”Spectral features for automatic

text-independent speaker recognition”, Licentiate’s thesis, University of Joensuu, 2003.

[21] A. Vongkunghae, A. Chumthong, “The

Performance Comparisons of Backpropagation Algorithm's Family on a Set of Logical Functions”, ECTI Transactions on Electrical Engineering, Electronics and Communications, Vol.5, No.2, August 2007.

[22] B. Widrow, M.A. Lehr, “30 years of adaptive

neural networks: perceptron, Madaline and backpropagation" Proceedings of the IEEE, Volume 78, Issue 9, Sept. 1990 Page(s):1415 – 1442

368

[ieee 2009 third asia international conference on modelling & simulation - bundang, bali,...

Documents