paithane speech

5
IITKGP-SEHSC : Hindi speech corpus for emotion analysis Shashidhar G. Koolagudi, Ramu Reddy, Jainath Yadav, K. Sreenivasa Rao School of Information Technology, Indian Institute of Technology, Kharagpur, India Email: [email protected], [email protected], [email protected], [email protected] Abstract — In this paper, simulated emotion Hindi speech corpus has been introduced for analyzing the emotions present in speech signals. The proposed database is recorded using professional artists from Gyanavani FM radio station, Varanasi, India. The speech corpus is collected by simulating eight different emotions using neutral (emotion free) text prompts. The emotions present in the database are anger, disgust, fear, happy, neutral, sad, sarcastic and surprise. This speech corpus is named as Indian Insti- tute of Technology Kharagpur Simulated Emotion Hindi Speech Corpus (IITKGP-SEHSC). Emotion classification is performed on the proposed IITKGP-SEHSC using prosodic and spectral features. Mel frequency cepstral coefficients (MFCCs) are used to represent spectral information. En- ergy, pitch and duration are used to represent prosody infor- mation. The average emotion recognition performance using prosodic and spectral features are found to be around 77% and 81% for female speech utterances. This paper describes the design, acquisition, post processing and evaluation of the proposed speech corpus (IITKGP-SEHSC). The qual- ity of the emotions expressed in the database is evaluated using subjective listening tests. The emotion recognition performance using subjective listening tests is observed to be around 74%. The results of subjective listening tests are grossly on par with the results obtained using prosodic analysis of the database. Keywords — IITKGP-SEHSC, Duration, Emotion, Emo- tion recognition, Energy, Prosody, Spectral features, Pitch, Standard deviation of pitch. I. Introduction H UMAN beings use emotions extensively for expressing their intentions through speech. At the receiving end, the intended listener will interpret the message according to the emotions present in the speech. Therefore in develop- ing speech systems (i.e., speech recognition, speaker recog- nition, speech synthesis and language identification), one should appropriately exploit the knowledge of emotions. But, most of the existing speech systems are not using the knowledge of emotions while performing the tasks. This is due to the difficulty in modeling and characterization of emotions present in speech. Speech systems developed under different constraints have the applications in limited domain. Broad applications such as real time speech-to- speech translation and sophisticated human-machine inter- face, demand the robust speech systems which can work in unconstrained environments. In this direction to develop robust speech systems, there is a need to analyze and characterize the emotions present in speech. The basic issues that need to be addressed in this area are: 1. Exploring the features for discriminating the emotions in speech, 2. Exploring the models for capturing the emotion-specific knowledge from speech, 3. Characteri- zation of emotions from speech using different features and 4. Incorporation of emotions in speech synthesis. In practice, it is very difficult to collect real life emo- tions as they are not properly expressed. Therefore, at the first step analysis of simulated emotions from the speech, recorded through the professional artists may be carried out. The emotions present in practical situations have lot of variability, and it is difficult to model them. The proposed speech database is the first one developed in Hindi for analyzing the emotions present in speech. This database is sufficiently large to analyze the emotions in view of speaker, gender, text and session variability. Be- fore this, we have carried out the study on emotions us- ing the speech database Speech Under Simulated Emotions (SUSE), collected at IIT Guwahati [1]. SUSE is very small, containing 600 utterances, used for simulating 4 emotions (anger, compassion, happy and neutral). These utterances are emotionally simulated by inexperienced, graduate stu- dents of IIT-Guwahati, using a single text prompt. There- fore, the emotion analysis using the above database may not be appropriate for developing the robust speech sys- tems. Hence, to fill this gap, we have developed IITKGP- SESC [2], which is essential for the analysis of emotions in speech in the context of Indian languages. This speech database is collected in Telugu language. IITKGP-SESC contains 12000 utterances, simulated in 8 emotions, by 10 all india radio (AIR) artists. Total duration of the cor- pus is around 7 hours. IITKGP-SESC may be sufficient for speaker, text and gender specific analysis for emotion recognition in Indian languages. But emotions are basi- cally independent of languages. Therefore, there is a need for equally competent speech database in other Indian lan- guage, to study language independent emotion recognition. This is the motivation for collecting IITKGP-SEHSC, in Hindi language. Information about several emotion speech corpora, in different languages, is available in the literature [3]. Remaining part of the paper is arranged as follows: the details of the IITKGP-SEHSC are discussed in section II. Philosophy of the classifiers used to develop emotion recog- nition models is explained in section III. The statistics of the prosodic parameters for various emotions of IITKGP- SEHSC are discussed in section IV . The discrimination of emotions using prosodic and spectral features is car- ried out in section V . Subjective evaluation of the pro- posed database (IITKGP-SEHSC) is explained in section VI . The summary of the paper and the future works that 978-1-4244-9190-2/11/$26.00 ©2011 IEEE

Upload: pushkar-anand

Post on 21-Apr-2015

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Paithane Speech

IITKGP-SEHSC : Hindi speech corpus foremotion analysis

Shashidhar G. Koolagudi, Ramu Reddy, Jainath Yadav, K. Sreenivasa RaoSchool of Information Technology, Indian Institute of Technology, Kharagpur, IndiaEmail: [email protected], [email protected], [email protected], [email protected]

Abstract— In this paper, simulated emotion Hindi speechcorpus has been introduced for analyzing the emotionspresent in speech signals. The proposed database isrecorded using professional artists from Gyanavani FM radiostation, Varanasi, India. The speech corpus is collected bysimulating eight different emotions using neutral (emotionfree) text prompts. The emotions present in the databaseare anger, disgust, fear, happy, neutral, sad, sarcastic andsurprise. This speech corpus is named as Indian Insti-tute of Technology Kharagpur Simulated Emotion HindiSpeech Corpus (IITKGP-SEHSC). Emotion classification isperformed on the proposed IITKGP-SEHSC using prosodicand spectral features. Mel frequency cepstral coefficients(MFCCs) are used to represent spectral information. En-ergy, pitch and duration are used to represent prosody infor-mation. The average emotion recognition performance usingprosodic and spectral features are found to be around 77%and 81% for female speech utterances. This paper describesthe design, acquisition, post processing and evaluation ofthe proposed speech corpus (IITKGP-SEHSC). The qual-ity of the emotions expressed in the database is evaluatedusing subjective listening tests. The emotion recognitionperformance using subjective listening tests is observed tobe around 74%. The results of subjective listening testsare grossly on par with the results obtained using prosodicanalysis of the database.

Keywords— IITKGP-SEHSC, Duration, Emotion, Emo-tion recognition, Energy, Prosody, Spectral features, Pitch,Standard deviation of pitch.

I. Introduction

HUMAN beings use emotions extensively for expressingtheir intentions through speech. At the receiving end,

the intended listener will interpret the message according tothe emotions present in the speech. Therefore in develop-ing speech systems (i.e., speech recognition, speaker recog-nition, speech synthesis and language identification), oneshould appropriately exploit the knowledge of emotions.But, most of the existing speech systems are not using theknowledge of emotions while performing the tasks. Thisis due to the difficulty in modeling and characterizationof emotions present in speech. Speech systems developedunder different constraints have the applications in limiteddomain. Broad applications such as real time speech-to-speech translation and sophisticated human-machine inter-face, demand the robust speech systems which can work inunconstrained environments.

In this direction to develop robust speech systems, thereis a need to analyze and characterize the emotions presentin speech. The basic issues that need to be addressed in thisarea are: 1. Exploring the features for discriminating theemotions in speech, 2. Exploring the models for capturingthe emotion-specific knowledge from speech, 3. Characteri-

zation of emotions from speech using different features and4. Incorporation of emotions in speech synthesis.

In practice, it is very difficult to collect real life emo-tions as they are not properly expressed. Therefore, at thefirst step analysis of simulated emotions from the speech,recorded through the professional artists may be carriedout. The emotions present in practical situations have lotof variability, and it is difficult to model them.

The proposed speech database is the first one developedin Hindi for analyzing the emotions present in speech. Thisdatabase is sufficiently large to analyze the emotions inview of speaker, gender, text and session variability. Be-fore this, we have carried out the study on emotions us-ing the speech database Speech Under Simulated Emotions(SUSE), collected at IIT Guwahati [1]. SUSE is very small,containing 600 utterances, used for simulating 4 emotions(anger, compassion, happy and neutral). These utterancesare emotionally simulated by inexperienced, graduate stu-dents of IIT-Guwahati, using a single text prompt. There-fore, the emotion analysis using the above database maynot be appropriate for developing the robust speech sys-tems. Hence, to fill this gap, we have developed IITKGP-SESC [2], which is essential for the analysis of emotionsin speech in the context of Indian languages. This speechdatabase is collected in Telugu language. IITKGP-SESCcontains 12000 utterances, simulated in 8 emotions, by 10all india radio (AIR) artists. Total duration of the cor-pus is around 7 hours. IITKGP-SESC may be sufficientfor speaker, text and gender specific analysis for emotionrecognition in Indian languages. But emotions are basi-cally independent of languages. Therefore, there is a needfor equally competent speech database in other Indian lan-guage, to study language independent emotion recognition.This is the motivation for collecting IITKGP-SEHSC, inHindi language. Information about several emotion speechcorpora, in different languages, is available in the literature[3].

Remaining part of the paper is arranged as follows: thedetails of the IITKGP-SEHSC are discussed in section II.Philosophy of the classifiers used to develop emotion recog-nition models is explained in section III. The statistics ofthe prosodic parameters for various emotions of IITKGP-SEHSC are discussed in section IV . The discriminationof emotions using prosodic and spectral features is car-ried out in section V . Subjective evaluation of the pro-posed database (IITKGP-SEHSC) is explained in sectionVI . The summary of the paper and the future works that

978-1-4244-9190-2/11/$26.00 ©2011 IEEE

Page 2: Paithane Speech

can be carried out on this database are given in the finalsection of the paper.

II. IITKGP:SEHSC(Indian Institute of

Technology Kharagpur: Simulated Emotion

Hindi Speech Corpus)

The proposed database is recorded using 10 (5 male and5 female) professional artists from Gyanavani FM radiostation, Varanasi, India. The artists have sufficient expe-rience in expressing the desired emotions from the neutralsentences. The male artists are in the age group of 28-48years with varied experience of 5-20 years. similarly fe-male artists are selected from the age group of 20-30 yearswith 3-10 years of experience. For recording the emotions15 Hindi text prompts are considered. All the sentencesare emotionally neutral in meaning. Each of the artistshas to speak the 15 sentences in 8 basic emotions in onesession. The number of sessions considered for prepar-ing the database is 10. The total number of utterancesin the database is 12000 (15 textprompts× 8 emotions×10 speakers×10 sessions ). Each emotion has 1500 utter-ances. The number of words and syllables in the sentencesvary from 4-7 and 9-17 respectively. The total durationof the database is around 9 hours. The eight emotionsconsidered for collecting the proposed speech corpus are:anger, disgust, fear, happy, neutral, sad, sarcastic and sur-prise. The speech samples are recorded using SHURE dy-namic cardioid microphone C660N. The distance betweenthe speaker and the microphone is maintained to be 1 ft.Speech signal was sampled at 16 kHz and each sample isrepresented as 16 bit number. The sessions are recorded onalternate days to capture the variability in human speechproduction mechanism. In each session, all the artists havegiven the recordings of 15 sentences in 8 emotions. Therecording has done in such a way that each artist has tospeak all the sentences at a stretch in a particular emo-tion. This provides the coherence among the sentences foreach emotion category. Since, all the artists are from thesame organization, it ensures the coherence in the qualityof the collected speech data. The entire speech databaseis recorded using single microphone and at the same loca-tion. The recording was done in a quiet room, without anyobstacles in the recording path.

III. Classifiers used for emotion classification

In this work, SVMs are used to classify the emotionsbased on prosodic features and GMMs are used to developemotion recognition systems based on spectral features.SVMs are known to distinguish the feature vectors basedon discrimination characteristics present in them. Numberof feature vectors is not crucial in case of SVMs, as thediscriminating support vectors play important role dur-ing classification. In this study, utterance level decisionis to be taken regarding the emotion category. Thereforenumber of feature vectors is limited. This is the motiva-tion to choose SVMs for developing emotion recognitionmodels using prosodic features. GMMs are known to cap-ture general distribution of data points in the feature space

[4]. Therefore GMMs are suitable for developing emotionrecognition models using spectral features, as the decisionregarding the emotion category may be taken based on dis-tribution of frame wise feature vectors, in a feature space.

A. Support Vector Machines(SVM)

SVMs are designed for two-class pattern classification.Multiclass (n-class) pattern classification problems can besolved using a combination of binary (2-class) support vec-tor machines. One-against-the-rest approach is used for de-composition of n-class pattern classification problem into ntwo-class classification problems.

For developing SVM models for the specific emotion, fea-ture vectors derived from the speech utterances of desiredemotion are used as positive examples, and the feature vec-tors derived from the speech of all other emotions (otherthan the desired emotion) are treated as negative exam-ples. The block diagram of the emotion recognition (ER)system using SVM models is shown in Fig. 1. For evaluat-ing the performance of the ER systems, the feature vectorsderived from the test utterances are given as input to allSVM models. The output of each model is given to decisionlogic. Decision logic determines the emotion, based on thehighest score among the evidence provided by the emo-tion models. Gaussian kernels are used to develop SVMbased emotion recognition systems. The parameters likestandard deviation are determined empirically. The emo-tion recognition systems using prosodic features are devel-oped using SVM’s.

SVMFear

SVMHappy

SVMNeutral

SVMDisgust

Test Feature Vector

DecisionDevice

Hypothesized Emotion

SVMSad

SVMSarcastic

SVMAnger

SVMSurprise

Fig. 1. Speaker recognition system using support vector machines.

B. Gaussian mixture models (GMM)

Gaussian Mixture Models (GMMs) are among the moststatistically matured methods for clustering and for densityestimation. They model the probability density function ofobserved data points using a multivariate Gaussian mixturedensity. Given a set of inputs, GMM refines the weights ofeach distribution through expectation-maximization algo-rithm. Once a model is generated, conditional probabilitiescan be computed for test patterns (unknown data points).

Page 3: Paithane Speech

Number of Gausses in the mixture model is known asnumber of components. They indicate the number of clus-ters in which data points are to be classified. In this workone GMM is developed to capture the information aboutone emotion. The components within each GMM capturefiner level details among the feature vectors of each emo-tion. Depending on the number of data points, numberof components may be varied in each GMM. Presence offew components in GMM and trained using large numberof data points may lead to more generalized clusters, fail-ing to capture specific details related to each class. Onthe other hand over fitting of the data points may happen,if too many components represent few data points. Obvi-ously the complexity of the models increases, if they con-tain higher number of components. In this work, GMM’sare designed with 64 components and iterated for 30 timesto attain convergence of weights. The emotion recognitionsystems using spectral features are developed using GMMs.

IV. Prosodic analysis

From the state-of-the-art literature, it is observed thatthe characteristics of the emotions can be seen at the sourcelevel (characteristics of excitation signal and shape of theglottal pulse), system level (shape of the vocal tract andnature of movements of different articulators) and at theprosodic level [5]. Among the features obtained from differ-ent levels, prosodic features have widely been used in theliterature for emotion recognition [6]. Therefore, in this pa-per, prosodic analysis of the utterances of IITKGP-SEHSCis performed. The prosodic parameters considered in thisstudy are (1) average duration of all utterances for a spe-cific emotion, (2) average pitch, (3) standard deviation ofpitch (SD), and (4) average energy.

Duration of each of the speech files is determined in sec-onds. The mean of the durations is computed for eachemotion category. The pitch values of each utterance areobtained from the autocorrelation of the Hilbert envelopeof the LP residual [7]. As each utterance has sequence ofpitch values according to its intonation pattern, analysisis carried out using mean and Standard Deviation (SD) ofpitch values. Average frame wise energy of speech signal isdetermined for all utterances of a specific emotion. In thispaper, we have not used the entire database for deriving thestatistics of the prosody parameters. We have consideredone male and one female artists’ speech files for illustratingthe statistics of prosody parameters. The average values ofthe prosodic parameters are determined at sentence level.Table I shows the mean values of the prosodic parametersfor different emotions. From the Table it may be observedthat the time taken to express the fast emotions like sur-prise, anger and disgust is less compared to that of slowemotions like sad and fear. Pitch values are observed tobe high for extreme emotions like sad, surprise and fear.Energy is obviously more for high arousal emotions likeanger and fear. The energy in case of sad emotions is alsolittle high, but this is rare observation compared to otheremotional speech databases.

V. Emotion classification using prosodic and

spectral features

In this work, prosodic and spectral features are sepa-rately extracted from the utterances. Frame wise pitchand energy values are extracted to represent prosodic in-formation. Average duration of the syllables present in anutterance is also computed to represent duration parame-ter. Similarly mel frequency cepstral coefficients (MFCCs)are used as the correlates of spectral information. Speechframes of 20 ms, with a shift of 10 ms are used to ex-tract both prosodic and spectral features. Each artist hasuttered 1200 sentences (15 sentences × 8 emotions ×10 sessions), out of which, 960 sentences are used for de-riving the statistical models and the remaining 240 are usedfor verifying the models. In this study, we use the first or-der statistics (Mean of the distribution) for the analysis ofbasic emotions.

A. Emotion recognition using prosodic features

Prosodic features are the important correlates of emo-tion specific information present in the speech. Specificphoneme level modulation is observed from the same text,uttered in different emotions. This leads to variation inthe prosodic parameters. Therefore, in this study, we havedeveloped different emotion recognition systems using en-ergy, pitch and duration contours extracted from speechutterances. The emotion recognition performance usingseparately energy, pitch and duration values are found tobe 56%, 68% and 62% respectively. Durations of syllablesare derived by using ergodic hidden Markov models. Se-quence of syllable durations form the feature vectors forrepresenting the duration contours. For representing theintonation pattern of an utterance, the sequence of pitchvalues of all the voiced frames of the utterance is used. Forderiving the fixed dimensional feature vector resamplingis performed. Similarly for representing energy contour,sequence of voiced frame energies followed by resamplingis used. In this way, 100 dimensional feature vectors arederived for pitch and energy contours. Similarly, 17 di-mensional feature vectors are derived to represent durationpattern. Table II shows the confusion matrix of emotionrecognition performance of the ER System developed usingthe combination of all prosodic (energy, pitch and duration)features. In this work, score level fusion is performed bysumming the weighted confidence scores (evidence) derivedfrom the ER systems developed using individual prosodicfeatures. The weighting rule for combining scores of indi-vidual modalities is as follows: cm = 1

m

∑mi=1 wici, where

cm is the multimodal confidence score, wi and ci are weight-ing factor and confidence score of the ith modality, andm indicates number of modalities used for combining thescores. The weights used for combination are 0.3, 0.35,0.35 respectively for energy, pitch and duration. From Ta-ble II, it is observed that, there is a classification overlapfor anger and fear, as they mostly share similar arousalcharacteristics, which are represented by prosodic features.

Page 4: Paithane Speech

TABLE I

Mean values of the prosodic parameters for each emotion

Emotion Male Artist Female ArtistDuration Pitch SD of pitch Energy Duration Pitch SD of pitch Energy(Seconds) (Hertz) (Hertz) (Joules) (Seconds) (Hertz) (Hertz) (Joules)

Anger 2.55 164.18 45.76 343.65 2.80 301.57 50.56 183.96Disgust 2.67 154.06 46.70 202.58 2.73 264.43 57.83 122.06Fear 3.65 174.59 20.11 426.45 3.68 285.66 27.74 201.39Happy 3.18 161.53 35.98 165.38 3.39 302.10 46.67 97.47Neutral 4.86 122.47 30.19 270.00 4.91 210.75 40.54 128.75Sad 3.35 178.47 35.72 390.34 3.54 253.63 37.76 225.42Sarcastic 2.79 159.75 60.88 144.53 2.96 263.12 73.76 76.28Surprise 2.16 210.85 70.76 232.09 2.30 293.20 85.63 106.72

B. Emotion recognition using spectral features

Emotion recognition from speech depends equally uponspeaker’s ability to express emotions as well as listener’sability to perceive them. The human auditory system pro-cesses the speech frequency components in a nonlinear fash-ion. The nonlinearity of human perception mechanism wasrealized using mel scale of frequency bands. The mappingof linear frequency onto mel frequency uses a logarithmicscale as shown below:

m = 2595 log10(1 + f700 )

where, f represents normal frequency and m is the cor-responding mel frequency. In speech processing, mel fre-quency cepstrum is a representation of the short termpower spectrum of a speech frame using linear cosine trans-form of log power spectrum on a nonlinear mel frequencyscale.

In this work, we have developed emotion recognitionmodels using 13 MFCC features extracted per frame of 20ms. During testing, MFCC feature vectors containing 13features are found out and given as a input to each of thedeveloped emotion models. The model gives the probabil-ities of the feature vectors, that it comes from that model(emotion). The mean of probabilities of all feature vectors,generated from an utterance, with respect to that model, iscomputed. The model that gives highest mean probabilityis hypothesized as the emotion of the utterance. The emo-tion classification performance for male and female speak-ers using spectral features is given in Table III. The tableindicates the confusion matrices of emotion classificationtask. The diagonal entries correspond to the correct classi-fication. The emotion recognition performance of spectralfeatures is better compared to that of prosodic features.

VI. Subjective evaluation

The quality of the emotions expressed in the database isevaluated using subjective listening tests. Here, the qualityrepresents how well the artists simulated the emotions fromthe neutral sentences. The human subjects are used to as-sess the naturalness of the emotions embedded in speechutterances. This evaluation is carried out by 25 post grad-uate and research students of IIT Kharagpur. This study isuseful for comparing the emotion recognition performancein case of human beings and machine. The study is alsohelpful to determine clearly discriminable and confusing

emotions among 8 classes.In this study, 40 sentences (5 sentences from each emo-

tion) from each artist are considered for evaluation. Beforetaking the test, the subjects have given the pilot training byplaying 24 sentences (3 sentences from each emotion) fromeach artist’s speech data, for understanding (familiarizing)the characteristics of emotions. The forty sentences used inthe evaluation are randomly ordered, and played to the lis-teners. For each sentence, the listener has to mark the emo-tion category from the set of 8 basic emotions. The overallemotion classification performance for the chosen male andfemale artists’ speech data is given in Table IV. The ob-servation shows that the average recognition rate in caseof both male and female speech utterances is about 71%and 74% respectively. Anger, neutral and sad emotionsare recognised well compared to other emotions. Disgustand surprise are comparatively less accurately recognised.The expected overlap in the classification is observed be-tween happy, fear and surprise. The emotion recognitionperformance of subjective listening tests and prosodic fea-tures is almost same, which indicates that, human beingsmostly use prosodic cues to identify the emotions.

VII. Summary and Conclusions

In this paper, we have proposed an emotional speechcorpus (IITKGP-SEHSC) recorded in Hindi. The emo-tions considered for developing IITKGP-SEHSC are anger,disgust, fear, happy, neutral, sad, sarcastic and surprise.The emotion analysis of IITKGP-SEHSC is performed us-ing prosodic parameters. The importance of the prosodicand spectral parameters for discriminating the emotionsis shown by performing the emotion classification usingprosodic and spectral features. The quality of the emo-tions present in the developed emotional speech corpus isevaluated using subjective listening tests.

The proposed emotional speech database can be furtherexploited for characterizing the emotions using the emotionspecific features extracted from vocal tract and excitationsource. The emotion classification performance can be im-proved further by combining the evidence from differentmodels developed using features extracted from speech atvarious levels. Use of nonlinear models such as neural net-works and support vector machines, may further enhancethe emotion recognition performance [8, 9]. The proposeddatabase has wide variety of characteristics in terms ofemotions, speakers and text. One can perform the sys-

Page 5: Paithane Speech

TABLE II

Emotion classification performance of male and female speech, using prosodic features; Abbreviations: A-Anger, D-Disgust,

F-Fear, H-Happy, N-Neutral, Sa-Sad, S-Sarcastic, Sur-Surprise.

Male Artist(Average: 73.33) Female Artist(Average: 76.75)A D F H N Sa S Sur A D F H N Sa S Sur

Anger 47 0 30 10 0 0 0 13 57 0 10 33 0 0 0 0Disgust 7 73 3 0 4 0 10 3 7 70 0 7 0 0 13 3Fear 0 0 84 3 0 3 3 7 0 0 77 0 0 0 3 20Happy 7 7 10 53 0 3 13 7 7 3 13 60 0 0 17 0Neutral 0 0 0 7 93 0 0 0 0 0 0 3 97 0 0 0Sad 10 6 17 0 0 53 7 7 7 0 10 0 0 73 7 3Sarcastic 0 7 0 10 0 0 83 0 0 0 0 13 0 0 87 0Surprise 0 0 0 0 0 0 0 100 0 0 7 0 0 0 0 93

TABLE III

Emotion classification performance of male and female speech, using spectral features; Abbreviations: A-Anger, D-Disgust,

F-Fear, H-Happy, N-Neutral, Sa-Sad, S-Sarcastic, Sur-Surprise.

Male Artist(Average: 77.375) Female Artist(Average: 80.75)A D F H N Sa S Sur A D F H N Sa S Sur

Anger 43 0 0 17 3 20 17 0 53 0 0 7 0 27 13 0Disgust 0 67 0 0 0 17 16 0 0 73 0 0 0 17 10 0Fear 0 0 93 0 0 7 0 0 0 0 90 3 0 7 0 0Happy 0 0 3 70 10 0 17 0 0 0 3 77 7 13 0 0Neutral 0 0 0 7 90 3 0 0 0 0 0 7 93 0 0 0Sad 0 0 4 3 0 93 0 0 0 3 7 0 0 90 0 0Sarcastic 0 0 0 0 0 7 93 0 0 0 0 0 0 4 93 3Surprise 0 0 3 0 0 13 13 71 0 3 0 0 0 10 10 77

TABLE IV

Emotion classification performance of male and female speech, based on subjective Evaluation; Abbreviations: A-Anger,

D-Disgust, F-Fear, H-Happy, N-Neutral, Sa-Sad, S-Sarcastic, Sur-Surprise.

Male Artist(Average: 70.75) Female Artist(Average: 74.25)A D F H N Sa S Sur A D F H N Sa S Sur

Anger 91 0 4 3 2 0 0 0 88 0 8 4 0 0 0 0Disgust 0 51 0 2 10 19 10 8 0 48 0 0 9 30 7 6Fear 0 0 65 7 0 0 0 28 0 0 71 4 0 0 0 25Happy 0 0 5 69 0 0 0 26 0 0 8 82 0 0 0 10Neutral 0 10 0 0 85 5 0 0 0 15 0 0 83 2 0 0Sad 0 10 0 0 12 72 6 0 0 4 0 0 10 86 0 0Sarcastic 0 11 0 0 18 0 71 0 0 7 0 0 16 2 75 0Surprise 0 0 21 16 1 0 0 62 0 0 24 15 0 0 0 61

tematic study on emotional analysis from various points ofview.

Acknowledgements

Authors would like to acknowledge IIT Kharagpur forproviding financial support to collect this speech corpus.We also acknowledge the contribution of the artists fromGyanavani FM radio station Varanasi, India, for helping torecord this database. Our sincere thanks to the students ofIIT Kharagpur, who spent their time for subjective evalu-ation of this database.

References

[1] S. Ramamohan and S. Dandapat, “Sinusoidal model-based analy-sis and classification of stressed speech,” IEEE Trans. Speech andAudio Processing, vol. 14, pp. 737–746, May 2006.

[2] S. G. Koolagudi, S. Maity, V. A. Kumar, S. Chakrabarti, andK. S. Rao, IITKGP-SESC : Speech Database for Emotion Analy-sis. Communications in Computer and Information Science, JIITUniversity, Noida, India: Springer, issn: 1865-0929 ed., August17-19 2009.

[3] D. Ververidis and C. Kotropoulos, “A state of the art review onemotional speech databases,” in Eleventh Australasian Interna-

tional Conference on Speech Science and Technology, (Auckland,New Zealand), Dec. 2006.

[4] B. Yegnanarayana and S. P. Kishore, “AANN an alternative toGMM for pattern recognition,” Neural Networks, vol. 15, pp. 459–469, Apr. 2002.

[5] L. Yang, “The expression and recognition of emotions throughprosody,” in Proc. Int. Conf. Spoken Language Processing,pp. 74–77, 2000.

[6] R. Cowie and R. R. Cornelius, “Describing the emotional statesthat are expressed in speech,” Speech Communication, vol. 40,pp. 5–32, Apr. 2003.

[7] S. R. M. Prasanna and B. Yegnanarayana, “Extraction of pitchin adverse conditions,” in Proc. IEEE Int. Conf. Acoust., Speech,Signal Processing, (Montreal, Canada), May 2004.

[8] B. Yegnanarayana, Artificial Neural Networks. New Delhi, India:Prentice-Hall, 1999.

[9] C. J. C. Burges, “A tutorial on support vector machines for pat-tern recognition,” Data Mining and Knowledge Discovery, vol. 2,no. 2, pp. 121–167, 1998.