discriminative feature learning for speech emotion...

Discriminative Feature Learning for SpeechEmotion Recognition

Yuying Zhang1 , Yuexian Zou1(&) , Junyi Peng1, Danqing Luo1,and Dongyan Huang2

1 ADSPLAB, School of ECE, Peking University, Shenzhen, [email protected]

2 UBTECH Robotics Corporate, Shenzhen, China

Abstract. It is encouraged to see that the deep neural networks based speechemotion recognition (DNN-SER) models have achieved the state-of-the-art onpublic datasets. However, the performance of DNN-SER models is limited dueto the following reasons: insufficient training data, emotion ambiguity and classimbalance. Studies show that, without large-scale training data, it is hard forDNN-SER model with cross-entropy loss to learn discriminative features bymapping the speech segments to their category labels. In this study, we proposea deep metric learning based DNN-SER model to facilitate the discriminativefeature learning by constraining the feature embeddings in the feature space. Forthe proof of the concept, we take a four-hidden layer DNN as our backbone forimplementation simplicity. Specifically, an emotion identity matrix is formedusing one-hot label vectors as supervision information while the emotionembedding matrix is formed using the embedding vectors generated by DNN.An affinity loss is designed based on the above two matrices to simultaneouslymaximize the inter-class separability and intra-class compactness of theembeddings. Moreover, to restrain the class imbalance problem, the focal loss isintroduced to reduce the adverse effect of the majority well-classified samplesand gain more focus on the minority misclassified ones. Our proposed DNN-SER model is jointly trained using affinity loss and focal loss. Extensiveexperiments have been conducted on two well-known emotional speech data-sets, EMO-DB and IEMOCAP. Compared to DNN-SER baseline, theunweighted accuracy (UA) on EMO-DB and IEMOCAP increased relatively by10.19% and 10% respectively. Besides, from the confusion matrix of the testresults on Emo-DB, it is noted that the accuracy of the most confusing emotioncategory, ‘Happiness’, increased relatively by 33.17% and the accuracy of theemotion category with the fewest samples, ‘Disgust’, increased relatively by13.62%. These results validate the effectiveness of our proposed DNN-SERmodel and give the evidence that affinity loss and focal loss help to learn betterdiscriminative features.

Keywords: Deep neural network � Speech emotion recognition � Affinity loss �Focal loss � Deep metric learning

© Springer Nature Switzerland AG 2019I. V. Tetko et al. (Eds.): ICANN 2019, LNCS 11730, pp. 1–13, 2019.https://doi.org/10.1007/978-3-030-30490-4_17

Aut

hor

Proo

f

http://orcid.org/0000-0002-2451-2727

http://orcid.org/0000-0001-9999-6140

http://crossmark.crossref.org/dialog/?doi=10.1007/978-3-030-30490-4_17&domain=pdf



https://doi.org/10.1007/978-3-030-30490-4_17

1 Introduction

As one of the basic means for human communication, speech contains abundantemotional information. Recently, speech emotion recognition (SER) has gainedincreasing attention because of its various applications, especially in human-computerinteraction, mental health analysis, remote education and so on. SER is a task ofidentifying human emotional state from the speech signal. When the emotional state isdescribed as discrete labels, such as anger, happiness, neutral, etc., SER is essentially amulti-class classification problem.

Generally, SER systems derive emotion recognition decisions on the frame-level(short segment) or on the utterance-level. For frame-level approach, low-leveldescriptors (LLDs) are extracted from speech frame as input to the sequential classi-fier, which mainly uses Gaussian mixture model (GMM) [1] or hidden Markov model(HMM) [2] to model the distribution of the emotional state of speakers. For utterance-level approach, on the other hand, statistical functions are applied to the LLDs over allframes of an utterance to obtain global features, which are then used as input todiscriminative classifiers such as support vector machines (SVM) [3]. In recent years,for SER task, deep neural networks (DNNs) have been employed to obtain high-levelfeatures (HFS) from low-level raw features. In the work of [4], DNN is trained to learnacoustic utterance-level features for SER. In [5], DNN is employed to extract short-term LLDs and the extreme learning machine (ELM) is used for final utterance levelclassification, which achieves the state-of-the-art for SER task.

Studies show that DNN-based SER (DNN-SER) methods have several limitations.Firstly, there is a lack of large-scale training datasets for SER tasks since it is extremelydifficult to collect the speech emotion data [6]. Without large-scale training data, it ishard for DNN-SER models to learn discriminative features. Secondly, human emotionsare naturally ambiguous [7]. Different types of emotions are easily confused with eachother, which increases the difficulty of SER [8]. For example, previous studies haveshown that ‘Anger’ and ‘Happiness’ have similar acoustic expression [9]. [10] showsthe activations of different nodes for emotions at the recurrent neural network

Fig. 1. Illustration of emotion ambiguity and class imbalance problem. The picture (left)indicates the specific active nodes for Anger and Happiness [10]. As can be observed, the activenodes are similar for Anger and Happiness, reflecting the emotion ambiguity problem. The piechart (right) shows the class imbalance problem on EMO-DB, i.e., different emotions account fordifferent proportion.

2 Y. Zhang et al.

Aut

hor

Proo

f

(RNN) and observes that the active nodes are similar for ‘Anger’ and ‘Happiness’.Thirdly, there is the class imbalance problem in the SER corpus, as illustrated in Fig. 1.Research outcomes of the visual object detection have shown that the class imbalancewill cause performance degradation since the large well-classified samples affect thelearning behavior of the deep model by dominating the gradient [11]. As discussedabove, with insufficient training data, DNN-SER models ask for a very good dis-criminative feature learning mechanism. However, most of existing DNN-SER meth-ods adopt the cross-entropy (CE) together with softmax as the supervision component,which cannot explicitly encourage the discriminative learning of features and has noadvantage in handling the class imbalance problem because it assigns an equal weightof loss to the majority and minority examples.

In order to resolve the critical issue of discriminative feature learning, severalapproaches have been developed under the metric learning framework. Huang et al.[12] applied the triplet loss to a Long Short-Term Memory Neural Network (LSTM)SER model, which separates the positive pair from the negative one by a distancemargin. Very recently, Lian et al. [13] applied the contrastive loss to Siamese neuralnetwork (SNN) SER model, which enables the model to learn more discriminativefeatures. However, carefully examining the triplet loss and contrastive loss, it is notedthat they require a carefully designed pair selection procedure, which results in theperformance of SER heavily depending on the manual selection of training pairs. Inaddition, compared to the training samples, the number of training pairs or tripletsdramatically grows even for a small dataset, which inevitably results in slow conver-gence and instability [14]. Moreover, we note that both of them neglect the classimbalance problem.

In this work, we propose a deep metric learning based DNN-SER method topromote the discriminative feature learning. For the proof of the concept, we take afour-hidden layer DNN as our backbone for implementation simplicity. Since the labelinformation is given, the emotion identity matrix can be formed using the one-hot labelvectors which are used as the supervision information. Correspondingly, the emotionembedding matrix is formed using the embedding vectors generated at the last hiddenlayer of the DNN. The affinity loss is designed based on the above two matrices tosimultaneously maximize the inter-class separability and the intra-class compactness ofthe feature embeddings. It is noted that, compared to CE loss which measures thesimilarity of emotion posterior probability distribution, the affinity loss directly opti-mizes the similarities between emotion embeddings. In addition, compared to tripletloss and contrastive loss in [12, 13], the affinity loss eliminates the selection of thesample pairs by exploits the correlation information of all embedding pairs and avoidssuffering from dramatic data expansion. Besides, to restrain the class imbalanceproblem, the focal loss [11] is introduced to down-weight the contribution of majoritywell-classified emotion class and gain more focus on minority misclassified ones.Specifically, the focal loss alleviates the adverse effect of class imbalance by preventingthe vast number of well-classified examples from dominating the gradient duringtraining. Our proposed DNN-SER model is jointly trained using affinity loss and focalloss. Extensive experiments have been conducted on two well-known emotional speechdatasets, EMO-DB and IEMOCAP. Compared to the DNN-SER baseline, theunweighted accuracy (UA) on EMO-DB and IEMOCAP increased relatively by

Discriminative Feature Learning for Speech Emotion Recognition 3

Aut

hor

Proo

f

10.19% and 10% respectively. Besides, from the confusion matrix of the test results onEmo-DB, it is encouraged to see that the accuracy of the most confusing emotioncategory, ‘Happiness’, increased relatively by 33.2% and the accuracy of the emotioncategory with the fewest samples, ‘Disgust’, increased relatively by 13.6%. Theseresults validate the effectiveness of our proposed DNN-SER model and give the evi-dence for supporting our joint loss has the ability to learn better discriminative features.

The rest of the paper is organized as follows. In Sect. 2, the proposed method isintroduced. Section 3 presents the datasets and feature set. Section 4 explains theexperimental settings and results. Conclusions and future work are given in Sect. 5.

2 Proposed Method

2.1 System Architecture

In this work, a four-hidden layer DNN-SER model is jointly trained using affinity lossand focal loss to address the emotion ambiguity problem and the class imbalanceproblem with insufficient training data, respectively. The architecture of the proposedSER system is shown in Fig. 2.

Our SER system essentially consists of three key ingredients: input data preparation(pre-processing), the learning model and learning criterion (loss calculation). In the pre-processing part, the utterance-level feature vector is extracted from the input trainingutterance. The training is carried out in batch size. In the learning model part, dataflows through all the fully connected (FC) layers. The output of the last hidden layer istermed as the feature embeddings, which are used for computing the affinity lossspecifically. The network is optimized with the joint supervision of the affinity loss andthe focal loss using Stochastic Gradient Descent (SGD). Details of the loss design aregiven in Sect. 2.2.

Fig. 2. An overview of DNN-SER system with proposed joint loss.

4 Y. Zhang et al.

Aut

hor

Proo

f

2.2 Loss Function

In this subsection, we will introduce our proposed joint loss for training the DNNmodel. The proposed joint loss consists of two parts: the affinity loss and the focal loss,which will be addressed separately in the following subsequent paragraphs.

Affinity Loss. For this study, the affinity loss is adopted since it is able to mitigate theemotion ambiguity by reinforcing the discriminability of features and alleviate the datainsufficiency problem by making full use of the correlation information and identityinformation.

The schematic diagram of affinity loss is shown in Fig. 3. In the following context,we will introduce our proposed method in formulating affinity loss for SER task.

Let’s assume that the DNN-SER system designed (Fig. 2) is parameterized by h andthe training is conducted in a batch of size B (utterance-level). The DNN embeddingextractor with four fully connected layers maps an utterance-level speech feature x to aD-dimensional unit-norm emotion embedding s ¼ fh xð Þ 2 R

1�D; i.e., sk k2¼ 1. Whiletraining in a batch, B utterance-level speech emotion features are randomly selected toform a set as X ¼ xif g; i 2 1; . . .;Bf g. Correspondingly, the output of the last fullyconnected layer is denoted as S = {si}, which is named as the emotion embeddingmatrix. Intuitively, the matrix SST 2 R

B�B is termed as the emotion embedding affinitymatrix. In this study, one-hot label vector representation yi is adopted which indicatesthe corresponding emotion identity of xi. Therefore, the label matrix associated withX can be denoted as Y = {yi}, which is termed as the emotion identity matrix and thenYYT 2 R

B�B is defined as the emotion identity affinity matrix.Obviously, the matrix YYT are available and can be taken as the supervision

information. Our goal is to design a loss to fully make use of the limited training databy exploiting the correlation information of all emotion embeddings. Hence, affinityloss (AL) is defined as follows:

Fig. 3. The schematic diagram of affinity loss. B denotes the batch size of input data, D is thedimension of output emotion embedding, S is the emotion embedding matrix, and Y is theemotion identity matrix.


Aut

hor

Proo

f

LAL ¼ SST � 2YYT þ 1��

��2F

¼X

i;j;yi¼yj1� cos si; sj

� �� 2 þX

i;j;yi 6¼yj�1� cos si; sj

� �� 2 ð1Þ

where �k k2F denotes the squared Frobenius norm. It is noted that SST� �

ij¼ si�sTjn o

indicates the cosine similarity between si and sj. When segment i and j belong to thesame emotion class, the cosine similarity between si and sj should be close to 1. Bycomparison, when i and j belong to different emotion class, the cosine similaritybetween si and sj should be close to −1. Besides, it is clear that YYT is a binary matrix.Therefore, if the segment i and j belong to the same emotion class (with the same one-hot label vector), we have (YYT)ij = 1. Otherwise, we have (YYT)ij = 0. Under asupervised learning framework, YYT is known which can be calculated using thetraining data.

From the definition in (1), we can see that the affinity loss is composed of two parts.The first item promotes the similarity between the embeddings of the same emotionclass while the second item promotes the discrimination between the embeddings fromdifferent emotion class. Two parts are optimized simultaneously during training. Asanalyzed above, the discriminability of features can be reinforced by increasing inter-class separability and intra-class compactness between feature embeddings.

Focal Loss. In this study, we introduce the focal loss (FL) to our DNN-SER model tomitigate the class imbalance problems and improve the ability of better discriminativefeature learning.

The focal loss was first proposed by Lin et al. at Facebook AI Research (FAIR) in[11] to address the class imbalance problem for dense visual object detection. Theestimated probability of the yi-th emotion sample is defined in (2) as follows:

p yið Þ ¼ eWTyisi þ byi

PNj¼1 e

WTj si þ bj

ð2Þ

where yi 2 1; . . .;Nf g, N is the number of emotion class. Wj 2 RD is the j-th column of

the weights W 2 RD�N in the last fully connected layer and b 2 R

N is the bias term.Essentially, the focal loss is a simple extension to cross-entropy (CE) by adding a

weighting factor a and a focusing parameter c to the conventional CE loss, as for-mulated in (3):

LFL p yið Þð Þ ¼ �ai 1� p yið Þð Þclog p yið Þð Þ ð3Þ

where ai [ 0 and c>0 are two hyper-parameters, controlling the decaying extent of theloss. These two parameters can be fixed through the training process, or be changedaccording to the specific training condition (e.g. changing by the training epochs) [15].It is noted that c is a tunable focusing parameter, which is able to reduce the relativeloss for large well-classified samples and gain more focus on the minority misclassifiedsamples.

6 Y. Zhang et al.

Aut

hor

Proo

f

Joint Loss. Following the discussions above, to address the emotion ambiguity andclass imbalance problems with insufficient training data simultaneously, the linearcombination of the affinity loss and focal loss is taken as the joint loss for the training,which is given as:

L ¼ LAL þ LFL ð4Þ

In this study, we only equally consider affinity loss and focal loss. However, a super-parameter can be introduced to assign different weight on the affinity loss and focal losswhich will be further explored.

As shown in Fig. 2. the DNN model is trained in a supervision manner using thejoint loss defined in (4) with the error back propagation method.

3 Dataset and Feature Set

3.1 Dataset

To evaluate the performance of our proposed SER system, experiments are conductedon the Berlin Emotional Database (EMO-DB) and Interactive Emotional DyadicMotion Capture (IEMOCAP) database. EMO-DB consists of 535 utterances that dis-played by ten professional actors, covering seven emotions (Anger, Boredom, Neutral,Disgust, Fear, Happiness and Sadness). IEMOCAP consists of 5531 utterances,including five sessions from 10 speakers. Each session is displayed by a pair of actorsin scripted and improvised scenarios. At least three evaluators annotated each utterancein the database with the categorical emotion labels chosen from the set. In ourexperiments, we consider only the utterances with majority agreement (two out of threeevaluators gave the same emotion label) over the emotion classes of Anger, Happiness,Neutral and Sadness, with “Excitement” considered as “Happiness”, which is similar toprior studies [9, 12]. The distribution of EMO-DB and IEMOCAP database is shown inTables 1 and 2.

Experiments are conducted in a speaker-independent manner. A 10-fold Leave-One-Speaker-Out (LOSO) cross-validation scheme was employed in all experiments.For each fold, the utterances from one speaker are used as the testing set, and theutterances from the other speakers are used as the training set. Such configuration is thesame as that used in [9, 16–18], which makes the experimental results comparablebetween our work and others.

Table 1. The distribution of EMO-DB database.

Emotion Anger Boredom Neutral Disgust Fear Happiness Sadness

Number 127 81 79 46 69 71 62


Aut

hor

Proo

f

3.2 Feature Set

Following [9, 17], the feature set designed for INTERSPEECH 2009 Emotion Chal-lenge is adopted in our experiments. Specifically, the low-level descriptors (LLDs)include the zero-crossing rate (ZCR), root mean square (RMS), pitch frequency (nor-malized to 500 Hz), the harmonics-to-noise ratio (HNR), and 1–12 Mel-frequencycepstral coefficients (MFCC) [19]. Moreover, another 12 functionals are applied to allthe frame-based features at utterance-level, such as mean, standard deviation, kurtosis,skewness and so on. The details are depicted in Table 3. Thus, the final feature vectorfor an utterance contains 384 attributes.

4 Experiment Setup and Results

The raw waveform is segmented into frames with a sliding window of 25 ms at the stepof 10 ms. Then LLDs are extracted from the frames and statistical functionals arecomputed for them. In this way, the 384-dim utterance-level feature vector is extractedfor each utterance sample. In all experiments, the input data is pre-processed with z-score normalization to reduce the impact of data range variation and make the networkeasier to converge, where zero mean and unit deviation is set for the data distribution.

Regarding the proposed DNN-SER model, it contains four hidden layers, with 512,256, 128 and 64 neurons respectively. We adopt a batch normalization after each fullyconnected layer and the activation function used is the Max Feature Map (MFM) [20].In the training process, the batch size is set to 32 and the regular RMSprop is adoptedas the optimizer, with an initial learning rate of 0.001. Learning rate is decayed if thevalidation loss has not decreased. The values of a and c in Eq. 3 are set to 0.5 and 4,respectively.

As standard practice in SER research, results are reported using the weightedaccuracy (WA) and the unweighted accuracy (UA) [19]. In this research, we focus onstudying the effectiveness of the proposed joint loss when training on class-imbalanceddata, so UA is taken as the primary evaluation metric.

Table 2. The distribution of IEMOCAP database.

Emotion Anger Happiness Neutral Sadness

Number 1103 1636 1708 1084

Table 3. Features set used in our experiment: low-level descriptors (LLD) and functionals [19].

LLD (16 � 2) Functionals (12)

(D) ZCR Mean(D) RMS Energy Standard deviation(D) F0 Kurtosisk, skewness(D) HNR Extremes: value, rel. position, range(D) MFCC 1–12 Linear regression: offset, slope, MSE

8 Y. Zhang et al.

Aut

hor

Proo

f

4.1 Evaluation Results

Experiments are conducted with our proposed DNN-SER systems on EMO-DB andIEMOCAP database, where the joint loss in (4) is used for supervising the networktraining. To prove the effectiveness of our proposed method, a DNN of the samestructure using the CE loss is used as the baseline model. Tables 4 and 5 compares theclassification accuracy obtained using the proposed method to those obtained usingother methods that have been tested on the same emotional speech database.

Firstly, from Tables 4 and 5, we can see that on both databases, the proposed DNN-SER system with joint loss outperforms other comparison methods.

It is noted that the works in [4, 5, 17, 21] all use DNN with CE loss, which does notstrongly encourage the discriminative learning of features, leading to unsatisfactoryperformance. In addition, [12] uses triplet loss to reduce the intra-class distance andenlarge the inter-class distance which is also inferior to our model.

For EMO-DB, our proposed DNN (AL+FL) SER system achieves WA of 89.34%and UA of 90.09%, which increased relatively by 7.21% and 10.19% respectivelycompared to the baseline. For IEMOCAP database, our proposed DNN (AL+FL) SERsystem achieves WA of 62.08% and UA of 64.35%, which relatively improved by6.63% and 10% compared to the baseline.

4.2 Ablation Study

In order to understand the impact of individual loss functions, we conducted an ablationstudy on EMO-DB. Comparison of classification accuracy among different lossfunctions on EMO-DB is given in Table 6.

Table 5. Performance comparison onIEMOCAP database (%).

Model WA UA

DNN-ELM [5]DNN-ELM [5] [21]LSTM+SVM [12]Naïve Bayes classifier[9]

54.357.91–

57.85

48.252.1360.462.54

BaselineDNN (AL+FL)

58.2262.08

58.5064.35

Table 4. Performance comparison onEMO-DB (%).

Model WA UA

GerDA [4]Artificial NeuralNetwork [16]DNN-ELM [17]SVM [18]

81.980.60–

87.30

79.180.7684.0987.32

BaselineDNN (AL+FL)

83.3389.34

81.8290.09

Table 6. Performance comparison among different loss functions on EMO-DB (%)

Methods WA UA

Baseline 83.33 81.82DNN (FL) 84.90 86.55DNN (AL) 88.16 88.93DNN (AL+FL) 89.34 90.09

AQ1


Aut

hor

Proo

f

Experiment results show that, with the same DNN architecture, the WA and UA ofusing joint loss are highest among all the methods, and the WA and UA of using focalloss and affinity loss separately are better than the baseline.

To further investigate the capability of the CE loss, focal loss, affinity loss, and jointloss, Figs. 4, 5, 6 and 7 give the confusion matrices of the test results for four differentloss settings. The AN, BO, NE, DI, FE, HA, and SA are short for Anger, Boredom,Neutral, Disgust, Fear, Happiness and Sadness, respectively.

From Fig. 1 and Table 1, we can see that the emotion category with the largest andfewest samples are ‘Anger’ and ‘Disgust’ respectively. It is noted that ‘Anger’ accountsfor a significantly larger proportion (23.74%) than other categories and ‘Disgust’accounts for a much smaller proportion (8.6%) than other categories.

Fig. 4. Confusion matrix of testresults on EMO-DB for baseline (%).

Fig. 5. Confusion matrix of testresults on EMO-DB for DNN(FL) (%).

Fig. 6. Confusion matrix of test resultson EMO-DB for DNN (AL) (%).

Fig. 7. Confusion matrix of test resultson EMO-DB for DNN (AL + FL) (%).

10 Y. Zhang et al.

Aut

hor

Proo

f

As can be seen from Fig. 4, using the CE loss, the accuracy of ‘Anger’ is quite highand ‘Disgust’ is rather low. This phenomenon indicates that the majority categorydominates the loss during training and the category with few examples cannot betrained adequately, which degrades the performance of SER. Besides, 28.29% of‘Happiness’ is misclassified as ‘Anger’, reflecting that ‘Happiness’ is easily confusedwith ‘Anger’.

In this study, we focus on improving the classification accuracy of the emotioncategory with the fewest samples, ‘Disgust’ and the most confusing emotion category,‘Happiness’ to alleviate the problem of class imbalance and the emotion ambiguity,respectively.

As can be seen from Fig. 5, after using the focal loss, the accuracy of category‘Disgust’ achieves 92.35%, which increased relatively by 19.84% compared to thebaseline, indicating that focal loss helps to restrain class imbalance problem. As can beseen from Fig. 6, after using the affinity loss, the accuracy of ‘Happiness’ achieves81.96%, which increased relatively by 36.55% compared to baseline. Besides, the ratioof “Happiness” misclassified as “Anger” decreased relatively by 45.74%. These resultsprove the capability of the affinity loss to deal with the ambiguity of emotions.

From Fig. 7, we observe that compared to the baseline, the accuracy of category‘Disgust’ increased relatively by 13.62% after using the joint loss of affinity loss andfocal loss. The accuracy of ‘Happiness’ increased relatively by 33.17% and the ratio of“Happiness” misclassified as “Anger” decreased relatively by 44.88%. These resultsdemonstrate the effectiveness of the joint loss for both class imbalance and emotionalambiguity problems. Besides, the classification accuracy of ‘Boredom’, ‘Neutral’ and‘Sadness’ are increased relatively by 6.46%, 11.63%, and 15.08%, respectively. Theclassification accuracy of the ‘Anger’ decreased relatively by 0.08%. The main reasonis that the network reduces the focus on the large samples during training after applyingthe focal loss. As for the classification accuracy of ‘Fear’, which has a relativelydecrease of 0.03%, we speculate that there may be some underlying factors underminethe system performance, and we will discuss it in our future work.

As discussed above, the experiment results validate the effectiveness of our pro-posed DNN-SER model and give the evidence that the proposed joint loss helps tolearn discriminative features.

5 Conclusion

In this paper, we propose an effective SER system via learning discriminative emotionfeatures with a DNN-SER model, which is supervised trained using our proposed jointloss consisting of affinity loss and focal loss. The affinity loss aims at simultaneouslymaximizing the inter-class separability and intra-class compactness of the embeddingswhile the focal loss targets at suppressing the contribution of majority well-classifiedsamples and gaining more focus on the minority misclassified samples. Performanceevaluation has been conducted on EMO-DB and IEMOCAP database. It is desired toobserve that our proposed SER system outperforms the comparison methods, which


Aut

hor

Proo

f

illustrate the capability of the joint loss to deal with the emotion ambiguity problem aswell as alleviating both the data insufficiency and class imbalance for SER task. Morespecifically, our proposed DNN-SER system achieves WA of 89.34% and UA of90.09% on EMO-DB, which increased relatively by 7.21% and 10.19% respectivelycompared to the DNN-SER baseline model. For IEMOCAP, our proposed DNN-SERsystem achieves WA of 62.08% and UA of 64.35%, which increased relatively by6.63% and 10% respectively compared to the baseline model. Besides, from theconfusion matrix of the test results on EMO-DB, it is encouraged to see that theaccuracy of the most confusing emotion category, ‘Happiness’, increased relatively by33.17% and the accuracy of the emotion category with the fewest samples, ‘Disgust’,increased relatively by 13.62%. In our future work, we will evaluate our proposed jointloss with different deep models, such as LSTM and CRNN to further improve SERperformance.

Acknowledgements. This work was partially supported by Shenzhen Science & TechnologyFundamental Research Programs (No: JCYJ20170817160058246, JCYJ20170306165153653 &JCYJ20180507182908274). Special acknowledgements are given to AOTO-PKUSZ JointResearch Center for Artificial Intelligence on Scene Cognition & Technology Innovation for itssupport.

References

1. Neiberg, D., Elenius, K., Laskowski, K.: Emotion recognition in spontaneous speech usingGMMs. In: Ninth International Conference on Spoken Language Processing (2006)

2. New, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden Markovmodels. Speech Commun. 41, 603–623 (2003)

3. Mower, E., Mataric, M.J., Narayanan, S.: A framework for automatic human emotionclassification using emotion profiles. IEEE Trans. Audio Speech Lang. Process. 19, 1057–1070 (2010)

4. Stuhlsatz, A., Meyer, C., Eyben, F., et al.: Deep neural networks for acoustic emotionrecognition: raising the benchmarks. In: 2011 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pp. 5688–5691 (2011)

5. Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network andextreme learning machine. In: Fifteenth Annual Conference of the International SpeechCommunication Association (2014)

6. Huang, Y., Hu, M., Yu, X., Wang, T., Yang, C.: Transfer learning of deep neural networkfor speech emotion recognition. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H.(eds.) CCPR 2016. CCIS, vol. 663, pp. 721–729. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-3005-5_59

7. Mower, E., Metallinou, A., Lee, C.C., et al.: Interpreting ambiguous emotional expressions.In: 2009 3rd International Conference on Affective Computing and Intelligent Interactionand Workshops, pp. 1–8. IEEE (2009)

8. Chao, L., Tao, J., Yang, M., Li, Y., et al.: Long short term memory recurrent neural networkbased encoding method for emotion recognition in video. In: 2016 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2752–2756 (2016)

12 Y. Zhang et al.

Aut

hor

Proo

f

http://dx.doi.org/10.1007/978-981-10-3005-5_59

http://dx.doi.org/10.1007/978-981-10-3005-5_59

9. Ma, X., Wu, Z., Jia, J., et al.: Speech emotion recognition with emotion-pair basedframework considering emotion distribution information in dimensional emotion space. In:INTERSPEECH 2017, pp. 1238–1242 (2017)

10. Ma, X., Wu, Z., Jia, J., et al.: Emotion recognition from variable-length speech segmentsusing deep learning on spectrograms. In: Proceedings of Interspeech 2018, pp. 3683–3687(2018)

11. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection.In: Proceedings of the IEEE International Conference on Computer Vision (2018)

12. Huang, J., Li, Y., Tao, J., Lian, Z.: Speech emotion recognition from variable-length inputswith triplet loss function. In: Proceedings of Interspeech 2018, pp. 3673–3677 (2018)

13. Lian, Z., Li, Y., Tao, J., et al.: Speech emotion recognition via contrastive loss under siamesenetworks. In: Proceedings of the Joint Workshop of the 4th Workshop on Affective SocialMultimedia Computing and First Multi-Modal Affective Computing of Large-ScaleMultimedia Data, pp. 21–26. ACM (2018)

14. Li, N., Tuo, D., Su, D., et al.: Deep discriminative embeddings for duration robust speakerverification. In: Proceedings of Interspeech 2018, pp. 2262–2266 (2018)

15. Wang, S., Qian, Y., Yu, K.: Focal KL-divergence based dilated convolutional neuralnetworks for co-channel speaker identification. In: 2018 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), pp. 5339–5343. IEEE (2018)

16. Bhargava, M., Polzehl, T.: Improving automatic emotion recognition from speech usingrhythm and temporal feature. arXiv preprint arXiv:1303.1761 (2013)

17. Guo, L., Wang, L., Dang, J., Zhang, L., Guan, H.: A feature fusion method based on extremelearning machine for speech emotion recognition. In: 2018 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), pp. 2666–2670 (2018)

18. Gao, Y., Li, B., Wang, N., Zhu, T.: Speech emotion recognition using local and globalfeatures. In: Zeng, Y., He, Y., Kotaleski, J.H., Martone, M., Xu, B., Peng, H., Luo, Q. (eds.)BI 2017. LNCS (LNAI), vol. 10654, pp. 3–13. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70772-3_1

19. Schuller, B., Steidl, S., Batliner, A.: The interspeech 2009 emotion challenge. In: TenthAnnual Conference of the International Speech Communication Association (2009)

20. Wu, X., He, R., Sun, Z., Tan, T.: A light CNN for deep face representation with noisy labels.IEEE Trans. Inf. Forensics Secur. 13, 2884–2896 (2018)

21. Lee, J., Tashev, I.: High-level feature representation using recurrent neural network forspeech emotion recognition. In: Sixteenth Annual Conference of the International SpeechCommunication Association (2015)


Aut

hor

Proo

f

http://arxiv.org/abs/1303.1761

http://dx.doi.org/10.1007/978-3-319-70772-3_1

http://dx.doi.org/10.1007/978-3-319-70772-3_1

discriminative feature learning for speech emotion...

Documents