mediaeval 2015 - emotion in music: task overview

Emotion in Music: Task Overview

Anna Aljanaki1 Mohammad Soleymani2

Yi-Hsuan Yang3

1Utrecht University, Netherlands2University of Geneva, Switzerland

3Academia Sinica, Taiwan

14-15 September, MediaEval 2015

Find me a song...

...like this

...or like this

Emotion in Music Task

� Focuses on audio analysis (optionally, metadata)� Recognizes that during a duration of a song the mood can

change� Uses valence/arousal model

Valence/Arousal model

We look at emotion over time (over duration of a piece)

Our history

From 2013 to now� Emotion in Music 2013 Brave new task

� Dynamic (overtime) emotion prediction� Static (per whole clip) emotion prediction

Our history



� Emotion in Music 2014� Dynamic task� Feature design (evaluated on static data)

Our history



� Emotion in Music 2014� Dynamic task� Feature design (evaluated on static data)

� Emotion in Music 2015� Dynamic task

Ground truth. Development set music

� In 2013 and 2014 we annotated on Mechanical Turk 1744excerpts of 45 seconds

� Music from Free Music Archive (freemusicarchive.org)� Licensed under Creative Commons� 10 genres: Rock, Pop, Electronic, Hip-Hop, Classical, Soul

and RnB, Country, Folk, International, Jazz

Ground truth

Cleaning of development set data

� The data was cleaned based on inter-annotator agreement� It resulted in a reduction from 1744 to 431 songs� The average Cronbach’s alpha for valence is 0.73 ± 0.12

and 0.76 ± 0.12 for arousal

Ground truth. Test set music

� 26 complete songs from Multitrack MedleyDB dataset:http://marl.smusic.nyu.edu/medleydb

� 26 complete songs from Jamendo music website� Automatically and manually checked for emotional variety� The same genres as in the development set� Cronbach’s alpha for valence is 0.29 ± 0.94 and

0.65 ± 0.28 for arousal

Ground truth

Such a small test set?

� Duration of the train set: 323 minutes� Duration of the test set: 227 minutes

Ground truth - evaluation set

Collecting annotations in 2015

� 5 annotators per song: 2 people from the lab and 3Mechanical Turk Workers

� Preliminary listening round� Mechanical Turk workers were supervised and only

received full money after the quality was confirmed

Ground truth. Annotations.

Annotation Interface

Baseline features

� Baseline features from openSMILE framework (260low-level features)

� 65 low-level acoustic descriptors, their first order derivates,the mean and standard deviation functionals of each LLDover 1s time windows with 50% overlap.

Emotion in Music 2015 obligatory runs

Every submission has to include:� Predictions using baseline features� Custom feature set (if applicable)� Free-style run (if desired)

Evaluation

Dynamic subtask evaluationWe use RMSE and Pearson’s correlation coefficient as metrics in thefollowing steps:

1. Calculate RMSE between predictions and ground truth for eachsong separately.

2. Average across songs separately for valence and for arousal.

3. Rank all submissions for each dimension based on the averagedRMSE.

4. In case the difference based on the one sided Wilcoxon test isnot significant (p>0.05), we use rho to break the tie.

5. If the ranking changed, we do significance test betweenneighbouring pairs again.

Baseline

There is a baseline for participants to compete with:� Baseline features� Linear Regression

Results - Arousal

12 teams crossed the finish line and submitted the papers.

Rank Team ArousalRMSE ρ

1 THUHCSIL 0.23 ± 0.11 0.66 ± 0.252 ICL 0.23 ± 0.11 0.63 ± 0.273 SAILUSC 0.24 ± 0.11 0.65 ± 0.224 HKPOLYU 0.24 ± 0.11 0.56 ± 0.275 PKU-AIPL 0.24 ± 0.10 0.54 ± 0.276 IRIT-SAMOVA 0.24 ± 0.11 0.63 ± 0.227 JKU-Tinnitus 0.25 ± 0.11 0.53 ± 0.238 UNIZA 0.25 ± 0.10 0.49 ± 0.239 NCUTom 0.25 ± 0.12 0.34 ± 0.2510 MIRUtrecht 0.26 ± 0.13 0.40 ± 0.3411 Baseline 0.27 ± 0.11 0.36 ± 0.2612 Average baseline 0.28 ± 0.13 013 UoA 0.39 ± 0.17 0.58 ± 0.24

Results - Valence

Rank Team ValenceRMSE ρ

1 JUNLP 0.29 ± 0.14 −0.03 ± 0.022 Average Baseline 0.29 ± 0.15 03 THU-HCSIL 0.31 ± 0.17 0.15 ± 0.474 MIRUtrecht 0.29 ± 0.15 0.08 ± 0.395 PKU-AIPL 0.33 ± 0.18 0.01 ± 0.436 NCUTom 0.34 ± 0.16 0.01 ± 0.347 SAILUSC 0.35 ± 0.18 0.00 ± 0.58 Baseline 0.36 ± 0.18 0.01 ± 0.389 IRIT-SAMOVA 0.36 ± 0.19 0.04 ± 0.4910 UNIZA 0.36 ± 0.17 0.01 ± 0.411 ICL 0.37 ± 0.19 0.02 ± 0.4912 JKU-Tinnitus 0.39 ± 0.19 0.01 ± 0.4113 UoA 0.49 ± 0.24 0.02 ± 0.46

Evaluation.

So, what happened to valence?Between arousal and valence in the trainset, rho=0.51 ± 0.65and RMSE=0.24 ± 0.17

Evaluation

So, what happened to valence?Between arousal and valence in the testset, rho=0.00 ± 0.59and RMSE=0.40 ± 0.25

Evaluation

And what about submissions?And in submissions, the correlation between valence andarousal was even stronger than in the train set:

� THUHCSIL, rho=0.79 ± 0.32 and RMSE=0.11 ± 0.07� ICL, rho=0.99 ± 0.00 and RMSE=0.05 ± 0.01� SAILUSC, rho=0.88 ± 0.18 and RMSE=0.09 ± 0.04

Feature sets evaluated on arousal

Rank Team Feature setsRMSE ρ

1 ICL 0.25 0.492 MIRUtrecht 0.25 0.483 HKPOLYU 0.26 0.504 JUNLP-run1 0.26 0.495 UNIZA-run1 0.26 0.516 UNIZA-run2 0.26 0.517 IRIT-SAMOVA 0.26 0.508 JUNLP-run2-arousal 0.27 0.359 THU-HCSIL 0.27 0.41

Evaluation. Baseline features

Team’s results using baseline features

Acknowledgments

Missing teams presentations

� PKU-AIPL� HKPOLYU� NCUTom� SAILUSC

PKU-AIPL

Kang Cai, Wanyi Yang, Yao Cheng, Deshun Yang, XiaoouChenInstitute of Computer Science and Technology, PekingUniversity, Beijing, China

� Features: MFCC, edge orientation histograms onspectrograms, low-level spectral features

� Continuous conditional random fields with SVR as baseclassifier

HKPOLYU

Yang Liu, Yan Liu, Zhonglei GuHong Kong Baptist University, Hong Kong PolytechnicUniversity, Hong Kong SAR

� Features: 260 baseline features� The main contribution is a supervised feature reduction

technique that takes into account similarity between items� SVR as a classifier

MIRUtrecht

Anna Aljanaki, Frans Wiering, Remco C. VeltkampUtrecht University

� Features: Essentia, extracted using bigger frames (severalseconds)

� Gaussian Processes� Based on segmenting audio by emotion� There is a poster!

Predicting affect in music using regression methods on low level features

Rahul Gupta, Shrikanth NarayananSignal Analysis and Interpretation LabUniversity of Southern California

Approach: Regression methods

Baseline features Regression Smoothing Valence/ Arousal

prediction

1. Linear regression

2. Least squares boosting

Moving average filter



prediction

Boosted Ensemble of Single feature Filters

(BESiF)

Gradient boosting based combination of

regression + smoothing

=



prediction

Method Valence Arousal

RMSE r RMSE r

Baseline .37 .01 .27 .36

Linear regression + smoothing

.35 .01 .24 .65

Least squares boosting + smoothing

.35 .05 .24 .59

BESiF .37 -.04 .28 .50



prediction

Method Valence Arousal

RMSE r RMSE r

Baseline .37 .01 .27 .36

Linear regression + smoothing

.35 .01 .24 .65

Least squares boosting + smoothing

.35 .05 .24 .59

BESiF .37 -.04 .28 .50

Future investigations● Annotation biases due

to longer songs● Differences in features

for valence and arousal prediction

● Generalization of models trained on smaller segments to longer segments

MediaEval 2015: Recurrent Neural Network Approach to Emotion in Music Tack

Yu-Hao Chin and Jia-Ching Wang

Department of Computer Science and Information Engineering

National Central University, Taiwan, R.O.C

• This paper adopts deep recurrent neural network to predict the valence and arousal for each moment of a

song, and Limited-Memory-Broyden–Fletcher–Goldfarb–Shanno algorithm is used to update the

weights when doing back-propagation. A 10-fold cross validation is used to evaluate the performance.

• Approach 1: The MIR feature set (see Table 1) is adopted. A RNN model is adopted.

• Approach 2: The baseline feature set is adopted. A RNN model is adopted.

Valence

and

Arousal

Music

Database

Feature

Extraction

Recurrent

Neural

Network

Technical retreat

Today between 14:15-15:15. Everyone is welcome!

mediaeval 2015 - emotion in music: task overview

Education