mediaeval 2015 - emotion in music: task overview
TRANSCRIPT
Emotion in Music: Task Overview
Anna Aljanaki1 Mohammad Soleymani2
Yi-Hsuan Yang3
1Utrecht University, Netherlands2University of Geneva, Switzerland
3Academia Sinica, Taiwan
14-15 September, MediaEval 2015
Find me a song...
...like this
...or like this
Emotion in Music Task
� Focuses on audio analysis (optionally, metadata)� Recognizes that during a duration of a song the mood can
change� Uses valence/arousal model
Valence/Arousal model
We look at emotion over time (over duration of a piece)
Our history
From 2013 to now� Emotion in Music 2013 Brave new task
� Dynamic (overtime) emotion prediction� Static (per whole clip) emotion prediction
Our history
From 2013 to now� Emotion in Music 2013 Brave new task
� Dynamic (overtime) emotion prediction� Static (per whole clip) emotion prediction
� Emotion in Music 2014� Dynamic task� Feature design (evaluated on static data)
Our history
From 2013 to now� Emotion in Music 2013 Brave new task
� Dynamic (overtime) emotion prediction� Static (per whole clip) emotion prediction
� Emotion in Music 2014� Dynamic task� Feature design (evaluated on static data)
� Emotion in Music 2015� Dynamic task
Ground truth. Development set music
� In 2013 and 2014 we annotated on Mechanical Turk 1744excerpts of 45 seconds
� Music from Free Music Archive (freemusicarchive.org)� Licensed under Creative Commons� 10 genres: Rock, Pop, Electronic, Hip-Hop, Classical, Soul
and RnB, Country, Folk, International, Jazz
Ground truth
Cleaning of development set data
� The data was cleaned based on inter-annotator agreement� It resulted in a reduction from 1744 to 431 songs� The average Cronbach’s alpha for valence is 0.73 ± 0.12
and 0.76 ± 0.12 for arousal
Ground truth. Test set music
� 26 complete songs from Multitrack MedleyDB dataset:http://marl.smusic.nyu.edu/medleydb
� 26 complete songs from Jamendo music website� Automatically and manually checked for emotional variety� The same genres as in the development set� Cronbach’s alpha for valence is 0.29 ± 0.94 and
0.65 ± 0.28 for arousal
Ground truth
Such a small test set?
� Duration of the train set: 323 minutes� Duration of the test set: 227 minutes
Ground truth - evaluation set
Collecting annotations in 2015
� 5 annotators per song: 2 people from the lab and 3Mechanical Turk Workers
� Preliminary listening round� Mechanical Turk workers were supervised and only
received full money after the quality was confirmed
Ground truth. Annotations.
Annotation Interface
Baseline features
� Baseline features from openSMILE framework (260low-level features)
� 65 low-level acoustic descriptors, their first order derivates,the mean and standard deviation functionals of each LLDover 1s time windows with 50% overlap.
Emotion in Music 2015 obligatory runs
Every submission has to include:� Predictions using baseline features� Custom feature set (if applicable)� Free-style run (if desired)
Evaluation
Dynamic subtask evaluationWe use RMSE and Pearson’s correlation coefficient as metrics in thefollowing steps:
1. Calculate RMSE between predictions and ground truth for eachsong separately.
2. Average across songs separately for valence and for arousal.
3. Rank all submissions for each dimension based on the averagedRMSE.
4. In case the difference based on the one sided Wilcoxon test isnot significant (p>0.05), we use rho to break the tie.
5. If the ranking changed, we do significance test betweenneighbouring pairs again.
Baseline
There is a baseline for participants to compete with:� Baseline features� Linear Regression
Results - Arousal
12 teams crossed the finish line and submitted the papers.
Rank Team ArousalRMSE ρ
1 THUHCSIL 0.23 ± 0.11 0.66 ± 0.252 ICL 0.23 ± 0.11 0.63 ± 0.273 SAILUSC 0.24 ± 0.11 0.65 ± 0.224 HKPOLYU 0.24 ± 0.11 0.56 ± 0.275 PKU-AIPL 0.24 ± 0.10 0.54 ± 0.276 IRIT-SAMOVA 0.24 ± 0.11 0.63 ± 0.227 JKU-Tinnitus 0.25 ± 0.11 0.53 ± 0.238 UNIZA 0.25 ± 0.10 0.49 ± 0.239 NCUTom 0.25 ± 0.12 0.34 ± 0.2510 MIRUtrecht 0.26 ± 0.13 0.40 ± 0.3411 Baseline 0.27 ± 0.11 0.36 ± 0.2612 Average baseline 0.28 ± 0.13 013 UoA 0.39 ± 0.17 0.58 ± 0.24
Results - Valence
Rank Team ValenceRMSE ρ
1 JUNLP 0.29 ± 0.14 −0.03 ± 0.022 Average Baseline 0.29 ± 0.15 03 THU-HCSIL 0.31 ± 0.17 0.15 ± 0.474 MIRUtrecht 0.29 ± 0.15 0.08 ± 0.395 PKU-AIPL 0.33 ± 0.18 0.01 ± 0.436 NCUTom 0.34 ± 0.16 0.01 ± 0.347 SAILUSC 0.35 ± 0.18 0.00 ± 0.58 Baseline 0.36 ± 0.18 0.01 ± 0.389 IRIT-SAMOVA 0.36 ± 0.19 0.04 ± 0.4910 UNIZA 0.36 ± 0.17 0.01 ± 0.411 ICL 0.37 ± 0.19 0.02 ± 0.4912 JKU-Tinnitus 0.39 ± 0.19 0.01 ± 0.4113 UoA 0.49 ± 0.24 0.02 ± 0.46
Evaluation.
So, what happened to valence?Between arousal and valence in the trainset, rho=0.51 ± 0.65and RMSE=0.24 ± 0.17
Evaluation
So, what happened to valence?Between arousal and valence in the testset, rho=0.00 ± 0.59and RMSE=0.40 ± 0.25
Evaluation
And what about submissions?And in submissions, the correlation between valence andarousal was even stronger than in the train set:
� THUHCSIL, rho=0.79 ± 0.32 and RMSE=0.11 ± 0.07� ICL, rho=0.99 ± 0.00 and RMSE=0.05 ± 0.01� SAILUSC, rho=0.88 ± 0.18 and RMSE=0.09 ± 0.04
Feature sets evaluated on arousal
Rank Team Feature setsRMSE ρ
1 ICL 0.25 0.492 MIRUtrecht 0.25 0.483 HKPOLYU 0.26 0.504 JUNLP-run1 0.26 0.495 UNIZA-run1 0.26 0.516 UNIZA-run2 0.26 0.517 IRIT-SAMOVA 0.26 0.508 JUNLP-run2-arousal 0.27 0.359 THU-HCSIL 0.27 0.41
Evaluation. Baseline features
Team’s results using baseline features
Acknowledgments
Missing teams presentations
� PKU-AIPL� HKPOLYU� NCUTom� SAILUSC
PKU-AIPL
Kang Cai, Wanyi Yang, Yao Cheng, Deshun Yang, XiaoouChenInstitute of Computer Science and Technology, PekingUniversity, Beijing, China
� Features: MFCC, edge orientation histograms onspectrograms, low-level spectral features
� Continuous conditional random fields with SVR as baseclassifier
HKPOLYU
Yang Liu, Yan Liu, Zhonglei GuHong Kong Baptist University, Hong Kong PolytechnicUniversity, Hong Kong SAR
� Features: 260 baseline features� The main contribution is a supervised feature reduction
technique that takes into account similarity between items� SVR as a classifier
MIRUtrecht
Anna Aljanaki, Frans Wiering, Remco C. VeltkampUtrecht University
� Features: Essentia, extracted using bigger frames (severalseconds)
� Gaussian Processes� Based on segmenting audio by emotion� There is a poster!
Predicting affect in music using regression methods on low level features
Rahul Gupta, Shrikanth NarayananSignal Analysis and Interpretation LabUniversity of Southern California
Approach: Regression methods
Baseline features Regression Smoothing Valence/ Arousal
prediction
1. Linear regression
2. Least squares boosting
Moving average filter
Approach: Regression methods
Baseline features Regression Smoothing Valence/ Arousal
prediction
Boosted Ensemble of Single feature Filters
(BESiF)
Gradient boosting based combination of
regression + smoothing
=
Approach: Regression methods
Baseline features Regression Smoothing Valence/ Arousal
prediction
Method Valence Arousal
RMSE r RMSE r
Baseline .37 .01 .27 .36
Linear regression + smoothing
.35 .01 .24 .65
Least squares boosting + smoothing
.35 .05 .24 .59
BESiF .37 -.04 .28 .50
Approach: Regression methods
Baseline features Regression Smoothing Valence/ Arousal
prediction
Method Valence Arousal
RMSE r RMSE r
Baseline .37 .01 .27 .36
Linear regression + smoothing
.35 .01 .24 .65
Least squares boosting + smoothing
.35 .05 .24 .59
BESiF .37 -.04 .28 .50
Future investigations● Annotation biases due
to longer songs● Differences in features
for valence and arousal prediction
● Generalization of models trained on smaller segments to longer segments
MediaEval 2015: Recurrent Neural Network Approach to Emotion in Music Tack
Yu-Hao Chin and Jia-Ching Wang
Department of Computer Science and Information Engineering
National Central University, Taiwan, R.O.C
• This paper adopts deep recurrent neural network to predict the valence and arousal for each moment of a
song, and Limited-Memory-Broyden–Fletcher–Goldfarb–Shanno algorithm is used to update the
weights when doing back-propagation. A 10-fold cross validation is used to evaluate the performance.
• Approach 1: The MIR feature set (see Table 1) is adopted. A RNN model is adopted.
• Approach 2: The baseline feature set is adopted. A RNN model is adopted.
Valence
and
Arousal
Music
Database
Feature
Extraction
Recurrent
Neural
Network
Technical retreat
Today between 14:15-15:15. Everyone is welcome!