ismir2012 tutorial2

28
10/5/2012 1 Music Affect Recognition: The State-of-the-art and Lessons Learned Xiao Hu, Ph.D Yi-Hsuan Eric Yang, Ph.D The University of Hong Kong Academic Sinica, Taiwan ISMIR 2012 Tutorial 2 10/5/2012 1 Speaker 10/5/2012 2 Speaker 10/5/2012 3 The Audience Do you believe that music is powerful? Why do you think so? Have you searched for music by affect? Have you searched for other things (photos, video) by affect? Have you questioned the difference between emotion and mood? Is your research related to affect? 10/5/2012 4 Music Affect: 10/5/2012 5 Music Affect: 10/5/2012 6

Upload: xiaohusmile

Post on 20-Jan-2015

1.586 views

Category:

Technology


7 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Ismir2012 tutorial2

10/5/2012

1

Music Affect Recognition:

The State-of-the-art and

Lessons Learned

Xiao Hu, Ph.D Yi-Hsuan Eric Yang, Ph.D

The University of Hong Kong Academic Sinica, Taiwan

ISMIR 2012 Tutorial 2

10/5/2012 1

Speaker

10/5/2012 2

Speaker

10/5/2012 3

The Audience

� Do you believe that music is powerful?

� Why do you think so?

� Have you searched for music by affect?

� Have you searched for other things (photos, video) by affect?

� Have you questioned the difference between emotion and mood?

� Is your research related to affect?

10/5/2012 4

Music Affect:

10/5/2012 5

Music Affect:

10/5/2012 6

Page 2: Ismir2012 tutorial2

10/5/2012

2

Music Affect:

10/5/2012 7

Music Affect:

10/5/2012 8

Agenda

� Grand challenges on music affect

� Music affect taxonomy and annotation

� Automatic music affect analysis

� Categorical approach

� Multimodal approach

� Dimensional approach

� Temporal approach

� Beyond music

� Conclusion

10/5/2012 9

Agenda

� Grand challenges on music affect

� Music affect taxonomy and annotation

� Automatic music affect analysis

� Categorical approach

� Multimodal approach

� Dimensional approach

� Temporal approach

� Beyond music

� Conclusion

10/5/2012 10

Emotion or Mood ?

10/5/2012 11

Emotion or Mood ?

� Mood: “relatively permanent and stable”

� Emotion: “temporary and evanescent”

� "most of the supposed [psychological] studies of emotion in music are actually concerned with mood and association."

Meyer, Leonard B. (1956). Emotion and Meaning in Music.

Chicago: Chicago University Press

Leonard Meyer

10/5/2012 12

Page 3: Ismir2012 tutorial2

10/5/2012

3

Expressed or Induced

� Designated/indicated/expressed by a music piece

� Induced/evoked/felt by a listener

� Both are studied in MIR

� Mainly differ in the ways of collecting labels

� “indicate how you feel when listen to the music”

� “indicate the mood conveyed by the music”

10/5/2012 13

Which Moods? 1/2� Different websites / studies use different terms

Thayer’s stress-energy model gives 4 clusters Farnsworth’s 10 adjective groups

Tellegen-Watson-Clark model

10/5/2012 14

Which Moods ? 2/2

� Lack of a general theory of emotions � Ekman’s 6 basic emotions:

� anger, joy, surprise disgust, sadness, fear

� Verbalization of emotional states is often a “distortion” (Meyer, 1956)� “unspeakable feelings”

� “ a restful feeling throughout ... like one of going downstream while swimming”

10/5/2012 15

Sources of Music Emotion

� Intrinsic (structural characteristics of the music)� e.g., modality -> happy vs. sad

� What about melody?

� Extrinsic emotion (semantic context related but outside the music)

� Lee et al., (2012) identified a range of factors in people’s assessment of music mood� Lyrics, tempo, instrumentation, genre, delivery, and even cultural

context

� Little has been known on the mapping of these factors to music mood

Lee, J. H., Hill, T., & Work, L. (2012) What does music mood mean for real

users? Proceedings of the iConference10/5/2012 16

Let’s ask the users… (Lee et al., 2012)

10/5/2012 17

Data, data, data!

� Extremely scarce resource

� Annotations are time consuming

� Consistency is low across annotators

� Existent public datasets on mood:� MoodSwings Turk dataset

� 240 30-sec clips; Arousal – Valence scores

� MIREX mood classification task

� 600 30-sec clips; in 5 mood clusters

� MIREX tag classification task (mood sub-task)

� 3,469 30-sec clips; in 18 mood-related tag groups

� Yang’s emotion regression dataset

� 193 25-sec clips; in 11 levels Arousal Valence scale

10/5/2012 18

Page 4: Ismir2012 tutorial2

10/5/2012

4

Suboptimal Performance

� MIREX Mood Classification (2012)

� Accuracy: 46% - 68%

� MIREX Tag Classification mood subtask(2011)

10/5/2012 19

Newer Challenges

� Cross-cultural applicability� Existent efforts focus on Western music

� OS1 @ ISMIR 2012 (tomorrow): Yang & Hu: Cross-cultural Music Mood Classification: A Comparison on English and Chinese Songs

� Personalization � Ultimate solution to the subjectivity problem

� Contextualization� Even the same person’s emotional responses change in different

time, location, occasions

� PS1 @ ISMIR 2012 (Tomorrow) Watson & Mandryk: Modeling Musical Mood From Audio Features and Listening Context on an In-Situ Data Set

10/5/2012 20

Summary of Challenges

� Terminology

� Models and categories

� No consensus

� Sources and factors

� No clear mapping between sources and affects

� Data scarcity

� Suboptimal performances

� Newer issues

� Cross-cultural, personalization, contextualization,...

10/5/2012 21

Agenda

� Grand challenges on music affect

� Music affect taxonomy and annotation

� Automatic music affect analysis

� Categorical approach

� Multimodal approach

� Dimensional approach

� Temporal approach

� Beyond music

� Conclusion

10/5/2012 22

Music affect taxonomy and

annotation

� Background� What are taxonomies? � Taxonomy vs. Folksonomy

� Developing music mood taxonomies� Taxonomy from Editorial Labels� Taxonomies from Social Tags

� Annotations� Experts� Crowdsourcing (e.g., MTurks, games)� Subjects� Derived from online services

10/5/2012 23

Taxonomy � Domain oriented controlled vocabulary

� Contain labels (metadata)

� Commonly used on websites� Pick list; browsable directory, etc.

10/5/2012 24

Page 5: Ismir2012 tutorial2

10/5/2012

5

Taxonomy vs. Folksonomy

� Taxonomy

� Controlled, structured vocabulary

� Often require expert knowledge

� Top-down and bottom up approaches

� e.g.,

� Folksonomy

� Uncontrolled, unstructured vocabulary

� Social tags freely applied by users

� Commonality exists in large number of tags

� e.g.,

10/5/2012 25

Models in Music Psychology 1/2

� Categorical

� Hevner’s

adjective circle

(1936)

10/5/2012

Hevner, K. 1936. Experimental studies of the elements of expression in music. American Journal of

Psychology, 48

26

Models in Music Psychology 2/2

� Dimensional

� Russell’s

circumplex

model

10/5/2012

Russell, J. A. 1980. A circumplex model of affect. Journal of

Personality and Social

Psychology, 39: 1161-1178.

27

Borrow from Psychology to MIR

Thayer’s stress-energy model gives 4 clusters

Tellegen-Watson-Clark model

Grounded in music perception research, but lack social context of music listening (Juslin & Laukka, 2004)

Juslin, P. N. and Laukka, P. (2004). Expression, perception, and induction of musical emotions: a review and a questionnaire study of everyday listening. JNMR.

Farnsworth’s 10 adjective groups

10/5/2012 28

Taxonomy Built from Editorial Labels

� allmusic.com: “the most comprehensive music reference source on the planet”

� 288 mood labels created and assigned to music works

10/5/2012

• Editorial labels:-Given by professional editors of online repositories

-Have a certain level of control

- Rooted in realistic social contexts

29 10/5/2012 30

Page 6: Ismir2012 tutorial2

10/5/2012

6

Mood Label Clustering

Mood labels for albums

10/5/2012

Mood labels for songs

C1 C1C2 C2C3 C3C4 C4C5 C5

Hu, X., & Downie, J. S. (2007). Exploring Mood Metadata: Relationships with Genre, Artist and Usage Metadata. In Proceedings of ISMIR 31

A Taxonomy of 5 Mood Clusters

10/5/2012

Cluster_1:

passionate, rousing, confident, boisterous, rowdy

Cluster_2:

rollicking, cheerful, fun, sweet, amiable/good natured

Cluster_3:

literate, poignant, wistful, bittersweet, autumnal, brooding

Cluster_4:

humorous, silly, campy, quirky, whimsical, witty, wry

Cluster_5:

aggressive, fiery, tense/anxious, intense, volatile, visceral

32

Taxonomy from Social Tags

� Social tags

Pros:

� Users’ perspectives

� Large quantity

Cons:

� Non-standardized

� Ambiguous Linguistic Resources Human Expertise

“The largest music tagging site for Western music”

Hu, X. (2010). Music and Mood: Where Theory and Reality Meet. InProceedings of the 5th iConference, (Best Student Paper).10/5/2012 33

The Method

� 1,586 terms in WordNet-Affect (a lexicon of affective words)

� – 202 evaluation terms in General Inquirer

� (“good”, “great”, “poor”, etc.)

� – 135 non-affect/ ambiguous terms by experts

� ( “cold”, “chill”, “beat”, etc.)

� = 1,249 terms

10/5/2012

476 terms are last.fm tags

group the tags by WordNet-Affect and experts

=> 36 categories

34

2222----D Mood TaxonomyD Mood TaxonomyD Mood TaxonomyD Mood Taxonomy

10/5/2012

2-Dimensional Representation

10/5/2012 35

Comparison to Russell’s 2-D Model

10/5/2012 36

Page 7: Ismir2012 tutorial2

10/5/2012

7

VAL

EN

CE

AROUSAL

Our TaxonomyOur TaxonomyOur TaxonomyOur Taxonomy

10/5/2012 37

Laurier et al. (2009) Taxonomy from

Social Tags 1/2

� Manually compiled 120 mood words from the literature

� Crawled 6.8M social tags from last.fm

� 107 unique tags matched mood words

� 80 tags with more than 100 occurrences

10/5/2012

Laurier et al. (2009) Music mood representations from social tags, ISMIR

Most used Least used

sad rollicking

fun solemn

melancholy rowdy

happy tense

38

Laurier et al. (2009) Taxonomy from

Social Tags 2/2

10/5/2012

• Used LSA to project tag-track matrix to a space of 100 dim.

• Clustering trials with varied number of clusters

Laurier et al. (2009) Music mood representations from social tags, ISMIR

cluster 1 cluster 2 cluster 3 cluster 4

angry sad tender happy

aggressive bittersweet soothing joyous

visceral sentimental sleepy bright

rousing tragic tranquil cheerful

intense depressing quiet humorous

confident sadness calm gay

anger spooky serene amiable

+A –V -A –V -A +V +A +V

39

� Based on Laurier’s 100-dimensional space

10/5/2012

Agreement between Laurier’s and the 5 cluster taxonomy

Inter-cluster dissimilarity

C1 C2 C3 C4 C5

C1 0 .74 .13 .20 .11

C2 0 .86 .82 .88

C3 0 .32 .27

C4 0 .53

C5 0

Laurier et al. (2009) Music mood representations from social tags, ISMIR

Intra-cluster similarity

40

Summary on Taxonomy

� What are taxonomies?

� Taxonomy vs. Folksonomy

� Developing music mood taxonomies

� from Editorial Labels

� from Social Tags

10/5/2012 41

Mood Annotations

� All annotation needs three things

� taxonomy, music, people

� People

� Experts

� Subjects

� Crowdsourcing (e.g., MTurks, games)

� Derive annotations from online services

10/5/2012 42

Page 8: Ismir2012 tutorial2

10/5/2012

8

Expert Annotation

� The MIREX Audio Mood Classification (AMC) task

� 5 cluster taxonomy

� 1,250 tracks selected from the APM libraries

� A Web-based annotation system called E6K

10/5/2012Hu, X., Downie, J. S., Laurier, C., Bay, M., & Ehmann, A. (2008). The 2007 MIREX Audio Mood Classification Task: Lessons Learned. In ISMIR.

43

Expert Annotation: MIREX AMC

� 2468 judgments collected (3750 planned)� Each clips had 2 or 3 judgments� Avg. Cohen’s Kappa: 0.5

•Each expert had 250 clips• 8 of 21 experts finished all assignments

Agreements C1 C2 C3 C4 C5 Total

3 of 3 judges 21 24 56 21 31 153

2 of 3 judges 41 35 18 26 14 134

2 0f 2 judges 58 61 46 73 75 313

Total 120 120 120 120 120 600

Accuracy

0.59

0.38

0.54

Lessons: 1. Missed judgments -> low accuracy2. Need more motivated annotators

Dataset built from agreements among experts

10/5/2012 44

Crowdsourcing: Amazon Mechanic Turk

• Lee & Hu (2012): compare expert and MTurk annotations

• The same 1,250 music clips as in MIREX AMC

• The same 5 clusters

• Annotators: “Turkers” who work on human intelligent tasks for very low payment

• Advantages of MTurk

• Plenty of labor

• Disadvantages of MTurk

• Quality control

Lee, J. H. & Hu, X. (2012) Generating Ground Truth for Music Mood Classification Using Mechanical Turk, In Proceedings of Joint Conference on Digital Libraries10/5/2012 45

Annotation: Amazon Mechanic Turk

� Human Intelligence Task (HIT)

� Each HIT had 27 clips

� 2 duplicates for consistency check

� Each clips had 2 judges

� Paid 0.55 USD for 1 HIT

� Qualification test before proceeding to task

� 186 HITs collected

� 100 HITs accepted

� Avg. Cohen’s kappa: 0.4810/5/2012 46

EVALUTRON 6000 EVALUTRON 6000 EVALUTRON 6000 EVALUTRON 6000

Comparison: Stats on Collecting Data

Number of Judgments Collected

2468 (incomplete) 2500 (complete)

Total Time for Collecting All Judgments

38 days(+ additional in-house

assessment)

19 days

Cost for Collecting All Judgments

$0 $60.50

Average Time Spent on Each Music Clip

21.54 seconds 17.46 seconds10/5/2012 47

EVALUTRON 6000 EVALUTRON 6000 EVALUTRON 6000 EVALUTRON 6000

Comparison: Agreement Rates

222 % of clips with agreements

C1 40.2%C2 60.2%C3 70.5% C4 39.6%C5 70.8%

Other 16.9%

% of clips with agreements

C1 39.6%C2 48.9%C3 69.5%C4 46.3%C5 60.0%

Other 21.3%

10/5/2012 48

Page 9: Ismir2012 tutorial2

10/5/2012

9

Clusters Disagreed in E6KDisagreed IN

MTurk

Cluster 1 & Cluster 2 20 95

Cluster 2 & Cluster 4 31 86

Cluster 1 & Cluster 5 13 74

⁞ ⁞ ⁞

Cluster 3 & Cluster 4 6 27

Cluster 2 & Cluster 5 1 22

Cluster 3 & Cluster 5 1 20

Total 253 595

Comparison: Confusions among

Clusters

EVALUTRON 6000 EVALUTRON 6000 EVALUTRON 6000 EVALUTRON 6000

10/5/2012 49

Cluster 1

Cluster 2

Cluster 5

Cluster 4Cluster

3

Confusions Shown in Russell’s Model

10/5/2012 50

Comparison: System Performances

(MIREX 2007)

10/5/2012

EVALUTRON 6000 EVALUTRON 6000 EVALUTRON 6000 EVALUTRON 6000

51

Crowdsourcing: Games

� MoodSwings (Kim et al., 2008) � 2-player Web-based game to collect

annotations of music pieces in the arousal- valence space

� Time-varying annotations are collected at a rate of 1 sample per second

� Players “score” for agreement with their competitor

10/5/2012

Kim, Y. E., Schimdt, E., and Emelle, L. (2008). Moodswings: a collaborative game for music mood label collection, ISMIR

52

MoodSwings: Challenges

� Needs a pair of players� Simulated AI player

� Randomly following the real player � less challenging

� Based on prediction model � need training data

� Attracting players (true for all games)� Must be challenging and fun

� Music: more recent and entertaining

� Game interface: sleek, aesthetic

� Research values� Variety of music and mood

10/5/2012

B. G. Morton, J. A. Speck, E. M. Schmidt, and Y. E. Kim (2010). Improving music emotion labeling using human computation,” in HCOMP 53

MoodSwings: MTurk version

10/5/2012

Speck, J. A., Schmidt, E. M., Morton, B. G., and Kim, Y. E. (2011). A comparative study of collaborative vs. traditional music mood annotation, ISMIR

• Single person game

• No competition, no scores

• Monetary reward

(0.25 USD/11 pieces)

• Consistency check:

-- 2 identical pieces whose labels must be within experts’ decision boundary

-- must not label all clips the same way

54

Page 10: Ismir2012 tutorial2

10/5/2012

10

MoodSwings: 2 version Comparison

Label

Corr.

V: 0.71

A: 0.85

10/5/2012

Speck, J. A., Schmidt, E. M., Morton, B. G., and Kim, Y. E. (2011). A comparative study of collaborative vs. traditional music mood annotation, ISMIR 55

Subject Annotation

� Do not require music expertise

� Easier to recruit than experts

� Arguably more authentic to MIR situations

� Can be trained for annotation task

� Higher data quality than MTurk

� Still needs verification/evaluation

� Often with payments

� Rates much higher than MTurk

10/5/2012 Image Copyright © www.allaboutaddiction.com 56

Derive Annotations from online services

� Harness the power of Music 2.0

� Based on editorial labels and noisy user tags

� e.g., the MSD

� e.g., MIREX Audio Tag Classification mood dataset

10/5/2012

Music 2.0 Logo by Rocketsurgeon

57

MIREX Mood Tag Classification

10/5/2012 58

MIREX Mood Tag Classification Dataset:

Positive Examples in Each Category

� Based on the top 100 tags provided by last.fm API

10/5/2012

Select songs tagged heavily with terms in a category

59

MIREX Mood Tag Classification Dataset:

An Example

10/5/2012 60

Page 11: Ismir2012 tutorial2

10/5/2012

11

Annotation Derived from Music 2.0

PROS

� Grounded on real-life usage

� Larger dataset, supporting multi-label

� No manual annotation required

10/5/2012

CONS

• Need mood-related social tags

• Need clever ways to filter out noise

• May be culturally dependent

61

Cross-Cultural Issue in Annotation

� A survey of 30 clips on Americans and Chinese

C1: passionateC2: cheerfulC3: bittersweetC4: humorousC5: aggressive

Hu, X. & Lee, J. H. (2012). A Cross-cultural Study of Music Mood Perception between American and Chinese Listeners, ISMIR (PS3 – Thursday!)

Got to get you into my life by The

Beatles

10/5/2012 62

Summary on Annotation

� Expert annotation for small datasets

� Crowdsourcing with careful designs

� Music 2.0 for super size datasets

� ??10/5/2012 63

Agenda

� Grand challenges on music affect

� Music affect taxonomy and annotation

� Automatic Music affect analysis

� Categorical approach

� Multimodal approach

� Dimensional approach

� Temporal approach

� Beyond music

� Conclusion

10/5/2012 64

Automatic Approaches

� Categorical vs. Dimensional

Pros Cons

Categorical • Intuitive• Natural language

• Term are ambiguous• Difficult to offer fine-

grained differentiation

Dimensional • Continuousaffective scales

• Good user interface

• Less intuitive• Difficult to annotate

10/5/2012 65

Categorical and Multimodal

Approaches

� Classification problem and framework

� Audio features and classification models

� Existing experiments

� Multimodal classification

� Cross-cultural classification

10/5/2012 66

Page 12: Ismir2012 tutorial2

10/5/2012

12

Automatic Classification

(supervised learning)

“Here comes the sun” � Happy

“ I will be back” -> Sad

“Down with the sickness” � AngrySong X � Happy

Song Y� Sad………

Training examples

New examples

� Happy

� Angry

� Sad

Classifier

Training

Testing

Prediction

6710/5/2012

Audio

Feature Extraction

Social tagsTextual

Feature Extraction

linguistic stylistic … tempo …

Feature Selection F-score

language modeling PCA

Classification

SVM KNN …

Hybrid methods

featureconcatenation

late fusion …

performance comparison

learning curves

featurecomparison

Classification andMultimodal Combination

Evaluation and Analysis

Lyrics MP3s…

timbral

Feature Generation and Selection

Dataset Construction

A Framework for Multimodal Mood Classification

10/5/2012 68

Audio FeaturesType Description Tool

EnergyThe mean and standard deviation of root

mean square energy

Marsyas,

MIR Toolbox

Rhythm Fluctuation pattern and tempoMIR Toolbox

PsySound

Pitch

Pitch class profile, the intensity of 12

semitones of the musical octave in

Western twelve-tone scale

MIR Toolbox

PsySound

TonalKey clarity, musical mode (major/minor),

and harmonic change (e.g., chord change)

MIR Toolbox

Timbre

The mean and standard deviation of the

first 13 MFCCs, delta MFCCs, and delta

delta MFCCs

Marsyas,

MIR Toolbox

Psycho-

acoustic

perceptual loudness, volume, sharpness

(dull/sharp), timbre width (flat/rough),

spectral and tonal dissonance

(dissonant/consonant) of music

PsySound

10/5/2012 69

Classification Models

� Generic supervised learning algorithms� neural network, k-nearest neighbor (k-NN), maximum likelihood,

decision tree, support vector machine (SVM), Gaussian mixture models (GMM), Neural Network, etc.

� Tools: generic machine learning packages� Weka, RapidMiner, LibSVM, SVMLight

� SVM seems superior

MIREX AMC 2007 Results10/5/2012 70

Audio signal’s “glass-ceiling”

� Aucouturier & Pachet (2004) “Semantic Gap” between low-Level music feature

and high-level human perception

� MIREX AMC performance (5 classes)

Year Top 3 accuracies2007 61.50%, 60.50%, 59.67%

2008 63.67%, 58.20%, 56.00%

2009 65.67%, 65.50%, 63.67%

2010 63.83%, 63.50%, 63.17%

2011 69.50%, 67.17%, 66.67%

2012 67.83%, 67.67%, 67.17%

Aucouturier, J-J., & Pachet, F. (2004), Improving timbre similarity: How high is the sky? Journal of Negative. Results in Speech and Audio Sciences, 1 (1).

10/5/2012 71

Multimodal Classification

MUSIC

Audio Lyrics

Social Tags

Improving classification performance by combining multiple independent sources

Bischoff et al. 2009

Yang & Lee, 2004Laurie et al, 2009

Hu & Downie, 2010

Metadata

Schuller et al. 2011

10/5/2012 72

Page 13: Ismir2012 tutorial2

10/5/2012

13

Lyric Features� Basic features:

� Content words, part-of-speech, function words

� Lexicon features:� Words in WordNet-Affect

� Psycholinguistic features:� Psychological categories in GI (General

Inquirer)

� Scores in ANEW (Affective Norm of English Words)

� Stylistic features:� Punctuation marks; interjection words

� Statistics: e.g., how many words per minute

Hu, X. & Downie, J. S. (2010) Improving Mood Classification in Music Digital Libraries by Combining Lyrics and Audio, JCDL

ANEW examples

Valence

Arousal

Dominance

Happy 8.21 6.49 6.63

Sad 1.61 4.13 3.45

Thrill 8.05 8.02 6.54

Kiss 8.26 7.32 6.93

Dead 1.94 5.73 2.84

Dream 6.73 4.53 5.53

Angry 2.85 7.17 5.55

Fear 2.76 6.96 3.22

10/5/2012 73

Lyric Feature Example

GI Feature Description Example

WlbPhys words connoting the physical aspects of well

being, including its absence

blood, dead, drunk,

pain

Perceiv words referring to the perceptual process of

recognizing or identifying something by

means of the senses

dazzle, fantasy, hear,

look, make, tell, view

Exert action words hit, kick, drag, upset

TIME words indicating time noon, night, midnight

COLL words referring to all human collectivities people, gang, party

WlbLoss words related to a loss in a state of well

being, including being upset

burn, die, hurt, mad

Top General Inquire (GI) features in category “Aggressive”

10/5/2012 74

No significant difference

between top combinations

Lyric

Classification

Results10/5/2012 75

Distribution of feature “!”

10/5/2012 76

Distribution of feature “hey”

10/5/2012 77

“number of words per minute”

10/5/2012 78

Page 14: Ismir2012 tutorial2

10/5/2012

14

Combine with Audio-based Classifier

� A leading system in MIREX AMC 2007 and 2008: Marsyas

� Music Analysis, Retrieval and Synthesis for Audio Signals

� led by Prof. Tzanetakis at University of Victoria

� Uses audio spectral features

� marsyas.info

� Finalist in the Sourceforge Community Choice Awards 2009

10/5/2012 79

Hybrid Methods

– Late fusion

– Feature concatenation (early fusion)

Classifier

Prediction

Lyric Classifier

Audio Classifier

Prediction

Prediction

Final Prediction

Dominate due to clarity and the avoidance of “curse of dimensionality”

10/5/2012 80

10/5/2012 81

Effectiveness

Hybrid(early (early (early (early fusion)fusion)fusion)fusion)

Lyrics

Audio

Hybrid(late (late (late (late fusion)fusion)fusion)fusion)

10/5/2012 82

Learning Curves

10/5/2012 83

Audio vs. Lyrics

10/5/2012 84Hu & Downie (2010) When Lyrics Outperform Audio for Music Mood Classification: A Feature Analysis, ISMIR

Page 15: Ismir2012 tutorial2

10/5/2012

15

Top Lyric Features

10/5/2012 85

Top Lyric Features in “Calm”

10/5/2012 86

vs.

Top AffectiveTop AffectiveTop AffectiveTop AffectiveWordsWordsWordsWords

10/5/2012 87

Other Textual Features used in

Music Mood Classification

� Based on SentiWordNet

� assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity

� Simple Syntactic Structures

� Negation, modifier

� Lyric rhyme patterns (inspired by poems)

� Contextual features (Beyond lyrics)

� Social tags, blogs, playlists, etc.

10/5/2012 88

Cross-cultural Mood Classification

� Tomorrow, Oral Session 1

Yang & Hu (2012) Cross-cultural Music Mood Classification: A Comparison on English and Chinese Songs, ISMIR

Cross cultural model applicability:-23 mood categories based on AllMusic.com- Train on songs in one culture and classify songs in the other

10/5/2012 89

Summary of Categorical and

Multimodal Approaches� Natural language labels are intuitive to end users

� Based on supervised learning techniques

� Studies mostly focusing on Feature Engineering

� Multimodal approaches improve performances

� Effectiveness and Efficiency

� Cross-cultural mood classification: just started

� Challenges

� Ambiguity inherent in terms (Meyer’s “distortion”)

� Hierarchy of mood categories

� Connections between features and mood categories

10/5/2012 90

Page 16: Ismir2012 tutorial2

10/5/2012

16

Agenda

� Grand challenges on music affect

� Music affect taxonomy and annotation

� Automatic Music affect analysis

� Categorical approach

� Multimodal approach

� Dimensional approach

� Temporal approach

� Beyond music

� Conclusion

10/5/2012 91

Dimensional Approach

� What is and why dimensional model

� Computational model for dimensional music emotion recognition

� Issues

� Difficulty of emotion rating

� Subjectivity of emotion perception

� Context of music listening

� Usability of UI

10/5/2012 92

Categorical Approach

Hevner’ model (1936)

Audio spectrum

10/5/2012 93

Circumplex model

(Russell 1980)

Audio spectrum

Dimensional Approach

10/5/2012 94

What is the Dimensional Model� Alternative conceptualization of

emotions based on their placement along broad affective dimensions

� It is obtained by analyzing “similarity ratings” of emotion words or facial expression by factor analysis or multi-dimensional scaling� For example, Russell (1980) asked

343 subjects to describe their emotional states using 28 emotion words and use four different methods to analyze the correlation between the emotion ratings

� Many studies identifies similar dimensions

10/5/2012 95

The Valence-Arousal (VA) Emotion Model

○ Energy or neurophysiological stimulation level

Activation‒Arousal

Evaluation‒Valence○ Pleasantness○ Positive and

negative affective states

[psp80]10/5/2012 96

Page 17: Ismir2012 tutorial2

10/5/2012

17

More Dimensions

� The world of emotions is not 2D(Fontaine et al., 2007)

� 3rd dimension: potency‒control

� Feeling of power/weakness; dominance/submission

� Anger ↔ fear

� Pride ↔ shame

� Interest ↔ disappointment

� 4th dimension: predictability

� Surprise

� Stress↔ fear

� Contempt ↔ disgust

� However, 2D model seems to work fine for music emotion

10/5/2012 97

Why the Dimensional Model 1/3

� Free of emotion words

� Emotion words are not always precise and consistent� We often cannot find proper words to express our feelings

� Different people have different understandings to the words

� Emotion words are difficult to translate and might not exist with the exact same meaning in different languages (Russell 1991)

� Semantic overlap between emotion categories� Cheerful, happy, joyous, party/celebratory

� Melancholy, gloomy, sad, sorrowful

� Difficult to determine how many and what categories to be used in a mood classification system

10/5/2012 98

No Consensus on Mood Taxonomy in MIRWork # Emotion description

Katayose et al [icpr98] 4 Gloomy, urbane, pathetic, serious

Feng et al [sigir03] 4 Happy, angry, fear, sad

Li et al [ismir03],

Wieczorkowska et al [imtci04]

13Happy, light, graceful, dreamy, longing, dark, sacred, dramatic, agitated, frustrated, mysterious, passionate, bluesy

Wang et al [icsp04] 6 Joyous, robust, restless, lyrical, sober, gloomy

Tolos et al [ccnc05] 3 Happy, aggressive, melancholic+calm

Lu et al [taslp06] 4 Exuberant, anxious/frantic, depressed, content

Yang et al [mm06] 4 Happy, angry, sad, relaxed

Skowronek et al [ismir07]

12Arousing, angry, calming, carefree, cheerful, emo-tional, loving, peaceful, powerful, sad, restless, tender

Wu et al [mmm08] 8Happy, light, easy, touching, sad, sublime, grand, exciting

Hu et al [ismir08] 5 Passionate, cheerful, bittersweet, witty, aggressive

Trohidis et al [ismir08] 6 Surprised, happy, relaxed, quiet, sad, angry10/5/2012 99

Why the Dimensional Model 2/3

� Reliable and economical model

� Only two variables (valence, arousal), instead of tens or hundreds of mood tags

� Easy to compare the performance of different systems

� Suitable for continuous measurements

� Emotions may change over time

� Emotion intensity

� More precise and intuitive than emotion words

very angry

angry

neutral

arousal

valence

Emotion changes as time unfolds

10/5/2012 100

Why the Dimensional Model 3/3

� Ready canvas for user interaction� Emotion-based retrieval � Song collection navigation

Three dimensions are used:valence, arousal, synthetic/acoustic10/5/2012 101

Mapping Songs to the VA Space

� Assumption� View the VA space as a

continuous, Euclidean space

� View each point as an emotional state

� Goal� Given a short music clip

(e.g., 10 to 30 seconds)

� Automatically compute a pair of valence and arousal (VA) valuesthat best quantify (summarize) the expressed emotion of the overall clip

� The research on time-dependent second-by-second emotion recognition (emotion tracking) will be introduced in the next session

(valence, arousal)

10/5/2012 102

Page 18: Ismir2012 tutorial2

10/5/2012

18

How to Predict Emotion Values 1/3

� Sol (A): by dividing the emotion space into several mood classes� For example, into 16 classes

� Pros� Standard classification problem

y = f(x),x is a feature vector,y is a discrete label (1‒16)

� Cons� Poor granularity of the

emotion space (not really VA values)

Moody by Crayonroom

10/5/2012 103

How to Predict Emotion Values 2/3

� Sol (B): by further exploiting the “geographic information” (Yang et al., 2006)

� For example, perform binary classification for each quadrant

� Apply arithmetic operations to the probability estimates

� Valence = u1 + u4 – u2 – u3

� Arousal = u1 + u2 – u3 – u4

� Pros� Easy to compute

� Cons� Lack theoretical foundation

0

0.5

1

class 1 class 2 class 3 class 4

(u denotes likelihood)

10/5/2012 104

How to Predict Emotion Values 3/3

� Sol (C): by means of regression (Yang et al., 2007, 2008;

MacDorman et al., 2007; Eerola et al., 2009)

� Given features, predict a numerical value

� One for valence, one for arousal

yv = fv (x),ya = fa (x),

� Pros� Regression analysis is theoretical sound and well-developed

� Many off-the-shelf good regression algorithms

� Cons� Require ground truth “emotion values”

� Need to ask human subject to “rate” the emotion values of songs

x is a feature vector,yv and ya are both numerical values

10/5/2012 105

Linear Regression: Example

� Linear regression� f(x) = wTx +b

� Possible (hypothesized) w for valence and arousal

� positive valence = consonant harmony & major mode

� high arousal = loud loudness & fast tempo & high pitch

� Nonlinear regression functions can also be used

loudness(loud/soft)

tempo(fast/slow)

pitch level(high/low)

harmony(consonant/dissonant)

mode(major/miner)

valence 0 0 0 1 1

arousal 1 1 1 0 0

10/5/2012 106

Computational Framework

� Emotion annotation: obtain y for training data

� Feature extraction: obtain x

� Regression model training: obtain w

� Automatic prediction: obtain y for test data

Trainingdata

Emotion annotation

Feature extraction

Emotion value

Regressor training

Feature

Testdata

Feature extraction

Automatic Prediction

Feature

Regressor

Emotion value

y

x

x

w

y10/5/2012 107

Feature Extraction: Get xxxxExtractor Language Features

Marsyas-0.2 CMFCC, LPCC, spectral properties (centroid, moment, flatness, crest factor)

MIR toolbox MatlabSpectral features, rhythm features, pitch, key clarity, harmonic change, mode

MA toolbox MatlabMFCC, spectral histogram, periodic histogram, fluctuation pattern

PsySound MatlabPsychoacoustic model –based features (loudness, sharpness, roughness, virtual pitch, volume, timbre width, dissonance)

Rhythm pattern extractor

Matlab Rhythm pattern, beat histogram, tempo

EchoNest API Python Timbre, pitch, loudness, key, mode, tempo

MPEG-7 audio encoder

JavaSpectral properties, harmonic ratio, noise level, fundamental frequency type

10/5/2012 108

Page 19: Ismir2012 tutorial2

10/5/2012

19

Relevant Features

Sound intensity Tempo Rhythm

Pitch rangeMode Consonance

major

[Gomez and Danuser, 2007]

10/5/2012 109

Example Matlab Code for Extracting MFCC

Using the MA Toolbox

Take mean & STD along time

DC value

we take 20 coefficients

10/5/2012 110

Emotion Annotation: Get y

� Rate the VA values of each song

� Ordinal rating scale

� Scroll bar

Trainingdata

Emotion annotatio

n

Feature extraction

Emotion value

Regressor training

Feature

Testdata

Feature extraction

Automatic Prediction

Feature

Regressor

Emotion value

y

x

x

w

y

Only need to annotate the y for training data, the y for the test data can be

automatically predicted by our regression model

10/5/2012 111

Example System

� Data set (Yang et al., 2008)

� 195 pop songs (Chinese, Japanese, and English)

� Each song is rated by 10+ subjects

� Ground truth is set by averaging

� Use Marsyas and PsySound to extract features

� Model learning (get w)

� Linear regression

� Adaboost.RT (nonlinear)

� Support vector regression (SVR)(nonlinear)

10/5/2012 112

Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H.-H. Chen (2008) A regression approach to music emotion recognition, IEEE TASLP 16(2)

Performance Evaluation

� Evaluation metric� R 2 statistics

� Squared correlation between estimate and ground truth

� The higher the better

� R 2 = 1 � perfectly fits

� R 2 = 0 � random guess

� 10-fold cross validation� 9/10 data for training and 1/10 for testing

� Repeat 20 times to get average result

10/5/2012 113

Quantitative Result

� Result� SVR (nonlinear) performs the best

� Feature selection by the algorithm RReliefF offers gain

� Valence: 0.254

� Arousal: 0.609

� Valence is more difficult to model (it is more subjective)

� Valence: 0.25 – 0.35

� Arousal: 0.60 – 0.85

Method R2 of valence R2 of arousalLinear regression 0.109 0.568Adaboost.RT [ijcnn04] 0.117 0.553SVR (support vector regression) [sc04] 0.222 0.570SVR + RReliefF (feature selection) [ml03] 0.2540.2540.2540.254 0.6090.6090.6090.609

10/5/2012 114

Page 20: Ismir2012 tutorial2

10/5/2012

20

Qualitative ResultNo No No Part 2 - Beyonce

All Of Me - 50 Cent

New York Giants -Big Pun

Why Do I Have To Choose - Willie Nelson

The Last Resort - The Eagles

Mammas Don't Let Your Babies Grow

Up To Be Cowboys -Willie Nelson

Live For The One I Love -Celine Dion

If Only In The Heaven's Eyes - NSYNC

I've Got To See You Again - Norah Jones

Bodies - Sex Pistols

You're Crazy - Guns N' Roses

Out Ta Get Me - Guns N' Roses

10/5/2012 115

Music Retrieval in VA Space

� Provide a simple means for 2D user interface

� Pick a point

� Draw a trajectory

� Useful for mobile devices with small display space

⊳ Demoarousalarousal

valencevalence

Y.-H. Yang, Y.-C. Lin, H.-T. Cheng, and H.-H. Chen (2008) Mr. Emo: Music retrieval in the emotion plane, Proc. ACM Multimedia

10/5/2012 116

How to Further Improve the Accuracy

� Larger dataset

� Use higher-level features � Articulation, pitch contour, melody direction, tonality (Gabrielsson

and Lindström, 2010)

� High-level music concepts (tags)

� Lyrics

� Better understanding of human perception of valence

� Consider the correlation between arousal and valence� Output associative relevance vector machine (Nicolaou et al., 2012)

� Exploit the temporal information

� Long short-term memory neural networks (Weninger et al., 2011),

or HMM (hidden Markov model)

10/5/2012 117

Issue 1: Difficulty of Emotion

Annotation� Rating emotion is difficult

� User fatigue

� Uniform ratings

� Low quality of the ground truth

� Difficult to create large-scale dataset

Trainingdata

Manual annotation

Feature extraction

Emotion value

Regressor training

Feature

Testdata

Feature extraction

Automatic Prediction

Feature

Regressor

Emotion value

user A user B

10/5/2012 118

AnnoEmo: GUI for Emotion Rating

� Encourages differentiation

Click to listen again

Drag & drop to modify

annotation

⊳ Demo

Y.-H. Yang, Y.-F. Su, Y.-C. Lin, and H.-H. Chen (2007) Music emotion recognition: The role of individuality, Proc. Int. Works. Human-centered Multimedia10/5/2012 119

Sol: Ranking Instead of Rating

� Determines the position of a song

� By the relative ranking with respect to other songs

� Strength

� Ranking is easier than rating

� Encourages differentiation

� Avoids inconsistency

� Enhances the qualityof the ground truth

Oh Happy Day

I Want to Hold Your Hand by Beatles

I Feel Good by James Brown

What a Wonderful World by Louis Armstrong

Into the Woods by My Morning Jacket

The Christmas Song

C'est La Vie

Labita by Lisa One

Just the Way You Are by Billy Joel

Perfect Day by Lou Reed

When a Man Loves a Woman by Michael Bolton

Smells Like Teen Spirit by Nirvana

relativeranking

10/5/2012 120

Page 21: Ismir2012 tutorial2

10/5/2012

21

Ranking-Based Emotion Annotation

� Emotion tournament � Requires only n–1 pairwise comparisons

� The global ordering can later be approximated by a greedy algorithm [jair99]

� Use machine learning (“learning-to-rank”)to train a model that ranks songs according to emotion

a b c d e f g h

Which songs is more positive?

Y.-H. Yang and H.-H. Chen (2011) Ranking-based emotion recognition for music organization and retrieval, IEEE TASLP 19(4)

10/5/2012 121

Issue 2: Emotion Perception is Subjective

� A song can be perceived differently by two people, especially for emotions in the 3rd and 4th quadrants

� Also explain why valence prediction is more challenging

(a) Smells Like Teen Spirit

(b) A Whole New World

(c) The Rose (d) Tell Laura I Love Her

10/5/2012 122

Subjectivity of Emotion Perception

Each circle represents the emotion annotation for a music piece by a subject

10/5/2012 123

Sol: Personalized MER

� Need to have some emotion annotation of the target user to train a personalized model

� Not well studied so far

Trainingdata

Emotionannotation

Feature extraction

Emotion value

Regressor training

Feature

Testdata

Feature extraction

Automatic prediction

Feature

Regressor

Emotion value

Target user’sinteraction

Personalization

User feedback

10/5/2012 124

User Feedback: Personal Annotation

� Green cross: “universal” annotation

� Obtained from a group of annotators

� The blue eclipse shows the STD

� Red cross: “personal” annotation

10/5/2012 125

Evaluation of Personalized Prediction

Method |Ф|=5 |Ф|=10 |Ф|=20 |Ф|=30

General regressor 0.1630 0.1635 0.1645 0.1639

Personalized regressor 0.1632 0.1671 0.1768* 0.1839*

* Significant improvement over the general method (p<0.01)

|Ф|

Y. –H. Yang et al. (2009) Personalized music

emotion recognition, ACM SIGIR

10/5/2012 126

Page 22: Ismir2012 tutorial2

10/5/2012

22

Issue 3: Context of Music Listening

� Listening mood/context

� Familiarity/associated memory

� Preference of the singer/performer/song

� Social context (alone, with friends, with strangers)

10/5/2012 127

Issue 4: Usability of UI 1/2

10/5/2012 128

� A new user may not be familiar with the meaning of valence and arousal

Issue 4: Usability of UI 1/2

� A new user may not be familiar with the meaning of valence and arousal => display mood tags

10/5/2012 129

J.-C. Wang et al. (2012) Exploring the relationship between categorical and dimensional emotion semantics of music, Proc. MIRUM

Issue 4: Usability of UI 2/2

� Issues

� How to automatically map emotion words to the VA space

� How to “personalize” the above mapping� Different people may have different interpretation of words

� Let user determine which emotion word to be displayed � The emotion word can be used as a shortcut to organize user’s

music collection over the VA space

� Some preliminary study has been reported (Wang et al, 2012)

J.-C. Wang, Y.-H. Yang, K.-C. Chang, H.-M. Wang, and S.-K. Jeng (2012) Exploring the relationship between categorical and dimensional emotion semantics of music, Proc.

MIRUM10/5/2012 130

Summary of Dimensional Approach

� The dimensional approach as an alternative to the categorical approach

� Free of ambiguity, granularity issues

� A ready canvas for visualization, retrieval, and browsing

� Easy to track emotion variation within a music piece

� Regression-based computational method

� Issues

� Replacing the difficult task of emotion rating by ranking

� Address individual difference by personalization

� Need to take user context into account

� Enhance usability by displaying mood labels10/5/2012 131

Agenda

� Grand challenges on music affect

� Music affect taxonomy and annotation

� Automatic Music affect analysis

� Categorical approach

� Multimodal approach

� Dimensional approach

� Temporal approach

� Beyond music

� Conclusion

10/5/2012 132

Page 23: Ismir2012 tutorial2

10/5/2012

23

Temporal Emotion Variation� Emotion may change as a

musical piece unfolds� The chorus part is usually of

greater emotional intensity than the verse

� Failure of accounting the dynamic (time-varying) nature of music emotion may limit the performance of MER

� The combination of ‘anticipation’ and ‘surprise’ is important for our enjoyment of music (Huron, 2006)

10/5/2012 133

Temporal Aspect is Usually Neglected

� A 10–30 second segment is often used to represent the whole song� To reduce the emotion variation within the segment

� To lessen the burden of emotion annotation on the subjects

� The segment is selected� Manually by picking the most representative part

� By identifying the chorus section automatically

� By selecting the middle 30 seconds

� By selecting the [30, 60] segment

� Short-time features are pooled (e.g., by taking mean and STD) over the all segment, leading to a segment-level feature vector

10/5/2012 134

ComparisonApproach Emotion

annotationFeature extraction

Model training Prediction

Music emotion recognition

Segment-level (10-30 sec)

Segment-level vector or bag-of-frame

Segment-level A segment-level, static, estimate that summarizesthe emotion of the whole song

Musicemotion tracking

Second-level(1-3 sec)

Second-levelvector

Second-level;make use of the temporal relationship

Moment-to-moment “continuous” emotion variation within a song

Trainingdata

Emotion annotation

Feature extraction

Emotion value

Regressor training

Feature

Testdata

Feature extraction

Automatic Prediction

Feature

Regressor

Emotion value

y

x

x

w

y10/5/2012 135

Continuous Response� Continuous-response measures moment-to-moment

responses during listening

� In contrast, post-performance response assumes that music emotion can be understood by collecting affectiveresponses after a musical stimulus has been sounded

� Duration neglect: post-performance (remembered) rating ≠ averaged continuous-response data (Duke and Colprit, 2001)

� Post-performance ratings were generally higher (close to the “peak” or “end” experience, rather than the averaged one)

� Post-performance experience is information poor; listeners are forced to “compress” their response by giving an overall impression

10/5/2012 136

Issues� The emotion of popular music may not change very much or

often (Schmidt and Kim 2008)

� Collecting moment-to-moment response is interrupting and labor-intensive

� Gathering responses along 2 or more scales in the same pass may incur excessive cognitive load on the participant

� Lag structure

� There is a variable “reaction lag” between events and subject responses (Schubert 2001)

� Arousal responses follow the path of loudness with a delay of 2 to 4 seconds (Schubert 2004)

10/5/2012 137

Annotation Examples

� MoodSwings Turk Dataset (Schmidt and Kim, 2011)� 240 15-second clips annotated with Mechanical Turk

Live – Waitress Billy Joel– Captain Jack

10/5/2012 138

Page 24: Ismir2012 tutorial2

10/5/2012

24

Computational Approaches 1/2� The VA approach: predict the emotion values for every one

or two second

� Correspondingly, participants are asked to rate the VA values every one or two seconds� It is easier to track the continuous changes of emotional expression

by rating VA values instead of labeling mood classes (Gabrielsson, 2002)

� Algorithms� Regression

(neglect temporal info)

� Time-series analysis(make use of temporal info.) (Schubert, 1999)(Korhonen et al., 2006)(Schmidt and Kim, 2010)

10/5/2012 139

Time-Series Analysis

� Autoregression with extra inputs (ARX): learn Ak and Bk

valence (arousal) at time t valence (arousal) at previous time

feature at time t feature at previous time (lagged input) noise

Correlation between output values (V & A) y

Relationship between V, A and input values u(music features)

M. D. Korhonen, D. A. Clausi, and M. E. Jernigan (2006) Modeling emotional content of music using system identification, IEEE TSMC 36(3)10/5/2012 140

Block-wise Prediction

� Make a prediction for each short-time segment

� Fixed-length sliding window (Yang et al., 2006)

� Adaptive window (Lu et al., 2006)

Detect emotion change

boundariesSegmentation

Make a prediction for each segment

• Each contains a constant mood• The minimum length of a segment is set to 16

second for classical music (Lu et al., 2006)

10/5/2012 141

Subjective Issue� Listener ratings collected continuously in response to

music are notoriously diverse (Upham and McAdams, 2009)

Thirty audience members’ ratings, in colour, of Experienced Emotional Intensity during a live performance of Mozart’s Overture to Le nozze di Figaro and the average rating for this population in black (Upham and McAdams, 2009)

10/5/2012 142

Computational Approaches 2/2� The VA-Gaussian approach: model the emotion expression

of each time instance as a Gaussian distribution to take into account the subjectivity issue� Predict the time-dependent distribution parameters N(μ; Σ)

� Algorithm� Regression on the five parameters (Yang and Chen 2011)

� Kalman filter (Schmidt and Kim 2010)

� Acoustic emotion Gaussians(Want et al., 2012)

“American Pie” (Don McLean) 10/5/2012 143

Wang et al. (2012) “The Acoustic Emotion Gaussians model for emotion-based music annotation and retrieval,” Proc. ACM MM

Application of Emotion Tracking

� Specify a trajectory to indicate the desired emotion variation within a musical piece

A. Hanjalic (2006) Extracting moods from pictures and sounds: Towards truly personalized TV, IEEE Signal Processing Magazine10/5/2012 144

Page 25: Ismir2012 tutorial2

10/5/2012

25

Summary of Temporal Approach� The emotion of music changes as time unfolds

� Post-performance rating ≠ averaged continuous-response

� Computational methods for

� Tracking the second-by-second emotion variation as a trajectory

� Modeling emotion as a stochastic Gaussian distribution

� Issues

� Difficulty of gathering reliable annotation

(ranking is not useful here)

� Lag structure in emotion perception

� The effect of expectation, surprise

� Subjective differenceSweet anticipation by

David Huron

10/5/2012 145

Agenda

� Grand challenges on music affect

� Music affect taxonomy and annotation

� Automatic Music affect analysis

� Categorical approach

� Multimodal approach

� Dimensional approach

� Temporal approach

� Beyond music

� Conclusion

10/5/2012 146

Emotion in Images� The International Affective Picture System (IAPS)

10/5/2012 147

Emotions in Videos

� Need to depict the emotion variation along the

time (Zhang et al., 2009)

10/5/2012 148

Emotion Recognition in

Image/Video

� Image emotion recognition

� Categorical approach

� Dimensional approach

� Video emotion recognition

� Temporal approach

� Make use of dynamic attributes

such as motion intensity and

shot change rate

10/5/2012 149

Visual Features

10/5/2012 150

Page 26: Ismir2012 tutorial2

10/5/2012

26

Categorical Approach� The “categorical” approach

� Horror event detection in videos (Moncrieff et al., 2001)

� Affect detection in movies (Kang, 2002)

� Colors 'yellow', 'orange' and 'red' correspond to the 'fear' & 'anger'

� Color 'blue', 'violet' and 'green' can be found when the spectator feels 'high valence' and 'low arousal'

Class startle apprehension surprise climax

Precision 63% 89% 93% 80%

Class fear sadness joy

Precision 81% 77% 78%

10/5/2012 151

Dimensional/Temporal Approach

� Valence� Lighting key

� Color (saturation, color energy)

� Rhythm regularity

� Pitch

� Arousal� Shot change rate

� Motion Intensity

� Sound energy

� Rhythm-based features

� Tempo and beat strength

10/5/2012 152

Application: Affective Video Recommendation

� iMTV (Zhang et al., 2010)

10/5/2012 153

Application: Video Content Representation

(Hanjalic and Xu, 2005)

10/5/2012 154

Explicit (Self-Report) vs. Implicit Tagging

� Focus on felt emotion

10/5/2012 155

Bio-Sensors and Features

� Facial expression

� Blood volume pulse

� Respiration pattern

� Skin temperature

� Skin conductance

� EEG (ElectroEncephaloGram)

� EMG (ElectroMyoGram)

10/5/2012 156

Page 27: Ismir2012 tutorial2

10/5/2012

27

The DEAP Dataset 1/2

Music video recommendation system

User

Bodily Responses(EEG/

peripheral)Emotion

User’s taste

Recommended Music Video

Koelstra et al. (2012) “DEAP: A Database for Emotion Analysis using Physiological Signals,” IEEE Trans. Affective Computing

• Both explicit and implicit tagging

10/5/2012 157

The DEAP Dataset 2/2

40 one-minMusic Video

32 Participants

Multimedia content feat.

EEG

Physiological Signals

Rating

Dominance

Valance

Arousal

Liking

Familiarity

EEG feat.

Phys. Feat.

MusicVideos

Face Video

Correlation

Classification

10/5/2012 158

Correlation of EEG and Rating

Koelstra, “DEAP: A Database for Emotion Analysis using Physiological Signals”

Co

rrelation

Theta Alpha Beta Gamma

10/5/2012 159

Summary of Visual Emotion

Recognition� Commonality

� Similar emotion models and computational method (categorical, dimensional, temporal)

� Similar audio features

� Similar result (valence is more difficult than arousal)

� Difference� Prefer the temporal approach

� More focus on highlight extraction

� More focus on the “implicit” tagging of emotion, especially the study of physiological signals

10/5/2012 160

Agenda

� Grand challenges on music affect

� Affect categories and labels

� Music affect analysis

� Categorical approach

� Multimodal approach

� Dimensional approach

� Temporal approach

� Beyond music

� Conclusion

10/5/2012 161

Related Books

Sweet Anticipation: Music and the Psychology of Expectation

The Oxford Handbook of Music Psychology

Music Emotion Recognition

Handbook of Music and Emotion: Theory, Research, Applications

10/5/2012 162

Page 28: Ismir2012 tutorial2

10/5/2012

28

Recent Survey

� Y. E. Kim et al. (2010), “Music emotion recognition: A state of the art review,” in Proc. ISMIR.

� Y.-H. Yang and H.-H. Chen (2012) “Machine recognition of music emotion: A review,” ACM Trans. Intel. Systems & Technology, 3(4)

� M. Barthet et al. (2012) “Multidisciplinary perspectives on music emotion recognition: Implications for content and context-based models,” in Proc. CMMR

� Z. Zeng et al. (2009) “A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions,” IEEE Trans. Pattern Anal. & Machine Intel., 31(1)

10/5/2012 163

Conclusion

10/5/2012 164