1 wp3 speech and emotion (analysis & recognition) human language technologies

1

WP3 speech and emotion (analysis & recognition)

humanlanguage

technologies

2

Databases and Annotations

3

UERLN: SYMPAFLY

Fully automatic speech dialogue telephone system for flight reservation and booking, different system stages; 270 Dialogues.

• Annotations: word-based emotional user states, prosodic and conversational peculiarities; dialogue (step) success; emotional user states distribution follows nested Pareto (80/20) principle

4

UERLN: AIBO

Children's interaction (age 10-12, 51 children, 9.2 hours of speech) with SONY’s AIBO robot, Wizard-of-Oz-scenario; cf. WP5 (plus English and read speech)

• Annotations: word-based emotional user states (holistic, 5 labellers) and prosodic peculiarities; alignment of children's utterances with AIBO's actions; manual correction of F0, labelling of voice quality. Emotional user states for the English data.

5

AIBO disobedient: frommotherese to angry

g'radeaus Aibolein ja M fein M gut M machst M du M *da M | *tz l"aufst du mal bitte nach links | stopp E Aibo stopp | nach links E umdrehen | nein M <*ne> nein M <*ne> nein M <*ne> so M weit M *simma M noch M nicht M aufstehen M Schlafm"utze M komm M hoch M | ja M so M ist M es M <*is> guter M Hund M lauf mal jetzt nach links | nach links Aibo | Aibolein M aufstehen M *son M sonst M werd' M ich M b"ose M hoch E | nach A links A | Aibo A nach A links A | Aibolein A ganz A b"oser A Hund A jetzt A stehst A du A auf A | hoch A | dreh dich ein bisschen | ja M so ist es <*is> gut stopp Aibo stopp | *tz lauf g'radeaus |

6

UERLN: Different Conceptualizations

Aibo straight on stop Aibo stop turn round to the left Aibo get up turn round to the left Aibo get up turn round, to the left Aibo get up get up Aibo now go left now straight on Aibo st´ straight on

Straight on little Aibo ok greatYou‘re doing fine now please to the left stop Aibo stop turn to the left no no no we aren´t thatfar yet get up sleepyhead get upyes that´s a good dog now goleft left Aibo little Aibo get upelse I´m getting angry get up Aibo left little Aibo bad boy now get up turn a little ok that´s fine stop Aibo stop straight on

Remote control tool Pet dog

7

Fully automatic speech dialogue telephone system • 15,6 hours of Italian natural speech• 9444 files (turns) -> 450 emotionally rich

Word-level• Orthographic transcription and word segmentation• Prosodic peculiarities annotated

Turn-level• Holistic emotion labels

Sympafly (cf. UERLN) for comparison and benchmarking

ITC: Targhe

8

UKA: LDC2002S28

Elicited emotional speech database; native American English

• labels: 1 of 15 holistic speaker states per utterance; used in algorithm and feature set development

9

UKA: ISL Meeting Corpus

18 recordings of multi-party (mean 5.1 participants) meetings; mean 35 minute duration; American English

• Annotations: orthographic transcription; Verbmobil II, and discourse-level annotations.

10

Assessment of Data Collection:

• focus on• spontaneous, realistic data• important/new types of dialogues/interaction• evaluation of annotations

• considerable percentage of realistic (processed and available) databases world-wide

11

Features & Classification

12

UERLN: Features

• large feature vector for a context of 2 words:• 95 prosodic (duration, energy, F0, pauses)• 80 spectral (HNR, formant based frequencies and energy)• 24 MFCC• 30 POS

• Language Models & dialogue based features

13

Baseline feature set• 96 features• Based on energy, duration, and pitch

Final feature set• 273 features (many redundant)• Based on energy, duration, pitch, and pauses• Different pitch extractors tried

Normalized Cross CorrelationWeighted Auto CorrelationUERLN PDA

• Different subsets compared• Different tests to reduce the feature space

Principal component analysis

ITC: Features

14

UKA: 133 Acoustic Features

• pitch, unvoiced/unvoiced energy, quartiles (15)• voice quality, Praat metrics (11)• harmonicity, quartiles (5) and Praat metrics (3)• zero-crossing rate vs energy, histogram (20)• correlation/regression, coefficients (36)• vocal tract volume, quartiles (25)• duration/timing, verbmobil features (18)

15

Classifiers

UERLN: Linear Discriminant Analysis LDA, Decision Trees (CARTs), Neural Networks NN, Support Vector machines SVM, Gaussian Mixtures GM, Language Models LM

ITC: Decision Trees (CARTs), Neural Networks NN UKA: Linear, Neural Networks NN, Support Vector

machines SVM

16

UERLN classification I: SympaFly

GM/NN, 2 classes, neutral vs. problem, l≠t

dialogue step success, 2 classes, SVM: CL 82.5dialogue success, 2 classes, CART: CL 85.4

combination CL RR

Pros.+MFCC: 74.4 74.2

HNR+Pros: 74.8 76.0

HNR+MFCC: 70.4 69.8

RR: overall rec. rateCL: class-wise averaged rec. rate

LDA, 4 classes

SVM/CART, 2 classes, loo

17

UERLN classification II: AIBO

features CL

pros/POS 59.7

pros. /POS, opt. 63.2

MFCC, frames 45.4

MFCC, words 58.3

pros/POS + MFCC 65.3

4 classes "AMEN", NN joyful surprised motherese neutral (default) rest (non-neutral) bored helpless, hesitant emphatic touchy (=irritated) angry reprimanding

18

Final feature set• 273 (acoustic/temporal) features• 2 class problem (neutral and non neutral)

ITC Classification II:

Classifier CART Neural Networks

Database Targhe Sympafly Targhe Sympafly

RR 73.2% 73.9% 74.2% 73.5%

CL 70.7% 72.1% 69.4% 74.1%

RR = overall rec. rate; CL = class-wise averaged rec. rateN = neutral turns; NN = Non neutral turns

19

UKA Classification II:

133 utterance-level prosodic features, 15 classes,acted speech, 8 speakers:

Task Classifier Feat Selection CL

spk-indep linear none 19.0%

spk-indep linear spk-indep 21.3%

spk-indep linear spk-dep 31.3%

spk-dep linear none 38.7%

spk-dep SVM none 53.0%

20

Assessment of Features

• a pool of many different features/feature groups implemented/compared• prosodic features better (more consistent) than "spectral" features in realistic speech• combination of knowledge sources improves performance• relevance of single features (feature classes)?

21

Assessment of Classifications

• not much difference between different classifiers in classification performance (linear classifiers highly competitive in speaker-independent classification)• large differences between speaker-dependent and speaker-independent classification

22

Categories & Dimensions

cf. also tomorrow

23

UKA: Meeting Annotation

Meeting audio appears to be rich in non-neutral speech.

0

10

20

30

40

50

60

70

project work game discuss chat

Labeler 1

Labeler 2

Labeler 3

Open-set holistic labeling of 5 meetings by 3 labellers

24

UKA: towards new Dimensions for Social Interaction in Meetings denoting conflict, bulding community, or skepticism etc.

IMAGE PROMOTION

self self group groupat expense of more than no bias more than at expense of

group group self self

resolve/strength

grateful

doubt/weakness insecure

ego-building conflict-diffusinggiving up

skeptical

demandingencouraging/comforting advocating

↕directing/leading

ignoring/interrupting collegial-conflicthostile-conflict

accedingcommunity-building

weak

pow

er

s

tron

g

self support group

25

Assessment of Categories & Dimensions

New categories, new dimensions, new consistency measure

prototypical "full-blown" emotions are rare labels depend on type of data (call center, human-

robot, different types of multi-party meeting) new dimensions that do not model emotions but

interaction between participants in communication new entropy based consistency measure

26

Thak you for your attention

1 wp3 speech and emotion (analysis & recognition) human language technologies

Documents

left aibo little aibo

left stop aibo

aibo disobedient

features classification

principle slide

targhe slide

oo slide

sonys aibo robot