1 wp3 speech and emotion (analysis & recognition) human language technologies
TRANSCRIPT
![Page 1: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/1.jpg)
1
WP3 speech and emotion (analysis & recognition)
humanlanguage
technologies
![Page 2: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/2.jpg)
2
Databases and Annotations
![Page 3: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/3.jpg)
3
UERLN: SYMPAFLY
Fully automatic speech dialogue telephone system for flight reservation and booking, different system stages; 270 Dialogues.
• Annotations: word-based emotional user states, prosodic and conversational peculiarities; dialogue (step) success; emotional user states distribution follows nested Pareto (80/20) principle
![Page 4: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/4.jpg)
4
UERLN: AIBO
Children's interaction (age 10-12, 51 children, 9.2 hours of speech) with SONY’s AIBO robot, Wizard-of-Oz-scenario; cf. WP5 (plus English and read speech)
• Annotations: word-based emotional user states (holistic, 5 labellers) and prosodic peculiarities; alignment of children's utterances with AIBO's actions; manual correction of F0, labelling of voice quality. Emotional user states for the English data.
![Page 5: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/5.jpg)
5
AIBO disobedient: frommotherese to angry
g'radeaus Aibolein ja M fein M gut M machst M du M *da M | *tz l"aufst du mal bitte nach links | stopp E Aibo stopp | nach links E umdrehen | nein M <*ne> nein M <*ne> nein M <*ne> so M weit M *simma M noch M nicht M aufstehen M Schlafm"utze M komm M hoch M | ja M so M ist M es M <*is> guter M Hund M lauf mal jetzt nach links | nach links Aibo | Aibolein M aufstehen M *son M sonst M werd' M ich M b"ose M hoch E | nach A links A | Aibo A nach A links A | Aibolein A ganz A b"oser A Hund A jetzt A stehst A du A auf A | hoch A | dreh dich ein bisschen | ja M so ist es <*is> gut stopp Aibo stopp | *tz lauf g'radeaus |
![Page 6: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/6.jpg)
6
UERLN: Different Conceptualizations
Aibo straight on stop Aibo stop turn round to the left Aibo get up turn round to the left Aibo get up turn round, to the left Aibo get up get up Aibo now go left now straight on Aibo st´ straight on
Straight on little Aibo ok greatYou‘re doing fine now please to the left stop Aibo stop turn to the left no no no we aren´t thatfar yet get up sleepyhead get upyes that´s a good dog now goleft left Aibo little Aibo get upelse I´m getting angry get up Aibo left little Aibo bad boy now get up turn a little ok that´s fine stop Aibo stop straight on
Remote control tool Pet dog
![Page 7: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/7.jpg)
7
Fully automatic speech dialogue telephone system • 15,6 hours of Italian natural speech• 9444 files (turns) -> 450 emotionally rich
Word-level• Orthographic transcription and word segmentation• Prosodic peculiarities annotated
Turn-level• Holistic emotion labels
Sympafly (cf. UERLN) for comparison and benchmarking
ITC: Targhe
![Page 8: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/8.jpg)
8
UKA: LDC2002S28
Elicited emotional speech database; native American English
• labels: 1 of 15 holistic speaker states per utterance; used in algorithm and feature set development
![Page 9: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/9.jpg)
9
UKA: ISL Meeting Corpus
18 recordings of multi-party (mean 5.1 participants) meetings; mean 35 minute duration; American English
• Annotations: orthographic transcription; Verbmobil II, and discourse-level annotations.
![Page 10: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/10.jpg)
10
Assessment of Data Collection:
• focus on• spontaneous, realistic data• important/new types of dialogues/interaction• evaluation of annotations
• considerable percentage of realistic (processed and available) databases world-wide
![Page 11: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/11.jpg)
11
Features & Classification
![Page 12: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/12.jpg)
12
UERLN: Features
• large feature vector for a context of 2 words:• 95 prosodic (duration, energy, F0, pauses)• 80 spectral (HNR, formant based frequencies and energy)• 24 MFCC• 30 POS
• Language Models & dialogue based features
![Page 13: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/13.jpg)
13
Baseline feature set• 96 features• Based on energy, duration, and pitch
Final feature set• 273 features (many redundant)• Based on energy, duration, pitch, and pauses• Different pitch extractors tried
Normalized Cross CorrelationWeighted Auto CorrelationUERLN PDA
• Different subsets compared• Different tests to reduce the feature space
Principal component analysis
ITC: Features
![Page 14: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/14.jpg)
14
UKA: 133 Acoustic Features
• pitch, unvoiced/unvoiced energy, quartiles (15)• voice quality, Praat metrics (11)• harmonicity, quartiles (5) and Praat metrics (3)• zero-crossing rate vs energy, histogram (20)• correlation/regression, coefficients (36)• vocal tract volume, quartiles (25)• duration/timing, verbmobil features (18)
![Page 15: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/15.jpg)
15
Classifiers
UERLN: Linear Discriminant Analysis LDA, Decision Trees (CARTs), Neural Networks NN, Support Vector machines SVM, Gaussian Mixtures GM, Language Models LM
ITC: Decision Trees (CARTs), Neural Networks NN UKA: Linear, Neural Networks NN, Support Vector
machines SVM
![Page 16: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/16.jpg)
16
UERLN classification I: SympaFly
GM/NN, 2 classes, neutral vs. problem, l≠t
dialogue step success, 2 classes, SVM: CL 82.5dialogue success, 2 classes, CART: CL 85.4
combination CL RR
Pros.+MFCC: 74.4 74.2
HNR+Pros: 74.8 76.0
HNR+MFCC: 70.4 69.8
RR: overall rec. rateCL: class-wise averaged rec. rate
LDA, 4 classes
SVM/CART, 2 classes, loo
![Page 17: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/17.jpg)
17
UERLN classification II: AIBO
features CL
pros/POS 59.7
pros. /POS, opt. 63.2
MFCC, frames 45.4
MFCC, words 58.3
pros/POS + MFCC 65.3
4 classes "AMEN", NN joyful surprised motherese neutral (default) rest (non-neutral) bored helpless, hesitant emphatic touchy (=irritated) angry reprimanding
![Page 18: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/18.jpg)
18
Final feature set• 273 (acoustic/temporal) features• 2 class problem (neutral and non neutral)
ITC Classification II:
Classifier CART Neural Networks
Database Targhe Sympafly Targhe Sympafly
RR 73.2% 73.9% 74.2% 73.5%
CL 70.7% 72.1% 69.4% 74.1%
RR = overall rec. rate; CL = class-wise averaged rec. rateN = neutral turns; NN = Non neutral turns
![Page 19: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/19.jpg)
19
UKA Classification II:
133 utterance-level prosodic features, 15 classes,acted speech, 8 speakers:
Task Classifier Feat Selection CL
spk-indep linear none 19.0%
spk-indep linear spk-indep 21.3%
spk-indep linear spk-dep 31.3%
spk-dep linear none 38.7%
spk-dep SVM none 53.0%
![Page 20: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/20.jpg)
20
Assessment of Features
• a pool of many different features/feature groups implemented/compared• prosodic features better (more consistent) than "spectral" features in realistic speech• combination of knowledge sources improves performance• relevance of single features (feature classes)?
![Page 21: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/21.jpg)
21
Assessment of Classifications
• not much difference between different classifiers in classification performance (linear classifiers highly competitive in speaker-independent classification)• large differences between speaker-dependent and speaker-independent classification
![Page 22: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/22.jpg)
22
Categories & Dimensions
cf. also tomorrow
![Page 23: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/23.jpg)
23
UKA: Meeting Annotation
Meeting audio appears to be rich in non-neutral speech.
0
10
20
30
40
50
60
70
project work game discuss chat
Labeler 1
Labeler 2
Labeler 3
Open-set holistic labeling of 5 meetings by 3 labellers
![Page 24: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/24.jpg)
24
UKA: towards new Dimensions for Social Interaction in Meetings denoting conflict, bulding community, or skepticism etc.
IMAGE PROMOTION
self self group groupat expense of more than no bias more than at expense of
group group self self
resolve/strength
grateful
doubt/weakness insecure
ego-building conflict-diffusinggiving up
skeptical
demandingencouraging/comforting advocating
↕directing/leading
ignoring/interrupting collegial-conflicthostile-conflict
accedingcommunity-building
weak
pow
er
s
tron
g
self support group
![Page 25: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/25.jpg)
25
Assessment of Categories & Dimensions
New categories, new dimensions, new consistency measure
prototypical "full-blown" emotions are rare labels depend on type of data (call center, human-
robot, different types of multi-party meeting) new dimensions that do not model emotions but
interaction between participants in communication new entropy based consistency measure
![Page 26: 1 WP3 speech and emotion (analysis & recognition) human language technologies](https://reader035.vdocuments.mx/reader035/viewer/2022062417/551a789e550346b52d8b52b3/html5/thumbnails/26.jpg)
26
Thak you for your attention