predicting emotion in spoken dialogue from multiple knowledge sources

34
Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources Kate Forbes-Riley and Diane Litman Learning Research and Development Center and Computer Science Department University of Pittsburgh

Upload: ping

Post on 21-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources. Kate Forbes-Riley and Diane Litman Learning Research and Development Center and Computer Science Department University of Pittsburgh. Overview. Motivation spoken dialogue tutoring systems Emotion Annotation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Kate Forbes-Riley and Diane Litman

Learning Research and Development Center and

Computer Science DepartmentUniversity of Pittsburgh

Page 2: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

OverviewMotivation

spoken dialogue tutoring systems

Emotion Annotation positive, negative and neutral student states

Machine Learning Experimentsextract linguistic features from student speech

use different feature sets to predict emotions

best-performing feature set: speech & text, turn & context 84.75% accuracy, 44% error reduction

Page 3: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Motivation

Bridge Learning Gap between Human Tutors and Computer Tutors

(Aist et al., 2002): Adding human-provided emotional scaffolding to a reading tutor increases student persistence

Our Approach: Add emotion prediction and adaptation to ITSPOKE, our Intelligent Tutoring SPOKEn dialogue system (demo paper)

Page 4: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources
Page 5: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Experimental Data

Human Tutoring Spoken Dialogue Corpus

128 dialogues (physics problems), 14 subjects

45 average student and tutor turns per dialogue

Same physics problems, subject pool, web interface, and experimental procedure as ITSPOKE

Page 6: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Emotion Annotation Scheme (Sigdial’04)

Perceived “Emotions”

Task- and Context-Relative

3 Main Emotion Classes:

negative neutral positive

3 Minor Emotion Classes:

weak negative, weak positive, mixed

Page 7: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Example Annotated Excerpt (weak, mixed -> neutral)

Tutor: Uh let us talk of one car first.

Student: ok. (EMOTION = NEUTRAL)

Tutor: If there is a car, what is it that exerts force on the car such that it accelerates forward?

Student: The engine. (EMOTION = POSITIVE)

Tutor: Uh well engine is part of the car, so how can it exert force on itself?

Student: um… (EMOTION = NEGATIVE)

Page 8: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Emotion Annotated Data

453 student turns, 10 dialogues, 9 subjects 2 annotators, 3 main emotion classes 385/453 agreed (84.99%, Kappa 0.68)

Negative Neutral Positive

Negative 90 6 4

Neutral 23 280 30

Positive 0 5 15

Page 9: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Feature Extraction per Student Turn Five feature types

– acoustic-prosodic (1)– non acoustic-prosodic

• lexical (2)• other automatic (3)• manual (4)

– identifiers (5)

Research questions

– utility of different features– speaker and task dependence

Page 10: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Feature Types (1)

Acoustic-Prosodic Features (normalized)

4 pitch (f0) : max, min, mean, standard dev.

4 energy (RMS) : max, min, mean, standard dev.

4 temporal: turn duration (seconds) pause length preceding turn (seconds)

tempo (syllables/second) internal silence in turn (zero f0 frames)

available to ITSPOKE in real time

Page 11: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Feature Types (2)

Lexical Items

word occurrence vector

Page 12: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Feature Types (3)

Other Automatic Features: available from ITSPOKE logs

Turn Begin Time (seconds from dialog start) Turn End Time (seconds from dialog start) Is Temporal Barge-in (student turn begins before tutor turn ends) Is Temporal Overlap (student turn begins and ends in tutor turn) Number of Words in Turn Number of Syllables in Turn

Page 13: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Feature Types (4)

Manual Features: (currently) available only from human transcription

Is Prior Tutor Question (tutor turn contains “?”) Is Student Question (student turn contains “?”) Is Semantic Barge-in (student turn begins at tutor

word/pause boundary) Number of Hedging/Grounding Phrases (e.g. “mm-

hm”, “um”) Is Grounding (canonical phrase turns not preceded

by a tutor question) Number of False Starts in Turn (e.g. acc-

acceleration)

Page 14: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Feature Types (5)

Identifier Features

subject ID problem ID subject gender

Page 15: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Machine Learning (ML) ExperimentsWeka software: boosted decision trees give best

results (Litman&Forbes, ASRU 2003)

Baseline: Predicts Majority Class (neutral) Accuracy = 72.74%

Methodology: 10 runs of 10-fold cross validation

Evaluation MetricsMean Accuracy: %Correct

Relative Improvement Over Baseline (RI): error(baseline) – error(x) error(baseline)

Page 16: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Acoustic-Prosodic vs. Other Features

Baseline = 72.74%; RI range = 12.69% - 43.87%

Feature Set -ident

speech 76.20%

lexical 78.31%

lexical + automatic 80.38%

lexical + automatic + manual 83.19%

Acoustic-prosodic features (“speech”) outperform majority baseline, but other feature types yield even higher accuracy, and the more the better

Page 17: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Acoustic-Prosodic plus Other Features

Feature Set -ident

speech + lexical 79.26%

speech + lexical + automatic 79.64%

speech + lexical + automatic + manual 83.69%

Baseline = 72.74%; RI range = 23.29% - 42.26%

Adding acoustic-prosodic to other feature sets doesn’t significantly improve performance

Page 18: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Adding Contextual Features

(Litman et al. 2001, Batliner et al 2003): adding contextual features improves prediction accuracy

Local Features: the values of all features for the two student turns preceding the student turn to be predicted

Global Features: running averages and total for all features, over all student turns preceding the student turn to be predicted

Page 19: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Previous Feature Sets plus Context

Same feature set with no context: 83.69%

Feature Set +context -ident

speech + lexical + auto + manual

local 82.44

speech + lexical + auto + manual

global 84.75

speech + lexical + auto + manual

local+global 81.43

Adding global contextual features marginally improves performance, e.g.

Page 20: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Feature Usage

Feature Type Turn + Global

Acoustic-Prosodic

16.26%

Temporal 13.80%

Energy 2.46%

Pitch 0.00%

Other 83.74%

Lexical 41.87%

Automatic 9.36%

Manual 32.51%

Page 21: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Accuracies over ML Experiments

Page 22: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Related Research in Emotional Speech Actor/Native Read Speech Corpora (Polzin & Waibel 1998; Oudeyer 2002; Liscombe et al. 2003)

more emotions; multiple dimensions acoustic-prosodic predictors

Naturally-Occurring Speech Corpora (Ang et al. 2002; Lee et al. 2002; Batliner et al. 2003; Devillers et al. 2003;

Shafran et al. 2003)

less emotions (e.g. E / -E); Kappas < 0.6

additional (non acoustic-prosodic) predictors

Few address the tutoring domain

Page 23: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

SummaryMethodology: Annotation of student emotions in

spoken human tutoring dialogues, extraction of linguistic features, and use of different feature sets to predict emotions

Our best-performing feature set contains acoustic-prosodic, lexical, automatic and hand-labeled features from turn and context (Accuracy = 85%, RI = 44%)

This research is a first step towards implementing emotion prediction and adaptation in ITSPOKE

Page 24: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Current Directions

Address same questions in ITSPOKE computer tutoring corpus (ACL’04)

Label human tutor reactions to student emotions to:

develop adaptive strategies for ITSPOKE examine the utility of different annotation granularities

determine if greater tutor response to student emotions correlates with student learning and other performance measures

Page 25: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Thank You!

Questions?

Page 26: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Prior Research: Affective Computer Tutoring(Kort and Reilly and Picard., 2001): propose a cyclical model of emotion change during

learning; developing a non-dialog computer tutor that will use eye-tracking/facial features to predict emotion and support movement into positive emotions.

(Aist and Kort and Reilly and Mostow and Picard, 2002): Adding human-provided emotional scaffolding to an automated reading tutor increases student persistence

(Evens et al, 2002): for CIRCSIM: computer dialog tutor for physiology problems; hypothesize adaptive strategies for recognized student emotional states; e.g. if detecting frustration, system should respond to hedges and self-deprecation by supplying praise and restructuring the problem.

(de Vicente and Pain, 2002): use human observation about student motivational states in videod interaction with non-dialog computer tutor to develop rules for detection

(Ward and Tsukahara, 2003): spoken dialog computer “tutor-support” uses prosodic and contextual features of user turn (e.g. “on a roll”, “lively”, “in trouble”) to infer appropriate response as users remember train stations. Preferred over randomly chosen acknowledgments (e.g. “yes”, “right” “that’s it”, “that’s it <echo>”,… )

(Conati and Zhou, 2004): use Dynamic Bayesian Networks) to reason under uncertainty about abstracted student knowledge and emotional states through time, based on student moves in non-dialog computer game, and to guide selection of “tutor” responses.

Most will be relevant to developing ITSPOKE adaptation techniques

Page 27: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

ML Experiment 3: Other Evaluation Metrics

alltext + speech + ident: leave-one-out cross-validation (accuracy = 82.08%)

Best for neutral, better for negatives than positives Baseline: neutral: .73, 1, .84; negatives and positives = 0, 0, 0

Class Precision Recall F-Measure

Negative 0.71 0.60 0.65

Neutral 0.86 0.92 0.89

Positive 0.50 0.27 0.35

Page 28: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Machine Learning (ML) ExperimentsWeka machine-learning software: boosted decision trees

give best results (Litman&Forbes, 2003)

Baseline: Predicts Majority Class (neutral) Accuracy = 72.74%

Methodology: 10 x 10 cross validation

Evaluation MetricsMean Accuracy: %Correct

Standard Error: SE = std(x)/sqrt(n), n=10 runs +/- 2*SE = 95% confidence interval

Relative Improvement Over Baseline (RI): error(baseline) – error(x) error(baseline)

error(y) = 100 - % Correct

Page 29: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Outline

Introduction

ITSPOKE Project

Emotion Annotation

Machine-Learning Experiments

Conclusions and Current Directions

Page 30: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

ITSPOKE: Intelligent Tutoring SPOKEn Dialogue System

Back-end is text-based Why2-Atlas tutorial

dialogue system (VanLehn et al., 2002)

Sphinx2 speech recognizer

Cepstral text-to-speech synthesizer

Try ITSPOKE during demo session !

Page 31: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Experimental Procedure

Students take a physics pretest

Students read background material

Students use the web and voice interface to work through up to 10 problems with either ITSPOKE or a human tutor

Students take a post-test

Page 32: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

ML Experiment 3: 8 Feature Sets + Context

Global context marginally better than local or combinedNo significant difference between +/- ident setse.g., speech without context: 76.20% (-ident), 77.41% (+ident)

Feature Set +context -ident +ident

speech local 76.90 76.95

speech global 77.77 78.02

speech local+global 77.00 76.88

Adding context marginally improves some performances

Page 33: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources
Page 34: Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

8 Feature Sets

speech: normalized acoustic-prosodic features

lexical: lexical items in the turn

autotext: lexical + automatic features

alltext: lexical + automatic + manual features

+ident: each of above + identifier features