saga –the bielefeld speech and gesture alignment corpus

Kirsten Bergmann & Stefan KoppResearch Group „Sociable Agents“, CITEC, Bielefeld UniversitySFB 673 „Alignment in Communication“, Bielefeld University

SaGA–The Bielefeld Speech and Gesture Alignment Corpus

„The church has a dome-shaped roof.“

„The problem of generating an overt gesture from an abstract [...] representation is one of the great puzzles of human gesture, and has received little attention in the literature.“ (de Ruiter, 2007, p. 30)

Our Research Goal

„[...] why different gestures take the particular physical form they do is one of the most important yet largely unaddressed questions in gesture research“ (Bavelas et al. 2008, p. 499)

Human-like Expressiveness for Virtual Agents

Methodology

Evaluate

Design evaluationRealize gaps in model and data

Model

Build theoriesand formal models

AI/CogSci

Study humans

Gather data,build corpora,analyze

Build humanoids

Implement simulation on the basis of model


Multiple Influences on a Gesture‘s Form

Discourse Context (Bergmann & Kopp, 2009)

• Communicative goal• Information structure

Linguistic Context• Linguistic information packaging

(Kita & Özyürek, 2003)

• Syntactic constructions(Buschmeier, Bergmann & Kopp, submitted)

Previous Gesture Features• Catchments (McNeill, 2005)

Interlocutor‘s Communicative Behavior• Gestural mimicry (Kimbara, 2006)

Requirements for a corpus

–

1. Gesture annotation

• Gesture segmentation

• Gesture classification

• Gesture morphology

–

3. Gesture context

• Speech

• Words

• POS

• Syntactic constructions

• Discourse Context

• Information Structure

• Communicative Goals

• Dialogue Acts

–

2. Referent features

• Audio- and video recordings (3 camera views)

• Motion tracking of hand and head movements

• 25 dyads

• ~280 min

• 39.435 words

• 4.961 iconic/deictic gestures

Experimental Setting

• Direction giving and sight descriptions for a VR town

• Simplified objects to allow determination and control of message content (shape features, level of detail)

General Gesture Annotation

- preparation, pre-stroke hold, stroke, post-stroke hold, retraction

- iconic, deictic, beat, discourse; combinations

- indexing, placing, shaping, drawing, posturing; combinations

- survey, speaker

Gesture Phases

Gesture Phrases

Representation Techniques

Perspective

Gesture Morphology

no movement

movement in wrist position

movement in palm orientation and wrist position

Coding in 2 parts

1. Gesture Form• Wrist Location• Handshape • Palm Orientation• Extended Finger Orientation • Hand Combination

2. Motion in at least one form dimension• Path of movement (line, arc, ...)• Direction of movement (up, down, left, ...)• Repetition

Example

–

–

Gesture Form Coding: Wrist Location

1. Wrist Position (McNeill, 1992)

2. Wrist DistanceD-C Hand in contact with bodyD-CE Hand between body and elbow‘s length awayD-EK Between elbow and kneeD-KO Between knee and length of outstretched armD-O Length outstretched arm in front away

CC center center

C-UP center-upper

C-UR center-upper-right

C-UL center-upper-left

C-RT center-right

C-LT center-left

C-LW center-lower

...

–

3. Extent0 no wrist movementSMALL within 1 regionMEDIUM across 2 regionsLARGE more than 2 regions

Example: Wrist Location

Gesture Form Coding: Handshape

• ASL handshapes, e.g.

• Modifier - loose- small/medium/large

for ASL-C

ASL-B ASL-O ASL-C ASL-G ASL-5

Gesture Form Coding: Palm and Finger Orientation

Basic ValuesPUP BUP upPDN BDN downPTR BTR leftPTL BTL rightPTB BTB towards bodyPAB BAB away from body

Combinations of up to three basic values, e.g. down/left

Gesture Form Coding: Hand Combination

1. Configuration

2. Movement Relation

Mirror sagittalMovement is mirrored along on of the three axesfrontal

transversal

RHH Right hand stable, left hand active

LHH Left hand stable, right hand active

NOSYNC Both hands active; no synchrony

none Both hands static; only one hand active

FTT Tips of fingers and thumbs touchingFT Tips of fingers touchingTT Thumbs touchingFTF Tips of fingers facing..

Referent Feature Representation

d1

d2

d3

d1

d2

d3

d1

d2

d3

d1

d2

d3

d1

d2

d3

d1

d2

d3

d1

d2

d3

Referent FeaturesSubparts number of childnodesSymmetry number of symmetrical axesMain Axis x-axis, y-axis, z-axis, nonePosition 3D vector

Imagistic Description Trees (Sowa 2006)

- Hierarchical structure of shape decomposition- Extents in different dimensions as

approximation of shape

–

–

–

Speech Annotation

1. Transcription of spoken words

3. Disfluencies

2. Syntactic Constructions • Statistical Parsing to extract noun phrases: Stanford Parser (Klein & Manning, 2003)

• Annotation scheme for disfluencies in multiparty interaction (Besser & Alexandersson, 2007)

• Hesitations (uhm, uh, ...)• Discourse marker (I mean, so, well, ...)

• Pauses within a speaker‘s turn

–

–

Discourse Context

1. Communicative Goal Information Structure

3. Thematization

4. Information State

Locomotion „go ahead“

Reorientation „turn left“

Action+Landmark „follow the street“

Landmark „there is a church“

Landmark Property „the church is a large building“

Landmark Construction

„the church has a round window“

Landmark Position „there‘s a tree on the left“

Elemental actions of direction giving (Denis, 1997)

theme what the utterance is about

rheme what is said about the theme

privateno antecedent in the previous discourse

sharedalready mentioned in the previous discourse

Following Stone et al. (2003):

Following Ritz et al. (2008):

–2. Dialogue Acts

DAMSL annotation (Core & Allen, 1997)

Problems with Manual Annotation of Gesture Form

• Inter-coder reliability hard to achieve

• Extremely time-consuming

- Refinements of annotation scheme necessitate re-annotations

- Training of several coders

• Occlusion of hands/arms in video data

• Difficult estimation of position and orientation from perspectively distorted images

Automatic Data Analysis

Problems with Automatized Annotations

• Marker-based motion tracking may impair natural gesturing behavior

• Adequate parameters have to be found for each subject, depending from- Body height- Marker position- Inclination of upper body

• Gaps in data stream (caused, e.g., by hidden markers)

MethodologyModel

Evaluate

Build theoriesand formal models

Implement simulation on the basis of model

Design evaluationRealize gaps in model and data

Gather data,build corpora,analyze

AI/CogSci

Build humanoidsData corpus

Results: Systematicity vs. Idiosyncrasy

Inter-subjective patterns- Correlation of representation

technique and referent features - Correlation of communicative

intention and representation technique

- Correlations between visuo-spatial referent features and morphological gesture features

Individual patterns- Number of gestures- Use of representation

techniques- Morphological gesture

features

How to account for both in a computational model?

Bergmann & Kopp (2009), Kopp et al. (2007)

Employing Bayesian Decision Networks

• Representation of sequential decision problems combining probabilistic and rule-based decision-making

• Learned from annotated corpus

1. Structure learning

2. Parameter estimation

• Evidence is propagated through the network to determine values for variables of interest

Speech and Gesture Generation

Multimodal content planning and formulation of speech and gesture run in parallel and interact on multimodal working memory

Inspired by psycholinguistic production models (de Ruiter, 2000; Kita & Özyürek, 2003)

Speech and Gesture GenerationSpeech

Gesture

The church has a dome-shaped roof.

Handshape:Palm Orientation:Finger Orientation:Wrist Position:Handedness:

ASL-bent-5DownAwayFromBody(Center UpperChest Norm)Right

<utterance> <specification> The church has <time id= "t1"/>a dome-shaped roof<time id= "t2"/> </specification> <gesture> <affiliate onset="t1" end="t2"/> <constraints> <symmetrical dominant="right_arm"> <constraints> <parallel> <static slot="HandShape" value="BSflat (FBround all o)(ThCpart o)"/> <static slot="HandLocation" value="LocCenter LocUpperCest LocNorm"/> <static slot="PalmOrientation" value="DirD"/> <static slot="ExtFingerOrientation" value="DirA"/> </parallel> </constraints> </symmetrical> </constraints> </gesture></utterance>


Towards standardization...

• International effort to unify multimodal behavior for Embodied Conversational Agents (ECAs)

• SAIBA framework- Function Markup Language (FML)

• Intent description without referring to physical behavior

- Behavior Markup Language (BML)• Description of physical realization

• XML based representation language

Kopp et al. (2006)Vilhjálmsson et al. (2007)

Intent Planning

Behavior Planning

Behavior Realization

FML

BML

Feedback

Feedback

SAIBA

BML Behavior Elements

<HEAD> Nodding, shaking, tossing, orientation

<TORSO> Orientation, shape of spine and shoulder

<FACE> Movement of facial muscles (eyebrow, eyelid, mouth)

<GAZE> Coordinated movements of eyes, neck and head direction

<BODY> Overall orientation, position, posture

<LEGS> Pelvis, hip, legs, knee, toes, ankle

<GESTURE> Coordinated movement with arms and hands, e.g. MURML

<SPEECH> Verbal and paraverbal behavior (words, pauses, prosody)

<LIPS> Lip shapes

Gesture Specification

MURML: Multimodal Utterance Representation Language

Kranstedt et al. (2002)

MURML Specification

<gesture> <affiliate onset="t1" end="t2"/> <constraints> <symmetrical dominant="right_arm"> <constraints> <parallel> <static slot="HandShape" value="BSflat (FBround all o)(ThCpart o)"/>

<static slot="PalmOrientation" value="DirD"/> <static slot="ExtFingerOrientation" value="DirA"/>

<dynamic slot="HandLocation"> <dynamicElement type="linear">

<value type="start" name="LocShoulder LocCenterRight LocNorm"/> <value type="direction" name="DirR''/> <value type="distance" name="125.0''/>

</dynamicElement> </dynamic> </parallel>

</constraints> </symmetrical> </constraints> </gesture>

Simulation to check annotations

<utterance> <specification> The church has <time id= "t1"/>a dome-shaped roof<time id= "t2"/> </specification> <gesture> <affiliate onset="t1" end="t2"/> <constraints> <symmetrical dominant="right_arm"> <constraints> <parallel> <static slot="HandShape" value="BSflat (FBround all o)(ThCpart o)"/> <static slot="HandLocation" value="LocCenter LocUpperCest LocNorm"/> <static slot="PalmOrientation" value="DirD"/> <static slot="ExtFingerOrientation" value="DirA"/> </parallel> </constraints> </symmetrical> </constraints> </gesture></utterance>

MAX/ACE (Kopp & Wachsmuth, 2004)

- On-the-fly speech synthesis and movement planning

- Scheduling and co-articulation of speech and gestures, incremental chunks (intonation phrase + gesture phrase)

–

?

Summary

• Bielefeld SaGA corpus- Large collection of naturalistic, yet controlled,

speech-gesture behavior (~5000 gestures)- Comprehensive annotation

• Gesture types and morphology• Gesture referents• Verbal Context• Discourse Context

• Development of data analysis tools for motion tracking data

• Generation of multimodal behavior for virtual agents

• Representation languages (BML, MURML): Candidates for standardization

Hendrik Buschmeier, Oliver Damm, Florian Hahn, Andy Lücking, Catharine Oertel, Hannes Rieser, Attila Sepsi, Nick Thomas, Ipke Wachsmuth

Thanks for your attention!

Acknowledgements

saga –the bielefeld speech and gesture alignment corpus

Documents