saga –the bielefeld speech and gesture alignment corpus
TRANSCRIPT
Kirsten Bergmann & Stefan KoppResearch Group „Sociable Agents“, CITEC, Bielefeld UniversitySFB 673 „Alignment in Communication“, Bielefeld University
SaGA–The Bielefeld Speech and Gesture Alignment Corpus
„The church has a dome-shaped roof.“
„The problem of generating an overt gesture from an abstract [...] representation is one of the great puzzles of human gesture, and has received little attention in the literature.“ (de Ruiter, 2007, p. 30)
Our Research Goal
„[...] why different gestures take the particular physical form they do is one of the most important yet largely unaddressed questions in gesture research“ (Bavelas et al. 2008, p. 499)
Human-like Expressiveness for Virtual Agents
Methodology
Evaluate
Design evaluationRealize gaps in model and data
Model
Build theoriesand formal models
AI/CogSci
Study humans
Gather data,build corpora,analyze
Build humanoids
Implement simulation on the basis of model
„The church has a dome-shaped roof.“
Multiple Influences on a Gesture‘s Form
Discourse Context (Bergmann & Kopp, 2009)
• Communicative goal• Information structure
Linguistic Context• Linguistic information packaging
(Kita & Özyürek, 2003)
• Syntactic constructions(Buschmeier, Bergmann & Kopp, submitted)
Previous Gesture Features• Catchments (McNeill, 2005)
Interlocutor‘s Communicative Behavior• Gestural mimicry (Kimbara, 2006)
Requirements for a corpus
–
1. Gesture annotation
• Gesture segmentation
• Gesture classification
• Gesture morphology
–
3. Gesture context
• Speech
• Words
• POS
• Syntactic constructions
• Discourse Context
• Information Structure
• Communicative Goals
• Dialogue Acts
–
2. Referent features
• Audio- and video recordings (3 camera views)
• Motion tracking of hand and head movements
• 25 dyads
• ~280 min
• 39.435 words
• 4.961 iconic/deictic gestures
Experimental Setting
• Direction giving and sight descriptions for a VR town
• Simplified objects to allow determination and control of message content (shape features, level of detail)
General Gesture Annotation
- preparation, pre-stroke hold, stroke, post-stroke hold, retraction
- iconic, deictic, beat, discourse; combinations
- indexing, placing, shaping, drawing, posturing; combinations
- survey, speaker
Gesture Phases
Gesture Phrases
Representation Techniques
Perspective
Gesture Morphology
no movement
movement in wrist position
movement in palm orientation and wrist position
Coding in 2 parts
1. Gesture Form• Wrist Location• Handshape • Palm Orientation• Extended Finger Orientation • Hand Combination
2. Motion in at least one form dimension• Path of movement (line, arc, ...)• Direction of movement (up, down, left, ...)• Repetition
–
–
Gesture Form Coding: Wrist Location
1. Wrist Position (McNeill, 1992)
2. Wrist DistanceD-C Hand in contact with bodyD-CE Hand between body and elbow‘s length awayD-EK Between elbow and kneeD-KO Between knee and length of outstretched armD-O Length outstretched arm in front away
CC center center
C-UP center-upper
C-UR center-upper-right
C-UL center-upper-left
C-RT center-right
C-LT center-left
C-LW center-lower
...
–
3. Extent0 no wrist movementSMALL within 1 regionMEDIUM across 2 regionsLARGE more than 2 regions
Gesture Form Coding: Handshape
• ASL handshapes, e.g.
• Modifier - loose- small/medium/large
for ASL-C
ASL-B ASL-O ASL-C ASL-G ASL-5
Gesture Form Coding: Palm and Finger Orientation
Basic ValuesPUP BUP upPDN BDN downPTR BTR leftPTL BTL rightPTB BTB towards bodyPAB BAB away from body
Combinations of up to three basic values, e.g. down/left
Gesture Form Coding: Hand Combination
1. Configuration
2. Movement Relation
Mirror sagittalMovement is mirrored along on of the three axesfrontal
transversal
RHH Right hand stable, left hand active
LHH Left hand stable, right hand active
NOSYNC Both hands active; no synchrony
none Both hands static; only one hand active
FTT Tips of fingers and thumbs touchingFT Tips of fingers touchingTT Thumbs touchingFTF Tips of fingers facing..
Referent Feature Representation
d1
d2
d3
d1
d2
d3
d1
d2
d3
d1
d2
d3
d1
d2
d3
d1
d2
d3
d1
d2
d3
Referent FeaturesSubparts number of childnodesSymmetry number of symmetrical axesMain Axis x-axis, y-axis, z-axis, nonePosition 3D vector
Imagistic Description Trees (Sowa 2006)
- Hierarchical structure of shape decomposition- Extents in different dimensions as
approximation of shape
–
–
–
Speech Annotation
1. Transcription of spoken words
3. Disfluencies
2. Syntactic Constructions • Statistical Parsing to extract noun phrases: Stanford Parser (Klein & Manning, 2003)
• Annotation scheme for disfluencies in multiparty interaction (Besser & Alexandersson, 2007)
• Hesitations (uhm, uh, ...)• Discourse marker (I mean, so, well, ...)
• Pauses within a speaker‘s turn
–
–
Discourse Context
1. Communicative Goal Information Structure
3. Thematization
4. Information State
Locomotion „go ahead“
Reorientation „turn left“
Action+Landmark „follow the street“
Landmark „there is a church“
Landmark Property „the church is a large building“
Landmark Construction
„the church has a round window“
Landmark Position „there‘s a tree on the left“
Elemental actions of direction giving (Denis, 1997)
theme what the utterance is about
rheme what is said about the theme
privateno antecedent in the previous discourse
sharedalready mentioned in the previous discourse
Following Stone et al. (2003):
Following Ritz et al. (2008):
–2. Dialogue Acts
DAMSL annotation (Core & Allen, 1997)
Problems with Manual Annotation of Gesture Form
• Inter-coder reliability hard to achieve
• Extremely time-consuming
- Refinements of annotation scheme necessitate re-annotations
- Training of several coders
• Occlusion of hands/arms in video data
• Difficult estimation of position and orientation from perspectively distorted images
Problems with Automatized Annotations
• Marker-based motion tracking may impair natural gesturing behavior
• Adequate parameters have to be found for each subject, depending from- Body height- Marker position- Inclination of upper body
• Gaps in data stream (caused, e.g., by hidden markers)
MethodologyModel
Evaluate
Build theoriesand formal models
Implement simulation on the basis of model
Design evaluationRealize gaps in model and data
Gather data,build corpora,analyze
AI/CogSci
Build humanoidsData corpus
Results: Systematicity vs. Idiosyncrasy
Inter-subjective patterns- Correlation of representation
technique and referent features - Correlation of communicative
intention and representation technique
- Correlations between visuo-spatial referent features and morphological gesture features
Individual patterns- Number of gestures- Use of representation
techniques- Morphological gesture
features
How to account for both in a computational model?
Bergmann & Kopp (2009), Kopp et al. (2007)
Employing Bayesian Decision Networks
• Representation of sequential decision problems combining probabilistic and rule-based decision-making
• Learned from annotated corpus
1. Structure learning
2. Parameter estimation
• Evidence is propagated through the network to determine values for variables of interest
Speech and Gesture Generation
Multimodal content planning and formulation of speech and gesture run in parallel and interact on multimodal working memory
Inspired by psycholinguistic production models (de Ruiter, 2000; Kita & Özyürek, 2003)
Speech and Gesture GenerationSpeech
Gesture
The church has a dome-shaped roof.
Handshape:Palm Orientation:Finger Orientation:Wrist Position:Handedness:
ASL-bent-5DownAwayFromBody(Center UpperChest Norm)Right
<utterance> <specification> The church has <time id= "t1"/>a dome-shaped roof<time id= "t2"/> </specification> <gesture> <affiliate onset="t1" end="t2"/> <constraints> <symmetrical dominant="right_arm"> <constraints> <parallel> <static slot="HandShape" value="BSflat (FBround all o)(ThCpart o)"/> <static slot="HandLocation" value="LocCenter LocUpperCest LocNorm"/> <static slot="PalmOrientation" value="DirD"/> <static slot="ExtFingerOrientation" value="DirA"/> </parallel> </constraints> </symmetrical> </constraints> </gesture></utterance>
„The church has a dome-shaped roof.“
Towards standardization...
• International effort to unify multimodal behavior for Embodied Conversational Agents (ECAs)
• SAIBA framework- Function Markup Language (FML)
• Intent description without referring to physical behavior
- Behavior Markup Language (BML)• Description of physical realization
• XML based representation language
Kopp et al. (2006)Vilhjálmsson et al. (2007)
Intent Planning
Behavior Planning
Behavior Realization
FML
BML
Feedback
Feedback
SAIBA
BML Behavior Elements
<HEAD> Nodding, shaking, tossing, orientation
<TORSO> Orientation, shape of spine and shoulder
<FACE> Movement of facial muscles (eyebrow, eyelid, mouth)
<GAZE> Coordinated movements of eyes, neck and head direction
<BODY> Overall orientation, position, posture
<LEGS> Pelvis, hip, legs, knee, toes, ankle
<GESTURE> Coordinated movement with arms and hands, e.g. MURML
<SPEECH> Verbal and paraverbal behavior (words, pauses, prosody)
<LIPS> Lip shapes
MURML Specification
<gesture> <affiliate onset="t1" end="t2"/> <constraints> <symmetrical dominant="right_arm"> <constraints> <parallel> <static slot="HandShape" value="BSflat (FBround all o)(ThCpart o)"/>
<static slot="PalmOrientation" value="DirD"/> <static slot="ExtFingerOrientation" value="DirA"/>
<dynamic slot="HandLocation"> <dynamicElement type="linear">
<value type="start" name="LocShoulder LocCenterRight LocNorm"/> <value type="direction" name="DirR''/> <value type="distance" name="125.0''/>
</dynamicElement> </dynamic> </parallel>
</constraints> </symmetrical> </constraints> </gesture>
Simulation to check annotations
<utterance> <specification> The church has <time id= "t1"/>a dome-shaped roof<time id= "t2"/> </specification> <gesture> <affiliate onset="t1" end="t2"/> <constraints> <symmetrical dominant="right_arm"> <constraints> <parallel> <static slot="HandShape" value="BSflat (FBround all o)(ThCpart o)"/> <static slot="HandLocation" value="LocCenter LocUpperCest LocNorm"/> <static slot="PalmOrientation" value="DirD"/> <static slot="ExtFingerOrientation" value="DirA"/> </parallel> </constraints> </symmetrical> </constraints> </gesture></utterance>
MAX/ACE (Kopp & Wachsmuth, 2004)
- On-the-fly speech synthesis and movement planning
- Scheduling and co-articulation of speech and gestures, incremental chunks (intonation phrase + gesture phrase)
–
?
Summary
• Bielefeld SaGA corpus- Large collection of naturalistic, yet controlled,
speech-gesture behavior (~5000 gestures)- Comprehensive annotation
• Gesture types and morphology• Gesture referents• Verbal Context• Discourse Context
• Development of data analysis tools for motion tracking data
• Generation of multimodal behavior for virtual agents
• Representation languages (BML, MURML): Candidates for standardization