entertainment robot: learning from observation paradigm for humanoid robot dancing

Entertainment Robot: Learning from ObservationParadigm for Humanoid Robot Dancing

Shunsuke KudohThe University of Tokyo, Japan

[email protected]

Takaaki ShiratoriThe University of Tokyo, Japan

[email protected]

Shin’ichiro NakaokaNational Institute of Advanced IndustrialScience and Technology (AIST), Japan

[email protected]

Atsushi NakazawaOkasa University, Japan

[email protected]

Fumio KanehiroNational Institute of Advanced IndustrialScience and Technology (AIST), Japan

[email protected]

Katsushi IkeuchiThe University of Tokyo, Japan

[email protected]

Abstract— This paper describes how to enable a humanoidrobot to imitate human dance. The robotics-oriented approachfor generating dance motions is a challenging topic due todynamic and kinematic differences between humans and robots.In order to overcome these differences, we have been developinga paradigm, Learning From Observation (LFO), in which arobot observes human actions, recognizes what the human isdoing, and maps the recognized actions to robot actions forthe purpose of mimicking them. We extend this paradigm toenable a robot to perform classic Japanese folk dances. Oursystem recognizes human actions through abstract task models,and then maps recognized results to robot motions. Throughthis indirect mapping, we can overcome the physical differencesbetween human and robot. Since the legs and the upper bodyhave different purposes when performing a dance, support bythe legs and dance representation by the upper body, our designstrategies for the task models reflects these differences. We usea top-down analytic approach for leg task models and a bottom-up generative approach for keypose task models. Human motionsare recognized through these task models separately, and robotmotions are also generated separately through these models.Then, we concatenate those separately generated motions andadjust them according to the dynamic body balance of a robot.Finally, we succeed in having a humanoid robot dance with anexpert dancer at the original music tempo.

I. INTRODUCTION

Recently, research on humanoid robots has been progressingdramatically. Several applications areas for humanoid robotshave emerged. Among them, one of the most promising andunique areas is to utilize humanoid robots for entertainment, inparticular for dancing, using the shape similarity of the robotto that of a human dancer. Dance motions also contain variouschallenging actions, which require long training and practiceperiods even for human dancers, and motivate the advancementof robotics technologies.

To make a robot dance is a challenging and emerging re-search area. Various CG theories have been proposed tomake avatars and characters dance. However, difficulties areencountered in applying those graphics-oriented theories todancing robots due to factors such as the physical mass of

humanoid robots, friction between the floor and the robot’sfeet, and motor power-limitations. The robotics area also hasaccumulated research results on locomotion using the robot’slegs. However, few researchers have been challenged to makemore complicated actions than simple locomotion. Thus, todevelop systems to make humanoid robots dance should openup a new area that contributes to both robotics and graphicscommunities.

One of the issues to be solved for achieving dancing robots ishow to program such complicated actions on humanoid robotsin robot programming language. One simple solution is tomanually program robots. In fact some companies do programtheir humanoid robots manually. However, this requires longand tedious programming time and results in inflexibility ingenerating desired dances.

We have been developing a paradigm, Learning From Ob-servation (LFO), in which a robot observes human actions,recognizes what the human is doing, and maps the recognizedaction to robot actions for mimicking them. As shown inthe attached video, direct mapping from human joint anglesto robot joint angles does not work well because of thedifferences in weight balance and lengths of arms and legs.The LFO paradigm prepares task models with which the robotrecognizes what the human is doing. The recognized taskmodel generates appropriate robot motions to mimic thosehuman actions. This indirect mapping can overcome balanceand dimensional differences. In fact, this LFO has been appliedto various hand-eye operations such as parts assembly and knottying [1], [2]. In this paper, we plan to extend this paradigmto handle dancing robots.

We have chosen Japanese folk dance as the goal for ourdancing robots to perform. Japanese folk dance is highlystructured, so there is a possibility that we can define cleartask models for Japanese folk dances. It is also true that someJapanese folk dances disappear over time due to the lack ofinheritors who will continue to perform these dances. But oncea robot learns how to dance such folk dances, we can preserve

2.Recognition 3.Performance1.Observation

Task Models

(what to do)

(how to do)

task1

skill1

task2

skill2

task3

skill3

Task Models

(what to do)

(how to do)

task1

skill1

task2

skill2

task3

skill3

task1

skill1

task1

skill1

task2

skill2

task2

skill2

task3

skill3

task3

skill3

Fig. 1: A humanoid robot reproducing dance performance based on the LFO paradigm.

them forever in the performances of humanoid robots.

We employed different strategies to apply LFO to leg andupper-body motions. Leg and upper-body motions of dancingrobots have two different purposes. The leg motions stablysupport robot bodies, while upper body motions express danc-ing patterns. Thus, we needed different strategies for designingtask models of leg and upper body, and for generating legand upper body actions separately. We then concatenatedand adjusted those generated motions of leg and upper-bodymotions in the final stage.

This paper is organized as follows: Section II explains relatedworks. In Section III, we briefly explain our LFO paradigmsand overview system structures. Section IV and Section Vexplain how to generate leg and upper body motions, respec-tively. Section VI describes the adjustment of whole motionsusing separately generated upper body and leg motions. Thesetechniques successfully generate the dance demonstration bya humanoid robot explained in Section VII.

II. RELATED WORK

In the robotics field, many researchers have developed methodsto adapt human motion to a humanoid robot. Riley et al. [3]produced dancing motion in a humanoid robot by convertinghuman motion data obtained by a motion-capture systeminto joint trajectories of the robot. For the same purpose,Pollard et al. [4] proposed a method for constraining givenjoint trajectories within mechanical limitations of the joints.However, these methods are insufficient for satisfying dynamicbody balance because they basically focus on the constraintsof each individual joint. In fact, the pelvis of their robot wasfixed in space.

For biped humanoid robots, Tamiya et al. [5] proposed amethod that enables a robot to follow given motion trajectorieswhile keeping body balance. This method can only dealwith the motions in which a robot is standing on one leg.Kagami et al. [6] extended the method so that it allows changesin the motion of supporting legs. Yamane and Nakamura [7]proposed a dynamics filter, which converts a physically in-consistent motion into a consistent motion for a given body.Their dynamics filter lets a given motion follow the equationsof motions, but the output motion may not keep the features

of the original motion. In these conventional methods, theadaptation basically consists of a single process that modifiesgiven motion trajectories.

In contrast to these approaches, our framework consists oftwo processes. First, the recognition is processed, and thenthe motion for a humanoid robot is reconstructed. Here, theproblem of adapting motions is replaced with the problem ofgenerating motions.

With regard to dance performance, Kuroki et al. [8] enabled anactual biped humanoid to stably perform dance motions thatinclude dynamic-style steps. However, this achievement is notthe same as our goal, because the motions of their robot weremanually created by using their motion editing tool [9].

The adaptations of human motions have also been stud-ied actively in the computer animation community. Manyresearchers have developed methods to either edit motioncapture data [10], [11], [12], [13], to seamlessly blend [14],[15] or connect data [16], [17], or to modify data according tokinematic constraints [4], [18]. Basically, it is not necessary toconsider dynamic constraints, such as balance, in computer an-imation. However, some researchers consider such constraintsin order to create more realistic animation [19], [20], [21],[22]. In these studies, motion sequence is synthesized basedonly on features of motion.

In contrast to these approaches, we focus not only on motionaspects, but also on several environmental and perceptualaspects, such as dance music and a contact state betweenthe foot and the floor, and we extract the meaning of motionas task models based on these aspects to generate a robot’smotion. Other researchers have also considered these aspectsfor synthesized character animation. Peters et al. [23] proposeda method of human animation synthesis based on visualattention. Sakuma et al. [24] considered a psychological modelfor human crowd simulation in which neighboring computergraphics characters imposed mental stress on each other.Stone et al. [25] proposed a method to synthesize speechperformances in which input sound signals are considered.Kim et al. [26] proposed a rhythmic motion synthesis methodusing the results of motion rhythm analysis. Alankus et al. [27]and Lee et al. [28] also proposed a method to synthesize dancemotion by considering the rhythm of input music. However,

these are essentially similar to a method that considers onlyfeatures of motion because they use the perception factor onlyfor assisting their motion-based method. They do not usethe features of motion for extracting its meaning, which isimportant in synthesizing expressive motion, as we do in ourframework.

III. OVERVIEW

A. LFO paradigm

The LFO enables a robot to acquire the knowledge of what todo and how to do from observation of a human demonstration.As shown in Figure 1, the LFO generates robot actions throughthe following three steps: 1) The human demonstrates theactions to be imitated by a robot placed in front of the human.2) The robot recognizes the demonstrated actions based onpredefined abstract task models pre-defined, and constructs aseries of instantiated task models. 3) The robot converts thoseinstantiated task models into robot physical actions. Here,the abstract models are pre-designed using the knowledgeof action domains, such as assembly actions or folk danceactions, under the top-down approach.

In general, performing the same action does not requiremimicking the entire action performed. It is difficult, if notimpossible, to repeat the same trajectories to be mimicked,because each person has different dimensions in parts of his orher body. Instead, for this purpose, characteristics, or importantfeatures of the actions are extracted, and then, only suchcharacteristics or important features are performed. That is,each action consists of essential and nonessential parts tobe mimicked. The LFO introduces abstract task models torepresent those essential parts.

The merits of utilizing those abstract task models are that theyenable the performance of the same actions using robots ofdifferent dimensions. Since the abstract task models are com-mon among different robots, we only need to prepare mappingroutines for each abstract task model to each individual robot.

Each abstract task model consists of what to do and skillparameters to explain how to do the task. From an input image,one abstract task model is chosen as the one representing thecurrent action. After this task recognition, from input data,skill parameters that characterize the action are obtained.

B. Flowchart of Motion Generation

Figure 2 shows the overview of the system. First, the perfor-mance of a dancer is recorded using a motion-capture system.Next, the captured data is converted to abstract task modelsbased on the LFO paradigm. Finally, the robot’s motion, thatis, joint angle trajectory, is reconstructed from the task models.

Our method handles upper body and leg motions separatelyusing different kinds of task models. Leg motion has to stably

support the whole body. We designed leg task models by con-sidering foot-to-floor contact conditions. Each leg task modelalso has a template trajectory, to be modified from observation,for stable locomotion, along with the skill parameters requiredto complete the action.

For leg motion, a continuous foot motion is segmented andrecognized using pre-defined abstract task models, which usethe skill parameters obtained from the motion-capture system.Using the obtained skill parameters, the pre-defined trajectoryof a foot is modified, and inverse kinematics provides the entirejoint angles of the robot’s foot.

The upper body can move freely without considering anyconstraints for representing characteristic features of the folkdance. We have defined such characteristics, which we refer toas keyposes, from music beats and brief pauses of dancers in amotion sequence. Upper body motions of a robot follow thesekeyposes exactly and achieve a smooth transition from onekeypose to the next. The transitions are generated smoothly byconsidering motor and joint limits using our filter. By adjustingleg and upper body motions, we can produce the robot’s entiremotion.

IV. GENERATING LEG MOTION

In this section, we describe a method to generate leg motion.Generating leg motion for a humanoid robot involves majorchallenges because of the physical differences between a robotand a human dancer. The center of mass (CM) of a humanoidrobot is located at a higher position in the body than that ofa human dancer. The area of foot support is smaller than thatof a human dancer. And, finally, the foot-ground contact isless stable than a human’s because the feet of a humanoidrobot are much harder and have fewer degrees of freedomthan those of a human dancer. For these reasons, the physicalconstraints become very strict for keeping a robot’s balance,and so generating leg motion directly from captured humanmotion is not practical.

However, we found that we can overcome these physicalchallenges of a humanoid robot using the LFO paradigm.First, we design leg task models with skill parameters andtemplate trajectories that guarantee the stability of the robot.Then, from observed motion sequences of a human dancer, thesystem recognizes a leg task model, obtains skill parameters,and modifies the predetermined trajectory using the skillparameters.

A. Leg Task Model

We define four tasks shown in Figure 3 by considering acontact state between the feet of a humanoid robot andthe floor, through the top-down analytic approach. Two-footcontact can be represented by both a STAND task and aSQUAT task. The difference between STAND and SQUAT isin its waist position. Here, the SQUAT task model involves a

Upper body motion

Dancer Marker positionsequence

Marker positionsequence

Joint angle

Dynamic filterSkill refinementDynamic filter

Skill refinementJoint angle

Joint angle

RobotLeg motion

Whole body motion

Task models

Motion pause

KeyposeSkill transition

KeyposeSkill transition

TaskSkill parameter

TaskSkill parameter

Fig. 2: Overview of motion generation

R-STEPR-STEPSTANDSTAND L-STEPL-STEP

ff tatonnorientatioYaw s∑ψ

Param. of the swing foot

Param. of the waist

Param. of the mid-point (option)

Tas

k(W

hatt

odo

)S

kill

para

met

ers

(How

todo

)

ofnorientatioandPosition,

piont-midtheofTime

11

1

Rr

t

fs tatonfootswingthe ∑

ofnorientatioandPosition,

piont-midtheofTime

11

1

Rr

t

fs tatonfootswingthe ∑

positionHoriznotal, ff Rrftatonnorientatioand s∑

positionHoriznotal, ff Rrftatonnorientatioand s∑

SQUATSQUAT

destanceheightWaist

point-midtheofTime

1

1

d

tParam. of the mid-point

One-foot contactTwo-foot contact

timeFinishingtime,Begining0 fttCommon param. timeFinishingtime,Begining0 fttCommon param.

Fig. 3: Leg task model: the task model consists of a task anda skill parameter. The former explains what to do, and thelatter explains how to do it.

vertical movement of the waist, but does not involve horizontalwaist movement. This horizontal movement is reserved fordetermining the dynamic balance of the body in Section VI-A.

One-foot contact is represented by a STEP task; R-STEPrepresents one stepping motion by a right foot, and L-STEPrepresents that by a left foot. This task consists of a motionin which one foot, referred to as the swing foot, is lifted fromthe floor and lands again while the other foot, referred to asthe support foot, maintains contact with the floor. Using STEPtasks, various leg motions including footfalls, side- or back-stepping, and kicks can be expressed.

Skill parameters describe task-specific timings and spatialcharacteristics of each task motion as shown in the bottomrow in Figure 3.

All the tasks have the beginning time t0 and the finishing timet f . These values enable tasks to be arranged in a time sequence,and are necessary for composing a choreography and makingrhythm in a dance performance.

The geometric skill parameters are represented in a relativecoordinate system with respect to one basis coordinate systemfixed at one foot. This allows us to locally modify tasks ina task sequence for skill refinement described in Section VI-C. No geometric skill parameters represent positions at thebeginning point, because those values are inherited from theending of a previous task.

A STAND task model has only skill parameters to describebeginning, t0, and ending, t f . A SQUAT task model also hasthe same timing parameters. The SQUAT task model also hasskill parameters to characterize the mid-point, the lowest waistposition with respect to the beginning position, d1, and itstiming, t1.

All the position parameters in a STEP task are expressed withrespect to Σs, the coordinate system of a support foot. Here,the support foot is assumed to remain still on the floor duringa STEP task. The z-axis of Σs is fixed at the upper directionof the global coordinate, with the positive direction of the z-axis corresponding along the upper direction. The templatetrajectory of the swing foot is represented with respect to thiscoordinate system, Σs.

A STEP task model has a template trajectory, and its charac-terizing parameters, rrr f , and RRRf , the final configuration of theswing foot. This parameter, rrr f , does not include an element ofthe z-axis because we assume that the floor is flat in this study.A swing foot sometimes takes a characteristic pose during astep. This particular pose is represented by the mid-point ofthe trajectory of the swing foot. In this case, a time of mid-point, t1, and position parameters rrr1 at t1 are also used. Incontrast to rrr f , parameter rrr1 includes a z-axis element.

The STEP task model also has skill parameters to describethe waist rotation of the robot. Parameter ψ f is the yaw angle(an orientation around the vertical axis) of the waist at t f .This parameter allows motions in which the waist orientationchanges as the result of a step.

B. Recognizing Leg Task

A task sequence is recognized from marker trajectories ob-tained by a motion-capture system. Temporal segments are

R-S

TE

P

ST

EP

ST

EP

ST

EP

ST

EP

ST

EP

ST

EP

R-S

TE

P

ST

EP

ST

EP

ST

EP

ST

EP

ST

EP

ST

EP

Speed

Time

Fig. 4: Detecting STEP tasks: A STEP task is detected fromthe speed of the swing foot. The center represents the graph ofthe swing foot speed, and the bottom represents the extractedtask sequence. The figures at the top are the correspondingcaptured postures.

extracted and then recognized as tasks. From each segment,the values of the skill parameters corresponding to the taskare obtained.

In general, first STEP tasks are recognized from a motionsequence by considering foot-floor contact relations, and thenthe remaining segments are further classified into STAND orSQUAT, depending on the waist positions.

A STEP task is recognized by analyzing a trajectory of aswing foot. Let ppp(t) be the position of a foot marker at timet. The speed of the foot marker is represented as vp(t) =|ppp(t)|. In Figure 4, the middle row shows an example ofvp(t), a foot marker speed. If a foot marker speed, vp(t), hascontinuously positive values within an interval, that segmentis a candidate interval for a STEP task, as shown in the bottomrow of Figure 4. Further, in order to avoid an erroneous smallsegment, we also add the constraint about the moving distance.A segment that satisfies the following conditions is recognizedas one STEP task:

vp(t) ≥ vstep (t0 ≤ t ≤ t f ) ,∫ t f

t0vp(t)dt ≥ lstep, (1)

where vstep and lstep are threshold values in terms of velocityand moving distance respectively. The first condition stipulatesthat the speed of a swing foot is higher than a certain value,while the second condition guarantees that the swing footmoves more than a certain distance. These two conditionseliminate noisy slipping motion when the foot is a supportfoot. By applying these analyses to right and left feet, R-STEPsand L-STEPs are recognized.

After R-STEPs and L-STEPs are recognized and their corre-sponding intervals are removed from an input sequence, eitherSTAND or SQUAT tasks are recognized from the remainingsequences. SQUAT tasks are recognized by analyzing a ver-tical trajectory of the waist to extract a motion of loweringand rising again. This kind of waist motion is detected as a

Generated trajectory

Mid-point

Lengthof stride

Parameters obtained from observation

Start position End position

Fig. 5: Skill parameters for STEP and generated foot trajectory

segment from t0 to t f that satisfies{vh(t) < 0 (t0 ≤ t < t1)vh(t) > 0 (t1 < t ≤ t f )

,∫ t f

t0|vh(t)|dt ≥ lsquat , (2)

where t1 corresponds to the timing of the lowest waist positionand lsquat is a threshold for the vertical moving distance, whicheliminates slight vertical motions that are not regarded as aSQUAT.

C. Determining Skill Parameters for Generating Leg Motion

Finally, we extract skill parameters for each task model,generate the foot trajectory from the obtained skill parameters,and calculate the joint angles of the entire leg from thefoot motion by solving inverse kinematics equations. In eachtask, the timing parameters, t0 and t f are obtained from thebeginning and ending time of the recognized segment. Thevalues of the positional parameters, such as rrr f , and RRRf , arecalculated by using positions of a couple of related markersat these timings.

In a STAND task, we set all the joint angles of the leg by theposture of t0. A STAND task has only two skill parameters, t0and t f , which are the beginning and the ending time obtainedfrom the motion sequence.

In a SQUAT task, the mid-point time t1 is obtained in thetask recognition stage. The vertical waist positions at t0 andt1 are extracted from the waist markers, and the differenceof the waist height between these points in time is set to theskill parameter d1. Sequences of joint angles of the leg alongthe time frame are calculated by solving inverse kinematicsequations for the waist positions, so that the trajectory of thewaist position satisfies these skill parameters.

For a STEP task, the end point and the mid-point of the swingfoot are extracted as the skill parameters. The foot trajectorycan be calculated from these parameters as shown in Figure 5.In practice, the mid-point parameters are extracted only if itis necessary.

The position and orientation of the feet are obtained fromseveral markers attached to the legs of the dancer. Markers tobe utilized depend on a model of body markers in a motion-capture system. In our system, the position of the feet isobtained as the center of the two markers attached to the

toe and heel, while the orientation of the feet is obtainedfrom the markers attached to the toe, heel, and knee. Thebase coordinate system Σs is determined from the positionand orientation of the support foot, with respect to the worldcoordinate system, at the timing, t1; the z-axis on Σs is alignedto the z-axis of the world coordinate system.

The positional skill values, rrr1,rrr f , and RRR1,RRRf , are obtained byconverting the positions and orientation at t1 and t f into thosebased on Σs. The orientation of the waist ψ f is also extractedfrom several markers attached to the waist in the same way.

In a STEP task, whether the mid-point is valid or not mustbe determined. First, a model trajectory of the swing foot isgenerated by an interpolation from the beginning point to thefinishing point. If the distance between the model trajectoryand the actual trajectory is larger than a certain threshold,the mid-point is appended to express that trajectory. Herewe define the interpolating function fff n which passes n(≥ 2)points where time, a value, and a velocity are ti, yyyi, and yyyirespectively. One segment between the two adjacent pointsis expressed by a third polynomial equation. This function isexpressed

fff n 〈(t1,yyy1, yyy1), . . . ,(tn,yyyn, yyyn)〉(t). (3)

When yyyi is omitted, yyyi is assumed to be 000. By using thefunction, the interpolated trajectory is generated as ppp′(t) =f2〈(t0, ppp(t0)),(t f , ppp(t f ))〉(t). The difference between the twotrajectories is defined as d(t) = |ppp′(t)− ppp(t)|. The mid-pointis determined to be valid for time t1 if

d(t1) = maxt0<t<t f

d(t), d(t1) > dstep, (4)

where dstep is a threshold in distance.

When the mid-point is not determined, the trajectory of theswing foot is ppp′. When the mid-point is determined, thetrajectory is

fff n

⟨(t0, ppp(t0)),(t1, ppp(t1), ppp(t1)),(t f , ppp(t f ))

⟩(t). (5)

From this trajectory, we can obtain a series of joint angles ofboth feet by solving inverse kinematics equations.

V. GENERATING UPPER BODY MOTION

Upper body motion contains various keyposes to characterizeeach Japanese dance. A keypose, referred to as Kata inKabuki and Kyogen and Tome in Nichibu, can be definedas a fixed posture of a dancer for the purpose of impressingthe viewers with the dancer’s body line. It is important indance performance to properly represent those keyposes withappropriate timings.

We design a bottom-up generative method, based on briefpauses and musical information, to extract these keyposesfrom motion sequences and define them as task models.Theoretically, keyposes can be obtained by detecting briefpauses in a motion sequence. However, it turns out that too

Time

Speed

Min speedthresholdMin speedthreshold

Max speedthreshold

Max speedthreshold

Local max speed

Candidate Candidate Candidate

Fig. 6: Extracting keypose candidates from hands and CM

many poses are extracted only when we use a simple extractionmethod based on motion. Thus, we utilize musical informationto narrow keypose candidates. After analyzing motion andmusical information separately, both results are combined, andkeyposes are extracted.

Transitional motion between keyposes, a skill trajectory, isrepresented using a hierarchical B-spline model. Dependingon necessary details as well as the physical limitation of arobot, the order of the B-spline is determined. In this way, ourmethod recognizes human motion by keyposes as tasks, andthen reconstructs robot motion based on them.

A. Recognizing Keyposes Using Brief Pause Assumption

1) Analyzing Motion Information: The speed of hands andthe center of mass (CM) are calculated for extracting briefpauses in a motion sequence. In the case of hand motion, therelative speed is calculated using the body center coordinatesystem. Its origin is the waist position, the Z axis is the upwarddirection of the global coordinate system, the Y axis is thefrontal direction of the body, and the X axis is perpendicularto these axes. The speed of the CM is calculated in the globalcoordinate system.

Brief pauses of each body part are independently obtainedas the local minimum point that satisfies the following twocriteria: 1) Each candidate is a local minimum in speedsequence, and the local minimum is less than the minimum-speed threshold. 2) The local maximum between two succes-sive candidates is larger than the maximum-speed threshold.Figure 6 illustrates the procedure of pause extraction froma motion sequence of a hand. Namely, the motion at thecandidate period is sufficiently slow, and the segment betweentwo consecutive candidates has enough fast motion.

2) Analyzing Musical Information: Our rhythm extractionmethod is based on the onset of the sound power. Extractingthe musical rhythm uses the following two assumptions: 1) Asound is likely to be produced with the timing of the rhythm.2) The interval of the onset component is likely to be equalto that of the rhythm. Figure 7 illustrates onset componentcalculation. By using the first assumption, we calculate anonset component per frequency [29], where the power increasefrom the previous time frame t −1 defined as d(t, f ).

Fig. 7: An illustration of onset component calculation.

d(t, f ) =

⎧⎨⎩

max(p(t, f ), p(t +1, f )−PrevPow(min(p(t, f ), p(t +1, f )) ≥ PrevPow),

0 (otherwise)(6)

where

PrevPow = max(p(t −1, f ), p(t −1, f ±1)), (7)

and p(t, f ) is the spectral power at time t and frequency f .By calculating the total onset component D(t) = ∑ f d(t, f ), theintensity of the produced sound at time frame t is calculated.

According to the second assumption, the auto-correlationfunction of D(t) is calculated to estimate the average rhythminterval. We calculate the cross-correlation function betweenD(t) and a pulse sequence whose interval is the estimatedrhythm interval for estimating timing of the rhythm. In prac-tice, a rhythm interval sometimes changes slightly due tothe performers’ sensibilities, and other factors. To absorb thisslight rhythm change, the local maximum around the estimatedrhythm is searched.

3) Combining Pauses and Musical Information: The pausecandidates that are obtained are further refined using estimatedmusical rhythm, and are combined to have stable keyposes.At each speed sequence, our method tests whether there areany candidates around musical rhythm time (tbeat). If anycandidates are found around a musical rhythm time, a keyposemight exist there. Such rhythm times are newly selectedas the representative candidates, and they replace the oldcandidates. The other candidates, which are not found arounda musical rhythm time, are deleted. Through this procedure,all candidates are placed on a musical rhythm time. Figure 8illustrates this process. In this figure, there are no candidatesaround the first and third musical rhythm points, but thereare candidates around the second and fourth musical rhythmpoints. So the second and fourth musical rhythm points areextracted as new keypose candidates.

Candidates are combined through a logical operation to con-firm whether tbeat is the stop point of the entire body. Theoperation examines all candidates from hands, feet, and theCM. Rhythm points that satisfy the following two conditionsare extracted as keypose points. 1) There are at least twocandidates among those given by hands and feet. 2) Thereis a candidate from the CM. Meeting these criteria means thatthe entire body of the dancer stops at this point in the musicalrhythm.

Speed

Time

Rhythm points

Old candidates(from motion data)

New candidates(considering rhythm)

Fig. 8: Refinement of the keypose candidates with musi-cal rhythm: Rhythm points around which there are keyposecandidates extracted by motion analysis are selected as newcandidates.

Fig. 9: Result of keypose extraction – Aizu-bandaisan dance.Top row: keyposes, Tome, drawn in dance text book, andbottom row: keyposes extracted by our method. The motionsequence is segmented by the extracted keyposes.

We extracted keyposes in the dance motion sequences of theAizu-bandaisan dance, a traditional Japanese folk dance. Thebottom row of Figure 9 shows the result of our keyposeextraction with the estimated rhythm. The top row of Figure 9represents the keyposes, Tome, drawn in a dance textbookwritten by an expert dancer. We can easily confirm that ourmethod has a result very close to the expert’s understanding.

B. Generating Skill Trajectories

The final phase for upper body motion is generating skilltrajectories to smoothly connect two consecutive keyposes.This transitional motion is required to satisfy the followingtwo conditions: 1) It satisfies the mechanical constraints ofa humanoid robot, such as limitations in joint angles andvelocities. 2) It represents the characteristics of the originalmotion as much as possible. With regard to the former condi-tion, there are several related previous studies. For example,Hodgins et al. developed a method to modify joint angle

Knot

DataKeypose 1 Keypose 2

Sparselysampling

Sparselysampling

Denselysampling

Denselysampling

Fig. 10: Our sampling method for motion decomposition: Forhierarchical B-spline construction, we sample data around thekeyposes densely, while sampling data in other parts sparsely.In this example, the data on the vertical straight lines areconsidered, and the data on the dots are ignored.

and angular velocity to satisfy mechanical limitation usinga PD filter [4]. However, these studies have not consideredconserving the characteristics of motion. By paying attentionto keyposes, we have developed a method of generatingtransitional motion satisfying the both conditions.

Our method consists of two steps; a hierarchical motiondecomposition step considering keypose information, and amotion reconstruction step that satisfies the mechanical con-straints. For the hierarchical motion decomposition, a hier-archical B-spline is used. It consists of a series of B-splinecurves with different knot spacing. Higher layers are based onthe finer knot spacing and can preserve the higher frequencycomponents of the original sequence. Compared to a normalB-spline, hierarchical B-spline is better suited to approximatethe high frequency components of the original motion, and itscalculation cost is small.

Hierarchical B-spline is used for representing a motion se-quence because of its compactness. Lee et al. proposed amethod to solve the space-time constraints problem efficientlyby hierarchical motion decomposition using hierarchical B-spline [30]. Our hierarchical decomposition method is similarto this, but is extended to conserve postural information ofkeyposes as much as possible. We sample the input motionsequence around the keyposes densely, and we add a velocityconstraint to make joint angular velocity zero at the keyposeswhen decomposing motion sequence using hierarchical B-spline. Figure 10 provides an illustration of the data-samplingmethod for motion decomposition. Vertical lines in this illus-tration represent the sampled data, and our method uses onlythe straight lines shown among them.

The motion of a robot is then reconstructed which satisfiesmechanical constraints. In this step, we first segment themotion sequence with the music rhythm frames, and thenoptimize each motion segment in order that a resulting mo-tion sequence satisfy given kinematic constraints. Accordingto the insights obtained through the observation of humandance performance, when musical rhythm gets faster, higher

frequency components of the original joint angle trajectorywill be reduced to catch up with the musical rhythm. Likewise,in the optimization process of our method, if a joint angle orjoint angular velocity of a body parts is beyond its physicallimitation, the hierarchical B-spline layers to be used formotion composition are gradually reduced from finer to lowerlayers. When a joint angle trajectory is decomposed to Nlayer

layers, the motion segment qqq(t), which is a unit quaternionsequence, is represented as the product of qqql(t) (1≤ l ≤Nlayer).The optimization finds the maximum values of n and wn, thenumber of layers and the weighting factor, in the followingformula under the condition that qqq′(t) satisfies mechanicalconstraints:

qqq′(t) =n

∏l=1

(qqql(t)

)wl , (8)

where 1 ≤ n ≤ Nlayer, 0 < wn ≤ 1, and wl = 1 (1 ≤ l ≤ n−1).

Motion blending is performed around the discontinuities whichmay arise due to the difference in the numbers of layers usedin the neighboring segments. Let A and B be neighboringmotion segments. We interpolate the joint angle sequence ofneighboring motion segments qqqA , qqqB using SLERP interpo-lation:

qqq(t) = SLERP(

qqqA (t), qqqB(t); α ((t − tst)/L))

, (9)

where tst represents a starting frame of interpolation, L repre-sents the duration of interpolation, and α(t) is a C2 continuousquintic polynomial function given as

α(t) = −6t5 +15t4 −10t3 +1. (10)

Figure 11 shows the result of applying our algorithm to theAizu-bandaisan dance. In this figure, green lines represent theposture reconstructed using only the first layer, the lowestlayer. Yellow lines represent the posture using the first throughthe third layers, red lines represent the posture using the firstthrough the fifth layers, and white lines represent the originalposture. The reconstructed posture and the original posturebecome similar by using higher layers, and both are almostequal when using the fifth layer. Comparing the keyposes withthe transitional poses, the variance in the results from differentlevels of the layers is smaller at the keypose postures. Ouralgorithm enables us to reproduce keyposes more preciselythan transitional motion, and this is a significant advantage.

VI. GENERATING WHOLE BODY MOTION

This section describes a method to generate whole body mo-tion that can be executed by a humanoid robot. As describedabove, leg motion is generated based on leg task modelswith skill parameters, and upper body motion is generated bythe keypose-based model with skill trajectories. The simplestway to generate whole body motion is direct concatenationof the leg and upper body motion. However, because dy-namic conditions, such as balance, are not considered indirect concatenation, this motion is often not executable for

keypose 1 transition keypose 2 transition keypose 3

Fig. 11: Result of generating transitional motion: Green rep-resents the posture reconstructed using only the first (lowest)layer, yellow represents the posture using the first to the thirdlayers, red represents the posture using the first to the fifthlayers, and white represents the original posture.

a humanoid robot. In order to generate executable motion, weuse a dynamic filter and conduct skill refinement. The dynamicfilter compensates the zero moment point (ZMP) and the yaw-axis moment. The skill refinement step also resolves otherkinematic problems such as self-collision.

A. Compensating ZMP

The dynamic balance of a whole body motion preventing therobot from falling down can be obtained by considering theZMP position. ZMP [31] is defined as the point where ahorizontal element of the kinematic moment induced by theground reaction becomes zero. This is a convenient tool toarrive at the dynamic condition that prevents the robot fromfalling down. The Imaginary ZMP (IZMP) is obtained bycalculating the ZMP under the assumption that the sole ofa support foot has stable contact with the floor. The stablecontact condition in dynamics can be achieved when the IZMPis inside the support area, a convex hull of actual sole planeson the floor. We call this condition the ZMP condition.

We employ Nishiwaki’s method [32] to achieve the ZMPcondition by adjusting the horizontal waist position. A desiredZMP trajectory satisfying the ZMP condition, obtained froman IZMP trajectory, is input to the method. Then, the methodadjusts the waist position horizontally in order to realize thedesired ZMP trajectory. For reducing calculation cost, thesupport foot is allowed to skate on the floor in this calculation.Next, the leg motion is reconstructed by inverse kinematicscalculation, but in this phase, the calculation is performedunder the condition that the support foot is fixed on the floor.Because of this condition, the IZMP for new leg motion isnot equal to the desired ZMP trajectory. However, the IZMPbecomes approximately close to the desired ZMP through thisprocedure; the results converge sufficiently by applying thisprocedure repeatedly.

B. Compensating Yaw-Axis Moment

The yaw-axis moment that a robot exerts on the floor iscanceled by the friction between the soles and the floor whenthe moment is small. However, if the moment becomes so large

as to exceed the maximum reactive moment created by thefriction, the robot slips on the floor. Therefore, even when theoriginal dancer does not spin during a dance performance, wemust consider the problem of preventing a robot from spinningwhen generating robot motion.

Tamiya et al. [5] proposed a method for constraining the wholebody moment to remain under a constant value by a compen-sation technique. We use yaw-moment compensation part ofthe method as a filter in the task generation system in order togenerate motion that does not cause a spin in the actual robot.Tamiya’s method can use any set of joints for compensationwith arbitrary weights. This filter uses only a yaw-axis jointbetween the chest and the waist for compensating yaw-momentof the whole body. Tamiya’s method can deal with only single-leg support; this is not a serious problem because the momentby friction is sufficient for preventing spins in most cases whenthe robot is supported by both legs. Thus the compensation isapplied only for the period when both legs need support.

C. Skill Refinement

The dynamic filter described in the previous section is onlyconcerned with the contact condition between the feet and thefloor. It is necessary to also consider the issues arising from thephysical shape of a robot. Such possible faults include steppingover the movable distance, self-collisions, overruns in joint-angle range, and over-speed in angular velocity limit. Thesefaults are due to the fact that the skill parameters obtainedfrom human motions cannot necessarily be executed on a robotbecause of the mechanical constraints and the difference of thebody shapes. These faults must be eliminated by adjusting skillparameters to the robot body.

The occurrence of these faults is difficult to find withoutexecuting the motion, because a humanoid robot has a highdegree of freedom and a complex body shape, and also becausethe motion is generated through many processes with manyfactors. The task generation system simulates the result ofexecuting the motion by kinematics calculation. If faults aredetected from the simulation, the skill parameters of the relatedtask are modified.

In most of the possible faults, a solution of the skill parametersthat does not cause the fault is close to the original value be-cause the value is actually obtained from a dance performanceby a human. In addition, since the number of skill parametersis not great, the number of candidates for modification islimited. These features make it possible to resolve most faultsby simple rules for modifying skill parameters. Typical faultsand the corresponding rules for resolving them are shown inFigure 12.

In order to make it easy to apply this modification, wedeveloped an integrated software platform. Figure 13 showsa screenshot of the software. Human motions, task sequences,and motions of robots are visualized with both graphical andnumerical forms synchronized with each other. A user can

Refine

(a) (b) (c) (d)

Fig. 12: Typical faults and the rules for resolving them: (a)is an overrunning of the angle limit of a coxa yaw joint. (b)is an overrunning of the possible step distance. (c) is a self-collision between the knee joints. (d) is a self-collision duringstepping.

Fig. 13: Software platform for skill refinement

easily check faults and interactively modify them through thegraphical user interface (GUI).

VII. EXPERIMENT

In an experiment, we chose a traditional Japanese folk dancecalled Aizu-bandaisan. The dance includes many dynamic-style steps with various characteristics. We used an optical-type motion capture called Vicon for obtaining human mo-tions. The system we used consisted of eight infra-red videocameras and 34 body markers. Three-dimensional positionsof the markers were captured at a rate of 120 frames/sec. Wecaptured motions of the dance performed by two dancers: afemale dancer and a male dancer. They performed in time tothe same music. From this data, we extracted the initial 35

seconds for experimental data. The experiment involved fourrepetitions of the same choreography pattern.

Humanoid robot HRP-2 [33] was used for the experiment.HRP-2 is a biped humanoid robot consisting of a wholebody with 30-DOF joints. Its size (1.54[m] in height) andweight (58[kg]) are similar to those of humans. We useda control system of OpenHRP [34], [35], to control HRP-2according to the generated motion data. The system basicallyconsists of a sequence controller and a stabilizer. For eachjoint, the sequence controller follows a given joint angletrajectory with PD control. In addition, the stabilizer slightlymodifies the waist position in order to correct errors betweena given reference ZMP and an actual ZMP obtained by forcesensors. Although the generated motions are consistent interms of dynamics, this kind of feedback stabilization controlis necessary due to disturbances and physical model errors.

The robot HRP-2 successfully performed the generated mo-tions. The performances were sufficiently stable because thesole of the support foot was always kept flat on the floor.Figure 14 shows the imitative performance by HRP-2 and theoriginal performance by a female dancer.

The robot performance imitating a male dancer of the samefolk dance is shown in the attached video, and is compared tothe performance imitating the female dancer. The two dancersperformed the same dance, but there were individual styles intheir performance. These individual styles are preserved in thecorresponding robot dances, and this is one of the significantcontributions of our LFO-based approach.

In the attached video, the performance of another dance byanother humanoid robot is also shown. We made a humanoidrobot HRP-1S, an older version of HRP-2, perform theTsugaru-jongarabushi dance. Because HRP-1S is a lower-performance robot, it cannot move at the original tempo dueto its lesser motor capability; instead, it danced at half thetempo. Here, we emphasize that our method can be easilyapplied to a new robot with different dimensions. This showsthe versatility of our LFO-based approach.

VIII. CONCLUSION

This paper describes a dancing robot that imitates humandance performance based on the LFO paradigm. By usingthe LFO paradigm, essential factors of motion are extractedas task models, and so we can reproduce human motionefficiently. In this method, first, upper body motion and legmotion are generated separately. For leg motion, we designleg task models and obtain a task sequence. For upper bodymotion, we propose a keypose-based method and composemotion for a humanoid robot. Next, the upper body motionand the leg task sequence are integrated, dynamic filters areapplied to them, and skill refinement is conducted. Finally, ahumanoid robot performs the reconstructed dance motion. Inaddition, we develop an integrated software platform to easilydebug generated motions through a GUI interface.

Fig. 14: Dance performance of Aizu-bandaisan by HRP-2 and an expert dancer at the original music tempo

Although we have restricted our discussion to a dancing robotin this paper, we believe that the LFO paradigm is a usefuland powerful framework not only for the robotics field, butalso for all fields where human motion is involved, includingCG animation.

REFERENCES

[1] Katsushi Ikeuchi and Takeshi Suehiro. Toward an assembly plan fromobservation, part I: Task recognition with polyhedral objects. IEEETransactions on Robotics and Automation, 10(3):368–385, 1994.

[2] Jun Takamatsu, Taku Morita, Koichi Ogawara, Hiroshi Kimura, and Kat-sushi Ikeuchi. Representation for knot-tying tasks. IEEE Transactionson Robotics, 22(1):65–78, 2006.

[3] Marcia Riley, Ales Ude, and Christopher G. Atkeson. Methods formotion generation and interaction with a humanoid robot: Case studiesof dancing and catching. In Proceedings of AAAI and CMU Workshop onInteractive Robotics and Entertainment 2000, pages 35–42, Pittsburgh,Pennsylvania, 2000.

[4] Nancy S. Pollard, Jessica K. Hodgins, Marcia J. Riley, and Christo-pher G. Atkenson. Adapting human motion for the control of a humanoidrobot. In Proceedings of the 2002 IEEE International Conference onRobotics and Automation (ICRA), pages 1390–1397, 2002.

[5] Yukiharu Tamiya, Masayuki Inaba, and Hirochika Inoue. Realtime bal-ance compensation for dynamic motion of full-body humanoid standingon one leg. Journal of the Robotics Society of Japan, 17(2):268–274,1999. (in Japanese).

[6] Satoshi Kagami, Fumio Kanehiro, Yukiharu Tamiya, Masayuki Inaba,and Hirochika Inoue. Autobalancer: An online dynamic balance com-pensation scheme for humanoid robots. In Proceedings of the FourthInternational Workshop on Algorithmic Foundation of Robotics (WAFR2000), 2000.

[7] Katsu Yamane and Yoshihiko Nakamura. Dynamics filter - concept andimplementation of on-line motion generator for human figures. IEEETransactions on Robotics and Automation, 19(3):421–432, 2003.

[8] Yoshihiro Kuroki, Masahiro Fujita, Tatsuo Ishida, Ken’ichiro Nagasaka,and Jun’ichi Yamaguchi. A small biped entertainment robot exploringattractive applications. In Proceedings of the 2003 IEEE InternationalConference on Robotics and Automation (ICRA), pages 471–476, 2003.

[9] Yoshihiro Kuroki, Bill Blank, Tatsuo Mikami, Patrick Mayeux, At-sushi Miyamoto, Robert Playter, Ken’ichiro Nagasaka, Marc Raibert,Masakuni Nagano, and Jin’ichi Yamaguchi. Motion creating system fora small biped entertainment robot. In Proceedings of the IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS),pages 1394–1399, 2003.

[10] Arimin Bruderlin and Lance Williams. Motion signal processing. InComputer Graphics (SIGGRAPH 95 Proceedings), pages 97–104, 1995.

[11] Andrew Witkin and Zoran Popvic. Motion warping. In ComputerGraphics (SIGGRAPH 95 Proceedings), pages 105–108, 1995.

[12] Michael Gleicher. Retargetting motion to new characters. In ComputerGraphics (SIGGRAPH 98 Proceedings), pages 33–42, 1998.

[13] Jehee Lee and Sung Yong Shin. A coordinate-invariant approach tomultiresolution motion analysis. Graphical Models, 63(2):87–105, 2001.

[14] Douglas J. Wiley and James K. Hahn. Interpolation synthesis ofarticulated figure motiohn. IEEE Transactions on Computer Graphicsand Applications, 17(6):39–45, 1997.

[15] Lucas Kovar and Michael Gleicher. Flexible automatic motion blendingwith registration curves. In Proceedings of the 2003 ACM SIG-GRAPH/Eurographics Symposium on Computer animation, pages 214–224, 2003.

[16] Lucas Kovar, Michael Gleicher, and Frederic Pighin. Motion graphs.ACM Transactions on Graphics (SIGGRAPH 2002), 21(3):473–482,2002.

[17] Okan Arikan and D. A. Forsyth. Interactive motion generation from ex-amples. ACM Transactions on Graphics (SIGGRAPH 2002), 21(3):483–490, 2002.

[18] Miti Ruchanurucks, Shin’ichiro Nakaoka, Shunsuke Kudoh, and Kat-sushi Ikeuchi. Humanoid robot motion generation widh sequentialphysical constraints. In Proceedings of the 2006 IEEE InternationalConference on Robotics and Automation (ICRA), pages 2649–2654,2006.

[19] Seyoon Tak, Oh-Young Song, and Hyeong-Seok Ko. Motion balancefiltering. Computer Graphics Forum, 19(3):435–446, 2000.

[20] C. Karen Liu and Zordan Popvic. Synthesis of complex dynamic char-acter motion from simple animations. ACM Transactions on Graphics(SIGGRAPH 2002), 21(3):408–416, 2002.

[21] Anthony C. Fang and Nancy S. Pollard. Efficient synthesis of validhuman motion. ACM Transactions on Graphics (SIGGRAPH 2003),22(3):417–426, 2003.

[22] Victor B. Zordan, Anna Majkowska, Bill Chiu, and Mathew Fast.Dynamic response for motion capture animation. ACM Transactionson Graphics (SIGGRAPH 2005), 24(3):697–701, 2005.

[23] Christopher Peters and Carol O’ Sullivan. Bottom-up visual attention forvirtual human animation. In Proceedings of International Conferenceon Computer Animation and Social Agents, pages 111–117, 2003.

[24] Takeshi Sakuma, Tomohiko Mukai, and Shigeru Kuriyama. Psychologi-cal model for animating crowded pedestrians. Computer Animation andVirtual Worlds, 16(3-4):343–351, 2005.

[25] M. Stone, D. DeCarlo, I. Oh, C. Rodriguez, A. Stere, A. Lees, andC. Bregler. Speaking with hands: Creating animated conversationalcharacters from recordings of human performance. ACM Transactionson Graphics (SIGGRAPH 2004), 23(3):506–513, 2004.

[26] Tae-Hoon Kim, Sang Il Park, and Sung Yong Shin. Rhythmic-motionsynthesis based on motion-beat analysis. ACM Transactions on Graphics(SIGGRAPH 2003), 22(3):392–401, 2003.

[27] Gazihan Alankus, A. Alphan Bayazit, and O. Burchan Bayazit. Auto-mated motion synthesis for dancing characters. Computer Animationand Virtual Worlds, 16(3-4):259–271, 2005.

[28] Hyun-Chul Lee and In-Kwon Lee. Automatic synchronization of back-ground music and motion in computer animation. Computer GraphicsForum (Eurographics 2005), 24(3):353–361, 2005.

[29] Masataka Goto. An audio-based real-time beat tracking system formusic with or without drum-sounds. Journal of New Music Research,30(2):159–171, 2001.

[30] Jehee Lee and Sung Yong Shin. A hierarchical approach to interactivemotion editing for human-like figures. In Computer Graphics (SIG-GRAPH 99 Proceedings), pages 39–48, 1999.

[31] M. Vukobratovic, B. Borovac, D. Surla, and D. Stokic. Biped Locomo-tion: Dynamics, Stability, Control and Application, volume 7 of ScientificFundamentals of Robotics. Springer-Verlag, 1990.

[32] Koichi Nishiwaki, Satoshi Kagami, Yasuo Kuniyoshi, Masayuki Inaba,and Hirochika Inoue. Online generation of humanoid walking motionbased on a fast generation method of motion pattern that follows desiredzmp. In Proceedings of the IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), pages 2684–2689, 2002.

[33] Kenji Kaneko, Fumio Kanehiro, Shuuji Kajita, Hirohisa Hirukawa,Toshikazu Kawasaki, Masaru Hirata, Kazuhiko Akachi, and TakakatsuIsozumi. Humanoid robot HRP-2. In Proceedings of the 2004 IEEEInternational Conference on Robotics and Automation (ICRA), pages1083–1090, 2004.

[34] Kazuhito Yokoi, Fumio Kanehiro, Kenji Kaneko, Kiyoshi Fujiwara,Shuji Kajita, and Hirohisa Hirukawa. A honda humanoid robot con-trolled by aist software. In Proceedings of the IEEE-RAS InternationalConference on Humanoid Robots, pages 259–264, 2001.

[35] Fumio Kanehiro, Kiyoshi Fujiwara, Shuuji Kajita, Kazuhito Yokoi,Kenji Kaneko, Hirohisa Hirukawa, Yoshihiko Nakamura, and KatsuYamane. Open architecture humanoid robotics platform. In Proceedingsof the 2002 IEEE International Conference on Robotics and Automation(ICRA), pages 24–30, 2002.

entertainment robot: learning from observation paradigm for humanoid robot dancing

Documents