collaborative control in a humanoid dynamic task

COLLABORATIVE CONTROL IN A HUMANOID DYNAMIC TASK

Diego Pardo and Cecilio AnguloGREC - Knowledge Engineering Research Group

UPC - Technical University of CataloniaAvda. Victor Balaguer s/n. Vilanova i la Geltru, Spain

[email protected], [email protected]

Keywords: Robot control architecture, Sensorimotor learning, Coordination policy, Reinforcement learning.

Abstract: This paper describes a collaborative control scheme that governs the dynamic behavior of an articulated mobilerobot with several degrees of freedom (DOF) and redundancies. These types of robots need a high level of co-ordination between the motors performance to complete their motions. In the employed scheme, the actuatorsinvolved in a specific task share information, computing integrated control actions. The control functions arefound using a stochastic reinforcement learning technique allowing the robot to automatically generate thembased on experiences. This type of control is based on a modularization principle: complex overall behavioris the result of the interaction of individual simple components. Unlike the standard procedures, this approachis not meant to follow a trajectory generated by a planner, instead, the trajectory emerges as a consequenceof the collaboration between joints movements while seeking the achievement of a goal. The learning of thesensorimotor coordination in a simulated humanoid is presented as a demonstration.

1 INTRODUCTION

Robots with several Degrees of Freedom (DOF) andredundant configurations are more frequently con-structed; humanoids like Qrio (Kuroki et al., 2003),Asimo (Hirai et al., 1998) or HRP-2 (Kaneko et al.,2004), and entertainment robots like Aibo (Fujita andKitano, 1998) are examples of it. Complex move-ments in complex robots are not easy to calculate, theparticipation of multiple joints and its synchroniza-tion requirements demands novel approaches that en-dow the robot with the ability of coordination. Anattempt to drive the dynamics of its body optimally,measuring its performance in every possible config-uration, would bring to a combinational explosion ofits solution space.

The habitual use of simplified mathematical mod-els to represent complex robotic systems, i.e., approx-imating non-linearities and uncertainties, would resultin policies that execute approximately optimal con-trol, thus, two major assumptions command the devel-oped work. First, there is no pre-established mathe-matical model of the physics of the robot’s body fromwhich a control law could be computed; and second,

the control design philosophy is focused on the ac-tion performance of the robot and not on the trajectoryachievement by its joints; as an alternative, stochasticreinforcement learning techniques applied to numeri-cal simulation models are studied. An optimal controlproblem, where a cost function is minimized to com-pute the policies, is stated.

The goal of this work is to solve a multi-joint robotmotion problem where coordination is need. Plan-ners and trajectory generators are usually in charge ofcomplex motions where many joints are involved andthe mechanical stabilization of the robot is in risk; therelationship between sensory signals and motor com-mands at the level of dynamics is viewed as a lowlevel brick in the building of control hierarchy. Herewe use a dynamic control scheme that also solves thetrajectory problem.

It is important to mention that a previous workuses coordination at the level of dynamics to extenda manipulator robot capacity (Rosenstein and Barto,2001), their objective was to use a biologically in-spired approach to profit synergical movements be-tween joints, like human muscles relationship. Theyuse a hand-made pre-established PID controllers in

174

every joint of the robot and the final applied torquein each motor is the linear combination of the PID’soutput, then collaboration is shown. The combina-tion parameters are obtained by direct search meth-ods (Rosenstein, 2003). By using linear controllersto solve a nonlinear problem, they implement a hi-erarchical motor program that runs various feedbackcontrollers, i.e, they switch between PID’s (parame-ters and goal) in the middle of the motion.

Here we extend that result by proposing a moregeneral scheme, where sensor information is pro-cessed in independent and specific layers to producecoordinate control actions. Furthermore, at the ar-chitecture definition, the control functions are no re-stricted to PID’s and its linear combination. Addition-ally, we use reinforcement learning to compute all thearchitecture controllers.

This paper is organized as follows Section 2 is de-voted to the presentation of the architecture and a lin-ear implementation of practical use is outlined in 3;while section 4 validates it through a simulation ex-periment of a robot equilibrium dynamical task. Sec-tion 6 gathers the conclusions and points to the futurework.

2 LAYERED COOPERATIVECONTROL

The Layered Cooperative Control (LCC) scheme ismeant to generate motions while solving a DynamicalTask (DT). It is assumed that the final configuration ofthe DT is known, i.e. the set point of the joints anglesis preestablished; but how to coordinate motions toreach this configuration, restricted to its body dynam-ics, is unknown. It is also assumed that the states ofthe system are observable and that each joint positionis commanded by independent servos.

Figure 1 represents schematically the main ideaof the LCC. The control actions that drive the motorsare processed by two layers of controller functions.Each component of the first layer (D) manipulates thedynamics of a singular joint; the output of this layerintends to position its corresponding joint in a pre-viously specified angle, but it is filtered by the sec-ond layer of controllers ( f ). Each motor is driven bya different function, which calculates the control ac-tion based on all the signals originated in the previouslayer.

The controllers of the first layer are related withthe dynamics of the link, they must be establishedonce. The second layer of functions ( f ) must be de-signed facing the DT intended to be completed. Func-tionally and computationally speaking the architec-

Figure 1: LCC Basic Idea : Two layers of control.

ture is layered; every time step two mappings takeplace, one for error position reasons and the otherseeking coordination between joints.

Every controller of the second layer ( f ) uses theinformation originated in all the first layer’s con-trollers. This collaboration allows joint controllers toknow the dynamical state of the others, and then toact according to it in order to complete the DT.

2.1 Architecture Formulation

Let the articulated mobile robot have n joints. Eachjoint has an error based controller Di with (i = 1, ...,n)in the first layer. This is a SISO function, whose inputis the error of the position of its respective joint (ei)and whose output is a control action (ui). The error iscalculated using the sensor information providing theactual position of the joint (θi) and the already knownset point or goal configuration (θ∗i ).

ei = θi−θ∗i

ui = Di(ei) i = 1, ...,n

Let the dynamical task have associated n controllers fiin the second layer with (i = 1, ...,n), these are MISOfunctions whose inputs are the n outputs ui of the firstlayer described above. The output of the second layerfunctions vi is the velocity applied on the correspond-ing motor (Mi).

vi = fi(u1, ...,un) i = 1, ...,n

The robot state is represented by a continuous timedynamical system controlled by a parameterized pol-icy,

x = g(x,v)+h(x) · v (1)v = πw(x,v) (2)

where vector x represents the states of the robot, g andh its dynamics, v the control action and π the controlpolicy parameterized by the vector w.


175

2.2 LCC Synthesis

Once the control scheme has been stated, the nextstep is the definition of a methodology to compute thefunctions ( fi,Di). Here a policy gradient reinforce-ment learning (PGRL) algorithm is employed. PGRLmethods (see (Williams, 1992),(Sutton et al., 2000))are based on the measurement of the performance ofa parameterized policy πw applied as control functionduring a delimited amount of time. In order to mea-sure the performance of the controller, the followingfunction is defined,

V (w) := J(x,πw(x)) (3)

where the measurement of the performance V of theparameters w is done by defining the cost function J.

By restricting the scope of the policy to certainclass of parameterized functions u = πw(x), the per-formance measure (3) is a surface where the maxi-mum value corresponds to the optimal set of param-eters w ∈ Rd . The search for the maximum can beperformed by standard gradient ascent techniques,

wk+1 = wk +η∇wV (w) (4)

where η is the step size and ∇wV (w) is the gradientof V (w) with respect to w. The analytical formulationof this gradient is not possible without the acquisitionof a mathematical model of the robot. The numeri-cal computation is also not evident, then, a stochasticapproximation algorithm is employed : the ‘weightperturbation’ (Jabri and Flower, 1992), which esti-mates the unknown gradient using a Gaussian randomvector to orientate the change in the vector parame-ters. This algorithm is selected due to its good per-formance, easy derivation and fast implementation;note that the focus of this research is not the choiceof a specific algorithm, nor the development of one,but rather the cognitive architecture to provide robotswith the coordination learning ability.

This algorithm uses the fact that, by adding to wa small Gaussian random term z with E{zi} = 0 andE{ziz j}= σ2δi j, the following expression is a sampleof the desired gradient

α(w) = [J(w+ z)− J(w)] · z (5)

Then, both layers’ controllers can be found using thisPGRL algorithm.

ui = Dki(ei) (6)vi = fwi(u1, ...,un)

In order to be consequent with the proposed definitionof each layer, the training of the vector parametersmust be processed independently; first the dynami-cal layer and then the coordination one. It is assumed

that the movement has to take place between a limitedamount of time T , where signals are collected to com-pute the cost function (3), then, the gradient is esti-mated using (5) and a new set of parameters obtainedapplying the update strategy of (4). Notice that in thecase of the first layer, each function Dki is trained andupdated separately from the others of the same layer,but when learning the coordinate layer, all the func-tions fwi must be updated at the same time, becausein this case the performance being measured is that ofthe whole layer.

3 LINEAR IMPLEMENTATION

The previous description of the LCC is too generalto be of practical use. Then, it is necessary to makesome assumptions about the type of functions to beused as controllers. A well known structure to controlthe dynamic position of one link is the ProportionalIntegral Derivative (PID) error compensator. It hasthe following form,

ui = KPi · ei +KDi ·dei

dt+KIi ·

Zei dt (7)

The functionality of its three terms (KP,KD,KI) of-fers management for both transient and steady-stateresponses, therefore, it is a generic and efficient so-lution to real world control problems. By the useof this structure, a link between optimal control andPID compensation is revealed to robotic applications.Other examples of optimization based techniques fortuning a PID are (Daley and Liu, 1999; Koszalkaet al., 2006).

The purpose of the second layer is to compute theactual velocity to be applied on each motor by gath-ering information about the state of the robot whileprocessing a DT. The PID output signals are collectedand filtered by this function to coordinate them. Per-haps the simplest structure to manage this coordina-tion is a gain row vector Wi, letting the functions inthe second layer be the following,

fwi(u) = Wi ·u

Then, a linear combination of u commands the coor-dination. The matrix WDTm encapsules all the infor-mation about this layer for the mth DT.

WDTm =

w11 . . . win...

. . ....

wn1 . . . wnn

(8)

Where the term wi j is the specific weight of the jthPID in the velocity computation of the joint i.

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

176

4 HUMANOID COORDINATION

Classical humanoid robot motions are typically veryslow in order to maintain stability; e.g., the bipedwalking based on Zero Moment Point (ZMP) (Vuko-bratovic and Stepanenko, 1972) condition avoidsthose states far from the pre-determined trajectory inwhich the momentum is guaranteed. By doing this,thousands of trajectories and capabilities of the robotcould be restricted. Highly dynamic capabilities needthose avoided states to perform quick and realisticmotions.

4.1 The Simulation Robot Environment

A simulated humanoid is employed as a test-bed toevolve the controllers. It was modeled using Webots(Michel, 2004), a 3D robot simulation platform . De-tails can be obtained in (Webots, ).

This particular model of the full body humanoidrobot has a total of 25 DOF, each link torque is limitedto [−10,10] Nm, additionally it has a camera, a dis-tance sensor, a gps, 2 touch sensors in the feet, and aled in the forehead. The model includes physical char-acteristics; the software processes kinematic and dy-namic simulation of the interaction between the robotand its environment.

4.2 The One-Leg Equilibrium Task

The goal of the simulation experiment is to reach afinal configuration called ’one-leg equilibrium point’starting from the passive stand up position. Threejoints are involved in this motion (see Figure 2(a).):The longitudinal axis of the Back, the transversal axisof the left hip and the left knee; their goal states,i.e. angles in radians, are θ∗1 = 0.4,θ∗2 = 1.5 andθ∗3 = 1.5 respectively, with final velocity θ∗ = 0 forall the joints. Both configurations, initial and final,are shown in Figure 2(b). The goal configuration hasbeen designed to generate a slight inclination of theback to the right side, changing the center of mass ofthe body of the robot, thus, letting more weight sup-ported on the right leg, and then allowing the left legto lift, by means of the blending of the hip and kneearticulations.

Figure 3 shows the simulation sequence of the mo-tion of the robot using standard low level controllers.The robot falls down when attempts to reach the goalconfiguration due to the forces generated by the fric-tion between the floor and its foot soles. The wholebody of the robot suffers a destabilization at the startof the movement.

(a) (b)

Figure 2: Simulated Humanoid (a). Joints involved in themotion. (b) Initial and goal configuration.

Figure 3: Direct Solution: Failure Demonstration.

4.3 The Learning Procedure

Table 1 shows the PGRL algorithm. A supervisor pro-gram is in charge of the training; its EVALUATE sub-routine delivers the set of values to be evaluated inthe robot controllers. It starts a controlled episodefrom the initial configuration and intends to achievethe goal states; every evaluation takes T = 12s, withinthis time the cost function is measured and returned.The supervisor repeats this operation until a conver-gence criterium is matched or the number of iterationsis too large.In this experiment the PID values are learned but notcalculated. The following is the discrete implementa-

Table 1: Weight perturbation PGRL Algorithm.

inputstep size η ∈ [0,1]search size σ≥ 0max step ∆max ∈ [0,1]

initializeW0 ← I ∈ R3×3 (Identity Matrix)α← 0J← EVALUATE(W)

repeat1. z← N(0,σ)3. Jz← EVALUATE(W+z)4. α← [Jz− J] · z5. ∆W ←−ηα

6. ∆W ←min(∆W,∆max)7. W ←W +∆W8. J←EVALUATE(W)

until convergence or number of iterations too largereturn W,J


177

tion of the PID employed,

uti = KP · et

i +KD · (eti− et−1

i )+KI ·t

∑j=0

e ji (9)

Where uti represents the control signal associated with

joint i at step time t, it depends on its correspondingposition error et

i = θ∗i −θti . This controller must pro-

vide zero position error in a single joint motion, there-fore its parameters are learned using as performancecriteria the following expression,

Ji(V ) =−T

∑t=0

(θ∗i −θti)

2 +T

∑t=0

(θti)

2 (10)

The PID is tested using a time step of ∆T = 32msduring T = 12s. The algorithms constants used areσ = 0.032 and η = 0.05. For the learning of the par-ticular PID in charge of controlling the back, specialrestrictions are implemented to the reward function,penalizing those episodes where the robot falls down.The penalization is proportional to the time the robotstays on the ground.

The linear parameterized function selected for thecoordination layer is:

vt = W ·ut (11)

where vt ∈ℜ3 are the actual final velocities; ut ∈ℜ3

the outputs of the PIDs; and W ∈ℜ3×3 is the transfor-mation that contains the parameters to be found (Wi j).

The algorithm starts with a diagonal matrix as co-ordination controller, i.e. W0 = I, meaning that thefirst collaborative control function attempted is that inwhich just the original PID controllers act. The non-diagonal values of K deliver coordination. Same algo-rithms constants and time step are used in this stageof the LCC implementation.

5 RESULTS

5.1 PID Controller

Several local minima are found. Figure 4(a) shows theoutput of the back joint using one of the PID solutionsfound. It presents big overshoot, but no stationary er-ror to a set point of θ∗ = 0.3, and mainly: The robotlearns how ’not to fall’.The behavior of the Back joint is equivalent with aninverted pendulum system with a big mass being ma-nipulated (robot torso). For this joint, and for a bettercontrolled output, a more complex parameterized pol-icy would be needed. On the other hand, Figure 4(b)presents the output of the hip joint. It is evident that

(a)

(b)

Figure 4: (a) Back-Joint position output (radians), using alearned PID (set point=0.3rad), (b) Hip-Joint position out-put (radians) using a learned PID (set point=1.5rad).

a better solution is achieved, this is because the train-ing of this PID is done in a lay down position, withno risk of falling, and the mass of the link being con-trolled is widely insignificant compared with that inthe back-joint case.

Final values for the independent joints controllers(Back, Hip and Knee) are :

PID1 = [0.746946 0.109505 0.0761541]PID2 = [0.76541 0.0300045 0]PID3 = [0.182792 0.0114312 0.00324704]

5.2 Coordination Layer

After several iterations, the best solution is found; asequence of the humanoid learned motion is depictedin Figure 5. The associate coordination controller isthe following:

W =

0.6612900 −0.3643360 −0.1936260−0.0497653 1.0266600 −0.0684569

0.0285586 0.1469820 0.9642390

Note how the non-diagonal parameters have grown,allowing the coordination. This characteristic has


178

Figure 5: Simulated Humanoid One-Leg Equilibrium task.

(a)

(b)

Figure 6: (a) Reward Evolution J(W ) of the best learned so-lution (last 20 tirals); (b) Back-joint position, before (dottedline) and after the coordination learning.

been previously pointed out in (Rosenstein and Barto,2001), but here in an unstable 3D system task.

For this particular solution, the evolution of thereward is presented in Figure 6(a). Once the robotlearns how ‘not to fall’ the convergence rate increasesand the final results get to lower values.

It is important to mention that the robot presentssome absolute displacement, tending to go back-wards. A better designed reward function is neededto avoid absolute displacements, but a good solutionis harder to find in this scenario. Finally, it is impor-tant to emphasize the change in the behavior of theback joint after the second layer phase. It seems likethe system learns that the overshoot must be shorter intime in order to be able to lift the leg with out fallingdown. Figure 6(b). shows the output of the back jointbefore and after the second layer of learning.

6 CONCLUSIONS

The LLC scheme proposes to use the dynamic inter-action of the robots articulations as a trajectory gen-erator, without the use of a trajectory planner. Ex-ploiting dynamics gives the robot the ability to ex-plore motion interactions that result in optimized be-haviors: Learning at the level of dynamics to succeedin coordination.

The presented solution overcomes the model de-pendency pointed above; by the presentation of a sys-tematic control scheme the ideas of (Rosenstein andBarto, 2001) are extended. Here, the low level con-trollers are found using learning techniques and theformulation of a control architecture allows the im-plementation of different parameterized policies.

The interaction between Machine Learning, Con-trol Systems and Robotics creates an alternative forthe generation of artificial systems that consistentlydemonstrate some level of cognitive performance.

Currently, highly dynamic motions in humanoidsare widely unexplored. The capability of this robotswill be extended if motions that compromise thewhole-body dynamic are explored. The LCC is testedwithin a simulated humanoid and succeed in the per-formance of a very fast motion, one in which thewhole-body equilibrium is at risk.

Coordination is possible thanks to the sharing ofinformation.

ACKNOWLEDGEMENTS

This work has been partly supported by the SpanishMEC project ADA (DPI2006-15630-C02-01). Au-thors would like to thank Olivier Michel and RicardoTellez for their support and big help.

REFERENCES

Daley, S. and Liu, G. (1999). Optimal pid tuning usingdirect search algorithms. Computing and Control En-gineering Journal, 10(2):251–56.

Fujita, M. and Kitano, H. (1998). Development of an au-tonomous quadruped robot for robot entertainment.Autonomous Robots, 5(1):7–18.

Hirai, K., Hirose, M., Haikawa, Y., and Takenaka, T. (1998).The development of honda humanoid robot. In Pro-ceedings of the IEEE International Conference onRobotics and Automation, ICRA.

Jabri, M. and Flower, B. (1992). Weight perturbation: Anoptimal architecture and learning technique for analogVLSI feedforward and recurrent multilayer networks.


179

IEEE Transactions on Neural Networks, 3(1):154–157.

Kaneko, K., Kanehiro, F., Kajita, S., Hirukawa, H.,Kawasaki, T., Hirata, M., Akachi, K., and Isozumi,T. (2004). Humanoid robot hrp-2. In Proceedings ofthe IEEE International Conference on Robotics andAutomation, ICRA.

Koszalka, L., Rudek, R., and Pozniak-Koszalka, I. (2006).An idea for using reinforcement learning in adaptivecontrol systems. In Proceedings of the IEEE Int Con-ference on Networking, Systems and Mobile Com-munications and Learning Technologies, Kerkrade,Netherlands.

Kuroki, Y., Blank, B., Mikami, T., Mayeux, P., Miyamoto,A., Playter, R., Nagasaya, K., Raibert, M., Nagano,M., and Yamaguchi, J. (2003). A motion creating sys-tem for a small biped entretainment robot. In Proceed-ings of the IEEE International Conference on Intelli-gent Robots and Systems, IROS.

Michel, O. (2004). Webots: Professional mobile robotsimulation. Journal of Advanced Robotics Systems,1(1):39–42.

Rosenstein, M. T. (2003). Learning to exploit dynamics forrobot motor coordination. PhD thesis, University ofMassachusetts, Amherst.

Rosenstein, M. T. and Barto, A. G. (2001). Robotweightlifting by direct policy search. In Proceedingsof the IEEE International Conference on Artificial In-telligence, IJCAI, pages 839–846.

Sutton, R., McAllester, D., Singh, S., and Mansour, Y.(2000). Policy gradient methods for reinforcementlearning with function approximation. Advances inNeural Information Processing Systems, 12:1057–1063.

Vukobratovic, M. and Stepanenko, J. (1972). On the stabil-ity of anthropomorphic systems. Mathematical Bio-sciences, 15:1–37.

Webots. http://www.cyberbotics.com. Commercial MobileRobot Simulation Software.

Williams, R. J. (1992). Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.Machine Learning, 8:229–256.


180

collaborative control in a humanoid dynamic task

Documents