deep learning of biomimetic sensorimotor control for biomechanical human...

Deep Learning of Biomimetic Sensorimotor Control

for Biomechanical Human Animation

MASAKI NAKADA, TAO ZHOU, HONGLIN CHEN, TOMER WEISS, and DEMETRI TERZOPOULOS,

University of California, Los Angeles, USA

(a) (b) (c) (d)

Fig. 1. Our anatomically accurate biomechanical human model features a biomimetic sensorimotor system that incorporates 20 trained deep neural networks,

10 devoted to visual perception and 10 to motor control. Its motor subsystem comprises 5 neuromuscular controllers that actuate the cervicocephalic and

four limb musculoskeletal complexes. Via ray tracing (blue rays), its virtual eyes (b) visually sample the environment (a) to compute the irradiance at RGB

photoreceptors nonuniformly distributed on their foveated retinas. Its sensory subsystem performs visual analysis that drives the motor subsystem to actuate

appropriate eye, head, and limb movements. Our virtual human can visually foveate and track a moving ball, and intercept it with arms (a) and legs (c), as well

as write and draw with its finger (d). The rendered appearance of the numerous Hill-type skeletal muscles is unimportant in our work; they are depicted as

lines, color coded to distinguish the different muscle groups, with brightness proportional to each contractile muscle’s efferent neural activation.

We introduce a biomimetic framework for human sensorimotor control,which features a biomechanically simulated human musculoskeletal modelactuated by numerous muscles, with eyes whose retinas have nonuniformlydistributed photoreceptors. The virtual human’s sensorimotor control sys-tem comprises 20 trained deep neural networks (DNNs), half constituting theneuromuscular motor subsystem, while the other half compose the visualsensory subsystem. Directly from the photoreceptor responses, 2 visionDNNs drive eye and head movements, while 8 vision DNNs extract visualinformation required to direct arm and leg actions. Ten DNNs achieve neu-romuscular control—2 DNNs control the 216 neck muscles that actuate thecervicocephalic musculoskeletal complex to produce natural head move-ments, and 2 DNNs control each limb; i.e., the 29 muscles of each arm and39 muscles of each leg. By synthesizing its own training data, our virtualhuman automatically learns efficient, online, active visuomotor control ofits eyes, head, and limbs in order to perform nontrivial tasks involving thefoveation and visual pursuit of target objects coupled with visually-guidedlimb-reaching actions to intercept the moving targets, as well as to carryout drawing and writing tasks.

Authors’ address: Masaki Nakada, [email protected]; Tao Zhou, [email protected]; Honglin Chen, [email protected]; Tomer Weiss, [email protected];Demetri Terzopoulos, [email protected], Computer Science Department,University of California, Los Angeles, Los Angeles, CA, 90095, USA.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].© 2018 Copyright held by the owner/author(s). Publication rights licensed to the Asso-ciation for Computing Machinery.0730-0301/2018/8-ART56 $15.00https://doi.org/10.1145/3197517.3201305

CCS Concepts: • Computing methodologies → Computer graphics;Animation;Bio-inspired approaches;Neural networks;Artificial life;Physical simulation;

Additional Key Words and Phrases: Biomechanical human animation; Sen-sorimotor control; Biomimetic perception; Deep neural network learning.

ACM Reference Format:Masaki Nakada, Tao Zhou, Honglin Chen, Tomer Weiss, and Demetri Ter-zopoulos. 2018. Deep Learning of Biomimetic Sensorimotor Control forBiomechanical Human Animation. ACM Trans. Graph. 37, 4, Article 56 (Au-gust 2018), 15 pages. https://doi.org/10.1145/3197517.3201305

1 INTRODUCTION

The modeling of graphical characters based on human anatomyis attracting much attention in computer animation. Increasinglyaccurate biomechanical modeling of the human body should, in prin-ciple, yield more realistic human animation. However, it remains adaunting challenge to control complex biomechanical human mus-culoskeletal models with over two hundred bones and on the orderof a thousand contractile muscle actuators. This paper reports on amajor stride toward achieving this goal. We take a neuromuscularcontrol approach based on machine learning using Deep NeuralNetworks (DNNs). Beyond pure motor control, we confront thedifficult, less explored problem of sensorimotor control, in whichprocessed sensory information acquired through active perceptiondrives appropriate motor responses that support natural behavior.More specifically, the main research problems that we tackle in

this paper are as follows:

ACM Transactions on Graphics, Vol. 37, No. 4, Article 56. Publication date: August 2018.

https://doi.org/10.1145/3197517.3201305

https://doi.org/10.1145/3197517.3201305

56:2 • Masaki Nakada, Tao Zhou, Honglin Chen, Tomer Weiss, and Demetri Terzopoulos

• Biomimetic motor control: The human nervous systemgenerates the required muscle activations seemingly instan-taneously, and biological motor control mechanisms appearincompatible with any online control approach that wouldrely on solving inverse kinematics, inverse dynamics, or min-imal effort optimization problems in order to compute therequired muscle activations needed to produce desired move-ments. Rather, the brain is capable of learning muscle controlsfrom experience. Therefore, we explore a neuromuscular con-trol approach based on machine learning, particularly DNNsand deep learning.

• Biomimetic sensory control: Since our objective is realis-tic virtual humans, we revisit the fundamentals of humanvision, taking a strongly biomimetic approach to computervision for virtual humans. We create an anatomically consis-tent 3D eye model, which is capable of eye movements, witha retina populated by photoreceptors in a naturally nonuni-form, foveated pattern. Furthermore, we pursue a visual DNNapproach to producing natural eye movements and to infer-ring basic visual information from the retinal photoreceptorresponses.

• Biomimetic sensorimotor control:We combine the abovemotor and sensory control models to achieve an integrated,biomimetic sensorimotor control system capable of perform-ing several nontrivial tasks. These include not only well-coordinated eye-head movements characteristic of naturalactive vision, but also natural limb motions to reach staticand kinetic visual targets as well as to write and draw byhand.

Our overarching contribution is a novel biomimetic frameworkfor human sensorimotor control, which is illustrated in Fig. 1. Thehuman sensorimotor control model that we develop is unprece-dented, both in its use of a sophisticated biomechanical humansimulation incorporating numerous contractile muscle actuators,as well as in its use of modern machine learning methodologies toachieve online motor control of realistic musculoskeletal systemsand perform online visual processing for active, foveated perception.We accomplish this using a modular set of DNNs, all of which areautomatically trained offline with training data synthesized by thebiomechanical model itself, a strategy that we regard a contributionof interest to the machine learning research community as well.The remainder of the paper is organized as follows: Section 2

reviews relevant prior work. Section 3 reviews our biomechanicalhuman musculoskeletal model. Section 4 overviews our novel sen-sorimotor control framework and system for this model. Section 5develops the motor subsystem and Section 6 develops the sensory(currently purely visual) subsystem, while Section 7 describes thesensorimotor simulation cycle. Section 8 presents our experimentsand results with the virtual human model. Section 9 concludes thepaper with a summary of our work and current plans for futurework.

2 RELATED WORK

The following two sections review prior work relevant to the pri-mary topics of this paper—neuromuscular and sensorimotor control

of an anatomically accurate biomechanical human musculoskeletalmodel.

2.1 Neuromuscular and Visuomotor Control

With their pioneering “NeuroAnimator”, Grzeszczuk et al. [1998]were the first to apply artificial neural networks and backpropaga-tion learning in computer graphics, demonstrating the learning ofneural network emulators of physics-based models, as well as howthese emulators can support fast reinforcement learning of motioncontrollers based on backpropagation through time, a policy gradi-ent method utilizing recursive neural networks. Lee and Terzopoulos[2006] were the first to employ neural networks as neuromuscu-lar controllers, specifically for their detailed biomechanical modelof the human cervicocephalic system. Their work demonstratedthat neuromuscular learning approaches are the most biomimeticand most promising in tackling the high-dimensional problems ofmuscle-actuated biomechanical motor control. Unfortunately, theshallow neural networks that could practically be trained at thattime proved inadequate at coping with the higher dimensionalityand greater quantities of training data required to learn to controlfull-body musculoskeletal models. To tackle a more challenginggeneralization of the cervicocephalic control problem, Nakada andTerzopoulos [2015] introduced the use of deep learning techniques[Goodfellow et al. 2016].

The work reported in the present paper was inspired not just bythe aforementioned efforts, but also by the impressive visuomotor“EyeCatch” model of Yeo et al. [2012]. EyeCatch is a non-learning-based visuomotor system in a simple, non-physics-based humanoidcharacter. By contrast, we demonstrate dramatically more complexsensorimotor control realized using a set of trained DNNs in a com-prehensive, anatomically-accurate, muscle-actuated biomechanicalhuman model.

Furthermore, the EyeCatch model and the more primitive visuo-motor model of Lee and Terzopoulos [2006] made direct use of the3D spatial positions of virtual visual targets, without any significantbiologically-inspired visual processing. By contrast, we build uponthe pioneering “Animat Vision” work on foveated, active computervision for animated characters [Terzopoulos and Rabie 1995], whichdemonstrated vision-guided bipedal locomotion and navigation al-beit in purely kinematic human characters [Rabie and Terzopoulos2000]. We introduce a substantially more biomimetic active visionmodel based on the foveated pattern of cone photoreceptor place-ment in biological retinas [Schwartz 1977].

2.2 Biomechanical Human Modeling and Animation

Human modeling and animation has been a topic of interest sincethe early days of computer graphics. As the literature is now vast,we focus here on the most relevant work that precedes our accom-plishment of neurosensorimotor control of a detailed biomechanicalhuman musculoskeletal model.

Aside from the domain of biomechanical human facial animation,in which muscle actuators have been employed for decades withincreasingly realistic results, e.g., [Terzopoulos and Waters 1990],[Lee et al. 1995], [Kähler et al. 2002], [Sifakis et al. 2005], [Ichim et al.2017], the simulation and control of articulated anthropomorphic


Deep Learning of Biomimetic Sensorimotor Control for Biomechanical Human Animation • 56:3

(a) (b) (c) (d)

Fig. 2. The biomechanical human musculoskeletal model, showing the skeletal system with its 193 bones and 823 Hill-type muscle actuators.

models has been heavily researched, although mostly via “stickfigures” actuated by joint torque motors, which was the state-of-the-art in physics-based character animation two decades ago; e.g.,[Hodgins et al. 1995], [Faloutsos et al. 2001]. Unfortunately, these arepoor models of the humanmusculoskeletal system that inadequatelyaddress the mystery of its highly robust neuromuscular control.

Anatomically and biomechanically accurate musculoskeletal sim-ulation—early examples of which are the cervicocephalic model ofLee and Terzopoulos [2006] and hand model of Sueda et al. [2008](see also [Sachdeva et al. 2015])—is now rapidly superseding theearlier modeling mindset, having finally overcome the reluctanceof the animation research community to embrace the dauntingcomplexities of strongly biomimetic approaches. Researchers havebeen pursuing the construction of detailed musculoskeletal modelsfrom noninvasive 3D medical images [Fan et al. 2014], body sur-face scans [Kadleček et al. 2016], and other modalities. Controllingmusculoskeletal models remains a challenging problem, but recentpapers have reported progress on the important problem of loco-motion synthesis [Geijtenbeek et al. 2013; Lee et al. 2014; Wanget al. 2012], most recently by adopting machine-learning-based con-trol methods [Holden et al. 2017; Liu and Hodgins 2017; Peng et al.2017]. Cruz Ruiz et al. [2017] present a comprehensive survey onmuscle-based control for character animation.Most pertinent to our present work, Lee et al. [2009] created a

comprehensive biomechanical human musculoskeletal model, andthe full-body version of this model was employed by Si et al. [2014]to animate human swimming in simulated water with a neural CPGcontrol system that synthesizes sustained rhythmic muscle activa-tions suitable for locomotion. We utilize the same biomechanicalmodel, but instead tackle the important open problem of transient,task-based sensorimotor control, particularly for observing andreaching out to stationary or moving visual targets (through anentirely different approach than in earlier work [Huang et al. 2010]).

3 BIOMECHANICAL MUSCULOSKELETAL MODEL

Fig. 2 shows the musculoskeletal system of our anatomically accu-rate biomechanical human model. It includes all of the relevant artic-ular bones and muscles—193 bones (hand and foot bones included)plus a total of 823 muscle actuators. (To mitigate computational cost,

(a) Anterior (b) Posterior

Fig. 3. The biomechanical cervicocephalic musculoskeletal model showing

the 216 neck muscles, for some of which the displayed vertebra below C1

plus the ribs, clavicles, and scapulae serve as root attachments.

we exclude the finite element musculotendinous soft-tissue simula-tion that produces realistic flesh deformations.) Each skeletal muscleis modeled as a Hill-type uniaxial contractile actuator that appliesforces to the bones at its points of insertion and attachment. Thehuman model is numerically simulated as a force-driven articulatedmulti-body system; refer to [Lee et al. 2009] for the details.

Given our human model, the challenge in neuromuscular motorcontrol is to generate efferent activation signals for its musclesnecessary to carry out various motor tasks. For the purposes ofour research to date, we mitigate complexity by placing our virtualhuman in a seated position, immobilizing the pelvis as well as thelumbar and thoracic spinal column vertebra and other bones of thetorso, leaving the cervical column, arms, and legs free to articulate.

Cervicocephalic Musculoskeletal Complex: Fig. 3 shows in greaterdetail the neck-head musculoskeletal complex, which is rooted atthe thoracic vertebra (T1), with its seven vertabrae, C7 through C1(atlas), progressing up the cervical spine to the skull, which is an“end-effector” of substantial mass. A total of 216 short, intermediate,



(a) Right arm (b) Left arm

Fig. 4. The musculoskeletal anatomy of the arms, showing the 29 arm mus-

cles, for some of which the clavicle and scapula serve as root attachments.

(a) Right leg (b) Left leg

Fig. 5. The musculoskeletal anatomy of the legs, showing the 39 leg muscles,

for some of which the pelvis serves as a root attachment.

and long Hill-type uniaxial muscle actuators arranged in deep, in-termediate, and superficial layers, respectively, actuate the seven3-degree-of-freedom joints of the cervicocephalic musculoskeletalcomplex. Each cervical joint incorporates damped rotational springsto approximate the passive elasticity of the intervertebral discs andligaments.

Arm Musculoskeletal Complexes: Fig. 4 shows the musculoskeletalmodel of the bilaterally symmetric arms, which are rooted at theclavicle and scapula, and include the humerus, ulnar, and radiusbones, plus the bones of the wrist and hand. In each arm, a total of29 Hill-type uniaxial muscles actuate the shoulder, elbow, and wristjoints. To mitigate complexity, the bones of the hand are actuatedkinematically.

Leg Musculoskeletal Complexes: Fig. 5 shows the musculoskeletalmodel of the bilaterally symmetric legs, which are rooted at thepelvis and include the femur, patella, tibia, and fibula bones, plusthe bones of the ankle and foot. In each leg, a total of 39 Hill-typeuniaxial muscles actuate the hip, knee, and ankle joints. To mitigatecomplexity, the bones of the foot are actuated kinematically.

4 SENSORIMOTOR CONTROL FRAMEWORK

Thus, of the 823 muscle actuators in the complete musculoskeletalmodel, 216 + (2 × 29) + (2 × 39) = 352 muscle actuators must becontrolled by specifying for each of them a time-varying, efferentactivation signal a(t). In our sensorimotor control framework, all ofthese activation signals are generated by trained DNNs.

Fig. 6 presents an overview of the sensorimotor system, showingits sensory and motor subsystems. The figure caption describes theinformation flow and the functions of its 20 DNNs.The two subsequent sections develop first the motor subsystem,

which includes its 5 neuromuscular motor controllers (each com-prising 2 DNNs) (Fig. 6f,gL,gR,hL,hR), then the sensory subsystem,which includes the eyes (Fig. 6eL,eR) and their retinas (Fig. 6bL,bR),along with 10 vision DNNs (Fig. 6cL,cR,dL,dR).

5 MOTOR SUBSYSTEM

The cervicocephalic and limb musculoskeletal complexes of ourbiomechanical human model are actuated by groups of skeletalmuscles driven by neuromuscular motor controllers that generateefferent activation signals for the muscles. The motor subsystemincludes a total of 5 neuromuscular motor controllers, one for thecervicocephalic complex, two for the arm complexes, and two for theleg complexes. Each generates the muscle activations a to actuateits associated musculoskeletal complex in a controlled manner.Fig. 7 shows the neuromuscular motor controller architecture,

which is comprised of a voluntary controller and a reflex controller(c.f. [Lee and Terzopoulos 2006]), both of which are trained DNNsthat produce muscle activation adjustment (∆) signals. The vol-untary controller produces signals ∆av , which induce the desiredactuation of the associated musculoskeletal complex, while the re-flex controller produces muscle-stabilizing signals ∆ar . Thus, theoutput of the neuromuscular motor controller is given by

a(t + ∆t) = a(t) + (∆av (t) + ∆ar (t)) . (1)

In Fig. 7, note that the muscle activation signal a feedback loopmakes the neuromuscular controllers recurrent neural networks.Through systematic experiments, reported in Appendix A, we

identified a DNN architecture that works well for all 10 motor DNNs;specifically, rectangularly-shaped, fully-connected networks withsix 300-unit-wide hidden layers.1

1 The DNNs employ rectified linear units (i.e., the ReLU activation function). The totalnumber of weights in the DNNs ranges from 11.5–24.2 million. The initial weights aresampled from the zero-mean normal distribution with standard deviation

√2/fan_in,

where fan_in is the number of input units in the weight tensor [He et al. 2015]. Totrain the DNNs, we apply the mean-squared-error loss function and the AdaptiveMoment Estimation (Adam) stochastic optimizer [Kingma and Ba 2014] with learningrate η = 10−6 , step size α = 10−3 , forgetting factors β1 = 0.9 for the gradients andβ2 = 0.999 for their second moments, and avoid overfitting using the early stoppingcondition of negligible improvement for 10 successive epochs. Each epoch requires lessthan 10 seconds of computation time.



(b)R (b)L

⋮…

⋮⋮

⋮

.

∆ ∆

⋮…

⋮⋮

⋮

.

∆ ∆

⋮…

⋮⋮

⋮

.∆ ∆

⋮…

⋮⋮

⋮

.

∆ ∆

⋮…

⋮⋮

⋮

.

(h)R DNNs 17,18

Average

x

z

y

x

z

y

∆∆

∆

⋮…

⋮⋮

⋮

Right Retina Left Retina

(a)

∆∆

⋮…

⋮⋮

(c)R DNN 1 (c)L DNN 6

⋮

∆ ∆

…⋮

⋮⋮

∆ ∆

…⋮

⋮⋮

∆ ∆

…⋮

⋮⋮

∆∆ ∆

…⋮

⋮

(d)R DNNs 2,3,4,5

⋮

∆ ∆

…⋮

⋮⋮

∆ ∆

…⋮

⋮⋮

∆ ∆

…⋮

⋮⋮

∆∆ ∆

…⋮

⋮

(d)L DNNs 7,8,9,10

(e)R (e)L

(g)R DNNs 13,14 (g)L DNNs 15,16

(f) DNNs 11,12

Sensory Subsystem

Motor Subsystem

∆∆

∆

⋮…

⋮⋮

⋮

∆∆

∆

⋮…

⋮⋮

⋮

∆∆

∆

⋮…

⋮⋮

⋮

(h)L DNNs 19,20

ONV

Average

∆∆

⋮…

⋮⋮

ONV ONV

∆∆

∆

⋮…

⋮⋮

⋮

Fig. 6. Sensorimotor system architecture, showing the modular neural net-

work controllers in the sensory subsystem (top) and motor subsystem (bot-

tom), including a total of 20 Deep Neural Networks (DNNs), numbered 1–20,

of four types (colored yellow, green, orange, and blue).

• Sensory Subsystem: (a) Each retinal photoreceptor casts a ray into the vir-

tual world to compute the irradiance captured by the photoreceptor. (b) The

arrangement of the 3,600 photoreceptors (black dots) on the left (b)L and

right (b)R foveated retinas. Each retina outputs a 10,800-dimensional Optic

Nerve Vector (ONV). There are 10 vision DNNs. The two (yellow) foveation

DNNs (c) (1,6) input the ONV and output left-eye (c)L and right-eye (c)R

angular discrepancies that drive the movements of the eyes (e) to foveate

visual targets. (d) The other eight (green) vision DNNs—(d)L (7,8,9,10) for

the left eye and (d)R (2,3,4,5) for the right eye—also input the ONV and

output the limb-to-target visual discrepancy estimates.

•Motor Subsystem: There are 10 motor DNNs. The (orange) cervicocephalic

neuromuscular motor controller (f) (DNNs 11,12) inputs the average of the

left (c)L and right (c)R foveation DNN responses and outputs activations

to the neck muscle group. The four (blue) limb neuromuscular motor con-

trollers (g),(h) (DNNs 13–20) of the limb musculoskeletal complexes input

the average of the left (d)L and right (d)R limb vision DNN responses and

output activations to the respective arm and leg muscle groups.

VoluntaryMotor Controller

ReflexMotor Controller

ΔavΔe

ΔeΔar a.a + +

Fig. 7. Neuromuscular motor controller architecture. The voluntary con-

troller inputs a target discrepancy δ and, recurrently, the muscle activations

a . The reflex controller inputs the changes in muscle strains e and strain

rates Ûe . Both controllers output muscle activation adjustments.

In the next two sections, we present the details of the 5 voluntarymotor DNNs and the 5 reflex motor DNNs, and describe the trainingdata synthesis regimens for each type of network.

5.1 Voluntary Motor DNNs (11,13,15,17,19)2

The function of the cervicocephalic voluntary motor DNN (Fig. 8)is to generate efferent activation signals to the neck muscles inorder to balance the mass of the head in gravity atop the flexiblecervical columnwhile actuating realistic headmovements to achievetarget head poses. The function of the 4 limb voluntary motor DNNs(Fig. 9), is to generate efferent activation signals to the muscles of thefour limbs in order to execute controlled arm and leg movements,such as extending a limb to reach a target.The architecture of all five voluntary motor DNNs is identical,

except for the sizes of the input and output layers. Their input layersinclude units that represent the (angular or linear) components ofthe discrepancy (δ in Fig. 7) between the value of some relevantfeature of the virtual human’s state, such as head orientation orhand/foot position, and the target value of that feature, as well asunits that represent the current activations, ai (t), for 1 ≤ i ≤ n, ofeach of the n muscles in the associated musculoskeletal complex.The 6 hidden layers have 300 units each. The output layer consistsof units that encode the adjustments ∆ai (t), for 1 ≤ i ≤ n, tothe muscle activations, which then contribute additively as av toupdating the muscle activations according to Equation 1.The following sections provide additional details about each of

the DNNs and how they are trained. Appendix B explains in moredetail how we use the biomechanical musculoskeletal model itselfto synthesize training data.

5.1.1 Cervicocephalic Voluntary Motor DNN (11). As shown inFig. 8, the input layer of the cervicocephalic voluntary motor DNNhas 218 units, comprising 2 units for the target head angular discrep-ancies, ∆θ and ∆ϕ, and 216 units for the neck muscle activationsai , for 1 ≤ i ≤ 216. The output layer has 216 units that provide themuscle activation adjustments ∆ai .

To train theDNN,we use our biomechanical humanmusculoskele-tal model to synthesize training data, as follows: Specifying a targetorientation for the head yields angular discrepancies, ∆θ and ∆ϕ, be-tween the current head orientation and the target head orientation.With these angular discrepancies serving as the target control input,we compute inverse kinematics followed by inverse dynamics withminimal muscle effort optimization applied to the biomechanical2The numbers in parentheses here and elsewhere refer to the numbered DNNs in Fig. 6.



∆ ∆

∆∆

300 300 300 300 300 300

Fig. 8. Voluntary motor DNN for the cervicocephalic complex.

∆ ∆

∆∆∆

300 300 300 300 300 300

Fig. 9. Voluntary motor DNN for a limb (arm or leg) complex.

cervicocephalic model, as described in Appendix B. This determinesmuscle activation adjustments, which serve as the desired output ofthe cervicocephalic voluntary motor DNN. Hence, the input/outputtraining pair consists of an input comprising the concatenationof the desired angle discrepancies, ∆θ and ∆ϕ, and the currentmuscle activations ai , for 1 ≤ i ≤ 216, along with an associatedoutput comprising the desired muscle activation adjustments ∆ai ,for 1 ≤ i ≤ 216.This offline training data synthesis process takes 0.8 seconds of

computational time to solve for 0.02 simulation seconds on a 2.4 GHzIntel Core i5 CPU with 8GB of RAM. Repeatedly specifying randomtarget orientations and actuating the cervicocephalic musculoskele-tal subsystem to achieve them, we generated a training dataset of1M (million) input-output pairs.

The backpropagation DNN training process converged to a smallerror in 2,000 epochs. Fig. 10 graphs the progress of the training pro-cess. After the DNN is trained, it serves as the online cervicocephalicvoluntary motor controller.

5.1.2 Limb Voluntary Motor DNNs (13,15,17,19). As shown inFig. 9, the input layer of the limb voluntary motor DNN has 3 unitsthat represent ∆x , ∆y, and ∆z, the estimated linear discrepanciesbetween the position of the end effector and a specified target posi-tion, as well as units that represent the current activations of the26 arm muscles, ai , for 1 ≤ i ≤ 26, or of the 36 leg muscles, ai ,for 1 ≤ i ≤ 36. The output layer consists of units that encode themuscle activation adjustments ∆ai .

0 250 500 750 1000 1250 1500 1750 2000

0.2

0.4

0.6

0.8

1.0

Epoch

MeanSquaredError

Fig. 10. Progress of the backpropagation training of the cervicocephalic

voluntary motor DNN on the training (green) and validation (red) datasets.

To train the DNN, we employ our biomechanical human muscu-loskeletal model to synthesize training data as was described in theprevious section: We specify a target position, which determines thelinear discrepancies, ∆x , ∆y, and ∆z, between the position of the endeffector (hand or foot) and the target. Given these discrepancies andthe current muscle activations as the input, the desired output of thenetwork is the muscle activation adjustments, which are similarlycomputed by solving appropriate inverse kinematics followed byinverse dynamics and minimal muscle effort optimization problemsfor the limb musculoskeletal complex (see Appendix B). Repeatedlyspecifying a series of random target positions and actuating thelimb to reach them, we generated a large training dataset of 1Minput-output pairs. Four such training datasets were generated, onefor each limb. After the DNNs are trained, they serve online as limbvoluntary motor controllers.

Fig. 11a graphs the progress of the backpropagation trainingprocess for the left arm voluntary motor DNN (15). For its trainingdataset, it converged to a small error after 1,865 epochs beforetriggering the early stopping condition to avoid overfitting. Fig. 11bgraphs the progress of the backpropagation training process for theright arm voluntary motor DNN (13). It converged to a small errorafter 1,777 epochs before triggering the early stopping condition toavoid overfitting.Fig. 11c and Fig. 11d graph the progress of the backpropagation

training process for the left leg and right leg voluntary motor DNNs,respectively. Convergence to a small error was achieved after 2,000epochs for the left leg DNN (19) and after 1,850 epochs for the rightleg DNN (17) before triggering the early stopping condition to avoidoverfitting.

5.2 Reflex Motor DNNs (12,14,16,18,20)

The reflex motor controller of Lee and Terzopoulos [2006] computedthe relevant low-level dynamics equations online, which is notbiologically plausible. By contrast, our reflex motor controllers aremotor DNNs that have learned the desired control functionalityfrom training data synthesized offline.Fig. 12 shows the architecture of the reflex motor DNNs. It is

identical for the 5 reflex motor controllers, except for the sizes ofthe input and output layers, which are determined by the number



0 250 500 750 1000 1250 1500 17500.75

0.80

0.85

0.90

0.95

1.00

Epoch

MeanSquaredError

(a) Left arm

0 250 500 750 1000 1250 1500 17500.70

0.75

0.80

0.85

0.90

0.95

Epoch

MeanSquaredError

(b) Right arm

0 250 500 750 1000 1250 1500 1750 20000.3

0.4

0.5

0.6

0.7

0.8

Epoch

MeanSquaredError

(c) Left leg

0 250 500 750 1000 1250 1500 1750

0.4

0.5

0.6

0.7

0.8

Epoch

MeanSquaredError

(d) Right leg

Fig. 11. Progress of the backpropagation training of the limb voluntary motor DNNs on the training (green) and validation (red) datasets.

300 300 300 300 300 300

∆ ∆∆

∆

∆∆

.

.

Fig. 12. Architecture of the reflex motor DNNs. The networks for the cervic-

ocephalic and limb musculoskeletal complexes are identical except for the

size of the input and output layers, which are determined by the number of

muscles.

of muscles n in the cervicocephalic, arm, and leg musculoskeletalcomplexes. The input layer consists of 2n units, comprising unitsthat represent the change in muscle strain ∆ei and in strain rate∆ Ûei , for 1 ≤ i ≤ n. Like the voluntary motor DNNs, the networkshave 6 hidden layers with 300 units each. The output layer consistsof n units providing muscle activation adjustments ∆ai , for 1 ≤ i ≤n, which then contribute additively as ar to updating the muscleactivations according to Equation 1.To train the reflex motor DNNs, we employ the same training

methods as for the voluntary motor DNNs. We again use our biome-chanical human musculoskeletal model to synthesize training data.The process is similar to the voluntary motor controller training pro-cess, and the training dataset for the reflex controller is generatedas described in Appendix B.After the DNNs are trained, they serve as online reflex motor

controllers. In general, the backpropagation DNN training processesconverged to a small error in the range of 1,200 to 1,600 epochs.Fig. 13 graphs the progress of the training process for the right legreflex DNN. The graphs for the other 4 reflex DNNs are similar.

6 SENSORY SUBSYSTEM

We now turn to the sensory subsystem (Fig. 6), which includesthe eye and retinal models, as well as 10 trained vision DNNs thatperform visual processing.

0 200 400 600 800 1000 1200

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

EpochMeanSquaredError

Fig. 13. Progress of the backpropagation training of the right leg reflex

motor DNN on the training (green) and validation (red) datasets.

6.1 Eye and Retina Models

6.1.1 Eye Model. We modeled the eyes in accordance with hu-man physiological data.3 As shown in Fig. 6e, we model the virtualeye as a sphere of 12mm radius that can be rotated with respect toits center around its vertical y axis by a horizontal angle of θ andaround its horizontal x axis by a vertical angle of ϕ. The eyes are intheir neutral positions, looking straight ahead, when θ = ϕ = 0◦. Wecurrently model the eye as an ideal pinhole camera with aperture(optical center) at the center of the pupil and with horizontal andvertical fields of view of 167.5◦.

We compute the irradiance at any point on the spherical retinalsurface at the back of the eye using conventional ray tracing. Samplerays from the positions of photoreceptors on the retinal surfaceare cast through the pinhole and out into the 3D virtual world,where they recursively intersect with the visible surfaces of virtualobjects and, in accordance with the Phong local illumination model,combinewith shadow rays to light sources. The RGB values returnedby these rays determine the irradiance impinging upon the retinalphotoreceptors. Fig. 14 illustrates the retinal “imaging” process.

6.1.2 Placement of the Photoreceptors. To emulate biomimeticfoveated vision, we procedurally position the photoreceptors on3The transverse size of an average eye is 24.2 mm and its sagittal size is 23.7 mm. Theapproximate field of view of an individual eye is 100 degrees to temporal, 45 degrees tonasal, 30 degrees to superior, and 70 degrees to inferior. The combined field of view ofthe two eyes is approximately 200 degrees horizontally and 135 degrees vertically.



(a) (b)

Fig. 14. (a) Rays cast from the positions of photoreceptors on the retina

through the pinhole aperture and out into the scene by the ray tracing

procedure that computes the irradiance responses of the photoreceptors.

(b) Rays cast from both eyes as the seated virtual human looks forward.

(a) Left retina (b) Right retina

Fig. 15. Positions of the photoreceptors (black dots) on the retinas according

to the noisy log-polar model.

the hemispherical retina according to a noisy log-polar distribution,which has greater biological fidelity compared to earlier foveatedvision models [Rabie and Terzopoulos 2000]. On each retina, weinclude 3,600 photoreceptors situated at

dk = eρ j[cosαisinαi

]+

[N(µ,σ 2)N(µ,σ 2)

], for 1 ≤ k ≤ 3,600, (2)

where 0 < ρ j ≤ 40, incremented in steps of 1, and 0 ≤ αi < 360◦,incremented in 4◦ steps, and where N denotes additive IID Gauss-ian noise. We set mean µ = 0 and variance σ 2 = 0.0025, whichplaces the photoreceptors in slightly different positions on the tworetinas. Fig. 15 illustrates the placement of the photoreceptors onthe left and right retinas. Other placement patterns are readily im-plementable, including more elaborate procedural models [Deering2005] or photoreceptor distributions empirically measured frombiological eyes, all of which differ dramatically from the uniformly-sampled rectangular images common in computer graphics andvision.

6.1.3 Optic Nerve Vectors. The foveated retinal RGB “image”captured by each eye is output for further processing down thevisual pathway, not as a 2D array of pixels, but as a 1D vector oflength 3,600 × 3 = 10,800, which we call the Optic Nerve Vector(ONV). The raw sensory information encoded in the ONV feedsthe vision DNNs that directly control eye movements and feed the

10,800-D ONV

1000500

250100

5025

∆ ∆

Fig. 16. The feedforward vision DNN architecture. The DNN shown pro-

duces saccadic eye movements for visual target foveation.

neuromuscular motor controller networks that orchestrate neck-actuated head motions and the actions of the limbs.

6.2 Vision DNNs (1–10)

The sensory subsystem includes two types of fully-connected feed-forward DNNs that interpret the sensory information provided bythe 10,800-dimensional ONV. The first type controls the eye andhead movements, the latter via the cervicocephalic neuromuscu-lar motor controller. The second type estimates arm-to-target 3Ddiscrepancies, ∆x , ∆y, and ∆z, that drive limb actions via the limbneuromuscular motor controllers. Again, through a systematic setof experiments (see our companion paper [Nakada et al. 2018]), weidentified a common DNN architecture that works well for all 10vision DNNs—tapered networks with six hidden layers (Fig. 16).

6.2.1 Foveation DNNs (1,6). Along with the eyes, the left andright foveation DNNs constitute the oculomotor subsystem. The firstrole of these DNNs is to produce voluntary changes in gaze directionby driving saccadic eye movements to foveate visible objects ofinterest, thereby observing them with maximum visual acuity. Thisis illustrated in Fig. 17 for a white sphere in motion that entersthe field of view from the lower right, stimulating some peripheralphotoreceptors at the upper left of the retina. The maximum speedof saccadic eye movements is 900 degrees/sec, and the eye almostinstantly rotates to foveate the visual target.To aid foveation, fixation, and visual tracking, eye movements

induce compensatory head movements, albeit much more sluggishones due to the considerablemass of the head. Hence, the second roleof the foveation DNNs is to control head movements, by driving thecervicocephalic neuromuscular voluntary motor DNN (11) (Fig. 6f)with the average of their two outputs.

As shown in Fig. 16, the input layer to this tapered, fully-connectedDNN has 10,800 units in accordance with the dimensionality of theONV, the output layer has 2 units representing eye rotation adjust-ments, ∆θ and ∆ϕ, and there are 6 hidden layers with unit countsas indicated in the figure. The other details of the DNN are per thespecifications in Footnote 1.We use our human model to train the network, as follows: We

present a white sphere within the visual field. The responses ofthe photoreceptors in the retinas of each eye are computed by raytracing the 3D scene, and they are output as the RGB components of



(a) t0

I

(b) t1

I

(c) t2

I

(d) t3

�

Fig. 17. Time sequence (a)–(d) of photoreceptor responses in the left retina during a saccadic eye movement that foveates and tracks a moving white sphere.

At time t0 the sphere becomes visible in the periphery, at t1 the eye movement is bringing the sphere toward the fovea, and the moving sphere is being fixated

in the fovea at times t2 and t3.

(a) t0

I

(b) t1

I

(c) t2

�

Fig. 18. Photoreceptor responses during an arm-reaching motion toward a

moving target sphere. The photoreceptors are simultaneously stimulated

by the fixated red sphere and by the green arm entering the eye’s field of

view from the lower right (upper left on the retina).

the respective ONV. Given its ONV input, the desired output of thenetwork is the angular discrepancies, ∆θ and ∆ϕ, between the actualgaze directions of the eyes and the known gaze directions that wouldfoveate the sphere. Repeatedly positioning the sphere at randomlocations in the visual field, we generated a training dataset of1M input-output pairs. The backpropagation DNN training processconverged to a small error after 80 epochs before triggering theearly stopping condition to avoid overfitting.

6.2.2 Limb Vision DNNs (2,3,4,5 & 7,8,9,10). The role of the leftand right limb (arm and leg) vision DNNs is to estimate the separa-tion in 3D space between the position of the end effector (hand orfoot) and the position of a visual target, thus driving the associatedlimb neuromuscular motor controller to extend the limb to touchthe target. This is illustrated in Fig. 18 for a fixated red sphere and agreen arm that enters the eye’s field of view from the lower right,stimulating peripheral photoreceptors at the upper left of the retina.

The architecture of the limb visionDNN is identical to the foveationDNN in Fig. 16, except for the size of the output layer, which has 3units, ∆x , ∆y, and ∆z, the estimated discrepancies between the 3Dpositions of the end effector and visual target.

Again, we use our biomechanical human musculoskeletal modelto train the four limb vision DNNs, as follows: A red sphere ispresented in the visual field and the trained foveation DNNs areallowed to foveate the sphere. Then, a limb (arm or leg) is extendedtoward the sphere. Again, the responses of the photoreceptors in

Perception & Oculomotor

Control

Neuromuscular Motor

Controllers

Biomechanical Musculoskeletal

Simulator

External Forces

Fig. 19. The sensorimotor simulation cycle.

the retinas of the eyes are computed by ray tracing the 3D scene,and they are output as the RGB components of the respective ONV.Given the ONV input, the desired output of the network is thediscrepancies, ∆x , ∆y, and ∆z, between the known 3D positions ofthe end effector and visual target. Repeatedly placing the sphereat random positions in the visual field and randomly articulatingthe limb to reach for it in space, we generated a training dataset of1M input-output pairs. The backpropagation DNN training processconverged to a small error after 388 epochs, which triggered theearly stopping condition to avoid overfitting. As expected, due to thegreater complexity of this task, the training is significantly slowerthan for the foveation DNN.

7 THE SENSORIMOTOR SIMULATION CYCLE

Given the components of Fig. 6, the diagram in Fig. 19 illustrates theembodied, sensorimotor simulation process. Each iteration proceedsas follows: The ONVs output by the eyes are processed by the vision



(a)

I

(b)

I

(c)

I

(d)

I

(e)

I

(f)

�

Fig. 20. Sequence of frames rendered from a simulation in which the cervicocephalic neuromuscular controller is deactivated (b) and reactivated (d).

(a)

I

(b)

I

(c)

I

(d)

I

(e)

I

(f)

�

Fig. 21. Sequence of frames rendered from a simulation in which the left arm neuromuscular motor controller is activated (a) to reach toward the sphere and

deactivated (c) to release the arm, which swings down in gravity (d–e) before coming to rest (f).

I I I �

Fig. 22. Sequence of frames rendered from a simulation of the virtual human sitting on a stool and actively executing arm-reaching movements. The blue

lines show the rays from the photoreceptors which sample the scene. The sphere is perceived by the eyes and vision DNNs, foveated and tracked through

eye movements in conjunction with muscle-actuated head movements driven by the cervicocephalic neuromuscular motor controller and reactive reaching

movements driven by the arm neuromuscular motor controllers.

DNNs, and incremental eye rotations that effect saccadic foveationand visual tracking of moving targets are driven by the foveationDNNs. The vision DNNs simultaneously feed the neuromuscularmotor controllers, which produce muscle activation signals for thecervicocephalic and limb musculoskeletal complexes. The biome-chanical simulator advances the state of the musculoskeletal systemthrough time, thus incrementally actuating the five complexes sub-ject to external forces, such as gravity. The three main simulationcomponents can run asynchronously.

8 EXPERIMENTS AND RESULTS

All the DNNs in our sensorimotor control system were implementedand trained using the Theano library [Bergstra et al. 2010] runningon an NVIDIA Titan X GPU installed in a Ubuntu 16.04 system witha 3.2 GHz Intel Core i7 CPU.

8.1 Neuromuscular Controller Activation/Deactivation

Fig. 20 shows a sequence of frames rendered from the simulationof the cervicocephalic musculoskeletal complex with its trainedneuromuscular motor controller, which is deactivated (all outputmuscle activations become 0.0) and then reactivated. Upon controllerdeactivation, the head naturally droops backward due to gravity.After controller reactivation, the neck muscles lift the head back

up, balance it atop the cervical column, and make controlled headmovements.Similarly, Fig. 21 shows a sequence of frames rendered from a

simulation of the left arm musculoskeletal complex in which itstrained neuromuscular motor controller is deactivated and reacti-vated, demonstrating fully dynamic control of the arm.

8.2 Perception Demonstration

Fig. 22 shows a sequence of frames rendered from a simulation ofour virtual human sitting on a stool and actively executing arm-reaching movements towards a target sphere. As shown from a sideview in Fig. 1a,b, the blue lines indicate the rays cast from the retinalphotoreceptor positions to sample the 3D scene and compute theirradiance at the photoreceptors. First, the sphere is perceived bythe eyes, then processed by the vision DNNs, foveated and trackedthrough eye movements in conjunction with muscle-actuated headmovements controlled by the cervicocephalic neuromuscular motorcontroller. The muscle-actuated reaching movements are activelycontrolled by the arm neuromuscular motor controllers.

8.3 Intercepting an Incoming Ball

Fig. 23 shows a sequence of frames rendered from a simulation thatdemonstrates the active sensorimotor system in action. A cannon



I I I I I I

I I I I I �

Fig. 23. Sequence of frames rendered from a simulation of the biomechanical virtual human sitting on a stool and actively executing arm and leg reaching

movements to intercept balls shot by a cannon. The balls are perceived by the eyes, processed by the vision DNNs, foveated and smoothly pursued through eye

movements in conjunction with muscle-actuated head movements controlled by the cervicocephalic neuromuscular motor controller, and the muscle-actuated

reaching movements are actively controlled by the arm and leg neuromuscular motor controllers. The red lines indicate the eye viewing directions.

I I I I I I

I I I I I �

Fig. 25. Sequence of frames rendered from a simulation of the biomechanical virtual human sitting on a stool and actively executing arm-reaching movements

to make contact with balls bouncing on multiple sloped planes. The balls are perceived by the eyes, processed by the vision DNNs, foveated and smoothly

pursued through eye movements in conjunction with muscle-actuated head movements controlled by the cervicocephalic neuromuscular motor controller. In

this sequence, the muscle-actuated active reaching motion is controlled by the right arm neuromuscular motor controller.

I I �

Fig. 24. Sequence of frames from a ray tracing-rendered animation of the

biomechanical virtual human in the scenario shown in Fig. 23, with a skin

covering the musculoskeletal body model.

shoots balls at the virtual human, which actively perceives the ballson its foveated retinas. The ONVs from the eyes are processed by thevision DNNs to enable foveation and visual pursuit of the incomingballs, coupled with cooperative cervicocephalic motor controllerdriven head motions that pursue the gaze direction. Simultaneouslyfed by the vision DNNs, the limb neuromuscular motor controllerseffectively control the arms and legs to intercept the incoming balls.Thus, our biomechanical virtual human successfully controls itselfonline to carry out this nontrivial dynamic sensorimotor task.

Fig. 24 shows a frame sequence from a higher quality, ray tracing-rendered animation of our biomechanical virtual human with askin surface covering its musculoskeletal body model. In the samescenario as in Fig. 23, the virtual human uses its limbs to interceptand deflect balls shot at it by the cannon.

8.4 Intercepting a Bouncing Ball

Fig. 25 shows a sequence of frames from an animation of our biome-chanical virtual human actively performing reaching actions to

intercept an approaching ball bouncing on multiple sloped planes.The eyes, in conjunction with the head, visually pursue the targetball and the right arm also naturally tracks the ball until it comeswithin reach. The head, eye, and arm movements are automaticallyinduced by the instantaneous position of the visual target, which isvisually estimated by the vision DNNs from the ONV outputs of theeyes. The eyes make fast saccadic movements to quickly fixate themoving visual target and the pursue it, while the head actively ori-ents itself in a natural, cooperative manner, albeit more sluggishlydue to its substantial mass. Although the ball is initially out of reach,our virtual human automatically generates lifelike preparatory armmovements toward the approaching ball.

8.5 Drawing a Shape

Fig. 26 shows a sequence of frames from an animation of our biome-chanical virtual human drawing the shape of a butterfly with itsfinger. The butterfly shape was provided as a predefined trajectoryin space. Entirely through the neuromuscular control of its armmuscles, our human musculoskeletal model is able to follow thespecified trajectory with its finger. Our demonstration video alsoincludes a simulation of our virtual human using its finger to write“SIGGRAPH 2018”.

8.6 Robustness of the Sensorimotor System

To investigate the robustness of the sensorimotor system of ourbiomechanical human musculoskeletal model, we performed exper-iments to determine the extent to which its neuromuscular motor



I I I I I �

Fig. 26. Sequence of frames rendered from a simulation of our virtual human actively drawing a butterfly with its finger. Controlled head movements are

driven by the cervicocephalic neuromuscular motor controller while the arm-reaching and drawing movements are controlled by the left arm neuromuscular

motor controller.

Fig. 27. A frame rendered from a simulation of our virtual human reclining

on an office chair and writing with a finger.

controllers are capable of generalizing to situations for which theywere not specifically trained.

First, by applying a continuous sinusoidal displacement functionwe kinematically perturbed the 3D position of the stool verticallyand horizontally. Our virtual human attempted to intercept visualtargets in a scenario similar to that of Fig. 23. For perturbationsof amplitude 10 cm and frequency 3.75 Hz, we discovered that thesensorimotor system was still able to function robustly, whereas foran amplitude of 70 cm and frequency of 6.0 Hz, it often failed tointercept the moving balls with its arms and legs.Second, we laid the biomechanical virtual human in a supine

position and discovered that, without any additional training on thedramatically different relative orientation of gravity, its neuromus-cular controllers were sufficiently robust to enable it to reach upand touch with its hand a target sphere placed in various positionsover its upper body.Finally, Fig. 27 shows a frame from an animation of our virtual

human writing “SIGGRAPH 2018” while seated in an office chair andcontinuously reclining down and up, again without any retraining.

9 CONCLUSION

Our sensorimotor control framework is unprecedented in severalsignificant ways, as we will now discuss.

First, it features an anatomically accurate, biomechanically sim-ulated human musculoskeletal model that is actuated by numer-ous contractile muscles.4 Second, it employs a biomimetic visionsystem, including foveated retinas in human-like eyes capable ofrealistic eye movements. Ray tracing serves in computing the irra-diance captured by a biologically-inspired nonuniform distributionof photoreceptors on the retina. Third, through deep learning, its 5recurrent neuromuscular motor controllers, each of which employ2 fully-connected deep neural networks and muscle propriocep-tive feedback, learn to control the cervicocephalic, arm, and legmusculoskeletal complexes. Fourth, also through deep learning, ourvirtual human learns to perceive using 10 deep feedforward neuralnetworks fed by the optic nerve outputs of its eyes.Another novel feature of our sensorimotor approach is the fact

the 20 DNNs are trained with large quantities of training data syn-thesized by the biomechanical human musculoskeletal model itself.In particular, it is implausible that biological neuromuscular motorcontrol mechanisms can perform IK, ID, and minimal muscle effortoptimization (MO) computations per se. In our approach, such com-putations are performed (incrementally) only to synthesize trainingdata. Subsequently, backpropagation DNN learning from the synthe-sized data amounts to nonlinear regression in a high-dimensionalspace. Finally, the properly-trained DNNs yield biomimetic neuro-muscular motor controllers with the required input-output behav-iors. It is important to distinguish between these three aspects of ourapproach, as well as to understand that once the DNNs are trainedoffline, the training data can be discarded; i.e., training data need notbe stored or cached for the trained DNNs to operate online. More-over, the trained neuromuscular motor controllers are very efficient;e.g., our voluntary motor DNNs generate muscle activations in lessthan a millisecond, whereas the conventional IK-ID-MO approachconsumes 120–230 msec per timestep.We demonstrated the robust performance of our innovative ap-

proach in performing nontrivial sensorimotor tasks, such as generat-ing natural limb-reaching actions so as to make contact with movingvisual targets, or drawing and writing with a finger. These tasksrequire simultaneous eye movement control for saccadic foveationand smooth pursuit of visual targets in conjunction with lifelikedynamic head motion control as well as visually-guided dynamiclimb control, all hampered by the downward pull of gravity.Our work should inspire others to pursue increasingly realistic,

embodied computational models of human perception and motion,which should eventually impact not only the visual effects and gameindustries, but also robotics and possibly even the brain sciences.4In our experience [Lee et al. 2009; Lee and Terzopoulos 2006; Nakada and Terzopoulos2015; Si et al. 2014], biomimetic muscle-actuated control is inherently more stable androbust than robotic torque-actuated control.



9.1 Future Work

From the genesis of this project, we have believed that a deep-learning-based, strongly biomimetic sensorimotor approach is apromising direction for advanced human animation research. Yetit was by no means clear from the outset that our approach wouldprove to be so remarkably successful for a foveated vision modelin conjunction with an anatomically accurate, biomechanically-simulated musculoskeletal model. Prior sensorimotor systems workvery differently from ours and, in our opinion, pale in comparison.However, our work to date inevitably has some obvious limitationsthat we plan to address in our future work.

After the DNNs in our prototype sensorimotor system are trainedoffline, in advance, they remain unchanged as they perform their on-line sensorimotor control tasks. To address this unnatural shortcom-ing, we plan to incorporate an online, continual, deep reinforcementlearning scheme into our model.Our current eye model is an ideal pinhole camera. We plan to

create a more realistic eye model that not only has a finite-aperturepupil that dilates and constricts to accommodate to the incominglight intensity, but also includes cornea and lens components torefract light rays and is capable of adjusting the optical depth offield through active, ciliary muscle control of the lens deformation.Furthermore, our current eye model is a purely kinematic rotatingsphere. We plan to implement a biomechanical eye model (see, e.g.,[Wei et al. 2010]) with the typical 7.5 gram mass of a human eyeball,actuated by extraocular muscles, not just the 4 rectus muscles toinduce most of the θ , ϕ movement of our current kinematic eyeball,but also the 2 oblique muscles to induce torsion movements aroundthe gaze direction. In this context, the nonlinear pulse-step controllerof Lesmana and Pai [2011] for their robotic camera platform systemmay become relevant, as would their VOR and OKR based gazestabilization work [Lesmana et al. 2014].Although our work has demonstrated that in order to perform

certain sensorimotor tasks that are crucial first steps in sensorimo-tor control, it suffices to map rather directly from photoreceptorresponses to control outputs, it remains to introduce additionalstages of visual processing to enable the execution of more com-plex tasks. To this end, we will want to increase the number ofretinal photoreceptors, experiment with different nonuniform pho-toreceptor distributions, and (perhaps automatically) construct 2Dretino-cortical maps from the 1D retinal ONV outputs.Our current human model lacks a vestibular system emulating

that of the human inner ear and related neural circuits. We plan todevelop one that utilizes trained neural controllers that appropri-ately couple head orientation to eye movement through a Vestibulo-Ocular Reflex (VOR). Also, our current sensory subsystem lacks anOpto-Kinetic Reflex (OKR), which we plan to develop as well, againwith trained neural networks.

Ultimately, our goal is to free up our entire biomechanical mus-culoskeletal model, include additional neuromuscular motor con-trollers to control its entire spine and torso, and enable it to standupright, balance in gravity, locomote, and perform various activevision tasks, including attending to and recognizing objects. Thisis a challenging proposition, but we are excited by the seeminglyendless possibilities in this research direction.

0 250 500 750 1000 1250 1500 1750 2000Epoch

0.75

0.80

0.85

0.90

0.95

1.00

Mea

n Sq

uare

d Er

ror

123456789

Epoch

MeanSquaredError

(a) Network shapes

0 250 500 750 1000 1250 1500 1750 2000Epoch

0.76

0.78

0.80

0.82

0.84

Mea

n Sq

uare

d Er

ror

Epoch

MeanSquaredError

(b) Magnification of graph (a)

0 250 500 750 1000 1250 1500 1750 2000Epoch

0.75

0.80

0.85

0.90

0.95

1.00

Mea

n Sq

uare

d Er

ror

12345678

Epoch

MeanSquaredError

(c) Network depths

0 250 500 750 1000 1250 1500 1750 2000Epoch

0.76

0.78

0.80

0.82

0.84

Mea

n Sq

uare

d Er

ror

Epoch

MeanSquaredError

(d) Magnification of graph (c)

Fig. 28. Training progress of the left arm voluntary motor DNN in experi-

ments with various network shapes (a),(b) and depths (c),(d).

ACKNOWLEDGMENTS

We are indebted to Sung-Hee Lee for creating the biomechanicalhuman musculoskeletal model that we have employed in our work.We thank Gestina Yassa for her voice over of our demonstrationvideo, Zhili Chen for synthesizing handwriting trajectories, andthe anonymous reviewers for their constructive comments. Wegratefully acknowledge financial support from VoxelCloud, Inc.,and a 5-year Scholarship for Long-Term Study Abroad to MN fromthe Japan Student Organization (JASSO).

A EXPERIMENTS WITH THE ARM MOTOR DNN

A.1 Network shape:

We conducted training experiments with the left arm voluntarymotor controller using 9 different network architectures of variousshapes. The network shapes are as follows, where the leftmostnumber, rightmost number, and intermediate numbers, indicate thenumber of units in the input layer, output layer, and hidden layers,respectively:

(1) 32 | 100 | 100 | 100 | 100 | 100 | 29(2) 32 | 300 | 300 | 300 | 300 | 300 | 29(3) 32 | 500 | 500 | 500 | 500 | 500 | 29(4) 32 | 300 | 250 | 200 | 150 | 100 | 29(5) 32 | 500 | 400 | 300 | 200 | 100 | 29(6) 32 | 100 | 200 | 300 | 400 | 500 | 29(7) 32 | 100 | 200 | 300 | 200 | 100 | 29(8) 32 | 300 | 200 | 100 | 200 | 300 | 29(9) 32 | 3000 | 3000 | 29

This list encompasses wide-shallow and deep network architectureswith rectangular, triangular, and diamond shapes. Fig. 28a,b graphthe mean squared error as a function of the number of epochs duringthe training process on the validation datasets for each of the above



0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9

Seconds

Experiment #Experiment #

Second

s

(a)

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Seconds

Experiment #Experiment #

Second

s(b)

Fig. 29. Average time per epoch for the left arm voluntary motor DNN

experiments with various (a) network shapes and (b) network depths.

listed network architectures, by number.5 All the training processesconverged; however, the convergence speed and stability varied byarchitecture.The rectangular networks (1,2,3) required fewer epochs to con-

verge as the number of units in each hidden layer increased. Thenetwork with 100 units in each hidden layer did not converge wellenough to make it useful as a motor controller. Those with 300 and500 units in each hidden layer converged very well, especially net-work (3) with 500 units in each layer, which showed one of the bestperformances. The triangular and the diamond shaped architectures(4,5,6,7,8) converged with average speed; the learning progress wasstable but slow. The wide-shallow network (9) was efficient in thesense that it required fewer epochs.Fig. 29a graphs the average computational cost for each epoch.

As expected, the required time increases with an increasing numberof weights, the wide-shallow network requiring 6 to 8 times morecomputation per epoch than deep-narrow ones.

A.2 Network depth:

The next experiment evaluated networks of the same architecture,but different depths, as follows:

(1) 32 | 1800 | 29(2) 32 | 900 | 900 | 29(3) 32 | 600 | 600 | 600 | 29(4) 32 | 450 | 450 | 450 | 450 | 29(5) 32 | 360 | 360 | 360 | 360 | 360 | 29(6) 32 | 300 | 300 | 300 | 300 | 300 | 300 | 29(7) 32 | 257 | 257 | 257 | 257 | 257 | 257 | 257 | 29(8) 32 | 225 | 225 | 225 | 225 | 225 | 225 | 225 | 225 | 29

To evaluate the effect of network depth independently of the totalnumber of hidden units in the network, each of the above networksincludes approximately 1,800 hidden units. Fig. 28c,d graph themean squared error as a function of the number of epochs of thetraining process run on the validation dataset. Fewer epochs wererequired as the network depth increases, until the number of hiddenlayers reaches 6. However, the required number of epochs increasedfor 7 hidden layers. It then decreases for 8 hidden layers.Fig. 29b graphs the average computational cost of each epoch.

Interestingly, the required time per epoch increases until the number590% of our 1M-item synthesized datasets were used for training the DNNs while theremaining 10% were reserved for validation of the trained DNNs.

of hidden layers reaches 6, but the required times with more than 5hidden layers are nearly the same. The network with 7 hidden layersactually consumed slightly less time than the one with 6 hiddenlayers, and the time increased for the network with 8 hidden layers.Thus, this experiment shows that the number of hidden layers

affects the regression abilities of the voluntary motor DNN as wellthe computational time consumed for each training epoch, albeitmuch less substantially so beyond 6 hidden layers.

B MOTOR CONTROLLER TRAINING DATA SYNTHESIS

The DNNs that implement the voluntary and reflex neuromuscularmotor controllers in our sensorimotor system are trained offlinein a supervised manner. We synthesize training data offline usingour biomechanical human musculoskeletal model simulator. Oncethey are trained, the DNNs can quickly produce the required muscleactivation signals online.Given the current state of the musculoskeletal system, the cur-

rent muscle activations, and a desired target state, we compute anappropriate adjustment to each muscle activation so as to drive themusculoskeletal system closer to the target state.

For example, if a limb end effector is in some given position andorientation and we want it to achieve a new pose, first we computethe desired changes of the skeletal joint angles q by solving aninverse kinematics problem. Second, we compute the desired accel-erations for the joints to achieve the desired motion in a given timestep. Third, we compute the required joint torques by solving theinverse dynamics problem for each joint, subject to external forcessuch as gravity. Finally, we apply a muscle optimization techniqueto compute the minimal muscle activations that can generate thedesired torque for each joint; i.e, a minimal-effort objective.

Voluntary Motor Controller. For the purposes of training the vol-untary motor DNN, the input is the concatenation of the desiredmovement vector ∆p = pd − p between the current pose p anddesired pose pd , as well as the current muscle activations a, whilethe desired output of the network in response to this input is theadjustment to the voluntary activations ∆av = ad − a, where adis the desired voluntary muscle activations, which is to be addedto the neuromuscular controller output per Equation 1. Hence, asingle training pair for the voluntary network is as follows:

Training input-output pair

{Input: [∆p,a];Output: ∆av .

(3)

Reflex Motor Controller. The reflex muscle activation signals are

ar = kp (ed − e) + kd satm ( Ûed − Ûe), (4)

where e and Ûe are the current muscle strain and strain rates, anded and Ûed are the desired strain and strain rates, respectively, andwhere the saturation function

satm (x) =

{x |x | < m;m sgn(x) otherwise,

(5)

avoids instability from overly large derivative feedback. We set theproportional gain kp = 8.0, the derivative gain kd = 0.05, andm = 2.0, and compute the desired stain and strain rates by thesetpoint method [Lee et al. 2009]. Therefore, for the purposes of



training the reflex DNN controllers, the input is the concatenationof the desired adjustment of the strain ∆e = ed − e and strain rate∆ Ûe = Ûed − Ûe , while the desired output of the network in responseto the input is the reflex activations adjustment ∆ar , which is tobe added to the neuromuscular controller output per Equation 1.Hence, a single training pair for the reflex network is as follows:

Training input-output pair

{Input: [∆e,∆ Ûe];Output: ∆ar .

(6)

We iteratively update the joint angles in a gradient descent man-ner such that the difference between the current pose q and targetpose q∗ is minimized. The controller determines the required accel-erations to reach the target at each time step h using the followingPD approach:

Üq∗ = kp (q∗ − q) + kd ( Ûq

∗ − Ûq), (7)

with proportional gain kp = 2(1 −γ )/h2, where γ is the error reduc-tion rate, and derivative gain kd = 2/h. We set h = γ = 0.1.

We use the hybrid recursive dynamics algorithm due to Feather-stone [2014], which makes it possible to compute the desired accel-erations for acceleration-specified joints and the desired torques fortorque-specified joints, as follows: First, we set the muscle-drivenjoints as acceleration-specified joints, while the passive joints re-main torque-specified joints. We compute the desired accelerationsfor the muscle-driven joints and run the hybrid dynamics algorithmto compute the resulting accelerations for the passive joints. Second,we advance the system to the next time step using a first-orderimplicit Euler method, and use the hybrid dynamics algorithm tocompute torques for the muscle-driven joints, thus obtaining thedesired torques for the muscle-driven joints and accelerations forthe passive joints.

After the desired torques are obtained, an optimization problemfor agonist and antagonist muscles is solved to compute the desiredminimal-effort muscle activation levels.

Additional details about the above numerical techniques are foundin [Lee et al. 2009; Lee and Terzopoulos 2006].

REFERENCES

J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,D. Warde-Farley, and Y. Bengio. 2010. Theano: A CPU and GPU math compiler inPython. In Proc. 9th Python in Science Conference. Austin, TX, 1–7.

A.L. Cruz Ruiz, C. Pontonnier, N. Pronost, and G. Dumont. 2017. Muscle-based controlfor character animation. Computer Graphics Forum 36, 6 (2017), 122–147.

M. F. Deering. 2005. A photon accurate model of the human eye. ACM Transactions onGraphics 24, 3 (2005), 649–658.

P. Faloutsos, M. van de Panne, and D. Terzopoulos. 2001. Composable controllers forphysics-based character animation. In Proc. 28th Annual Conference on ComputerGraphics and Interactive Techniques (SIGGRAPH ’01). Los Angeles, CA, 251–260.

Y. Fan, J. Litven, and D.K. Pai. 2014. Active volumetric musculoskeletal systems. ACMTransactions on Graphics 33, 4 (2014), 152.

R. Featherstone. 2014. Rigid Body Dynamics Algorithms. Springer, New York, NY.T. Geijtenbeek, M. Van De Panne, and A.F. Van Der Stappen. 2013. Flexible muscle-based

locomotion for bipedal creatures. ACM Transactions on Graphics 32, 6 (2013), 206.I. Goodfellow, Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press, Cambridge,

MA.R. Grzeszczuk, D. Terzopoulos, and G. Hinton. 1998. NeuroAnimator: Fast neural

network emulation and control of physics-based models. In Computer GraphicsProceedings, Annual Conference Series. Orlando, FL, 9–20. Proc. ACM SIGGRAPH 98.

K. He, X. Zhang, S. Ren, and J. Sun. 2015. Delving deep into rectifiers: Surpassinghuman-level performance on ImageNet classification. In Proc. IEEE InternationalConference on Computer Vision. Santiago, Chile, 1026–1034.

J.K. Hodgins, W.L. Wooten, D.C. Brogan, and J.F. O’Brien. 1995. Animating humanathletics. In Proc. ACM SIGGRAPH ’95 Conference. Los Angeles, CA, 71–78.

D. Holden, T. Komura, and J. Saito. 2017. Phase-functioned neural networks for charactercontrol. ACM Transactions on Graphics 36, 4 (2017), 42.

W. Huang, M. Kapadia, and D. Terzopoulos. 2010. Full-body hybrid motor controlfor reaching. In Motion in Games (Lecture Notes in Computer Science, Vol. 6459).Springer-Verlag, Berlin, 36–47.

A.-E. Ichim, P. Kadleček, L. Kavan, and M. Pauly. 2017. Phace: Physics-based facemodeling and animation. ACM Transactions on Graphics 36, 4 (2017), 153.

P. Kadleček, A.-E. Ichim, T. Liu, J. Křivánek, and L. Kavan. 2016. Reconstructing per-sonalized anatomical models for physics-based body animation. ACM Transactionson Graphics 35, 6 (2016), 213.

K. Kähler, J. Haber, H. Yamauchi, andH.-P. Seidel. 2002. Head shop: Generating animatedhead models with anatomical structure. In Proc. 2002 ACM SIGGRAPH/EurographicsSymposium on Computer Animation. San Antonio, TX, 55–63.

D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. TechnicalReport. arXiv preprint arXiv:1412.6980.

S.-H. Lee, E. Sifakis, and D. Terzopoulos. 2009. Comprehensive biomechanical modelingand simulation of the upper body. ACM Transactions on Graphics 28, 4 (2009),99:1–17.

S.-H. Lee and D. Terzopoulos. 2006. Heads Up! Biomechanical modeling and neuromus-cular control of the neck. ACM Transactions on Graphics 23, 212 (2006), 1188–1198.Proc. ACM SIGGRAPH 2006.

Y. Lee, M.S. Park, T. Kwon, and J. Lee. 2014. Locomotion control for many-musclehumanoids. ACM Transactions on Graphics 33, 6 (2014), 218.

Y. Lee, D. Terzopoulos, and K. Waters. 1995. Realistic modeling for facial animation. InComputer Graphics Proceedings, Annual Conference Series (Proc. ACM SIGGRAPH 95).Los Angleles, CA, 55–62.

M. Lesmana, A. Landgren, P.-E. Forssén, and D.K. Pai. 2014. Active gaze stabiliza-tion. In Proc. Indian Conference on Computer Vision, Graphics, and Image Processing.Bangalore, India, Article 81, 8 pages.

M. Lesmana andD.K. Pai. 2011. A biologically inspired controller for fast eyemovements.In IEEE International Conference on Robotics and Automation (ICRA). IEEE, Shanghai,China, 3670–3675.

L. Liu and J. Hodgins. 2017. Learning to schedule control fragments for physics-basedcharacters using deep Q-learning. ACM Transactions on Graphics 36, 3 (2017), 29.

M. Nakada, H. Chen, and D. Terzopoulos. 2018. Deep learning of biomimetic visualperception for virtual humans. In Proc. ACM Symposium on Applied Perception (SAP’18). Vancouver, BC, 1–8.

M. Nakada and D. Terzopoulos. 2015. Deep learning of neuromuscular control forbiomechanical human animation. In Advances in Visual Computing (Lecture Notesin Computer Science, Vol. 9474). Springer, Berlin, 339–348. Proc. InternationalSymposium on Visual Computing, Las Vegas, NV, December 2015.

X.B. Peng, G. Berseth, K. Yin, and M. Van De Panne. 2017. Deeploco: Dynamic loco-motion skills using hierarchical deep reinforcement learning. ACM Transactions onGraphics 36, 4 (2017), 41.

T.F. Rabie and D. Terzopoulos. 2000. Active perception in virtual humans. In Proc. VisionInterface 2000. Montreal, Canada, 16–22.

P. Sachdeva, S. Sueda, S. Bradley, M. Fain, and D.K. Pai. 2015. Biomechanical simulationand control of hands and tendinous systems. ACM Transactions on Graphics 34, 4(2015), 42.

E.L. Schwartz. 1977. Spatial mapping in the primate sensory projection: Analyticstructure and relevance to perception. Biological Cybernetics 25, 4 (1977), 181–194.

W. Si, S.-H. Lee, E. Sifakis, and D. Terzopoulos. 2014. Realistic biomechanical simulationand control of human swimming. ACM Transactions on Graphics 34, 1, Article 10(Nov. 2014), 15 pages.

E. Sifakis, I. Neverov, and R. Fedkiw. 2005. Automatic determination of facial muscleactivations from sparse motion capture marker data. ACM Transactions on Graphics1, 212 (2005), 417–425.

S. Sueda, A. Kaufman, and D.K. Pai. 2008. Musculotendon simulation for hand animation.ACM Transactions on Graphics 27, 3 (Aug. 2008), 83.

D. Terzopoulos and T.F. Rabie. 1995. Animat vision: Active vision with artificial animals.In Proc. Fifth International Conference on Computer Vision (ICCV ’95). Cambridge,MA, 840–845.

D. Terzopoulos and K. Waters. 1990. Physically-based facial modelling, analysis, andanimation. Computer Animation and Virtual Worlds 1, 2 (1990), 73–80.

J.M. Wang, S.R. Hamner, S.L. Delp, and V. Koltun. 2012. Optimizing locomotion con-trollers using biologically-based actuators and objectives. ACM Transactions onGraphics 31, 4, Article 25 (2012), 11 pages.

Q. Wei, S. Sueda, and D.K. Pai. 2010. Biomechanical simulation of human eye movement.In Biomedical Simulation (Lecture Notes in Computer Science), Vol. 5958. Springer-Verlag, Berlin, 108–118.

S.H. Yeo, M. Lesmana, D.R. Neog, and D.K. Pai. 2012. Eyecatch: Simulating visuomotorcoordination for object interception. ACM Transactions on Graphics 31, 4 (2012),1–10.


deep learning of biomimetic sensorimotor control for biomechanical human...

Documents