from canonical poses to 3d motion capture using a single camera

17

Click here to load reader

Upload: p

Post on 06-Nov-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Canonical Poses to 3D Motion Capture Using a Single Camera

From Canonical Poses to 3D Motion CaptureUsing a Single Camera

Andrea Fossati, Miodrag Dimitrijevic, Vincent Lepetit, and Pascal Fua, Senior Member, IEEE

Abstract—We combine detection and tracking techniques to achieve robust 3D motion recovery of people seen from arbitrary

viewpoints by a single and potentially moving camera. We rely on detecting key postures, which can be done reliably, using a motion

model to infer 3D poses between consecutive detections, and finally refining them over the whole sequence using a generative model.

We demonstrate our approach in the cases of golf motions filmed using a static camera and walking motions acquired using a

potentially moving one. We will show that our approach, although monocular, is both metrically accurate because it integrates

information over many frames and robust because it can recover from a few misdetections.

Index Terms—Computer vision, motion, video analysis, 3D scene analysis, modeling and recovery of physical attributes, tracking.

Ç

1 INTRODUCTION

RECENT approaches to modeling people’s 3D motion fromvideo sequences can be roughly classified into those

that detect specific postures in individual frames and thosethat track the motion from frame to frame given an initialpose. The first category usually involves matching against alarge image database and is becoming increasingly popular,but requires very large training data sets to be effective. Thesecond category involves predicting the pose in a framegiven the pose computed in previous ones, which can easilyfail if errors start accumulating in the prediction, causingthe estimation process to diverge.

Neither technique is clearly superior to the other, andboth are actively investigated. In this paper, we show thatthey can be combined to accurately reconstruct the 3Dmotion of people seen from arbitrary viewpoints using asingle, and potentially moving, camera. At the heart of ourapproach is the fact that human motions often containcharacteristic postures that are relatively easy to detect.Given two consecutive such postures, modeling intermedi-ate poses becomes an interpolation problem, which is mucheasier to solve reliably than open-ended tracking.

More specifically, we show that we can reconstruct 3Dgolfing motions filmed using a static camera and walkingmotions acquired using a potentially moving one. In the golfcase, the easy-to-detect postures are the starting position,when the golfer transitions from upswing to downswing,and the final one. For walking, they are the ones that occurat the end of each step when people have their legs furthestapart. We therefore use a chamfer-based method [11] thatwas designed to detect key postures from any viewpoint,

even when the background is cluttered and backgroundsubtraction is impractical because the camera moves, as isthe case in Fig. 1a. Because the detected postures areprojections of 3D models, we can map them back to full 3Dposes and use them to select and warp motions from atraining database that closely match them. This yields initialpose estimates such as those of Fig. 1b. It lets us create thesynthetic images we would see if the person truly were inthose positions. These images are depicted by Fig. 1c and werefine the pose until they match the real ones. This yields theresults depicted by Figs. 1d and 1e.

The importance of combining detection and tracking toachieve robustness has long been known [9], [23] andmanually introducing a few 3D keyframes in a trackingalgorithm has been shown to be effective [10]. Morerecently, a fully automated approach to combining trackingand detection has been shown to be robust at followingmultiple people over very long sequences in [28] in 2D. Thisis achieved by detecting people in canonical poses andtracking them from there, which still has the potential todiverge. By contrast, interpolating between detected silhou-ettes prevents this and yields 3D reconstructions.

We chose walking and golfing to demonstrate ourapproach because we had access to both the relevant motiondatabases and silhouette detection techniques. The frame-work, however, is general because most human motionsinclude very characteristic postures that are easier to detectthan completely arbitrary ones. Athletic motions are a goodexample of this. Canonical postures can be detected when atennis player hits the ball with a forehand, a backhand, or aserve [38]. In a work environment, there also are verycharacteristic poses between which people alternate, such assitting at their desk and walking through doors.

The fact that canonical postures are common is im-portant because one of the limitations of state-of-the-artdetection-based approaches to 3D motion reconstruction isthat huge training databases would be required to detect allpossible postures. By contrast, if one only needs to detect afew easily recognizable postures, much smaller databasesshould suffice, thus making the approach easier to deploy.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 7, JULY 2010 1165

. The authors are with the Ecole Polytechnique Federale de Lausanne(EPFL)/IC/ISIM/CVLab, Computer Vision Laboratory, I&C Faculty,Station 14, CH-1015 Lausanne, Switzerland. E-mail: {andrea.fossati,miodrag.dimitrijevic, vincent.lepetit, pascal.fua}@epfl.ch.

Manuscript received 17 June 2008; revised 17 Feb. 2009; accepted 28 Apr.2009; published online 5 May 2009.Recommended for acceptance by T. Darrell.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2008-06-0364.Digital Object Identifier no. 10.1109/TPAMI.2009.108.

0162-8828/10/$26.00 � 2010 IEEE Published by the IEEE Computer Society

Page 2: From Canonical Poses to 3D Motion Capture Using a Single Camera

2 RELATED WORK

Existing approaches to video-based 3D motion captureremain fairly brittle for many reasons: Humans have acomplex articulated geometry overlaid with deformabletissues, skin, and loose clothing. Their motion is often rapid,complex, self-occluding, and presents joint reflectionambiguities. Furthermore, the 3D body pose is onlypartially recoverable from its projection in one single image,where usually the background is cluttered and the resolu-tion is poor. Reliable and robust 3D motion analysis,therefore, requires good tracking across frames, which isdifficult because of the poor quality of image-data andfrequent occlusions. Recent approaches to handling theseproblems can roughly be classified into those that

. Detect: This implies recognizing postures from asingle image by matching it against a database andhas become increasingly popular recently [40], [1],[13], [24], [20], [12], [19], [30], [25], [4], [42] butrequires very large sets of examples to be effective.Moreover, this often relies on background subtrac-tion and on clean silhouettes, such as those that canbe extracted from the HumanEva data set [35],

which require static cameras or controlled environ-ments. Finally, these methods are usually able just toobtain a good reconstruction of the body pose butcannot correctly locate it in a 3D environment.

. Track: This involves predicting the pose in a framegiven observation of the previous one. It thus requiresan initial pose and can easily fail if errors startaccumulating in the prediction, causing the estima-tion process to diverge. The possibility of drifting isusually mitigated by introducing sophisticated sta-tistical techniques for a more effective search [9], [7],[8], [45], [46], [37], by using strong dynamic motionmodels as priors [33], [27], [34], [2], [43], [39], [29], oreven by introducing physics-based models [5].

Neither technique has been proven to be superior, and bothare actively studied and sometimes combined: Manuallyintroducing a few 3D keyframes is known to be a powerfulway to constrain 3D tracking algorithms [10], [22]. In the 2Dcase, it has recently been shown that this can be done in afully automated fashion to track multiple people inextremely long sequences [28]. This involves trackingforwards and backwards from individual and automaticallydetected canonical poses. While effective, this approach to

1166 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 7, JULY 2010

Fig. 1. Our approach. (a) Input sequence acquired using a moving camera with silhouettes detected at the beginning and the end of the walkingcycle. The projection of the ground plane is overlaid as a blue grid. (b) Projections of the 3D poses inferred from the two detections. (c) Synthesizedimages that are most similar to the input: Different colors represent different appearance models. (d) Projections of the refined 3D poses. (e) 3Dposes seen from a different viewpoint.

Page 3: From Canonical Poses to 3D Motion Capture Using a Single Camera

tracking still has the potential to diverge. In this paper, weavoid this problem and go to full 3D by observing thatautomated canonical pose detections can be linked intocomplete trajectories, which let us first recover rough 3Dposes by interpolating between these detections and thenrefining them by using a generative model over fullsequences. A similar approach has been proposed for 3Dhand tracking [41] but makes much stronger assumptionsthan we do by requiring high-quality images so that thehand outlines can accurately and reliably be extracted fromthe background.

The work presented in this paper builds on some of ourown earlier results. We rely on spatiotemporal templates todetect the people in canonical poses [11] and on PCA-basedmotion models [44] to perform the interpolation. However,unlike in this latter paper, the system does not requiremanual initialization. This means that we had to develop astrategy to link detections, infer initial 3D poses from them,and perform the pose refinement even when the cameramoves or the background is cluttered. As a result, we cannow operate fully automatically under far more challengingconditions than before.

3 APPROACH

We first use a template-based approach [11] to detect peoplein poses that are most characteristic of the target activity, asshown in Fig. 1a. The templates consist of consecutive 2Dsilhouettes obtained from 3D motion capture data seen fromsix different camera views and at different scales. This waythe motion information is incorporated into the templatesand helps to distinguish actual people who move in apredictable way from static objects whose outlines roughlyresemble those of humans. For each detection, the systemreturns a corresponding 3D pose estimate.

In theory, a person should be detected every time a keypose is attained, which the template-based algorithm doesvery reliably. The few false positives tend to correspond toactual people but detected at somewhat inaccurate scales ororientations and false negatives occur when the relativeposition of the person with respect to the camera generatesan ambiguous projection and the key pose becomes hard todistinguish from others. In our experiments, this almostnever happened in the golfing case and sometimes did inthe walking case when the camera moved and saw thesubject against a cluttered background and from a difficultangle. To handle such cases, we have implemented aViterbi-style algorithm that links detections into consistenttrajectories, even though a few may have been missed. Sincethe camera may move, we perform this computation in theground plane, which we relate to the image plane via ahomography that is recomputed from frame to frame.

Finally, we use consecutive detections to select and time-warp motions from a training database obtained via opticalmotion capture. As shown in Fig. 1b, this gives us a roughestimate of the body’s position and configuration in eachframe between detections. To refine this initial estimate andsince the camera may move from frame to frame, we firstcompute homographies between consecutive frames and usethem to synthesize a background image from which themoving person has been almost completely removed. When

we know the camera to be static, we synthesize the back-ground image by simple median filtering of the imagesbetween detections. We then learn an appearance modelfrom the detections and use it in conjunction with thesynthesized background to produce new images, which letsus refine the body position by minimizing an objectivefunction that represents the dissimilarity between theoriginal and synthetic images. For increased robustness, weperform this minimization over all frames simultaneously.This yields the refined poses depicted by Figs. 1c, 1d, and 1e.

In the remainder of this section, we first introduce themodels we use to represent human bodies and their motion.We then briefly describe our approach first to detectingpeople in canonical poses, second to using these detectionsto estimate the motion between frames, and, finally, torefining this estimate.

3.1 Body and Motion Models

As in [44], we represent the human body as cylindersattached to an articulated 3D skeleton and use a linearsubspace method to represent the motion between canoni-cal poses. This body model has standard dimensions andproportions; thus it allows us to obtain reasonable resultson different subjects without the need of being specificallytrimmed. Adapting the skeleton proportions would haverequired an a priori knowledge of their more likelyvariations, as was done, for example, in [3] using theSCAPE model. In practice, we have not found it necessaryto do so because of the scale ambiguity inherent tomonocular reconstruction. Using a model that is slightlytoo small or too big simply results in variations in therecovered camera position with respect to the subject. Apose, whether canonical or not, is given by the position andorientation of its root node, defined at the sacroiliac, and aset of joint angles. More formally, let D denote the numberof joint angles in the skeletal model. A pose at time t is thengiven by a vector of joint angles, denoted t ¼ ½�1; . . . ; �D�T ,along with the global position and orientation of the root

gt 2 IR6: ð1Þ

A motion between two canonical poses can be viewed asa time-varying pose. While pose varies continuously withtime, we assume a discrete representation in which pose issampled at N distinct time instants. In this way, a motionbecomes a sequence of N discrete poses

� ¼� T1 ; . . . ; TN

�T 2 IRD N;

G ¼�gT1 ; . . . ;gTN

�T 2 IR6 N:ð2Þ

Since motions can occur at different speeds, we encodethem at a canonical speed and time warp them to representother speeds. We let the pose vary as a function of a phaseparameter � that is defined to be 0 at the beginning of themotion and 1 at the end. For periodic motions such aswalking, the phase is periodic. For nonperiodic ones such asswinging a golf club, it is not. The canonical motion is thenrepresented with a sequence of N poses, indexed by thephase of the motion. For frame n 2 ½1; N�, the discretephase �n 2 ½0; 1� is simply

�n ¼n� 1

N � 1: ð3Þ

FOSSATI ET AL.: FROM CANONICAL POSES TO 3D MOTION CAPTURE USING A SINGLE CAMERA 1167

Page 4: From Canonical Poses to 3D Motion Capture Using a Single Camera

In practice, we learn motion models from optical motioncapture data comprised of several people performing thesame activity several times. For walking, we used a Vicon2system to capture themotions of fourmen and four women on atreadmill at speeds ranging from 3 to 7 km/h by increments of0.5 km/h. The body model had D ¼ 84 degrees of freedom.While one might also wish to include global translational ororientational velocities in the training data, these were notavailable with the treadmill data. We therefore only learnmotion models for the joint angles. Four cycles of walking andrunning at each speed were used to capture the naturalvariability of motion from one gait cycle to the next for eachperson. Similarly, to learn the golf swing model, we asked twogolfers to perform 24 swings each.

Because our subjects move at different speeds, we firstdynamically time warp and resample each training sample.This produces training motions with the same number ofsamples, and with similar poses aligned. To this end, wefirst manually identify a small number of key posturesspecific to each motion type. We then linearly time warp themotions so that the key postures are temporally aligned.The resulting motions are then resampled at regular timeintervals using quaternion spherical interpolation [31] toproduce the training poses f jgNj¼1.

Given a training set of M such motions, denoted,f�jgMj¼1, we use Principal Component Analysis to find alow-dimensional basis with which we can effectively modelthe motion. In particular, the model approximates motionsin the training set with a linear combination of the meanmotion �0 and a set of eigen-motions f�igmi¼1:

� � �0 þXmi¼1

�i�i: ð4Þ

The scalar coefficients, f�ig, characterize the motion andm �M controls the fraction of the total variance of thetraining data that is captured by the subspace. In all of theexperiments shown in this paper, we used m ¼ 5, which hasproven sufficient to achieve good reconstruction accuracy.

A pose is then defined as a function of the scalarcoefficients, f�ig, and a phase value, �. We therefore write

ð�; �1; . . . ; �mÞ � �0ð�Þ þXmi¼1

�i�ið�Þ: ð5Þ

Note than now �ið�Þ are eigen-poses, and �0ð�Þ is the meanpose for that particular phase.

3.2 Detection and Initialization

As in our earlier publication [11], people in canonical posesare detected using spatiotemporal templates that are se-quences of three silhouettes of a person, such as the oneof Fig. 2c. The first corresponds to the moment just beforethey reach the target pose, the second to the moment whenthey have precisely the right attitude, and the third justafter. Matching these templates against three-image se-quences let us differentiate between actual people whomove in a predictable way and static objects whose outlinesroughly resemble those of humans, which are surprisinglynumerous. As a result, it turns out to be much more robustthan earlier template-based approaches to people detection[26], [15], [16].

As shown in Fig. 2a, to build these templates, weintroduced a virtual character that can perform the samecaptured motions we used to build the motion modeldiscussed above and rendered images at a rate of 25 framesper second as seen from virtual cameras in six differentorientations. The rendered images are then used to createtemplates such as those depicted by Fig. 2b. The renderedimages are rescaled at seven different scales ranging from52� 64 to 92� 113 pixels so that an image at one scale is10 percent larger than the image one scale below. From eachone of the rendered images, we extract the silhouette of themodel. Each template is made of the silhouette correspond-ing to the canonical pose, the one before, and the one after.The silhouettes are represented as sets of oriented pixelsthat can be efficiently matched against image sequences. Werefer the interested reader to our earlier publication forfurther details [11].

An added bonus of this approach to detecting people, isthat to each detection we can associate the set of f�ig PCAcoefficients, as defined in (4). Averaging the coefficientscorresponding to two consecutive detections and samplingthe �n phase parameter of (3) at regular intervals gives us apose estimate in each intermediate frame. In the golfing casewhere the body’s center of gravity moves little, this isenough to characterize the whole motion since we canassume that the gt vector of (1) that encodes the positionand orientation of the body root remains constant except forthe component that encodes the rotation around the z axis.In the walking case, this is of course not true and we use theposition of the detected silhouettes on the ground plane toestimate the person’s 3D location and orientation. We thenuse spline interpolation to derive initial gt values inbetween, as will be discussed in more details in Section 4.3.

3.3 Refinement

The poses obtained using the method discussed above areonly approximative. To refine them, we generate for eachone the synthetic images we would see if the person trulywere in that pose and compare to the original one.Minimizing the dissimilarity between real and syntheticimage then lets us refine the poses in each individual frame.

In our implementation, we depart from typical generativeapproaches in two important ways. First, to increaserobustness, we refine the poses over all the framessimultaneously by optimizing with respect to the � PCAcoefficients of (4) and gt root node positions and orientationsof (1). Second, as shown in Fig. 1c, we not only create anappearance model for the person but also for the back-ground so that the synthetic images we produce includeboth. As illustrated by Fig. 3, this is important because iteffectively constrains the projections of the reconstructedmodel to be at the right place and allows recovery of thecorrect pose even when the initial guess is far from it.

To perform the refinement, we define the objectivefunction Lð�Þ as � logðpð�jI1; . . . ; INÞÞ of a pose sequence� ¼ �ð�1; . . . ; �N;g1; . . . ;gN; �1; . . . ; �mÞ in an image se-quence I1; . . . ; IN . To compute it, we consider the standardBayesian formula

pð�jI1; . . . ; INÞ ¼pðI1; . . . ; IN j�Þ � pð�Þ

pðI1; . . . ; INÞ: ð6Þ

1168 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 7, JULY 2010

Page 5: From Canonical Poses to 3D Motion Capture Using a Single Camera

The pðI1; . . . ; INÞ term is constant and can be ignored.Because we have a dependable way to initialize �, weexpress the prior as a distance from its initial value andwrite its negative log as

� log pð�Þ ¼Xmk¼1

�k � �0kffiffiffiffiffi

�kp

� �2

; ð7Þ

where �0k represents the initialization value for the kth PCA

parameter, given by the detections, and �k is the eigenvalueassociated to the kth eigenvector.

Assuming conditional independence of the appearancein consecutive frames given the motion model, we candecompose pðI1; . . . ; IN j�Þ as

pðI1; . . . ; IN j�Þ ¼YNi¼1

pðIi j i; giÞ; ð8Þ

where i ¼ ð�i; �1; . . . ; �mÞ is the pose in image Ii, asdefined by (5).

Assuming for simplicity that, given an estimated pose i,a background, and a foreground model, all of the pixelsðu; vÞ in image Ii are conditionally independent, we write

pðIi j i; giÞ ¼Yðu;vÞ2Ii

pðIiðu; vÞ j i; giÞ: ð9Þ

We estimate pðIiðu; vÞ j iÞ for each pixel of frame i asfollows: Given the generated background images, weproject our human body model according to pose i. Asdiscussed above, individual limbs are modeled as cylindersto which we associate a color histogram obtained from theprojected area of the limb in the frames where thesilhouettes were detected. We project the body model ontothe generated background image to obtain a syntheticimage, such as those depicted by Fig. 1c. If ðu; vÞ is located

FOSSATI ET AL.: FROM CANONICAL POSES TO 3D MOTION CAPTURE USING A SINGLE CAMERA 1169

Fig. 3. Refinement process. The images are, from left to right, the inputimage, the initialization given by the interpolation process, the resultobtained without using a background image, and finally the resultobtained as proposed. Whole parts of the body can be missed when thebackground is not exploited.

Fig. 2. Creating spatiotemporal templates. (a) Six virtual cameras are placed around the model. Viewpoints 3 and 7 are not considered because theyare not discriminant enough. (b) A template corresponding to a particular view consists of several silhouettes computed at three consecutiveinstants. The small blue arrows in image Camera 1/Frame 1 represent edge orientations used for matching silhouettes for some of the contourpixels. (c) The three silhouettes of a walking template are superposed to highlight the differences between outlines. (d) Superposed silhouettes of agolf swing template.

Page 6: From Canonical Poses to 3D Motion Capture Using a Single Camera

within the projection of a body part, we take pðIiðu; vÞ j iÞto be proportional to the value of its corresponding bin inthe color histogram of the body part. If, instead, ðu; vÞ islocated on the background, we take pðIiðu; vÞ j Þ to be aGaussian distribution centered on the corresponding pixelvalue in the synthetic background image Bi, with fixedcovariance �. We therefore write

pðIiðu; vÞ j ; gÞ ¼hpartðu;vÞðIiðu; vÞÞ; if ðu; vÞ 2 F;

NðBiðu; vÞ;�; Iiðu; vÞÞ; if ðu; vÞ 62 F;

where F represents the projection of the body model into the

image. Modeling both the foreground and background

appearance helps in achieving more accuracy and robustness,

as already noted in literature [18], [32] for static camera cases.Given (7)-(9), we can write

Lð�Þ ¼ � logðpð�ÞÞ þXNi¼1

Xðu;vÞ2Ii

� logðpðIiðu; vÞj iÞÞ; ð10Þ

and refine all of the poses between detections by minimiz-ing Lð�Þ with respect to ð�1; . . . ; �N;g1; . . . ;gN; �1; . . . ; �mÞ,which define the motion in the whole sequence. Thisminimization is performed stochastically by samplingparticles thrown in the parameter space around theinitialization.

4 FROM DETECTIONS TO TRAJECTORIES

To reconstruct golf swings, we treat as canonical poses thetransition between the upswing and the downswing andthe end of the upswing, as shown in Fig. 4. Since there is alittle motion of the golfer’s center of gravity during theswing, we take the gt vectors of (1) to be initially all equal,and during the optimization we only allow a rotationaround the z axis. Furthermore, since the camera is static inthe examples we use, we simply median filter frames in thewhole sequence to synthesize the background image weneed to perform the pose refinement of Section 4.3.

To track walking people, we use the beginning of thewalking cycle, when the legs are furthest apart, as ourcanonical pose. Our spatiotemporal templates detect thispose reliably but with the occasional false positive andfalse negative. Such errors must be eliminated and thevalid detections linked into consistent trajectories, which ismore involved than in the golf case since people moveover time and the camera must be allowed to move also tokeep them in view.

In this section, we first describe the linking procedure,discuss how we initialize the gt vectors that encode theperson’s global motion, and, finally, account for the fact thatthe direction in which people face and in which they moveare strongly correlated.

1170 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 7, JULY 2010

Fig. 4. Key pose detections at the beginning and at the end of a golfswing.

Fig. 5. Ground plane tracking. Given the corresponding interest points in each pair of the consecutive frames in the sequence Ii, i ¼ 0::N it ispossible to compute the homographies between these frames Hi

iþ1, i ¼ 0::N � 1. Further, knowing the initial homography between the world groundplane and the reference image Hw

0 , we compute the required homographies between each of the frames and the world ground plane Hwi , i ¼ 1::N.

Page 7: From Canonical Poses to 3D Motion Capture Using a Single Camera

4.1 Linking Detections

In our scheme, people should be detected at the beginningof every walking cycle but are occasionally missed. To linkthese sparse detections into a complete trajectory, we haveimplemented a Viterbi-style algorithm. It is important tonote that the system is trained on a very specific pose, theleft leg in front of the right one, which helps our algorithmresolve ambiguities by giving higher scores to the correctdetections. We could have used two key poses instead ofone for each walking cycle, but we empirically found thatusing one was a good trade-off between tracking robustnessand computational load. As shown in Section 5, evenmissing a detection out of two can still lead to reliableresults. Finally, these detections include not only an imagelocation but also the direction the person faces, which is animportant clue for linking purposes.

4.1.1 Ground Plane Registration

Since the camera may move, we work in the ground plane,which we relate to each frame by a homography that iscomputed using a standard technique [36], which isillustrated in Fig. 5. In practice, we manually indicate theground plane in one frame and compute an initial homo-graphy between it and the world ground plane Gw �Hw

0 .Then, we detect interest points in both the reference frameand the next one and match them. From the set ofcorrespondences, we compute the homography betweenthe subsequent frame ground plane and the reference frameground plane H0

1 , and further from the subsequent frameground plane and the world ground plane Hw

1 . We repeatthis process for all the frames Ii, i ¼ 1::N obtaining thehomographies between them and world ground plane Hw

i ,i ¼ 1::N , N being the number of the sequence frames. Thismakes it easy to compute the world ground plane coordi-nates of the detections knowing the 2D coordinates in theframe ground plane. Since there are specific orientationsassociated with each detection, we also recalculate theseorientations with respect to the world ground plane.

4.1.2 Formalizing the Problem

The homographies let us compute ground plane locationsand one of the possible orientations for all detections, whichwe then need to link while ignoring potential misdetections.To this end, we define a hidden state at time t as the orientedposition of a person on the ground plane Lt ¼ ðX;Y ;OÞ,where t is a frame index, ðX;Y Þ are discretized groundplane coordinates, and O is one of the possible orientations.

We introduce the maximum likelihood estimate of aperson’s trajectory ending up at state i at time t,

�tðiÞ ¼ maxl1;...;ln

P ðI1; L1 ¼ l1; . . . ; In; Ln ¼ lnÞ; ð11Þ

where Ij represents the jth frame of the video sequence.Casting the computation of � in a dynamic programmingframework requires introducing probabilities of observing aparticular image given a state and of transitioning from onestate to the next.

We therefore take bit, the probability of observing frame Itgiven hidden state i, to be

bit ¼ P ðIt j Lt ¼ iÞ �1

dbayes-chamfer; ð12Þ

where dbayes-chamfer is a weighted average of the chamfer

distances between projected template contours and actualimage edges. This makes sense because the coefficients usedto weight the contributions are designed to account for therelevance of different silhouette portions in a Bayesian

framework [11].We also introduce the probability of transition from state j

at time t0 to state i at time t,

a�tji ¼ P ðLt ¼ i j Lt0 ¼ jÞ; with �t ¼ t� t0: ð13Þ

Since we only detect people when their legs are spread

furthest apart, we can only expect a detection approxi-mately every Nc ¼ 30 frames for an average v ¼ 5 km=h

walking speed in a 25 Hz video. This implies an averagedistance dc ¼ vNc

25 between detections. We therefore assume

that a�tji for state i ¼ ðX;Y ;OÞ follows a Gaussian distribu-

tion centered at ðX�; Y�Þ such that

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðX �X�Þ2 þ ðY � Y�Þ2

q¼ dc; ð14Þ

FOSSATI ET AL.: FROM CANONICAL POSES TO 3D MOTION CAPTURE USING A SINGLE CAMERA 1171

Fig. 6. Transitional probabilities for hidden state ðX;Y ;OÞ. They arerepresented by three Gaussian distributions corresponding to threepossible previous orientations. Each Gaussian covers a 2D areabounded by two circles of radii dc � �dc and dc þ �dc, where �dcrepresents an allowable deviation from the mean, and by two linesdefined by tolerance angle �’.

Page 8: From Canonical Poses to 3D Motion Capture Using a Single Camera

and positioned in the direction 180 degrees opposite to the

orientation O, as depicted by point A in Fig. 6. This

Gaussian covers only the hidden states with orientation

equal to O. The other previous states from which a

transition may occur are those with orientations Oþ �=4and O� �=4, which are covered by two neighboring

Gaussians, as depicted by points B and C in Fig. 6.

4.1.3 Linking Sparse Detections

Given the probabilities of (12) and (13), if we could expect a

detection in every frame, linking them into complete

trajectories could be done using the Viterbi algorithm to

recursively maximize the �t maximum likelihood of (11).However, since we can only expect a detection approxi-

mately every Nc ¼ 30 frames, we allow the model to

change state directly from Lt0 at time t0 to Lt at time

tðt0 < tÞ, Nc � �t < t� t0 < Nc þ �t, and skip all frames in

between. �t is a frame distance tolerance that we set to 10 in

our implementation.This lets us reformulate the maximization problem of

(11) as one of maximizing

�tðiÞ ¼ maxlt1 ;...;ltn

P ðIt1 ; Lt1 ¼ lt1 ; . . . ; Itn ; Ltn ¼ ltnÞ;

¼ bit maxj;�tða�t

ji �t��tðjÞÞ;ð15Þ

where t1 < t2 < � � � < tn,n are the indices of frames I in which

at least one detection occurred, and Nc � �t < �t < Nc þ �t.

This formulation lets us initially retain for each detectionseveral hypotheses with different orientations and allow thedynamic programming algorithm to select those thatprovide the most likely trajectories according to theprobabilities of (12) and (13). If a detection is missing, thealgorithm simply bridges the gap using the transitionprobabilities only. For the sequences of Figs. 7 and 9, thisyields the results depicted by Figs. 8 and 10.

4.2 Predicting 3D Poses between Detections

A complete trajectory computed as discussed aboveincludes rough estimates of the body’s position, orientation,and 3D pose parameterized by a set of joint angles for theframes in which the key posture was detected. We usestraightforward spline interpolation to predict positionsand orientations of the body between the two detections,which allows us to initialize the g vectors of (1).

The whole procedure is very simple and naturallyextends to the case where a key posture has been missed,which can be easily detected by comparing the number offrames between consecutive detections and the medianvalue for the whole sequence. In this case, a longer motionmust be created by concatenating several motion cycles—usually two, and never more than three in our experiments—depending on the number of frames between detections.This new motion is then resampled as before. Obviously, theinitial predictions then lose in accuracy, but they usuallyremain precise enough to retrieve the correct poses, thanks tothe refinement process described in the following section.

1172 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 7, JULY 2010

Fig. 7. Filtering silhouettes with temporal consistency on an outdoor sequence acquired by a moving camera. (a) Detection hypotheses.(b) Detections after filtering out the detection hypotheses that do not lie on the recovered most probable trajectory. Note that the extremely similarposes in which is very hard to distinguish which leg is in front of which leg are successfully disambiguated by our algorithm.

Page 9: From Canonical Poses to 3D Motion Capture Using a Single Camera

4.3 Refining the Predicted Poses

To track a person walking about, the camera usually has tomove to keep him in view. Therefore, we cannot use asimple background subtraction technique to create thebackground image we require for refinement purposes andadopt the more sophisticated approach depicted by Fig. 11.We treat each image of the sequence in turn as a referenceand consider the few images immediately before and after.We compute homographies between the reference and allother images [17], [36], which is a reasonable approximationof the frame-to-frame deformation because the time elapsedbetween successive frames is short and lets us warp all the

images into the reference frame. Then, by computing the

median of the values for each pixel in HSV color space, we

obtain background images with few artifacts.To account for the fact that walking trajectories are

smooth both spatially and temporally, we do not treat the giand �i as independent from each other. Instead, as we didfor initialization purposes, we represent trajectories as 2Dsplines lying on the ground plane and whose shape iscompletely defined by the position and orientation of thebody root node at the end points of a sequence, which wedenote as gstart and gend. In other words, we write all the gias functions of gstart and gend. Similarly, we introduce a

FOSSATI ET AL.: FROM CANONICAL POSES TO 3D MOTION CAPTURE USING A SINGLE CAMERA 1173

Fig. 9. Filtering silhouettes with temporal consistency on an outdoor sequence acquired by a moving camera. (a) Detection hypotheses.(b) Detections after filtering out the detection hypotheses that do not lie on the recovered most probable trajectory.

Fig. 8. Recovered trajectory for the sequence depicted by Fig. 7. Dark blue squares represent detection hypotheses and bright short lines insidethem represent the detection orientations. Smaller light green squares and lines represent the retained detections and their orientations,respectively. These detections form the most probable trajectory, depicted by dark blue lines.

Page 10: From Canonical Poses to 3D Motion Capture Using a Single Camera

parameter 0 < �c < 1 that defines what percentage of thewalking cycle has been accomplished during the first half ofthe sequence and derive all the other �i by simpleinterpolation. If the speed remains constant during awalking cycle, the value of �c is 0.5. In practice, it can gofrom 0.3 to 0.7 if the person speeds-up or slows-downbetween the first and the second half-cycle.

We can now refine the pose between two detections byoptimizing the objective function Lð�Þ of (10) with respectto ð�c; �1; . . . ; �m;gstart;gendÞ. Our experiments have shownthat using this reduced set of variables regularizes themotion and yields much better convergence properties thanusing the full parameterization. However, this formulationdoes not exploit the fact that people usually walk in thedirection they are facing, which means that the body globalposition, which is controlled by the first three variables ofthe 6D gstart and gend vectors is not independent from theother three, which control orientation. We can thereforefurther improve our results by adding an additional term toour objective function to enforce this constraint. We define

Lwalkð�Þ ¼ Lð�Þ þ �XNi¼2

�2ði�1Þ!i

; ð16Þ

where ði�1Þ!i is the angle between the direction the personfaces and the direction of motion and � is a weighting term

which is kept constant for all our experiments, and whosepurpose is to make the two terms of the same order ofmagnitude. As demonstrated in [14] and as will be shownin Section 5, minimizing Lwalkð�Þ instead of Lð�Þ has littleinfluence on the recovered poses but yields more realisticglobal body orientations.

5 RESULTS

In this section, we present our results on golfing andwalking sequences that feature subjects other than thosewe used to create our motion databases and seen frommany different perspectives. A computationally expensivepart of the algorithm is the refinement step of Section 4.3since, for each particle, we must render the wholesequence, be it a walking cycle or a golf swing, andcompute the image likelihood for each frame. Therefore,finding the best solution for an activity can require around10 minutes on a standard computer, in our currentMATLAB implementation.

5.1 Golfing

Fig. 12 depicts a golf swing by a professional golfer. Bycontrast, Fig. 13 depicts one performed by one of the authorsof this paper who does not play golf and whose motion is

1174 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 7, JULY 2010

Fig. 11. Synthesizing a background image. (a) The rightmost image is the reference image whose background we want to synthesize. The other fourare those before and after it in the sequence. (b) The same four images warped to match the reference image. Computing the median image of theseand the reference image yields the rightmost image, which is the desired background image.

Fig. 10. Recovered hypothetical trajectory for the sequence depicted by Fig. 9. Dark blue squares represent detection hypotheses and bright shortlines inside them represent the detection orientations. Smaller light green squares and lines represent the retained detections and their orientations,respectively. These detections form the most probable trajectory, depicted by dark blue lines.

Page 11: From Canonical Poses to 3D Motion Capture Using a Single Camera

therefore far from correct. In both cases, our system correctly

detects the key postures and recovers a 3D trajectory without

any human intervention. This demonstrates that it is robust

not only to the relatively low quality of the imagery but also

to potentially large variations in the exact motion being

recovered. Fig. 14 shows the background model that was

recovered and used to generate the results of Fig. 12. Note

that the feet are mistakenly made part of the background

reconstruction and this results in unwarranted motion of the

feet. This is easily fixed by constraining them to remain on the

ground, as shown in the supplemental material, which can be

found on the Computer Society Digital Library at http://

doi.ieeecomputersociety.org/10.1109/TPAMI.2009.108.

FOSSATI ET AL.: FROM CANONICAL POSES TO 3D MOTION CAPTURE USING A SINGLE CAMERA 1175

Fig. 13. Reconstructing a golf swing performed by a novice player. (a) Frames from the input video with reprojected 3D skeletons. (b) 3D skeletonseen from a different viewpoint.

Fig. 12. Reconstructing a golf swing performed by a professional player. (a) Frames from the input video with reprojected 3D skeletons. (b) 3Dskeleton seen from a different viewpoint. The corresponding videos are given as supplemental material, which can be found on the ComputerSociety Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2009.108.

Fig. 14. Background image used to generate the results of Fig. 12.Notice that there are some artifacts, for instance, in the feet area, whichare anyway overcome by our algorithm.

Page 12: From Canonical Poses to 3D Motion Capture Using a Single Camera

5.2 Walking

We now demonstrate the performance of our algorithm onwalking sequences acquired under common but challen-ging conditions. In all cases except when we use theHumanEva data set [35] to quantify our results, the subjectis seen against a cluttered background and the cameramoves to follow him, which precludes the use of simplebackground subtraction techniques.

In the sequences of Figs. 15 and 16 the camera translates.Furthermore, in Fig. 16, the subject is seen first from the sideand progressively from the back as he becomes smaller andsmaller. In the sequence of Fig. 17, the subject walks along acircular trajectory and the camera follows him from itscenter. At some point, the subject undergoes a totalocclusion, but the global model allows the algorithm tonevertheless recover both pose and position for the wholesequence. We can also recover the instantaneous speeds andthe ground plane trajectory, as shown in Fig. 18.

All of these results were obtained by minimizing theobjective function of (16) that explicitly enforces consistency

between the direction the person faces and the direction ofmotion. We also computed results by minimizing theobjective function of (10), which does not take this consis-tency into account. When shown in projections in the originalimages, these two sets of results are almost indistinguishable.However, the improvement becomes clear when onecompares the two trajectories of Fig. 18, one obtained withoutenforcing the constraint and the other with. To validate theseresults, we manually marked the subject’s feet every10 frames in the sequence of Fig. 17 and used their positionwith respect to the tiles on the ground plane to estimate their3D coordinates. We then treated the vector joining the feet asan estimate of the body orientation and the midpoint as anestimate of its location. As can be seen in Table 1, linkingorientation to motion produces a small improvement in theposition estimate and a much more substantial one in theorientation estimate, which is consistent with what canbe observed in Fig. 18. Obviously, these numbers shouldonly be considered in a relative way and, to have an idea ofthe quantitative performance of our algorithm, we refer thereader to the results on the HumanEvaII sequence.

1176 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 7, JULY 2010

Fig. 15. Recovered 3D skeletons reprojected into individual images of the sequence of Fig. 7, which was acquired by a camera translating to followthe subject.

Fig. 16. Final result for the subject of Fig. 9 who moves away from the camera and is eventually seen from behind. (a) Frames from the input videowith reprojected 3D skeletons. (b) 3D skeletons seen from a different viewpoint. The 3D pose is correctly estimated over the sequence, even whenthe person goes far away and eventually turns his back to the camera. Note that the extremely similar poses in which it is very hard to distinguishwhich leg is in front are successfully disambiguated by our algorithm.

Page 13: From Canonical Poses to 3D Motion Capture Using a Single Camera

In the sequence of Fig. 19, the subject walks along a

curvilinear path and the camera follows him so that the

viewpoint undergoes large variations. We are nevertheless

able to recover pose and motion in a consistent way, as

shown in Fig. 20 which depicts the recovered trajectory.

Again, linking orientation to motion yields improved results.Fig. 21 demonstrates the robustness of our approach to

missed detections. We ran our algorithm on the same

sequence as in Fig. 1 but ignored one out of every two

detections. Note that, even though the subject is now only

detected every other step, the algorithm’s performance

barely degrades.

To further quantify our results, we tracked subject S4 of the

HumanEvaII data set [35] over 230 frames acquired by

camera C1, as depicted in Fig. 23. Since it is static, we used the

same simple approach as in the golf case to synthesize the

background image we use to compute our image likelihoods.

In Fig. 22, we plot the mean 3D distance between the real

position of some reference points and those recovered by our

algorithm, which are commensurate with the numerical

results of Table 1 that we obtained using our own sequences.

Given that our approach is strictly monocular—we simply

ignored the input of the other cameras—the 158 mm average

error our algorithm produces is within the range of methods

FOSSATI ET AL.: FROM CANONICAL POSES TO 3D MOTION CAPTURE USING A SINGLE CAMERA 1177

Fig. 17. Subject walking in a circle. (a) Frames from the input video with reprojected 3D skeletons. (b) 3D skeletons seen from a different viewpoint.The numbers in the bottom right corner are the instantaneous speeds derived from the recovered motion parameters. The corresponding videos aresubmitted as supplemental material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2009.108.

Fig. 18. Recovered 2D trajectory of the subject of Fig. 17. The underlying grid is made of 1� 1 meter squares and the arrows represent the directionhe is facing. (a) When orientation and motion are not linked, he appears to walk sideways. (b) When they are, he walks naturally.

Page 14: From Canonical Poses to 3D Motion Capture Using a Single Camera

that make similar assumptions. By comparison, errorsaround 200 mm are reported in [21] and between 100 and200 mm in [6]. This is encouraging given the fact that we only

use relatively coarse models and motions described by areduced number of parameters. In other words, our algo-rithm is designed more for robustness, moving cameras, and

1178 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 7, JULY 2010

Fig. 19. Pedestrian tracking and reprojected 3D model in a second sequence. (a) Frames from the input video with reprojected 3D skeletons. (b) 3Dskeletons seen from a different viewpoint. The numbers in the bottom right corner are the instantaneous speeds derived from the recovered motionparameters.

Fig. 20. Recovered 2D trajectory of the subject of Fig. 19. As in Fig. 18, when orientation and motion are not linked, he appears to walk sideways (a),but not when they are (b).

TABLE 1Comparing the Recovered Position and Orientation Values for the Body Root Node against Ground Truth Data

for the Sequence of Fig. 17

We provide the mean and standard deviation of the absolute positional error in the X and Y coordinates in centimeters and the mean and standarddeviation of the recovered orientation error in degrees.

Page 15: From Canonical Poses to 3D Motion Capture Using a Single Camera

recovery from situations where other algorithms might lose

track, such as total occlusions, than for accuracy.Of course the algorithm, even if it is designed for

robustness, can fail. In the walking case, this can happen if

the subject performs very sharp turns, thus preventing the

Viterbi algorithm to infer the correct trajectory. Similarly,

facing the camera for too long can result in loss of tracksince our detector is designed for people not seencompletely frontally. This could be overcome by addingan appropriate detector, which would be fairly easy to dosince it could take advantage of the very reliable frontalhead detection algorithms that now exist. In the golfing

FOSSATI ET AL.: FROM CANONICAL POSES TO 3D MOTION CAPTURE USING A SINGLE CAMERA 1179

Fig. 21. Robustness to misdetection. (a) Initial and refined poses for a sequence in which three consecutive key poses are detected. (b) Initial andrefined poses for the same sequence when ignoring the central detection and using the other two. The initial poses are less accurate but the refinedones are indistinguishable.

Fig. 22. Absolute mean 3D error in joint location obtained on frames 21-248 of the HumanEvaII data set for subject S4 and using only camera C1 asinput. It is expressed in millimeters.

Page 16: From Canonical Poses to 3D Motion Capture Using a Single Camera

case, a misdetection of either the initial or the final posewould also cause a failure, but they are infrequent becausethe pose is so characteristic.

6 CONCLUSION

The walking and golfing motions contain characteristicpostures that are relatively easy to detect. We have exploitedthis fact to formulate 3D motion recovery from a single videosequence as an interpolation problem. This is much easier toachieve than open-ended tracking, and we have shown thatit can be solved using straightforward minimization.

This approach is generic because most human motionsalso feature canonical poses that can be easily detected. Thisis significant because it means that we can focus our futureefforts on developing methods to reliably detect thesecanonical poses instead of all poses, which is much harder.

A limitation of our approach is that we do not handletransitions from one activity to another, as Markovianmotion models could. However, since transitions typicallyalso involve key poses, the approach could potentially beextended to this much more demanding context given asufficiently rich training database. This would involvechoosing which motion model to use to connect these keyposes and modeling the transition probabilities betweenactivities, and is a topic for future research.

REFERENCES

[1] A. Agarwal and B. Triggs, “3D Human Pose from Silhouettes byRelevance Vector Regression,” Proc. IEEE Conf. Computer Visionand Pattern Recognition, 2004.

[2] A. Agarwal and B. Triggs, “Tracking Articulated Motion withPiecewise Learned Dynamical Models,” Proc. European Conf.Computer Vision, May 2004.

[3] A.O. Balan and M.J. Black, “The Naked Truth: Estimating BodyShape under Clothing,” Proc. European Conf. Computer Vision, PartII, pp. 15-29, 2008.

[4] L. Bo, C. Sminchisescu, A. Kanaujia, and D. Metaxas, “FastAlgorithms for Large Scale Conditional 3D Prediction,” Proc. IEEEConf. Computer Vision and Pattern Recognition, 2008.

[5] M. Brubaker, D. Fleet, and A. Hertzmann, “Physics-Based PersonTracking Using Simplified Lower-Body Dynamics,” Proc. IEEEConf. Computer Vision and Pattern Recognition, June 2007.

[6] M. Brubaker, A. Hertzmann, and D. Fleet, “Physics-Based HumanPose Trackng,” Proc. NIPS Workshop Evaluation of ArticulatedHuman Motion and Pose Estimation, 2006.

[7] K. Choo and D.J. Fleet, “People Tracking Using Hybrid MonteCarlo Filtering,” Proc. Int’l Conf. Computer Vision, July 2001.

[8] A.J. Davison, J. Deutscher, and I.D. Reid, “Markerless MotionCapture of Complex Full-Body Movement for Character Anima-tion,” Proc. Eurographics Workshop Computer Animation and Simula-tion, 2001.

[9] J. Deutscher, A. Blake, and I. Reid, “Articulated Body MotionCapture by Annealed Particle Filtering,” Proc. IEEE Conf. ComputerVision and Pattern Recognition, pp. 2126-2133, 2000.

[10] D.E. DiFranco, T.J. Cham, and J.M. Rehg, “Reconstruction of 3DFigure Motion from 2D Correspondences,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, Dec. 2001.

[11] M. Dimitrijevic, V. Lepetit, and P. Fua, “Human Body PoseDetection Using Bayesian Spatio-Temporal Templates,” ComputerVision and Image Understanding, vol. 104, nos. 2/3, pp. 127-139,2006.

[12] E.-J. Ong, A.S. Micilotta, R. Bowden, and A. Hilton, “ViewpointInvariant Exemplar-Based 3D Human Tracking,” Computer Visionand Image Understanding, vol. 104, nos. 2/3, pp. 178-189, 2006.

[13] A. Elgammal and C.S. Lee, “Inferring 3D Body Pose fromSilhouettes Using Activity Manifold Learning,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, June 2004.

[14] A. Fossati and P. Fua, “Linking Pose and Motion,” Proc. EuropeanConf. Computer Vision, Oct. 2008.

[15] D. Gavrila and V. Philomin, “Real-Time Object Detection for’Smart’ Vehicles,” Proc. Int’l Conf. Computer Vision, pp. 87-93, 1999.

[16] J. Giebel, D.M. Gavrila, and C. Schnorr, “A Bayesian Frameworkfor Multi-Cue 3D Object Tracking,” Proc. European Conf. ComputerVision, 2004.

[17] R. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision. Cambridge Univ. Press, 2000.

[18] M. Isard and J. MacCormick, “Bramble: A Bayesian Multiple-BlobTracker,” Proc. IEEE Conf. Computer Vision and Pattern Recognition,vol. 2, pp. 34-41, July 2001.

[19] C.S. Lee and A. Elgammal, “Body Pose Tracking from Uncali-brated Camera Using Supervised Manifold Learning,” Proc. NIPSWorkshop Evaluation of Articulated Human Motion and Pose Estima-tion, 2006.

[20] B. Leibe, E. Seemann, and B. Schiele, “Pedestrian Detection inCrowded Scenes,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 1, June 2005.

[21] R. Li, M. Yang, S. Sclaroff, and T. Tian, “Evaluation of 3D HumanMotion Tracking with a Coordinated Mixture of Factor Analy-zers,” Proc. NIPS Workshop Evaluation of Articulated Human Motionand Pose Estimation, 2006.

[22] G. Loy, M. Eriksson, J. Sullivan, and S. Carlsson, “Monocular 3DReconstruction of Human Motion in Long Action Sequences,”Proc. European Conf. Computer Vision, 2004.

[23] K. Mikolajczyk, R. Choudhury, and C. Schmid, “Face Detection ina Video Sequence—A Temporal Approach,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, 2001.

[24] G. Mori, X. Ren, A.A. Efros, and J. Malik, “Recovering HumanBody Configurations: Combining Segmentation and Recognition,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2004.

[25] R. Navaratnam, A. Fitzgibbon, and R. Cipolla, “The Joint ManifoldModel for Semi-Supervised Multi-Valued Regression,” Proc. Int’lConf. Computer Vision, Oct. 2007.

1180 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 7, JULY 2010

Fig. 23. Tracking subject S4 from the HumanEvaII data set using only camera C1. Obtained results projected onto the input frames.

Page 17: From Canonical Poses to 3D Motion Capture Using a Single Camera

[26] C.F. Olson and D.P. Huttenlocher, “Automatic Target Recognitionby Matching Oriented Edge Pixels,” IEEE Trans. Image Processing,vol. 6, no. 1, pp. 103-113, Jan. 1997.

[27] D. Ormoneit, H. Sidenbladh, M.J. Black, and T. Hastie, “Learningand Tracking Cyclic Human Motion,” Proc. Neural InformationProcessing Systems, pp. 894-900, 2001.

[28] D. Ramanan, A. Forsyth, and A. Zisserman, “Tracking People byLearning their Appearance,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 29, no. 1, pp. 65-81, Jan. 2007.

[29] B. Rosenhahn, T. Brox, and H.P. Seidel, “Scaled Motion Dynamicsfor Markerless Motion Capture,” Proc. IEEE Conf. Computer Visionand Pattern Recognition, 2007.

[30] G. Shakhnarovich, P. Viola, and T. Darrell, “Fast Pose Estimationwith Parameter-Sensitive Hashing,” Proc. Int’l Conf. ComputerVision, 2003.

[31] K. Shoemake, “Animating Rotation with Quaternion Curves,”Proc. ACM SIGGRAPH, vol. 19, pp. 245-254, 1985.

[32] H. Sidenbladh and M.J. Black, “Learning the Statistics of People inImages and Video,” Int’l J. Computer Vision, vol. 54, pp. 181-207,2003.

[33] H. Sidenbladh, M.J. Black, and D.J. Fleet, “Stochastic Tracking of3D Human Figures Using 2D Image Motion,” Proc. European Conf.Computer Vision, June 2000.

[34] H. Sidenbladh, M.J. Black, and L. Sigal, “Implicit ProbabilisticModels of Human Motion for Synthesis and Tracking,” Proc.European Conf. Computer Vision, May 2002.

[35] L. Sigal and M.J. Black, “Humaneva: Synchronized Video andMotion Capture Dataset for Evaluation of Articulated HumanMotion,” technical report, Dept. of Computer Science, BrownUniv., 2006.

[36] G. Simon, A. Fitzgibbon, and A. Zisserman, “Markerless TrackingUsing Planar Structures in the Scene,” Proc. Int’l Symp. Mixed andAugmented Reality, pp. 120-128, Oct. 2000.

[37] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas, “Discrimi-native Density Propagation for 3D Human Motion Estimation,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2005.

[38] J. Sullivan and S. Carlsson, “Recognizing and Tracking HumanAction,” Proc. European Conf. Computer Vision, 2002.

[39] L. Taycher, G. Shakhnarovich, D. Demirdjian, and T. Darrell,“Conditional Random People: Tracking Humans with CRFs andGrid Filters,” Proc. IEEE Conf. Computer Vision and PatternRecognition, 2006.

[40] A. Thayananthan, B. Stenger, P.H.S. Torr, and R. Cipolla,“Tracking Articulated Hand Motion Using a Kinematic Prior,”Proc. British Machine Vision Conf., pp. 589-598, 2003.

[41] C. Tomasi, S. Petrov, and A. Sastry, “3D Tracking ¼ Classificationþ Interpolation,” Proc. Int’l Conf. Computer Vision, pp. 1441-1448,2003.

[42] R. Urtasun and T. Darrell, “Sparse Probabilistic Regression forActivity Independent Human Pose Inference,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, pp. 1-8, 2008.

[43] R. Urtasun, D. Fleet, and P. Fua, “3D People Tracking withGaussian Process Dynamical Models,” Proc. IEEE Conf. ComputerVision and Pattern Recognition, 2006.

[44] R. Urtasun, D. Fleet, and P. Fua, “Temporal Motion Models forMonocular and Multiview 3D Human Body Tracking,” ComputerVision and Image Understanding, vol. 104, nos. 2/3, pp. 157-177,2006.

[45] Q. Wang, G. Xu, and H. Ai, “Learning Object Intrinsic Structurefor Robust Visual Tracking,” Proc. IEEE Conf. Computer Vision andPattern Recognition, June 2003.

[46] Y. Wu, G. Hua, and T. Yu, “Tracking Articulated Body byDynamic Markov Network,” Proc. Int’l Conf. Computer Vision,2003.

Andrea Fossati received the BSc degree incomputer engineering in 2003 from the Politec-nico di Milano, Italy. He received the MScdegree in computer engineering from thePolitecnico di Milano in 2005, and the MScdegree in computer science from the Universityof Illinois at Chicago, in 2006. He is currentlyworking toward the PhD degree in the Compu-ter Vision Laboratory (CVLab) at the SwissFederal Institute of Technology of Lausanne

(EPFL). His research focuses on human body tracking in monocularsequences.

Miodrag Dimitrijevic received the electronicengineering degree with specialization in com-puter science from the Faculty of ElectronicEngineering, University of Nis, Serbia, in 1997.After spending two years at Nitex (TextileIndustry of Nis, Serbia) as a database designer,he came to EPFL (Swiss Federal Institute ofTechnology) in 2001. In 2002, after finishing hisstudies at the Graduate School in ComputerSciences, he joined the Computer Vision La-

boratory, where he received the PhD degree in computer vision in 2007.In 2008, he worked as a postdoctoral researcher in the AudiovisualCommunications Laboratory at EPFL. During his research, he workedon 3D face reconstruction from uncalibrated images, human 3D posedetection from monocular videos, and 2D and 3D object detection withapplication on mobile devices.

Vincent Lepetit received the engineering andmaster’s degrees in computer science fromESIAL in 1996. He received the PhD degree incomputer vision in 2001 from the University ofNancy, France, after working on the ISA INRIAteam. He then joined the Virtual Reality Lab atEPFL (Swiss Federal Institute of Technology)as a postdoctoral fellow and became a found-ing member of the Computer Vision Laboratory.He has received several awards in computer

vision, including the best paper award at CVPR 2005. His researchinterests include vision-based augmented reality, 3D camera tracking,and object recognition.

Pascal Fua received the engineering degreefrom the Ecole Polytechnique, Paris, in 1984 andthe PhD degree in computer science from theUniversity of Orsay in 1989. He joined EPFL(Swiss Federal Institute of Technology) in 1996,where he is now a professor in the School ofComputer and Communication Science. Beforethat, he worked at SRI International and atINRIA Sophia-Antipolis as a computer scientist.His research interests include human body

modeling from images, optimization-based techniques for imageanalysis and synthesis, and the use of information theory in the areaof model-based vision. He has (co)authored more than 100 publicationsin refereed journals and conferences. He is an associate editor of theIEEE Transactions for Pattern Analysis and Machine Intelligence andhas been a program committee member of several major visionconferences. He is a senior member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

FOSSATI ET AL.: FROM CANONICAL POSES TO 3D MOTION CAPTURE USING A SINGLE CAMERA 1181