rapid skin: estimating the 3d human ... - reactive reality · mesh can be obtained by minimizing...

8
Rapid Skin: Estimating the 3D Human Pose and Shape in Real-Time Matthias Straka, Stefan Hauswiesner, Matthias R¨ uther, and Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology, Austria {straka,hauswiesner,ruether,bischof}@icg.tugraz.at Abstract—We present a novel approach to adapt a watertight polygonal model of the human body to multiple synchro- nized camera views. While previous approaches yield excellent quality for this task, they require processing times of several seconds, especially for high resolution meshes. Our approach delivers high quality results at interactive rates when a roughly initialized pose and a generic articulated body model are available. The key novelty of our approach is to use a Gauss- Seidel type solver to iteratively solve nonlinear constraints that deform the surface of the model according to silhouette images. We evaluate both the visual quality and accuracy of the adapted body shape on multiple test persons. While maintaining a similar reconstruction quality as previous approaches, our algorithm reduces processing times by a factor of 20. Thus it is possible to use a simple human model for representing the body shape of moving people in interactive applications. Keywords-human body; multi-view geometry; silhouette; Laplacian mesh adaption; real-time; I. I NTRODUCTION Marker-less human pose and body shape estimation from images has numerous applications in video games, virtual try-ons, augmented reality and motion capture for the enter- tainment industry. Recent advances in real-time human pose estimation enable to create interactive environments where the only controller is the body of the user [1]. However, an estimate of the human pose alone sometimes is not sufficient. For example, displaying a realistic user controlled avatar that not only mimics the pose of the user but also his appearance requires a full representation of the body surface. Such an avatar is an important component in augmented reality applications such as a virtual mirror [2]. The main challenges of capturing the body shape lie in the articulation of the human body and the variation of size, age and visual appearance between different persons. For static objects it is fairly easy to generate a realistic and accurate model in real-time, even with a single, moving camera [3]. However, people will change their pose continuously in an interactive scenario. This requires pose estimation and shape adaption for every single frame. Several authors have tackled the task of body shape adap- tion by recording images from a multi-view camera setup and deforming a human body mesh such that it is consistent with the background-subtracted silhouette in each view [4]– [9]. While most approaches yield convincing results, two common limitations remain: a previously scanned model of (a) (b) (c) (d) Figure 1. Our approach estimates the shape of the human body in real- time. We take a generic template mesh (a), correct its pose and size (b) and deform it according to multi-view silhouettes to obtain an accurate model (c). Projective texturing is used for realistic rendering (d). the actor is required, and processing times in the order of several seconds per frame have to be expected. In this paper, we present a novel approach that allows adapting a generic model of the human body to multi-view images (see Fig. 1). We improve over existing methods by introducing a constraint based mesh deformation and propose a real-time capable solver based on Gauss-Seidel iterations. We start with a polygonal mesh of the human body and create nonlinear constraints that align vertices with image features but keep the overall mesh smooth. The key for real-time operation is to process each constraint individ- ually, which allows for fast and stable estimation of the three dimensional shape of the human body such that interactive applications become feasible. The main contributions in this paper are as follows: We derive constraints to deform a mesh such that it becomes consistent with multi-view silhouette contours and propose an automatic constraint weighting scheme. Our approach enables performing this deformation in real-time even for large meshes, which has not been possible before. We adapt the size of the mesh at runtime by changing the length of skeleton bones. This allows us to represent a wide range of people with different age, gender and size using only a single template mesh. We demonstrate a method to transform multi-view silhouette data to depth-maps which allows using real- time pose estimation methods such as [1] directly. Section II reviews existing work in the field of human

Upload: others

Post on 05-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rapid Skin: Estimating the 3D Human ... - Reactive Reality · mesh can be obtained by minimizing over all constraints: V~ = argmin V XM j=1 k j C j(Vj j) (2) where k j 2[0;1] is a

Rapid Skin: Estimating the 3D Human Pose and Shape in Real-Time

Matthias Straka, Stefan Hauswiesner, Matthias Ruther, and Horst BischofInstitute for Computer Graphics and Vision

Graz University of Technology, Austria{straka,hauswiesner,ruether,bischof}@icg.tugraz.at

Abstract—We present a novel approach to adapt a watertightpolygonal model of the human body to multiple synchro-nized camera views. While previous approaches yield excellentquality for this task, they require processing times of severalseconds, especially for high resolution meshes. Our approachdelivers high quality results at interactive rates when a roughlyinitialized pose and a generic articulated body model areavailable. The key novelty of our approach is to use a Gauss-Seidel type solver to iteratively solve nonlinear constraints thatdeform the surface of the model according to silhouette images.We evaluate both the visual quality and accuracy of the adaptedbody shape on multiple test persons. While maintaining asimilar reconstruction quality as previous approaches, ouralgorithm reduces processing times by a factor of 20. Thusit is possible to use a simple human model for representing thebody shape of moving people in interactive applications.

Keywords-human body; multi-view geometry; silhouette;Laplacian mesh adaption; real-time;

I. INTRODUCTION

Marker-less human pose and body shape estimation fromimages has numerous applications in video games, virtualtry-ons, augmented reality and motion capture for the enter-tainment industry. Recent advances in real-time human poseestimation enable to create interactive environments wherethe only controller is the body of the user [1]. However, anestimate of the human pose alone sometimes is not sufficient.For example, displaying a realistic user controlled avatar thatnot only mimics the pose of the user but also his appearancerequires a full representation of the body surface. Suchan avatar is an important component in augmented realityapplications such as a virtual mirror [2].

The main challenges of capturing the body shape lie in thearticulation of the human body and the variation of size, ageand visual appearance between different persons. For staticobjects it is fairly easy to generate a realistic and accuratemodel in real-time, even with a single, moving camera [3].However, people will change their pose continuously in aninteractive scenario. This requires pose estimation and shapeadaption for every single frame.

Several authors have tackled the task of body shape adap-tion by recording images from a multi-view camera setupand deforming a human body mesh such that it is consistentwith the background-subtracted silhouette in each view [4]–[9]. While most approaches yield convincing results, twocommon limitations remain: a previously scanned model of

(a) (b) (c) (d)

Figure 1. Our approach estimates the shape of the human body in real-time. We take a generic template mesh (a), correct its pose and size (b)and deform it according to multi-view silhouettes to obtain an accuratemodel (c). Projective texturing is used for realistic rendering (d).

the actor is required, and processing times in the order ofseveral seconds per frame have to be expected.

In this paper, we present a novel approach that allowsadapting a generic model of the human body to multi-viewimages (see Fig. 1). We improve over existing methodsby introducing a constraint based mesh deformation andpropose a real-time capable solver based on Gauss-Seideliterations. We start with a polygonal mesh of the humanbody and create nonlinear constraints that align vertices withimage features but keep the overall mesh smooth. The keyfor real-time operation is to process each constraint individ-ually, which allows for fast and stable estimation of the threedimensional shape of the human body such that interactiveapplications become feasible. The main contributions in thispaper are as follows:• We derive constraints to deform a mesh such that it

becomes consistent with multi-view silhouette contoursand propose an automatic constraint weighting scheme.

• Our approach enables performing this deformation inreal-time even for large meshes, which has not beenpossible before.

• We adapt the size of the mesh at runtime by changingthe length of skeleton bones. This allows us to representa wide range of people with different age, gender andsize using only a single template mesh.

• We demonstrate a method to transform multi-viewsilhouette data to depth-maps which allows using real-time pose estimation methods such as [1] directly.

Section II reviews existing work in the field of human

Page 2: Rapid Skin: Estimating the 3D Human ... - Reactive Reality · mesh can be obtained by minimizing over all constraints: V~ = argmin V XM j=1 k j C j(Vj j) (2) where k j 2[0;1] is a

body shape adaption. In Section III, we present our shapeadaption algorithm consisting of constraints and a real-timecapable solver. In Section IV, we present how to use ouralgorithm for full human bodies in an interactive scenario.We evaluate our algorithm on several recorded sequencesand provide qualitative measurements of both accuracy andspeed in Section V. Finally, Section VI concludes the paperand gives an outlook for future work.

II. RELATED WORK

The idea of deforming a polygonal mesh such that itis consistent with images of the body silhouette is notnew. In the current literature, several approaches can befound that make use of the Laplacian Mesh Editing (LME)framework [10]. The basic idea is to represent each vertexusing delta coordinates, defined as the difference betweenthe vertex position and the weighted sum of positions ofneighboring vertices. Deformation of a mesh is then ex-pressed as a sparse system of linear equations which allowsmodifying the position of selected vertices while using deltacoordinates to enforce smooth deformations of the mesh.

The LME framework is used in Gall et al. [8] and Vlasicet al. [9], who transform a model of the human body toalign its pose to the recorded person, and then align verticesof the model with silhouette contours in multi-view cameraimages. Aguiar et al. [4] propose a similar method formesh deformation, but omit the explicit pose estimation step.Instead, they track the mesh over multiple frames based onsilhouette and texture correspondences. While the previouslymentioned approaches omit the skeletal structure duringsurface adaption, [5] present a method to jointly optimize forbones and surface. Most LME-based approaches use globalleast-squares optimization. This prohibits real-time operationsince solving the linear system can be slow for reasonablysized meshes.

In Hofmann and Gavrila [11], an automatic pose andshape estimation method is presented that not only adaptsa mesh to a single frame, but optimizes over a series offrames in order to obtain a stable body shape. A largedatabase of human body scans makes it possible to build astatistical body model which guides shape deformation basedon silhouette data [12], laser scans [13] or even single depthimages [14]. Bottom-up methods create a new mesh frommerged depth maps [15] or point clouds [16] and thereforedo not require any previously known body scan. Straka et al.[2] present a method to capture a moving 3D human bodywithout the use of an explicit model. They use image basedrendering to create an interactive virtual mirror image of theuser using multiple real cameras. However, it is not possibleto obtain an explicit body shape using this method.

None of the previously mentioned approaches is able toestimate pose and shape of the human body at interactiveframe rates. Recently, it was shown how to perform pose

estimation in real-time [1], but mesh deformation still re-quires several seconds. Our method is closely related to[8] and [9] as we follow their two-stage approach withseparate pose estimation and shape deformation. The majordifference compared to previous methods is the solver usedfor optimizing the deformed shape. Our method is inspiredby position based physics simulations [17] which are ableto compute realistic interactions between soft bodies in real-time. The key to real-time operation is to apply decoupledconstraints on individual vertices of a deformable mesh andoptimize for stable shape using an iterative method. Weshow that this decoupled optimization is suitable for meshdeformation guided by image space correspondences suchthat the final mesh resembles the content in the input images.

III. REAL-TIME 3D SHAPE ESTIMATION

In this section, we present our novel approach for real-time estimation of the shape of an object, which is repre-sented by its silhouette in multiple images. The main idea isto iteratively deform a template mesh consisting of verticesand faces such that the projection of the mesh into the sourceimages is identical to the silhouette of the object. For now,we assume that an initial mesh with the same topology isavailable in roughly the same pose as the object inside acalibrated multi-camera system. In Section IV, we show howto quickly initialize an articulated human body mesh suchthat it fulfills these requirements.

A. Constraint-based Mesh Deformation

We consider the problem of deforming a polygonal meshM = {V,N,F} consisting of vertices V = {vi ∈ R3|i =1 . . . V }, vertex normals N = {ni ∈ R3|i = 1 . . . V } andtriangular faces F such that all vertices satisfy a set ofconstraints

Cj(V|Φj) = 0 1 ≤ j ≤M. (1)

Each constraint is a function Cj : R3×V → R with aset of parameters Φj that encodes a relationship betweenselected vertices with other vertices of M or the scene.For example, a constraint can be responsible for aligningthe mesh with image data. We use the parameters Φj forstoring constraint properties such as initial curvature orcorrespondences. Usually, these parameters are initializedbefore optimization. The vertex positions of the deformedmesh can be obtained by minimizing over all constraints:

V = argminV

M∑j=1

kj∥∥Cj(V|Φj)

∥∥ (2)

where kj ∈ [0, 1] is a weighting term and ‖.‖ denotes thelength of a vector. Note that such constraints need not belinear but only differentiable.

Inspired by the Gauss-Seidel algorithm for linear systemsof equations [18], we do not minimize (2) as a whole.

Page 3: Rapid Skin: Estimating the 3D Human ... - Reactive Reality · mesh can be obtained by minimizing over all constraints: V~ = argmin V XM j=1 k j C j(Vj j) (2) where k j 2[0;1] is a

Image

Cameracenter

Viewing Ray

3D Mesh

Projected Mesh

Match

Rim-VertexNormal direction

Silhouette ContourConstraint

Figure 2. Silhouette constraints pull rim-vertices towards the silhouettecontour in every camera image.

Instead, we break it down into individual constraints andproject each Cj onto the vertices independently. We usea first-order Taylor series expansion to find a position-correction term ∆V such that

Cj(V + ∆V) ≈ Cj(V) +∇VCj(V) ·∆V = 0 (3)

where ∇VCj denotes the gradient of constraint j. Solvingfor ∆V yields the step for the iterative minimization

∆V = − Cj(V)

‖∇VCj(V)‖2· ∇VCj(V) (4)

which is similar to the standard Newton-Raphson method.We use (4) to perform a weighted correction of the currentvertex positions V ← V + kj∆V for every constraintCj . Analog to the Gauss-Seidel algorithm, we use updatedvalues of V for subsequent calculations as soon as avail-able. This requires less memory and allows the solutionto converge faster while keeping time complexity linear inthe number of constraints. By iterating constraint projectionmultiple times, we allow the effect of constraints to propa-gate along the surface of the mesh until all vertices of thedeformed mesh reach a stable position. A similar strategycan be found in real-time physics simulation, where internaland external forces of simulated objects are integrated usingiterative constraint projection [17].

B. Constraints

The presented algorithm is capable of handling nonlinearconstraints of any type. We propose to use two specifictypes of constraints for the task of template based shapeestimation. First, silhouette constraints Csil allow to alignrim vertices of a template mesh with silhouette contoursin the images. The second type of constraint Csm is asmoothness constraint which acts as a regularization term.This allows (2) to be rewritten as

V = argminV

Msil∑j=1

ksilj

∥∥Csilj (V)

∥∥+ ksm|V|∑i=1

∥∥Csmi (V)

∥∥(5)

with two distinct sets of constraints. We now describe theseconstraints in detail and show how to choose the weightsksilj automatically.

1-ring vertices vi

δi

Figure 3. Calculation of delta coordinates using the 1-ring of neighboringvertices.

Silhouette Consistency: In order to achieve silhouetteconsistency, we apply a method related to [4], [8], [9] toalign rim vertices of the mesh with the silhouette contour.Rim vertices lie on the contour of the mesh when projectedonto a camera image Ic. In order to find rim vertices, weproject vertices vi of mesh M into all camera views usingthe corresponding 3×4 projection matrices Pc = Kc[Rc|tc]and rotate the corresponding vertex normals ni onto theimage plane using rotation matrix Rc ∈ R3×3:

vci =

[Pc(1)Pc(2)

]·[vi

1

]Pc(3) ·

[vi

1

] nci = Rc · ni (6)

where Pc(r) denotes the rth row of the projection matrix.We calculate vertex normals ni as the normalized meanof face normals adjacent to the vertex vi. A rim vertexin image Ic is a vertex with a normal almost parallel tothe image plane of camera c. For such vertices, we samplepixels from Ic along a 2D line l(t) = vc

i + t · nci (1 : 2) for

intersections with the silhouette contour where −τ ≤ t ≤ τdefines the search region in pixels. Note that it is importantthat only intersections with a contour gradient similar to thenormal direction nc

i are considered a match pci ∈ R2. Simply

matching the closest contour pixel, such as in [4], can leadto false matches, especially if the initialization of mesh Mis inaccurate.

Each successfully matched rim-vertex/contour pair(vc

i ,pci ) yields a 2D correspondence in image space. We

translate this correspondence into a constraint Csilj which

enforces that vertex vi is pulled towards the viewingray Rj , which is a 3D line from the projection center ofcamera c through the contour pixel pc

i :

Csilj (V|Rj , i) = dpl(Rj ,vi) = 0 (7)

where dpl denotes the shortest Euclidean distance betweena point and line in 3D. In Fig. 2, we visualize the effect ofsilhouette constraints.

Mesh Smoothing: Smoothness constraints are based ondelta coordinates δi ∈ R3, which are calculated as

δi =∑

j∈N (i)

wij (vi − vj) (8)

where N (i) denotes the 1-ring of neighboring vertices ofvi (see Fig. 3). Each weight wij is calculated using thecotangent weighting scheme [10] with

∑j wij = 1 ∀i. For

Page 4: Rapid Skin: Estimating the 3D Human ... - Reactive Reality · mesh can be obtained by minimizing over all constraints: V~ = argmin V XM j=1 k j C j(Vj j) (2) where k j 2[0;1] is a

each vertex vi, we define a smoothness constraint Csmi that

ensures that the delta coordinate δi of vertex vi stays closeto its initial value, which is computed from the undeformedmesh M using (8):

Csmi (V|δi) =

∥∥∥∥∥∥ ∑

j∈N (i)

wij (vi − vj)

− δi∥∥∥∥∥∥

2

= 0. (9)

Automatic Constraint Weighting: Each silhouette con-straint Csil

j is weighted using a scalar ksilj . We proposeto use a weighting scheme that takes into considerationthe quality of silhouette contour matches and adapts theinfluence of constraints automatically. When a vertex is faraway from a silhouette contour, there is a large uncertaintywhich contour pixel it should correspond to. In this case,we put more trust in the smoothness term. In contrast, whenthe distance between a projected vertex and the silhouettecontour is small, we consider this a good match and keepthe vertex close to the corresponding viewing ray Rj .

We encode this uncertainty into the silhouette constraintweights by applying an unnormalized Gaussian kernel to theinitial Euclidean pixel distance between the projected vertexvci and the matched contour pixel pc

i of Csilj :

ksilj = exp

(−‖v

ci − pc

i‖2

2 · α2

). (10)

Therefore, good matches give the corresponding constraintCsil

j (V) a weight close to 1, while an increasing distanceleads to smaller weights (α > 0 controls the width ofthe Gaussian lobe). All smoothness constraints Csm

i (V) areequally weighted with ksm = 1 throughout this paper.

We perform multiple iterations of finding correspondencesand deforming the mesh according to the resulting con-straints. By using the proposed weighting scheme, we allowthat rim vertices already close to the silhouette contour arekept close to their optimal position. Distant matches areinitially affected more by the smoothness constraints. Thus,they eventually gain higher weights as they get aligned withthe silhouette contour during optimization.

C. The Iterative Solver

In Laplacian Mesh Editing (LME), (8) is used as aregularization term and a few selected control vertices guidethe shape deformation. Even for a large number of vertices,the deformed mesh can be computed efficiently when the setof control vertices does not change. In this case, the optimalsolution to a linear system of equations can be precomputedvia Cholesky decomposition once and a deformed mesh canbe obtained through simple back substitution multiple timeswhen the positions of control vertices change [10].

However, the set of control vertices changes continuouslywhen deforming a mesh using iteratively updated imagecorrespondences. Thus, no pre-computations are possibleand the optimization has to be performed from scratch every

Algorithm 1 Constraint projection algorithm.Require: V = {vi . . .v|V |}

1: {Φ1 . . .ΦM} ← initialize(V)2: for number of outer iterations No do3: {Φ1 . . .ΦM , k1 . . . kM} ← update(V, Φ1 . . .ΦM )4: for number of inner iterations N i do5: for j = 1 . . .M do6: V← V − kj ′ Cj(V|Φj)

‖∇VCj(V|Φj)‖2 · ∇VCj(V|Φj)7: end for8: end for9: end for

time. In contrast to LME, there are hundreds of controlvertices in shape deformation, which are often applied toneighboring vertices. In addition, it is usually possible toobtain initial vertex positions close to the optimal defor-mation when adapting a mesh to images. Therefore, weargue that an iterative solver is suitable to optimize (5). Byusing nonlinear constraints and an update step weightingthat is similar to [17], we achieve high quality deformationresults. In Section V, we show that our solver requires feweriterations than the iterative Conjugate Gradient method [18]with linear constraints only.

Our iterative solver for initializing and updating constraintparameters Φ1 . . .ΦM and projecting constraints C1 . . . CM

is outlined in Algorithm 1. In Line 1, we set up allconstraints using the initial vertex positions estimates (i.e.we calculate δi). The solver contains two loops: the outerloop (Line 2) is entered No times and controls how oftenconstraint parameters are updated (i.e. matching of rim-vertices with the silhouette contour) while the inner loopin Line 4 projects the constraints. Since constraints areprojected independently of each other, the number of inneriterations N i influences how far the effect of each constraintcan propagate along the surface of the mesh. We do notmultiply correction steps by kj directly, but use a modifiedweight kj ′ = 1 − (1 − kj)

1/Ni

which allows projectingconstraints with linear dependence on N i [17].

The constraint projection in Line 6 prohibits paralleliza-tion because each calculation depends on the updated valuesV of the previous projection. When a parallel processingarchitecture such as a GPU is available, it is possibleto calculate the update step ∆V from the same vertexpositions V for all constraints in parallel. However, thenumber of inner iterations N i needs to be increased sincethe convergence rate is slower compared to the Gauss-Seideltype solver. Vertex positions can be updated in parallel aswell, but it has to be ensured that a vertex is not updatedby multiple constraints at the same time.

IV. ESTIMATING THE HUMAN BODY SHAPE

One application of shape estimation is to deform a tem-plate mesh such that it fits the shape of a human body

Page 5: Rapid Skin: Estimating the 3D Human ... - Reactive Reality · mesh can be obtained by minimizing over all constraints: V~ = argmin V XM j=1 k j C j(Vj j) (2) where k j 2[0;1] is a

Tk Tk+1L Lk+1g

kgk-1g k-1g

kg

k+1g

(a) (b)

Figure 4. (a) Aligning limbs using local rotation and length transforma-tions. (b) The SCAPE mesh [19] in its default pose and its skeleton.

recorded by a synchronized multi-camera system. In thissection, we show how to initialize a generic model such thatwe can apply our constraints and solver. We first estimate the3D pose of the human body from multi-view camera images.Then, we transform the mesh such that it has roughly thesame body dimensions and posture. Finally, we deform themesh until it best fits to image data.

A. Pose Estimation

The availability of an affordable depth sensor (Kinect)has led to major improvements in real-time pose estimation.Shotton et al. [1] show how to translate pose estimationto a depth-map labeling problem which can efficiently besolved using randomized decision forests in real-time. Theoutput of such an algorithm is a set of joint positionsgk ∈ R3, which belong to a skeleton with K joints. Ouralgorithm can be initialized from such joint positions. Foreach joint k, we determine homogeneous transformationmatrices Tk ∈ R4×4 that allow to transform our templatemesh such that it has a pose similar to the user. We calculateTk directly from gk as a global transformation TG and locallimb transformations TL

k :

Tk =

|ck|∏j=1

TLck(j)

TG (11)

where ck is the mapping that represents the order of jointsalong the kinematic chain from the root node to joint k. Theglobal transformation aligns the upper body of the skeletonby means of rotation, scale and translation. Each local limbtransformation TL

k rotates and scales the bone betweenjoint k and its parent joint such that it is aligned with gk.In Fig. 4a we demonstrate this alignment process, whichautomatically adapts the template skeleton to the actual sizeof the body.

B. The Articulated Body Model

Template based shape estimation requires a mesh M0 ofthe human body. To handle arbitrary poses, the model mustsupport deformation by an underlying articulated skeleton.

In this work, we use the static SCAPE mesh model [19] inits default pose as in Fig. 4b. Any other watertight mesh issuitable for this purpose as well. The skeleton with K jointsis embedded into the mesh and linear skinning weights ρi,kare calculated using a rigging algorithm [20], which linkseach vertex to one or multiple joints. Linear blend skinningis used to transform the mesh M0 into the mesh M withthe current pose of the user:

vi =

K∑k=1

ρi,k ·Tk ·[vi

0

1

](12)

where vertex positions vi can be obtained as a linearcombination of the template vertex positions vi

0 that aretransformed by weighted joint transformations Tk.

C. Shape Estimation

We use the transformed mesh M for the initialization ofboth vertex positions and constraints in our shape estimationmethod. We no longer consider the underlying bone structurewhen deforming the mesh, since we have observed often thatthere is a non-negligible offset between the real joint positionand the estimate given by the skeleton tracker. Usually,our shape estimation algorithm corrects such offsets withoutvisible artifacts.

V. EXPERIMENTS

We evaluate our approach on multiple video sequencesof moving persons, either recorded with our own multi-camera setup or simulated through rendering of artificialdata. Besides visual quality, we evaluate our algorithm interms of reconstruction quality and run-time, and compareit to related approaches.

Specifically, we compare the mesh adapted with ourmethod to the output of related methods based on linearLaplacian Mesh Editing (LME) such as [8], [9]. We setup the linear systems of equations for LME mesh deforma-tion using the same template mesh and rim-vertex/contourcorrespondences as used with our approach. As suggestedin [10], we solve for optimal vertex positions in least squaressense using a sparse Cholesky decomposition. In addition,we compare our method to the iterative conjugate gradientalgorithm [18], which is an alternative for solving leastsquares linear equations.

A. Experimental Setup

Our recording hardware consists of a studio environmentwith ten synchronized cameras connected to a single com-puter [2]. Each camera delivers a color image with 640×480pixels at 15 frames per second. Silhouettes of the userare obtained through color-based background segmentation.Based on the image resolution, we set the search region forsilhouette contour matches to τ = 30 pixel and use α = 40pixel for the calculation of weights ksilj .

Page 6: Rapid Skin: Estimating the 3D Human ... - Reactive Reality · mesh can be obtained by minimizing over all constraints: V~ = argmin V XM j=1 k j C j(Vj j) (2) where k j 2[0;1] is a

(a) Constraint based (b) Conjugate gradient (c) Cholesky

Figure 5. Convergence quality after 2 and 8 solver iterations of constraintbased deformation (a) and least squares conjugate gradient (b). The meshobtained by solving the linear system using a Cholesky decomposition isshown in (c).

For estimating the human body pose, any algorithm thatcomputes skeleton joint positions in real-time is suitable.For example, Straka et al. [21] compute the skeleton posedirectly from silhouette images and Shotton et al. [1] usedepth maps as input. We use the OpenNI framework [22]which includes a real-time pose estimation module similarto [1]. Instead of using a Kinect camera, which wouldrequire additional calibration and synchronization with ourmulti-view system, we generate a volumetric 3D model [2]and render a depth map from a virtual viewpoint. Notethat [22] only supports typical Kinect poses, therefore ourimplementation would benefit from more advanced real-time pose estimation systems such as [1], [23], which areunfortunately not publicly available.

B. Visual Quality

Our solver and the conjugate gradient method requiremultiple iterations until a satisfying mesh deformation isobtained. In Fig. 5, we compare the quality of the resultingmesh (2’500 vertices) after two and eight solver iterationsN i while keeping the rim-vertex/contour matches constant.The constraint based approach produces smooth results aftertwo iterations already, while the conjugate gradient solveryields a noisy mesh. After eight iterations both approachesyield similar results, which are comparable to the meshobtained by solving the LME system via Cholesky decompo-sition. The reason for the fast convergence of our algorithmis that we use nonlinear constraints and that the step sizeis automatically tuned according to the number of itera-tions. For high quality results, we iterate between contourmatching and mesh deformation in an iterative closest pointfashion. In Fig. 6, we analyze how many iterations areneeded until the contour correspondences stabilize (at 100%vertices have converged to a stable position). We use No = 8iterations as a good trade-off between quality and speed inthe following experiments.

In Fig. 7a, we analyze the distribution of the remainingerror by rendering silhouettes of an artificial human body.Therefore we render a known human mesh from virtualcameras that mimic our real camera setup. After applying

0 5 10 15 2060

70

80

90

100

Number of iterations

Con

verg

ence

(%)

12000 Vertices6000 Vertices

Figure 6. Evaluation of the number of contour matching iterations. Thedotted line represents the value No = 8 used in this paper.

(a) (b)

Figure 7. (a) Deformation error measured via the Hausdorff distance. (b)Mesh overlaid on captured images.

our mesh deformation algorithm, we can determine the offsetbetween deformed vertex positions and ground-truth datausing the Hausdorff distance. The error stays below 10 mmfor the majority of the body surface. In concave areas suchas the crotch region there is a higher error since theseregions are not visible in silhouette images. Fig. 7b showsa wire frame representation of the deformed mesh, overlaidon recorded camera images.

Related methods often present the deformation of a sub-ject specific laser scan, which includes details such as theface and wrinkles of clothes [4], [8]. In contrast, we deformthe same template mesh to multi-view silhouette imagesof a variety of people (see Fig. 8). This means that themesh will only adapt to details that are visible in silhouettecontours. However, we can recover additional details in arendering stage through projective texturing (see Fig. 1d).The advantage of using a generic mesh is that we canestimate the body shape of previously unknown peoplewithout additional 3D scanning. Note that the quality of feetin our results is comparatively low as the majority of ourcameras are pointed towards the upper body.

C. Runtime Performance

We analyze the runtime performance of constraint basedmesh deformation on a single-threaded 3 GHz processor.In addition, we show that our approach can take advantageof current GPU architectures such as the NVIDIA GTX480 by processing all constraints in parallel. For runtimemeasurements, we perform No = 8 iterations of contour

Page 7: Rapid Skin: Estimating the 3D Human ... - Reactive Reality · mesh can be obtained by minimizing over all constraints: V~ = argmin V XM j=1 k j C j(Vj j) (2) where k j 2[0;1] is a

Figure 8. A single template mesh can be deformed to people of different size and gender. The color images are background segmented camera imagesand the mesh is rendered from a similar viewing angle.

0 2,000 4,000 6,000 8,000 10,000 12,000

10−3

10−2

10−1

100

101

Number of Vertices

Tim

epe

rfr

ame

(s) GPU Constraint CPU Constraint

Cholesky Conj. Gradient

Figure 9. Comparison of the runtime of different optimization methodswith increasing number of vertices.

Table ICOMPARISON OF THE TIME REQUIRED TO DEFORM A HUMAN MESH TO

MULTI-CAMERA DATA IN SECONDS PER FRAME.

Authors Model Vertices TimeAguiar et al. [4] Scan 2 K 27 sCagniart et al. [7] Scan/Visual Hull 10 K 25 sHofmann & G. [11] parametric N/A 15 sVlasic et al. [9] Scan 10 K 4.8 sGall et al. [8] Scan N/A 1.7 sThis work (CPU) SCAPE 12 K 0.15 sThis work (GPU) SCAPE 12 K 0.02 s

matching and use N i = 8 solver iterations for our methodand conjugate gradients.

In Fig. 9 we analyze the time required for the deformationof a mesh at different resolutions and compare the runtimeto standard linear solvers. Note that we exclude the time formatching rim-vertices with silhouette contours, which is thesame for all methods. Our method (GPU/CPU Constraint)clearly outperforms both linear solvers by a factor of about20 in the sequential implementation and is more than 100times faster when executed on a GPU. The bottleneck oflinear solvers lies in time consuming matrix decompositionsor matrix-vector products.

The complete pipeline for mesh deformation includeshuman pose estimation and mesh initialization. The imple-mentation of our approach is able to adapt a mesh with12’000 vertices to multi-view silhouette data within 150 ms

on a single CPU (or only 20 ms on a GPU). This allowsfor mesh deformation at the frame rate of our camera setup.Obviously, we can decrease this processing time even furtherwhen the number of vertices is reduced. Especially whentexture is applied to the mesh, a few thousand vertices aresufficient for a realistic display. In Table I, we compare theruntime of our approach with existing methods. It is notpossible to compare these methods directly nor is it fairto compare the run-time on different platforms. However,this paper presents the first method that eliminates theperformance bottleneck of the solver. So far, only our systemis able to achieve interactive frame rates when adapting theshape of a human body model to image data.

D. Limitations

The current implementation relies on a fairly accurateinitialization of the skeleton joints. Small displacements ofjoints can be handled without loss of quality since the meshautomatically gets pulled towards the silhouette contour.However, if the displacement is too large or completelywrong, the search for silhouette contours will fail and nosilhouette constraints can be generated for affected vertices.Our approach cannot adapt the body shape if the user wearssubstantially different clothing than the template mesh (e.g.a skirt). In this case, a specialized template with similarclothing is needed.

VI. CONCLUSIONS

We have presented a novel method which allows usto automatically estimate the shape of the human bodyfrom multi-view images in real-time. This is achieved bydeforming a generic template mesh such that rim-verticesare aligned with silhouette contours in all input images. Incontrast to existing approaches, we optimize the mesh byusing an iterative solver which allows integrating nonlinearconstraints. We have shown that the execution time of oursolver outperforms previous work by a factor of 20 ormore while we maintain a comparable visual quality ofthe deformed mesh. Thus, we are able to estimate the poseand shape of a human body in an interactive environment.This opens up the possibility for a variety of applicationsincluding live 3D video transmission and augmented reality

Page 8: Rapid Skin: Estimating the 3D Human ... - Reactive Reality · mesh can be obtained by minimizing over all constraints: V~ = argmin V XM j=1 k j C j(Vj j) (2) where k j 2[0;1] is a

applications where the user can control his own personalavatar.

Related work shows adapted body surface preferablyusing subject specific laser-scans [4], [8]. We have demon-strated that our constraints are sufficient to deform a genericmesh [19] to fit a variety of persons as long as they weartight fitting clothing. This makes our method particularlysuited for multi-user environments where no person-specifictemplate mesh is available or building such a model is thedesired task.

In this paper, we have focused on mesh deformation basedon silhouettes. However, our method is capable of adaptinga mesh to different input data as well. For example, itis possible to create constraints that deform a mesh to fitoriented point clouds [16] or depth maps [14]. Recently, ithas been shown how to jointly optimize the mesh surfaceand the underlying skeleton in a linear way [5], which iscompatible with our constraint definitions. Therefore, futurework will focus on including such skeleton constraints inour algorithm to make the deformation process even morerobust.

ACKNOWLEDGMENT

This work was supported by the Austrian Research Pro-motion Agency (FFG) under the BRIDGE program, project#822702 (NARKISSOS). Furthermore, we would like tothank the reviewers for their valuable comments and sug-gestions. We also want to thank everyone who was spendingher or his time during the evaluation of this work.

REFERENCES

[1] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio,R. Moore, A. Kipman, and A. Blake, “Real-time human poserecognition in parts from single depth images,” in Proc. ofCVPR, 2011.

[2] M. Straka, S. Hauswiesner, M. Ruther, and H. Bischof, “Afree-viewpoint virtual mirror with marker-less user interac-tion,” in Proc. of SCIA 2011, LNCS 6688, A. Heyden andF. Kahl, Eds., 2011, pp. 635–645.

[3] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux,D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges,and A. Fitzgibbon, “Kinectfusion: Real-time dense surfacemapping and tracking,” in Proc. of IEEE ISMAR, 2011.

[4] E. d. Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.-P. Seidel,and S. Thrun, “Performance capture from sparse multi-viewvideo,” ACM Transactions on Graphics, vol. 27, no. 3, 2008.

[5] M. Straka, S. Hauswiesner, M. Ruther, and H. Bischof,“Simultaneous shape and pose adaption of articulated modelsusing linear optimization,” in Proc. of ECCV 2012, Part I,LNCS 7572, 2012, pp. 724–737.

[6] L. Ballan and G. M. Cortelazzo, “Marker-less motion captureof skinned models in a four camera set-up using optical flowand silhouettes,” in Proc. of 3DPVT, 2008.

[7] C. Cagniart, E. Boyer, and S. Ilic, “Probabilistic deformablesurface tracking from multiple videos,” in Proc. of ECCV2010, Part IV, LNCS 6314, 2010, pp. 326–339.

[8] J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, andH.-P. Seidel, “Motion capture using joint skeleton trackingand surface estimation,” in Proc. of CVPR, 2009.

[9] D. Vlasic, I. Baran, W. Matusik, and J. Popovic, “Articulatedmesh animation from multi-view silhouettes,” ACM Transac-tions on Graphics, vol. 27, no. 3, 2008.

[10] M. Botsch and O. Sorkine, “On linear variational surfacedeformation methods,” IEEE Trans. on Visualization andComputer Graphics, vol. 14, no. 1, pp. 213–230, 2008.

[11] M. Hofmann and D. M. Gavrila, “3D human model adaptationby frame selection and shapetexture optimization,” ComputerVision and Image Understanding, vol. 115, no. 11, pp. 1559–1570, 2011.

[12] A. Kanaujia, N. Haering, G. Taylor, and C. Bregler, “3Dhuman pose and shape estimation from multi-view imagery,”in Proc. of CVPR Workshops, 2011.

[13] N. Hasler, C. Stoll, B. Rosenhahn, T. Thormahlen, andH.-P. Seidel, “Estimating body shape of dressed humans,”Computers & Graphics, vol. 33, no. 3, pp. 211–216, 2009.

[14] A. Weiss, D. Hirshberg, and M. J. Black, “Home 3D bodyscans from noisy image and range data,” in Proc. of ICCV,2011, pp. 1951–1958.

[15] K. Li, Q. Dai, and W. Xu, “Markerless shape and motioncapture from multiview video sequences,” IEEE Transactionson Circuits and Systems for Video Technology, vol. 21, no. 3,pp. 320–334, 2011.

[16] Y. Furukawa and J. Ponce, “Dense 3D motion capture fromsynchronized video streams,” in Proc. of CVPR, 2008.

[17] M. Muller, B. Heidelberger, M. Hennix, and J. Ratcliff,“Position based dynamics,” Journal of Visual CommunicationImage Representation, vol. 18, no. 2, pp. 109–118, 2007.

[18] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato,J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. derVorst, Templates for the Solution of Linear Systems: BuildingBlocks for Iterative Methods, 2nd ed. SIAM, 1994.

[19] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers,and J. Davis, “SCAPE: shape completion and animation ofpeople,” in Proc. of the ACM SIGGRAPH, 2005.

[20] I. Baran and J. Popovic, “Automatic rigging and animationof 3D characters,” in Proc. of the ACM SIGGRAPH, 2007.

[21] M. Straka, S. Hauswiesner, M. Ruther, and H. Bischof,“Skeletal graph based human pose estimation in real-time,” inProc. of BMVC, J. Hoey, S. McKenna, and E. Trucco, Eds.,2011.

[22] (2012) OpenNI. [Online]. Available: http://www.openni.org/

[23] C. Stoll, N. Hasler, J. Gall, H. Seidel, and C. Theobalt, “Fastarticulated motion tracking using a sums of gaussians bodymodel,” in Proc. of ICCV, 2011, pp. 951–958.