110922_ real-time human pose recognition in parts from single depth images.pptx

Real-Time Human Pose Recognition in Parts from Single Depth Images

Jamie Shotton Andrew Fitzgibbon Mat Cook

Toby Sharp Mark Finocchi Richard Moore

Alex Kipman Andrew Blake

Microsoft Research Cambridge & Xbox Incubation

CVPR 2011 Best Paper

OUTLINE

• Introduction• Data• Body Part Inference and Joint Proposals• Experiments• Discussion

Introduction• Robust interactive human body tracking

– gaming, human-computer interaction, security,– telepresence, health-care

• Real time depth cameras– tracking from frame to frame but struggle to

re-initialize quickly and so are not robust– Our focus on per-frame initialization + tracking

algorithm• focus on pose recognition in parts

– 3D position candidates for each skeletal joint

Introduction• appropriate tracking algorithm

– Tracking people with twists and exponential maps (CVPR 1998)– Tracking loose limbed people (CVPR 2004) – Nonlinear body pose estimation from depth images (DAGM 2005)– Real-time hand-tracking with a color glove (ACM 2009)– Real time motion capture using a single time-of-flight camera (CVPR

2010)

Introduction• inspired by recent object recognition work that

divides objects into parts– Object class recognition by unsupervised scale-invariant learning

[CVPR 2003]– The layout consistent random field for recognizing and segmenting

partially occluded objects [CVPR 2006]

• Two key design goals– Computational efficiency– robustness

Introduction

Depth Image

dense probabilistic body part labeling

+spatially localized

near skeletal joints

3D proposalsegment generate

Introduction• We treat the segmentation into body parts

as a per-pixel classification task– Evaluating each pixel separately

• Training data– generate realistic synthetic depth images– train a deep randomized decision forest classifier avoid overfitting

Introduction• Overfitting

• Simple, discriminative depth comparison image features • maintaining high computational efficiency

Introduction• For further speed, the classifier can be run in

parallel on each pixel on a GPU• mean shift resulting in the 3D joint proposals

What is Mean Shift ?

Non-parametricDensity Estimation

Non-parametricDensity GRADIENT Estimation

(Mean Shift)

Data

Discrete PDF Representation

PDF Analysis

PDF in feature space• Color space• Scale space• Actually any feature space you can conceive• …

A tool for:Finding modes in a set of data samples, manifesting an underlying probability density function (PDF) in RN

Intuitive Description

Distribution of identical billiard balls

Region ofinterest

Center ofmass

Mean Shiftvector

Objective : Find the densest region

Intuitive Description

Distribution of identical billiard balls

Region ofinterest

Center ofmass

Objective : Find the densest region

• Treat pose estimation as object recognition– using a novel intermediate body parts representation– spatially localize joints– low computational cost and high accuracy

Main contribution

• (i) synthetic depth training data is an excellent proxy for real data

• (ii) scaling up the learning problem with varied synthetic data is important for high accuracy

• (iii) our parts-based approach generalizes better than even an oracular exact nearest neighbor

Experiments

Data

• Depth imaging and Motion capture data• Pose estimation research

– often focused on techniques– lack of training data

• Two problems on depth image– color– pose

• Use real mocap data– Retargetted to a variety of base character models– to synthesize a large, varied dataset– 640x480 image at 30 frames per second

• Depth cameras > Traditional intensity sensors– working in low light levels– giving a calibrated scale estimate– resolving silhouette ambiguities in pose

Depth image

• capture a large database of motion capture (mocap) of human actions– approximately 500k frames– (driving, dancing, kicking, running, navigating menus)

• Need not record mocap with variation in rotation– vertical axis, mirroring left-right, scene position body shape and size, camera pose– all of which can be addedin (semi-)automatically

Motion capture data

• The classifier uses no temporal information– static poses– not motion

• frame to the next are so small as to be insignificant– using ‘furthest neighbor’ clustering algorithm– where the distance between poses

– j mean body joints , Pi mean i pose– Define distance more than 5 cm

Motion capture data

• necessary to iterate the process of motion capture– sampling from our model– training the classifier– testing joint prediction accuracy

• CMU mocap database

Motion capture data

• build a randomized rendering pipeline– sample fully labeled training images

• Goals– realism and variety

Generating synthetic data


• First : randomly samples a set of parameters• Then uses standard computer graphics techniques

– render depth and body part images– from texture mapped 3D meshes

• Use autodesk motionbulider– slight random variation in height – and weight give extra coverage of body shapes– Others parameters

Body Part Inference and Joint Proposals

• Body part labeling• Depth image features• Randomized decision forests• Joint position proposals

Body part labeling

• intermediate body part representation– as color-coded– Some directly localize particular skeletal joints– others fill the gaps

• transforms the problem into one that can readily be solved by efficient classification algorithms

Body part labeling

• The parts are specified in a texture map

Body part labeling

• 31 body parts:– LU/RU/LW/RW head, neck,– L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R– hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee,– L/R ankle, L/R foot (Left, Right, Upper, loWer)

Depth image features

• di (x) is the depth at pixel x in image I• Ө= (u, v) describe offsets u and v• 1/di (x) ensures the features are depth invariant


• Individually these features provide only a weak signal• combination in a decision forest

– sufficient to accurately– disambiguate all trained parts


• The design of these features was strongly motivated by their computational efficiency– no preprocessing is needed– read at most 3 image pixels– at most 5 arithmetic operations– straightforwardly implemented on the GPU

Randomized decision forests

• Randomized decision forests– fast and effective multi-class classifiers– Implemented efficiently on the GPU– 1

Randomized decision forests

Joint position proposals

• generate reliable proposals for the positions of 3D skeletal joints– the final output of our algorithm– used by a tracking algorithm to self initialize– and recover from failure


• A local mode-finding approach based on mean shift with a weighted Gaussian kernel– ^xi is the reprojection of image pixel xi– bc is a learned per-part bandwidth– world space given depth dI (xi)

Non-Parametric Density Estimation

Assumption : The data points are sampled from an underlying PDF

Assumed Underlying PDF Real Data Samples

Data point density implies PDF value !


Non-Parametric Density Estimation


?Non-Parametric Density Estimation

Parametric Density Estimation

Assumption : The data points are sampled from an underlying PDF

Assumed Underlying PDF

2

2

( )

2

i

PDF( ) = i

iic e

x-μ

x

Estimate

Real Data Samples


• Wic considers both the inferred body part probability at the pixel and the world surface area of the pixel


• The detected modes– lie on the surface of the body– pushed back into the scene by a learned z offset

produce a final joint position proposal• Bandwidth Bc = 0.065m• Threshold λc = 0.14• Z offset = 0.039m• Set = 5000 images by grid search

Experiments

• provide further results in the supplementary material– 3 trees, 20 deep, 300k training images per tree– 2000 training example pixels per image – 2000 candidate features Ө– 50 candidate thresholds ζ per feature

Experiments

• Test data– challenging synthetic and real depth images to

evaluate our approach– synthesize 5000 depth images

• Real test set– 8808 frames of real depth images– 15 different subjects– 7 upper body joint positions

Experiments

• Error metric:– quantify both classification

• average of the diagonal of the confusion matrix• between the ground truth part label and the most likely inferred part label

– Joint prediction accuracy• generate recall-precision curvesas a function of

confidence threshold• quantify accuracy as average precision per joint

Experiments

• Error metric:– This penalizes multiple spurious detections – Near the correct position which might slow a

downstream tracking algorithm• D = 0.1 m below closed real test data

Experiments

Experiments

• Real time motion capture using a single time-of-flight camera. [CVPR 2010]

Discussion

• accurate proposals – for the 3D locations of body joints– super real-time from single depth images

• body part recognition– as an intermediate representation

• a highly varied synthetic training set– train very deep decision forests– Depth invariant features without overfitting

Future work

• study of the variability in the source mocap data• Generative model underlying the synthesis pipeline• a similarly efficient approach

– directly regress joint positions– remove ambiguities in local pose

Thank you

110922_ real-time human pose recognition in parts from single depth images.pptx

Documents