110922_ real-time human pose recognition in parts from single depth images.pptx
DESCRIPTION
this is a real time pose recognition project which recognizes the human pose.TRANSCRIPT
Real-Time Human Pose Recognition in Parts from Single Depth Images
Jamie Shotton Andrew Fitzgibbon Mat Cook
Toby Sharp Mark Finocchi Richard Moore
Alex Kipman Andrew Blake
Microsoft Research Cambridge & Xbox Incubation
CVPR 2011 Best Paper
OUTLINE
• Introduction• Data• Body Part Inference and Joint Proposals• Experiments• Discussion
Introduction• Robust interactive human body tracking
– gaming, human-computer interaction, security,– telepresence, health-care
• Real time depth cameras– tracking from frame to frame but struggle to
re-initialize quickly and so are not robust– Our focus on per-frame initialization + tracking
algorithm• focus on pose recognition in parts
– 3D position candidates for each skeletal joint
Introduction• appropriate tracking algorithm
– Tracking people with twists and exponential maps (CVPR 1998)– Tracking loose limbed people (CVPR 2004) – Nonlinear body pose estimation from depth images (DAGM 2005)– Real-time hand-tracking with a color glove (ACM 2009)– Real time motion capture using a single time-of-flight camera (CVPR
2010)
Introduction• inspired by recent object recognition work that
divides objects into parts– Object class recognition by unsupervised scale-invariant learning
[CVPR 2003]– The layout consistent random field for recognizing and segmenting
partially occluded objects [CVPR 2006]
• Two key design goals– Computational efficiency– robustness
Introduction
Depth Image
dense probabilistic body part labeling
+spatially localized
near skeletal joints
3D proposalsegment generate
Introduction• We treat the segmentation into body parts
as a per-pixel classification task– Evaluating each pixel separately
• Training data– generate realistic synthetic depth images– train a deep randomized decision forest classifier avoid overfitting
Introduction• Overfitting
• Simple, discriminative depth comparison image features • maintaining high computational efficiency
Introduction• For further speed, the classifier can be run in
parallel on each pixel on a GPU• mean shift resulting in the 3D joint proposals
What is Mean Shift ?
Non-parametricDensity Estimation
Non-parametricDensity GRADIENT Estimation
(Mean Shift)
Data
Discrete PDF Representation
PDF Analysis
PDF in feature space• Color space• Scale space• Actually any feature space you can conceive• …
A tool for:Finding modes in a set of data samples, manifesting an underlying probability density function (PDF) in RN
Intuitive Description
Distribution of identical billiard balls
Region ofinterest
Center ofmass
Mean Shiftvector
Objective : Find the densest region
Intuitive Description
Distribution of identical billiard balls
Region ofinterest
Center ofmass
Mean Shiftvector
Objective : Find the densest region
Intuitive Description
Distribution of identical billiard balls
Region ofinterest
Center ofmass
Mean Shiftvector
Objective : Find the densest region
Intuitive Description
Distribution of identical billiard balls
Region ofinterest
Center ofmass
Mean Shiftvector
Objective : Find the densest region
Intuitive Description
Distribution of identical billiard balls
Region ofinterest
Center ofmass
Mean Shiftvector
Objective : Find the densest region
Intuitive Description
Distribution of identical billiard balls
Region ofinterest
Center ofmass
Mean Shiftvector
Objective : Find the densest region
Intuitive Description
Distribution of identical billiard balls
Region ofinterest
Center ofmass
Objective : Find the densest region
• Treat pose estimation as object recognition– using a novel intermediate body parts representation– spatially localize joints– low computational cost and high accuracy
Main contribution
• (i) synthetic depth training data is an excellent proxy for real data
• (ii) scaling up the learning problem with varied synthetic data is important for high accuracy
• (iii) our parts-based approach generalizes better than even an oracular exact nearest neighbor
Experiments
Data
• Depth imaging and Motion capture data• Pose estimation research
– often focused on techniques– lack of training data
• Two problems on depth image– color– pose
• Use real mocap data– Retargetted to a variety of base character models– to synthesize a large, varied dataset– 640x480 image at 30 frames per second
• Depth cameras > Traditional intensity sensors– working in low light levels– giving a calibrated scale estimate– resolving silhouette ambiguities in pose
Depth image
• capture a large database of motion capture (mocap) of human actions– approximately 500k frames– (driving, dancing, kicking, running, navigating menus)
• Need not record mocap with variation in rotation– vertical axis, mirroring left-right, scene position body shape and size, camera pose– all of which can be addedin (semi-)automatically
Motion capture data
• The classifier uses no temporal information– static poses– not motion
• frame to the next are so small as to be insignificant– using ‘furthest neighbor’ clustering algorithm– where the distance between poses
– j mean body joints , Pi mean i pose– Define distance more than 5 cm
Motion capture data
• necessary to iterate the process of motion capture– sampling from our model– training the classifier– testing joint prediction accuracy
• CMU mocap database
Motion capture data
• build a randomized rendering pipeline– sample fully labeled training images
• Goals– realism and variety
Generating synthetic data
Generating synthetic data
• First : randomly samples a set of parameters• Then uses standard computer graphics techniques
– render depth and body part images– from texture mapped 3D meshes
• Use autodesk motionbulider– slight random variation in height – and weight give extra coverage of body shapes– Others parameters
Generating synthetic data
Body Part Inference and Joint Proposals
• Body part labeling• Depth image features• Randomized decision forests• Joint position proposals
Body part labeling
• intermediate body part representation– as color-coded– Some directly localize particular skeletal joints– others fill the gaps
• transforms the problem into one that can readily be solved by efficient classification algorithms
Body part labeling
• The parts are specified in a texture map
Body part labeling
• 31 body parts:– LU/RU/LW/RW head, neck,– L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R– hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee,– L/R ankle, L/R foot (Left, Right, Upper, loWer)
Depth image features
• di (x) is the depth at pixel x in image I• Ө= (u, v) describe offsets u and v• 1/di (x) ensures the features are depth invariant
Depth image features
• Individually these features provide only a weak signal• combination in a decision forest
– sufficient to accurately– disambiguate all trained parts
Depth image features
• The design of these features was strongly motivated by their computational efficiency– no preprocessing is needed– read at most 3 image pixels– at most 5 arithmetic operations– straightforwardly implemented on the GPU
Randomized decision forests
• Randomized decision forests– fast and effective multi-class classifiers– Implemented efficiently on the GPU– 1
Randomized decision forests
Randomized decision forests
Joint position proposals
• generate reliable proposals for the positions of 3D skeletal joints– the final output of our algorithm– used by a tracking algorithm to self initialize– and recover from failure
Joint position proposals
• A local mode-finding approach based on mean shift with a weighted Gaussian kernel– ^xi is the reprojection of image pixel xi– bc is a learned per-part bandwidth– world space given depth dI (xi)
Non-Parametric Density Estimation
Assumption : The data points are sampled from an underlying PDF
Assumed Underlying PDF Real Data Samples
Data point density implies PDF value !
Assumed Underlying PDF Real Data Samples
Non-Parametric Density Estimation
Assumed Underlying PDF Real Data Samples
?Non-Parametric Density Estimation
Parametric Density Estimation
Assumption : The data points are sampled from an underlying PDF
Assumed Underlying PDF
2
2
( )
2
i
PDF( ) = i
iic e
x-μ
x
Estimate
Real Data Samples
Joint position proposals
• Wic considers both the inferred body part probability at the pixel and the world surface area of the pixel
Joint position proposals
• The detected modes– lie on the surface of the body– pushed back into the scene by a learned z offset
produce a final joint position proposal• Bandwidth Bc = 0.065m• Threshold λc = 0.14• Z offset = 0.039m• Set = 5000 images by grid search
Joint position proposals
Experiments
• provide further results in the supplementary material– 3 trees, 20 deep, 300k training images per tree– 2000 training example pixels per image – 2000 candidate features Ө– 50 candidate thresholds ζ per feature
Experiments
• Test data– challenging synthetic and real depth images to
evaluate our approach– synthesize 5000 depth images
• Real test set– 8808 frames of real depth images– 15 different subjects– 7 upper body joint positions
Experiments
• Error metric:– quantify both classification
• average of the diagonal of the confusion matrix• between the ground truth part label and the most likely inferred part label
– Joint prediction accuracy• generate recall-precision curvesas a function of
confidence threshold• quantify accuracy as average precision per joint
Experiments
• Error metric:– This penalizes multiple spurious detections – Near the correct position which might slow a
downstream tracking algorithm• D = 0.1 m below closed real test data
Experiments
Experiments
Experiments
Experiments
Experiments
Experiments
• Real time motion capture using a single time-of-flight camera. [CVPR 2010]
Discussion
• accurate proposals – for the 3D locations of body joints– super real-time from single depth images
• body part recognition– as an intermediate representation
• a highly varied synthetic training set– train very deep decision forests– Depth invariant features without overfitting
Future work
• study of the variability in the source mocap data• Generative model underlying the synthesis pipeline• a similarly efficient approach
– directly regress joint positions– remove ambiguities in local pose
Thank you