human pose estimation by deep learning
TRANSCRIPT
Human Pose Estimation by Deep Learning
Wei YangSupervisor: Prof. WANG Xiaogang, Prof. OUYANG Wanli
IVP Lab, CUHKSeptember 11, 2015
2
Outline
• Introduction• Traditional Approaches• Deep Learning Methods
– Global view (holistic view)
– Local appearance
– Combination of local appearance and global view
– Others
2015/9/11
3
Introduction
• What is articulated body pose estimation? “recovers the pose of an articulated body, which consists of joints and rigid parts using image-based observations.”
2015/9/11
4
Applications
Action recognition Clothing Parsing
Gaming2015/9/11
Human tracking
5
Challenges
2015/9/11
6
Traditional Approaches
Fischler & Elschlager 1973 Felzenszwalb & Huttenlocher 2005
Pictorial Structure• Unary Templates• Pairwise Springs
Yang & Ramanan 2011
Mixtures of “mini-parts”• Mixture of part • Unary template for part with mixture • Pairwise springs between part with
mixture and part with mixture
2015/9/11
headtorso
leg
Example of mini parts: near-vertical and near horizontal limbs
7
Deep Learning for Pose Estimation
• Holistic View– e.g., joints position regression
• Local View– e.g., body parts detection
• Combining global and local information– e.g., body parts detection + joints position regression
• Others– e.g., motion features, pose estimation in videos
2015/9/11
8
Holistic View
DeepPose: Human Pose Estimation via Deep Neural Networks
2015/9/11
9
Holistic Reasoning
2015/9/11
• Why holistic reasoning?– Besides extreme variability in articulations, many of the joints are barely visible
10
DeepPose: A CNN Regressor
2015/9/11
• Network architecture: AlexNet– Krizhevsky, Sutskever, and Hinton, NIPS 2012 (ImageNet)
– The first time deep model is shown to be effective on large scale
[Toshev & Szegedy, CVPR 2014]
11
Results on LSP (Leeds Sports Pose) dataset
2015/9/11
12
Cascade of Pose Regressors
• The pose estimation results are very coarse:– due to its fixed input size of 220 × 220, the network has limited capacity to look
at detail
– Train cascade of pose regressors for more precise joint localization
2015/9/11
13
Cascade of Pose Regressors
2015/9/11
14
Refined pose estimation
2015/9/11
15
Percentage of Correct Parts (PCP) on LSP dataset
2015/9/11
16
Local Appearance Method
Articulated Pose Estimation by a Graphical Modelwith Image Dependent Pairwise Relations
2015/9/11
17
Motivation
• Local image patches are able to capture:– Part presence
– Pairwise part spatial relationships
2015/9/11
Number of mixture type for each pair: 6
Neighbor: 1# of relationships:
Neighbor: 2# of relationships:
Lower arm
Upper arm
[Chen & Yuille NIPS 2014]
18
Tree-structured Relational Graph
– : body parts
– : pairwise relationships between parts
– : Pixel location of part
– Pairwise relationship
– Defined by relative position
– In experiment: 13 type for each pair
2015/9/11
19
Formulation
2015/9/11
𝐹 (𝐩 ,𝐭|𝐼 ;𝝎 ,𝜃 )¿∑𝑖∈𝑉
𝐴𝑖(𝑝𝑖∨𝐼 ;𝜃)
Part presence
𝜔 𝑖 ⋅
Inference: • Tree structure• Can be solved efficiently by dynamic programming
, , are learned by Latent structure SVM
+ ∑(𝑖 , 𝑗 )∈𝐸
𝑅 (𝑝𝑖 ,𝑝 𝑗 , 𝑡𝑖𝑗 , 𝑡 𝑗𝑖∨𝐼 ;𝜃)
Pairwise deformation
+𝝎𝑖𝑗𝑡𝑖𝑗 ⋅𝜔 𝑖𝑗 ⋅
Pairwise Relationship
20
Learning DCNN parameters
2015/9/11
Derive the type label for each patch• use relative position to represent the
pairwise relations• Cluster the relative positions over the
whole training set • Type label : cluster index• Mean relative position : cluster center
21
Casting Full Connections into Convolutions
2015/9/11Elbow
Part presence map
Pairwise relationship map
22
PCP and PDJ on LSP dataset and FLIC dataset
Dataset Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
LSPDCNN 92.5 85.1 82.7 76.3 70.2 55.9 74.8
Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6
LSP FLIC
2015/9/11
23
Combining Local Appearance and Holistic View
Dual-Source Deep Neural Networks for Human Pose Estimation
2015/9/11
24
Dual-Source CNN
• Integrate both the local part appearance and the holistic view of each local part for more accurate human pose estimation
• Each input is an image pair– Part patches
– Body patches
2015/9/11
25
Part patches: incorporate local appearance
• Generated by region proposals with some restrictions– Not too small (at least contain a body part)
– Not too big (may contain too many body parts and lacks sufficient resolution)
• All classes of joints are covered by similar number of part patches
• During testing, part patches are selected from multi-scale sliding windows
2015/9/11
26
Body patches: holistic view
• Also from region proposals– Must cover all body parts
– In testing stage, the body patch can be generated by human detection
• For DS-CNN, each training sample is made up with 3 components– A part patch
– A body patch
– Binary mask specifying the location of the part patch in body patch
2015/9/11
27
Training of the DS-CNN
2015/9/11
Shared weights Classification( softmax)
Regression(L2 distance)
28
• Part heat map– Same size of input image
– Uniformly distributed probability for each sliding window
– Sum and average over all pixels
Testing
2015/9/11
0.0
0.9
0
K part
29
Testing
• Final pose estimation– Weighted average of predicted joint locations within part patches with high
responses.
2015/9/11
30
Results: PCP on LSP
2015/9/11
31
Other Methods & Applications
• MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation
• Flowing ConvNets for Human Pose Estimation in Videos
2015/9/11
32
Using Motion Features for Human Pose Estimation
• motion is a powerful visual cue that alone can be used to extract high-level information, including articulated pose.
2015/9/11
Image credit: Large displacement optical flow: descriptor matching in variational motion estimationThomas Brox, J. Malik. IEEE TPAMI, 33(3): 500-513, 2011
33
Modeep: Using Motion Features for Human Pose Estimation
• Extended Frames Labeled In Cinema (FLIC) dataset with additional motion features
2015/9/11
MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation. Arjun et. al., ACCV 2014
Average of frame pair Optical flow
34
Multi-resolution efficient sliding window model
2015/9/11
35
Simple Spatial Model
• FLIC: multiple people with only one annotated person• Testing: incorporate annotated torso position with simple
spatial model
2015/9/11
Predicted left shoulder Spatial mask of left shoulder Result
36
Experiment results
2015/9/11
Without motion feature
With motion feature
occlusion Cluttered bg Motion blur
37
Flowing ConvNets for Human Pose Estimation in Videos
2015/9/11
• CNN can benefit from temporal context by combining information across the multiple frames using optical flow.
38
Spatial ConvNet
2015/9/11
Why regression heatmap instead of joint coordinates?• The network can be multi-modal• regressing coordinates directly is a highly
non-linear and more difficult to learn mapping
39
Warping neighbouring heatmaps for improving pose estimates
• Heatmaps from frames (t − n) and (t + n) warped to frame t using tracks from optical flow (green & blue lines) can help refine the wrongly estimated part location
2015/9/11
40
Results
2015/9/11
41
• End-to-end pose estimation– Joint learning of pose features and pose configurations
– Allow local appearance to be fine-tuned by pose configuration
Ongoing Project
2015/9/11
Unary response
Pairwise relationships
…
42
Ongoing Project
2015/9/11
Pairwise relationships
… 𝑥𝑡 −2 𝑥𝑡 −1 𝑥𝑡 𝑥𝑇
𝑥𝑡 𝑥𝑡+1𝑥𝑡 −1
𝑤𝑑𝑡 𝑤𝑑𝑡 𝑤𝑑𝑡
𝑤𝑚 𝑤𝑚 𝑤𝑚
() () ()
𝑧𝑡 𝑧𝑡+1𝑧𝑡 −1Add constraints between body parts in a network
Distance transform
Unary response
43
Preliminary Results (PCP on LSP)
2015/9/11
• Future work– Pose relational graph learning
– Multi-task learning• Human detection
• Human segmentation
– Combining global information
Head Torso U.arms L.arms U.legs L.legs mean 84.7 91 68.7 53.6 80.7 73.3 72.82
44
Recent developments
• Deeppose: Human pose estimation via deep neural networks– A Toshev, C Szegedy – CVPR, 2014
• Joint training of a convolutional network and a graphical model for human pose estimation– JJ Tompson, A Jain, Y LeCun, C Bregler – NIPS, 2014
• Human Pose Estimation with Iterative Error Feedback – Carreira, Joao, et al. arXiv preprint arXiv:1507.06550 (2015).
• Maximum-Margin Structured Learning with Deep Networks for 3D Human PoseEstimation – S Li, W Zhang, AB Chan - arXiv preprint arXiv:1508.06708, 2015
• Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network – S Li, ZQ Liu, AB Chan – CVPR Workshop, 2014
• Flowing ConvNets for Human Pose Estimation in Videos – T Pfister, J Charles, A Zisserman - ICCV, 2015
• R-CNNs for Pose Estimation and Action Detection – G Gkioxari, B Hariharan, R Girshick, J Malik - arXiv preprint arXiv:1406.5212, 2014
• MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation – A Jain, J Tompson, Y LeCun, C Bregler -ACCV 2014
• Efficient object localization using convolutional networks– J Tompson, R Goroshin, A Jain, Y LeCun, C Bregler – CVPR, 2015
• Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation– Xiaochuan Fan, Kang Zheng, Yuewei Lin, Song Wang, CVPR 2015
• Parsing Occluded People by Flexible Compositions– Xianjie Chen, Alan L. Yuille. CVPR 2015
• Articulated pose estimation by a graphical model with image dependent pairwise relations– X Chen, AL Yuille –NIPS, 2014
• …
2015/9/11
Thank you
Human Pose Estimation by Deep LearningWei Yang
IVP Lab, CUHKSeptember 11, 2015
46
Evaluation Metrics
• Percentage of Correct Parts (PCP)– measures the percentage of correctly localized body parts.
– A candidate body part is treated as correct if its segment endpoints lie within 50% of the length of the ground-truth annotated endpoints.
• Percentage of Detected Joints (PDJ)– measures the performance using a curve of the percentage of correctly localized
joints by varying localization precision threshold, which is normalized by the scale defined as distance between left shoulder and right hip
– invariant to scale
2015/9/11