human pose estimation by deep learning

Human Pose Estimation by Deep Learning

Wei YangSupervisor: Prof. WANG Xiaogang, Prof. OUYANG Wanli

IVP Lab, CUHKSeptember 11, 2015

2

Outline

• Introduction• Traditional Approaches• Deep Learning Methods

– Global view (holistic view)

– Local appearance

– Combination of local appearance and global view

– Others

2015/9/11

3

Introduction

• What is articulated body pose estimation? “recovers the pose of an articulated body, which consists of joints and rigid parts using image-based observations.”

2015/9/11

http://en.wikipedia.org/wiki/Articulated_body_pose_estimation

4

Applications

Action recognition Clothing Parsing

Gaming2015/9/11

Human tracking

5

Challenges

2015/9/11

6

Traditional Approaches

Fischler & Elschlager 1973 Felzenszwalb & Huttenlocher 2005

Pictorial Structure• Unary Templates• Pairwise Springs

Yang & Ramanan 2011

Mixtures of “mini-parts”• Mixture of part • Unary template for part with mixture • Pairwise springs between part with

mixture and part with mixture

2015/9/11

headtorso

leg

Example of mini parts: near-vertical and near horizontal limbs

7

Deep Learning for Pose Estimation

• Holistic View– e.g., joints position regression

• Local View– e.g., body parts detection

• Combining global and local information– e.g., body parts detection + joints position regression

• Others– e.g., motion features, pose estimation in videos

2015/9/11

8

Holistic View

DeepPose: Human Pose Estimation via Deep Neural Networks

2015/9/11

9

Holistic Reasoning

2015/9/11

• Why holistic reasoning?– Besides extreme variability in articulations, many of the joints are barely visible

10

DeepPose: A CNN Regressor

2015/9/11

• Network architecture: AlexNet– Krizhevsky, Sutskever, and Hinton, NIPS 2012 (ImageNet)

– The first time deep model is shown to be effective on large scale

[Toshev & Szegedy, CVPR 2014]

11

Results on LSP (Leeds Sports Pose) dataset

2015/9/11

12

Cascade of Pose Regressors

• The pose estimation results are very coarse:– due to its fixed input size of 220 × 220, the network has limited capacity to look

at detail

– Train cascade of pose regressors for more precise joint localization

2015/9/11

13

Cascade of Pose Regressors

2015/9/11

14

Refined pose estimation

2015/9/11

15

Percentage of Correct Parts (PCP) on LSP dataset

2015/9/11

16

Local Appearance Method

Articulated Pose Estimation by a Graphical Modelwith Image Dependent Pairwise Relations

2015/9/11

17

Motivation

• Local image patches are able to capture:– Part presence

– Pairwise part spatial relationships

2015/9/11

Number of mixture type for each pair: 6

Neighbor: 1# of relationships:

Neighbor: 2# of relationships:

Lower arm

Upper arm

[Chen & Yuille NIPS 2014]

18

Tree-structured Relational Graph

– : body parts

– : pairwise relationships between parts

– : Pixel location of part

– Pairwise relationship

– Defined by relative position

– In experiment: 13 type for each pair

2015/9/11

19

Formulation

2015/9/11

𝐹 (𝐩 ,𝐭|𝐼 ;𝝎 ,𝜃 )¿∑𝑖∈𝑉

𝐴𝑖(𝑝𝑖∨𝐼 ;𝜃)

Part presence

𝜔 𝑖 ⋅

Inference: • Tree structure• Can be solved efficiently by dynamic programming

, , are learned by Latent structure SVM

+ ∑(𝑖 , 𝑗 )∈𝐸

𝑅 (𝑝𝑖 ,𝑝 𝑗 , 𝑡𝑖𝑗 , 𝑡 𝑗𝑖∨𝐼 ;𝜃)

Pairwise deformation

+𝝎𝑖𝑗𝑡𝑖𝑗 ⋅𝜔 𝑖𝑗 ⋅

Pairwise Relationship

20

Learning DCNN parameters

2015/9/11

Derive the type label for each patch• use relative position to represent the

pairwise relations• Cluster the relative positions over the

whole training set • Type label : cluster index• Mean relative position : cluster center

21

Casting Full Connections into Convolutions

2015/9/11Elbow

Part presence map

Pairwise relationship map

22

PCP and PDJ on LSP dataset and FLIC dataset

Dataset Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP

LSPDCNN 92.5 85.1 82.7 76.3 70.2 55.9 74.8

Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6

LSP FLIC

2015/9/11

23

Combining Local Appearance and Holistic View

Dual-Source Deep Neural Networks for Human Pose Estimation

2015/9/11

24

Dual-Source CNN

• Integrate both the local part appearance and the holistic view of each local part for more accurate human pose estimation

• Each input is an image pair– Part patches

– Body patches

2015/9/11

25

Part patches: incorporate local appearance

• Generated by region proposals with some restrictions– Not too small (at least contain a body part)

– Not too big (may contain too many body parts and lacks sufficient resolution)

• All classes of joints are covered by similar number of part patches

• During testing, part patches are selected from multi-scale sliding windows

2015/9/11

26

Body patches: holistic view

• Also from region proposals– Must cover all body parts

– In testing stage, the body patch can be generated by human detection

• For DS-CNN, each training sample is made up with 3 components– A part patch

– A body patch

– Binary mask specifying the location of the part patch in body patch

2015/9/11

27

Training of the DS-CNN

2015/9/11

Shared weights Classification（ softmax）

Regression(L2 distance)

28

• Part heat map– Same size of input image

– Uniformly distributed probability for each sliding window

– Sum and average over all pixels

Testing

2015/9/11

0.0

0.9

0

K part

29

Testing

• Final pose estimation– Weighted average of predicted joint locations within part patches with high

responses.

2015/9/11

30

Results: PCP on LSP

2015/9/11

31

Other Methods & Applications

• MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation

• Flowing ConvNets for Human Pose Estimation in Videos

2015/9/11

32

Using Motion Features for Human Pose Estimation

• motion is a powerful visual cue that alone can be used to extract high-level information, including articulated pose.

2015/9/11

Image credit: Large displacement optical flow: descriptor matching in variational motion estimationThomas Brox, J. Malik. IEEE TPAMI, 33(3): 500-513, 2011

33

Modeep: Using Motion Features for Human Pose Estimation

• Extended Frames Labeled In Cinema (FLIC) dataset with additional motion features

2015/9/11

MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation. Arjun et. al., ACCV 2014

Average of frame pair Optical flow

34

Multi-resolution efficient sliding window model

2015/9/11

35

Simple Spatial Model

• FLIC: multiple people with only one annotated person• Testing: incorporate annotated torso position with simple

spatial model

2015/9/11

Predicted left shoulder Spatial mask of left shoulder Result

36

Experiment results

2015/9/11

Without motion feature

With motion feature

occlusion Cluttered bg Motion blur

37

Flowing ConvNets for Human Pose Estimation in Videos

2015/9/11

• CNN can benefit from temporal context by combining information across the multiple frames using optical flow.

38

Spatial ConvNet

2015/9/11

Why regression heatmap instead of joint coordinates?• The network can be multi-modal• regressing coordinates directly is a highly

non-linear and more difficult to learn mapping

39

Warping neighbouring heatmaps for improving pose estimates

• Heatmaps from frames (t − n) and (t + n) warped to frame t using tracks from optical flow (green & blue lines) can help refine the wrongly estimated part location

2015/9/11

40

Results

2015/9/11

41

• End-to-end pose estimation– Joint learning of pose features and pose configurations

– Allow local appearance to be fine-tuned by pose configuration

Ongoing Project

2015/9/11

Unary response

Pairwise relationships

…

42

Ongoing Project

2015/9/11

Pairwise relationships

… 𝑥𝑡 −2 𝑥𝑡 −1 𝑥𝑡 𝑥𝑇

𝑥𝑡 𝑥𝑡+1𝑥𝑡 −1

𝑤𝑑𝑡 𝑤𝑑𝑡 𝑤𝑑𝑡

𝑤𝑚 𝑤𝑚 𝑤𝑚

() () ()

𝑧𝑡 𝑧𝑡+1𝑧𝑡 −1Add constraints between body parts in a network

Distance transform

Unary response

43

Preliminary Results (PCP on LSP)

2015/9/11

• Future work– Pose relational graph learning

– Multi-task learning• Human detection

• Human segmentation

– Combining global information

Head Torso U.arms L.arms U.legs L.legs mean 84.7 91 68.7 53.6 80.7 73.3 72.82

44

Recent developments

• Deeppose: Human pose estimation via deep neural networks– A Toshev, C Szegedy – CVPR, 2014

• Joint training of a convolutional network and a graphical model for human pose estimation– JJ Tompson, A Jain, Y LeCun, C Bregler – NIPS, 2014

• Human Pose Estimation with Iterative Error Feedback – Carreira, Joao, et al. arXiv preprint arXiv:1507.06550 (2015).

• Maximum-Margin Structured Learning with Deep Networks for 3D Human PoseEstimation – S Li, W Zhang, AB Chan - arXiv preprint arXiv:1508.06708, 2015

• Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network – S Li, ZQ Liu, AB Chan – CVPR Workshop, 2014

• Flowing ConvNets for Human Pose Estimation in Videos – T Pfister, J Charles, A Zisserman - ICCV, 2015

• R-CNNs for Pose Estimation and Action Detection – G Gkioxari, B Hariharan, R Girshick, J Malik - arXiv preprint arXiv:1406.5212, 2014

• MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation – A Jain, J Tompson, Y LeCun, C Bregler -ACCV 2014

• Efficient object localization using convolutional networks– J Tompson, R Goroshin, A Jain, Y LeCun, C Bregler – CVPR, 2015

• Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation– Xiaochuan Fan, Kang Zheng, Yuewei Lin, Song Wang, CVPR 2015

• Parsing Occluded People by Flexible Compositions– Xianjie Chen, Alan L. Yuille. CVPR 2015

• Articulated pose estimation by a graphical model with image dependent pairwise relations– X Chen, AL Yuille –NIPS, 2014

• …

2015/9/11

Thank you

Human Pose Estimation by Deep LearningWei Yang

IVP Lab, CUHKSeptember 11, 2015

46

Evaluation Metrics

• Percentage of Correct Parts (PCP)– measures the percentage of correctly localized body parts.

– A candidate body part is treated as correct if its segment endpoints lie within 50% of the length of the ground-truth annotated endpoints.

• Percentage of Detected Joints (PDJ)– measures the performance using a curve of the percentage of correctly localized

joints by varying localization precision threshold, which is normalized by the scale defined as distance between left shoulder and right hip

– invariant to scale

2015/9/11

human pose estimation by deep learning

Science