cs 1699: intro to computer vision introduction · probabilities for part iii. ... derek hoiem,...
Post on 19-Aug-2018
215 Views
Preview:
TRANSCRIPT
CS 1699: Intro to Computer Vision
Human Pose and Actions
Prof. Adriana KovashkaUniversity of Pittsburgh
December 8, 2015
Today
• Human pose and actions: Introduction
• Estimating human pose
• Recognizing human actions
– Using specialized features
– Using pose
– Using objects
– From ego-centric video
Next time (Last class)
• Review for the final exam + OMETs
• By Wednesday night, post on Piazza questions or anything you want me to review, for participation credit
• Extra office hours on Friday, 2-3pm
Final Exam
• Monday, Dec. 14, 12pm
• Same room (5502 Sennott Square)
• Similar to midterm exam (mostly short questions and a few problems), but longer (100 points)
• Will only cover topics discussed after midterm (but some of these use topics from first half)
Homework 4
Mean = 77.16, median = 99, max = 123
Homework 5
• Due Thursday
• See Piazza for correction about how to get probabilities for Part III
Participation
• Tentative grades entered on CourseWeb
• Median is 80%
What is an action/activity?
Action: a transition from one state to another• Who is the actor?• How is the state of the actor changing?• What (if anything) is being acted on?• How is that thing changing?• What is the purpose of the action (if any)?• Could be more or less complex
Adapted from Derek Hoiem
Terminology: Human activity in video
No universal terminology, but approximately:
• “Actions”: atomic motion patterns – often gesture-like, single clear-cut trajectory, single nameable behavior (e.g., sit, wave arms)
• “Activity”: series or composition of actions (e.g., interactions between people)
• “Event”: combination of activities or actions (e.g., a football game, a traffic accident)
Adapted from Venu Govindaraju
How do we represent actions?
CategoriesWalking, hammering, dancing, skiing, sitting down, standing up, jumping
Poses
Nouns and Predicates<man, swings, hammer><man, hits, nail, w/ hammer>
Derek Hoiem
How can we identify actions?
Motion Pose
Held Objects
Nearby Objects
Derek Hoiem
Today
• Human pose and actions: Introduction
• Estimating human pose
• Recognizing human actions
– Using specialized features
– Using pose
– Using objects
– From ego-centric video
Jamie Shotton, Andrew Fitzgibbon, Mat Cook,Toby Sharp, Mark Finocchio, Richard Moore,
Alex Kipman, Andrew Blake
Best paper award at CVPR 2011
Adapted from Jamie Shotton
Recognize large variety of human poses, all shapes & sizes
Limited compute budget
super-real time on Xbox 360 to allow games to run concurrently
Adapted from Jamie Shotton
rightelbow
right hand leftshoulderneck
Jamie Shotton
No temporal information
frame-by-frame
Local pose estimate of parts
each pixel & each body joint treated independently
reduced training data and computation time
Very fast
simple depth image features
parallel decision forest classifier
Jamie Shotton
inferbody parts
per pixelcluster pixels to
hypothesizebody jointpositions
capturedepth image &
remove bg
fit model &track skeleton
Jamie Shotton
Compute P(ci|wi)
pixels i = (x, y)
body part ci
image window wi
Discriminative approach
learn classifier P(ci|wi) from training data
Jamie Shotton
Train invariance to:
Record mocap500k frames
distilled to 100k poses
Retarget to several models
Render (depth, body parts) pairs
Jamie Shotton
Depth comparisons
Very fast to compute
inputdepthimage
xΔ
xΔ
xΔx
Δ
x
Δ
x
Δ
𝑓 𝐼, x = 𝑑𝐼 x − 𝑑𝐼(x + Δ)
image depth
image coordinate
offset depth
featureresponse
Adapted from Jamie Shotton
Θ
fΘ (I, x) =
To classify pixel x, start here
no
Toy example:distinguishleft (L) and right (R)sides of the body
no yes
yes
L R
P(c)
L R
P(c)
L R
P(c)
fΘ(I, x; Δ1) > t1
fΘ(I, x; Δ2) > t2
Adapted from Jamie Shotton
depth 1depth 2depth 3depth 4depth 5depth 6depth 7depth 8depth 9depth 10depth 11depth 12depth 13depth 14depth 15depth 16depth 17depth 18
input depth ground truth parts inferred parts (soft)
Jamie Shotton
30%
35%
40%
45%
50%
55%
60%
65%
8 12 16 20
Ave
rag
e p
er-
cla
ss a
ccu
racy
Depth of trees
30%
35%
40%
45%
50%
55%
60%
65%
5 10 15 20Depth of trees
synthetic test data real test data
Jamie Shotton
Trained on different random subset of images
“bagging” helps avoid over-fitting
Average tree posteriors
[Amit & Geman 97][Breiman 01]
[Geurts et al. 06]
………tree 1 tree T
c
P1(c)c
PT(c)
(𝐼, x) (𝐼, x)
𝑃 𝑐 𝐼, x =1
𝑇
𝑡=1
𝑇
𝑃𝑡(𝑐|𝐼, x)
Jamie Shotton
ground truth
1 tree 3 trees 6 trees
inferred body parts (most likely)
40%
45%
50%
55%
1 2 3 4 5 6
Ave
rag
e p
er-
cla
ss a
ccu
racy
Number of trees
Jamie Shotton
front view top viewside view
input depth inferred body parts
inferred joint positions (modes found using mean shift)
no tracking or smoothingJamie Shotton
front view top viewside view
input depth inferred body parts
no tracking or smoothing
inferred joint positions (modes found using mean shift)
Jamie Shotton
Today
• Human pose and actions: Introduction
• Estimating human pose
• Recognizing human actions
– Using specialized features
– Using pose
– Using objects
– From ego-centric video
Representing Actions
Tracked Points
Matikainen et al. 2009Adapted from Derek Hoiem
Representing Actions
Space-Time Interest Points
Laptev 2005
• Corner detectors in space+time
Adapted from Derek Hoiem
“Talk on phone”
“Get out of car”
Derek Hoiem
Learning realistic human actions from movies, Laptev et al. 2008
Approach
• Space-time interest point detectors
• Descriptors
– HOG, HOF
• Pyramid histograms (3x3x2)
• SVMs with Chi-Squared Kernel
Interest Points
Spatio-Temporal Binning
Derek Hoiem, figures from Ivan Laptev
Results
Derek Hoiem, figures from Ivan Laptev
Today
• Human pose and actions: Introduction
• Estimating human pose
• Recognizing human actions
– Using specialized features
– Using pose
– Using objects
– From ego-centric video
Human-Object Interaction
Torso
Head
• Human pose estimation
Holistic image based classification
Integrated reasoning
Yao/Fei-Fei
Human-Object Interaction
Tennis
racket
• Human pose estimation
Holistic image based classification
Integrated reasoning
• Object detection
Yao/Fei-Fei
Human-Object Interaction
• Human pose estimation
Holistic image based classification
Integrated reasoning
• Object detection
Torso
Head
Tennis
racket
Activity: Tennis Forehand
• Action categorization
Yao/Fei-Fei
• Felzenszwalb & Huttenlocher, 2005
• Ren et al, 2005
• Ramanan, 2006
• Ferrari et al, 2008
• Yang & Mori, 2008
• Andriluka et al, 2009
• Eichner & Ferrari, 2009
Difficult part
appearance
Self-occlusion
Image region looks
like a body part
Human pose estimation & Object detection
Human pose
estimation is
challenging.
Yao/Fei-Fei
Human pose estimation & Object detection
Human pose
estimation is
challenging.
• Felzenszwalb & Huttenlocher, 2005
• Ren et al, 2005
• Ramanan, 2006
• Ferrari et al, 2008
• Yang & Mori, 2008
• Andriluka et al, 2009
• Eichner & Ferrari, 2009Yao/Fei-Fei
Human pose estimation & Object detection
Facilitate
Given the
object is
detected.
Yao/Fei-Fei
• Viola & Jones, 2001
• Lampert et al, 2008
• Divvala et al, 2009
• Vedaldi et al, 2009
Small, low-resolution,
partially occluded
Image region similar
to detection target
Human pose estimation & Object detection
Object
detection is
challenging
Yao/Fei-Fei
Human pose estimation & Object detection
Object
detection is
challenging
• Viola & Jones, 2001
• Lampert et al, 2008
• Divvala et al, 2009
• Vedaldi et al, 2009
Yao/Fei-Fei
Human pose estimation & Object detection
Facilitate
Given the
pose is
estimated.
Yao/Fei-Fei
Human pose estimation & Object detection
Mutual Context
Yao/Fei-Fei
Learning Results
Tennis
serve
Volleyball
smash
Tennis
forehand
Yao/Fei-Fei
Activity Classification Results
Gupta et
al, 2009
Our
model
Bag-of-
Words
83.3%
Cla
ssific
atio
n a
ccu
racy
78.9%
52.5%
0.9
0.8
0.7
0.6
0.5
Cricket
shot
Tennis
forehand
Bag-of-words
SIFT+SVM
Gupta et
al, 2009
Our
model
Yao/Fei-Fei
Today
• Human pose and actions: Introduction
• Estimating human pose
• Recognizing human actions
– Using specialized features
– Using pose
– Using objects
– From ego-centric video
Detecting Activities of Daily Living
in First-person Camera Views
Hamed Pirsiavash, Deva Ramanan
CVPR 2012
Hamed Pirsiavash
MotivationA sample video of Activities of Daily Living
ApplicationsTele-rehabilitation
• Kopp et al,, Arch. of Physical Medicine and Rehabilitation. 1997.
• Catz et al, Spinal Cord 1997.
Long-term at-home monitoring
Hamed Pirsiavash
ApplicationsLife-logging
• Gemmell et al, “MyLifeBits: a personal database for everything.” Communications of the ACM 2006.
• Hodges et al, “SenseCam: A retrospective memory aid”, UbiComp, 2006.
So far, mostly “write-only” memory!
This is the right time for computer vision community to get involved.
Hamed Pirsiavash
53
Wearable ADL detection
ADL actions derived from medical
literature on patient rehabilitation
It is easy to collect
natural data
Hamed Pirsiavash
ChallengesWhat features to use?
Low level features
(Weak semantics)
High level features
(Strong semantics)
Human pose
Difficulties of pose:
• Detectors are not accurate enough
• Not useful in first person camera views
Space-time interest points
Laptev, IJCV’05
Hamed Pirsiavash
ChallengesWhat features to use?
Low level features
(Weak semantics)
High level features
(Strong semantics)
Human pose Object-centric featuresSpace-time interest points
Laptev, IJCV’05Difficulties of pose:
• Detectors are not accurate enough
• Not useful in first person camera views
Hamed Pirsiavash
Challenges Long-scale temporal structure
time
Start boiling
water
Do other things
(while waiting)Pour in cup Drink tea
Wearable data: making tea
“Classic” data: boxing
Adapted from Hamed Pirsiavash
Appearance feature: bag of objects
Bag of detected objects
fridge TVstove
fridge TVstove
SVM
classifier
Video clip
Hamed Pirsiavash
Inspired by “Spatial Pyramid” CVPR’06 and “Pyramid Match Kernels” ICCV’05
Temporal pyramidCoarse to fine correspondence matching with a multi-layer pyramid
Temporal pyramid
descriptor
Video clip
SVM
classifier
timeHamed Pirsiavash
Accuracy on 18 action categories• Our model: 40.6%
• STIP baseline: 22.8%
Hamed Pirsiavash
Summary: Human actions
• Action recognition still an open problem
– How to represent actions?
• Types of data: atomic and more complex actions, ego-
centric video
• Common representations
– Space-time interest points
– Pose
– Objects (and temporal pyramids of objects)
• Pose
– Can be approached as a classification problem using depth data
top related