cs 1699: intro to computer vision introduction · probabilities for part iii. ... derek hoiem,...

CS 1699: Intro to Computer Vision

Human Pose and Actions

Prof. Adriana KovashkaUniversity of Pittsburgh

December 8, 2015

• Human pose and actions: Introduction

• Estimating human pose

• Recognizing human actions

– Using specialized features

– Using pose

– Using objects

– From ego-centric video

Next time (Last class)

• Review for the final exam + OMETs

• By Wednesday night, post on Piazza questions or anything you want me to review, for participation credit

• Extra office hours on Friday, 2-3pm

Final Exam

• Monday, Dec. 14, 12pm

• Same room (5502 Sennott Square)

• Similar to midterm exam (mostly short questions and a few problems), but longer (100 points)

• Will only cover topics discussed after midterm (but some of these use topics from first half)

Homework 4

Mean = 77.16, median = 99, max = 123

Homework 5

• Due Thursday

• See Piazza for correction about how to get probabilities for Part III

Participation

• Tentative grades entered on CourseWeb

• Median is 80%

What is an action/activity?

Action: a transition from one state to another• Who is the actor?• How is the state of the actor changing?• What (if anything) is being acted on?• How is that thing changing?• What is the purpose of the action (if any)?• Could be more or less complex

Adapted from Derek Hoiem

Terminology: Human activity in video

No universal terminology, but approximately:

• “Actions”: atomic motion patterns – often gesture-like, single clear-cut trajectory, single nameable behavior (e.g., sit, wave arms)

• “Activity”: series or composition of actions (e.g., interactions between people)

• “Event”: combination of activities or actions (e.g., a football game, a traffic accident)

Adapted from Venu Govindaraju

How do we represent actions?

CategoriesWalking, hammering, dancing, skiing, sitting down, standing up, jumping

Nouns and Predicates<man, swings, hammer><man, hits, nail, w/ hammer>

Derek Hoiem

How can we identify actions?

Motion Pose

Held Objects

Nearby Objects

Derek Hoiem

– Using pose

– Using objects

Jamie Shotton, Andrew Fitzgibbon, Mat Cook,Toby Sharp, Mark Finocchio, Richard Moore,

Alex Kipman, Andrew Blake

Best paper award at CVPR 2011

Adapted from Jamie Shotton

Recognize large variety of human poses, all shapes & sizes

Limited compute budget

super-real time on Xbox 360 to allow games to run concurrently

rightelbow

right hand leftshoulderneck

Jamie Shotton

No temporal information

frame-by-frame

Local pose estimate of parts

each pixel & each body joint treated independently

reduced training data and computation time

Very fast

simple depth image features

parallel decision forest classifier

Jamie Shotton

inferbody parts

per pixelcluster pixels to

hypothesizebody jointpositions

capturedepth image &

remove bg

fit model &track skeleton

Jamie Shotton

Compute P(ci|wi)

pixels i = (x, y)

body part ci

image window wi

Discriminative approach

learn classifier P(ci|wi) from training data

Jamie Shotton

Train invariance to:

Record mocap500k frames

distilled to 100k poses

Retarget to several models

Render (depth, body parts) pairs

Jamie Shotton

Depth comparisons

Very fast to compute

inputdepthimage

𝑓 𝐼, x = 𝑑𝐼 x − 𝑑𝐼(x + Δ)

image depth

image coordinate

offset depth

featureresponse

fΘ (I, x) =

To classify pixel x, start here

Toy example:distinguishleft (L) and right (R)sides of the body

no yes

fΘ(I, x; Δ1) > t1

fΘ(I, x; Δ2) > t2

depth 1depth 2depth 3depth 4depth 5depth 6depth 7depth 8depth 9depth 10depth 11depth 12depth 13depth 14depth 15depth 16depth 17depth 18

input depth ground truth parts inferred parts (soft)

Jamie Shotton

8 12 16 20

Depth of trees

5 10 15 20Depth of trees

synthetic test data real test data

Jamie Shotton

Trained on different random subset of images

“bagging” helps avoid over-fitting

Average tree posteriors

[Amit & Geman 97][Breiman 01]

[Geurts et al. 06]

………tree 1 tree T

P1(c)c

(𝐼, x) (𝐼, x)

𝑃 𝑐 𝐼, x =1

𝑡=1

𝑃𝑡(𝑐|𝐼, x)

Jamie Shotton

ground truth

1 tree 3 trees 6 trees

inferred body parts (most likely)

1 2 3 4 5 6

Number of trees

Jamie Shotton

front view top viewside view

input depth inferred body parts

inferred joint positions (modes found using mean shift)

no tracking or smoothingJamie Shotton

front view top viewside view

input depth inferred body parts

no tracking or smoothing

inferred joint positions (modes found using mean shift)

Jamie Shotton

– Using pose

– Using objects

Representing Actions

Tracked Points

Matikainen et al. 2009Adapted from Derek Hoiem

Space-Time Interest Points

Laptev 2005

• Corner detectors in space+time

Adapted from Derek Hoiem

Laptev 2005

“Talk on phone”

“Get out of car”

Derek Hoiem

Learning realistic human actions from movies, Laptev et al. 2008

Approach

• Space-time interest point detectors

• Descriptors

– HOG, HOF

• Pyramid histograms (3x3x2)

• SVMs with Chi-Squared Kernel

Interest Points

Spatio-Temporal Binning

Derek Hoiem, figures from Ivan Laptev

Results

Derek Hoiem, figures from Ivan Laptev

– Using pose

– Using objects

Human-Object Interaction

• Human pose estimation

Holistic image based classification

Integrated reasoning

Yao/Fei-Fei

Tennis

racket

• Object detection

Yao/Fei-Fei

• Object detection

Tennis

racket

Activity: Tennis Forehand

• Action categorization

Yao/Fei-Fei

• Felzenszwalb & Huttenlocher, 2005

• Ren et al, 2005

• Ramanan, 2006

• Ferrari et al, 2008

• Yang & Mori, 2008

• Andriluka et al, 2009

• Eichner & Ferrari, 2009

Difficult part

appearance

Self-occlusion

Image region looks

like a body part

Human pose estimation & Object detection

Human pose

estimation is

challenging.

Yao/Fei-Fei

Human pose

estimation is

challenging.

• Felzenszwalb & Huttenlocher, 2005

• Ren et al, 2005

• Ramanan, 2006

• Ferrari et al, 2008

• Yang & Mori, 2008

• Andriluka et al, 2009

• Eichner & Ferrari, 2009Yao/Fei-Fei

Facilitate

Given the

object is

detected.

Yao/Fei-Fei

• Viola & Jones, 2001

• Lampert et al, 2008

• Divvala et al, 2009

• Vedaldi et al, 2009

Small, low-resolution,

partially occluded

Image region similar

to detection target

Object

detection is

challenging

Yao/Fei-Fei

Object

detection is

challenging

• Viola & Jones, 2001

• Lampert et al, 2008

• Divvala et al, 2009

• Vedaldi et al, 2009

Yao/Fei-Fei

Facilitate

Given the

pose is

estimated.

Yao/Fei-Fei

Mutual Context

Yao/Fei-Fei

Learning Results

Tennis

Volleyball

Tennis

forehand

Yao/Fei-Fei

Activity Classification Results

Gupta et

al, 2009

Bag-of-

ssific

Cricket

Tennis

forehand

Bag-of-words

SIFT+SVM

Gupta et

al, 2009

Yao/Fei-Fei

– Using pose

– Using objects

Detecting Activities of Daily Living

in First-person Camera Views

Hamed Pirsiavash, Deva Ramanan

CVPR 2012

Hamed Pirsiavash

MotivationA sample video of Activities of Daily Living

ApplicationsTele-rehabilitation

• Kopp et al,, Arch. of Physical Medicine and Rehabilitation. 1997.

• Catz et al, Spinal Cord 1997.

Long-term at-home monitoring

Hamed Pirsiavash

ApplicationsLife-logging

• Gemmell et al, “MyLifeBits: a personal database for everything.” Communications of the ACM 2006.

• Hodges et al, “SenseCam: A retrospective memory aid”, UbiComp, 2006.

So far, mostly “write-only” memory!

This is the right time for computer vision community to get involved.

Hamed Pirsiavash

Wearable ADL detection

ADL actions derived from medical

literature on patient rehabilitation

It is easy to collect

natural data

Hamed Pirsiavash

ChallengesWhat features to use?

Low level features

(Weak semantics)

High level features

(Strong semantics)

Human pose

Difficulties of pose:

• Detectors are not accurate enough

• Not useful in first person camera views

Space-time interest points

Laptev, IJCV’05

Hamed Pirsiavash

ChallengesWhat features to use?

Low level features

(Weak semantics)

High level features

(Strong semantics)

Human pose Object-centric featuresSpace-time interest points

Laptev, IJCV’05Difficulties of pose:

• Detectors are not accurate enough

• Not useful in first person camera views

Hamed Pirsiavash

Challenges Long-scale temporal structure

Start boiling

Do other things

(while waiting)Pour in cup Drink tea

Wearable data: making tea

“Classic” data: boxing

Adapted from Hamed Pirsiavash

Appearance feature: bag of objects

Bag of detected objects

fridge TVstove

classifier

Video clip

Hamed Pirsiavash

Inspired by “Spatial Pyramid” CVPR’06 and “Pyramid Match Kernels” ICCV’05

Temporal pyramidCoarse to fine correspondence matching with a multi-layer pyramid

Temporal pyramid

descriptor

Video clip

classifier

timeHamed Pirsiavash

Accuracy on 18 action categories• Our model: 40.6%

• STIP baseline: 22.8%

Hamed Pirsiavash

Summary: Human actions

• Action recognition still an open problem

– How to represent actions?

• Types of data: atomic and more complex actions, ego-

centric video

• Common representations

– Space-time interest points

– Pose

– Objects (and temporal pyramids of objects)

• Pose

– Can be approached as a classification problem using depth data

cs 1699: intro to computer vision introduction · probabilities for part iii. ... derek hoiem,...

Documents

classification derek hoiem cs 598, spring 2009 jan 27, 2009

computational photography cs498dh derek hoiem 8/25/11

single-view metrology and camera calibration computer vision...

templates and image pyramids computational photography derek...

action recognition computer vision cs 543 / ece 549...

category independent region proposals ian endres and derek...

pixels and image filtering computational photography derek...

computer vision 776 jan-michael frahm 12/05/2011 many slides...

10/29/13 image stitching computational photography derek...

cs654: digital image analysis lecture 25: hough transform...

adaboost derek hoiem march 31, 2004

computer vision ece cs 543 / ece 549 university of illinois...

les pixels contenu tiré du cours par derek hoiem

10/19/10 image stitching computational photography derek...

indoor segmentation and support inference from rgbd images...

the problem with the blank slate approach derek hoiem...

midterm review computational photography derek hoiem,...

computational photography cs498dwh derek hoiem 8/24/10

10/27/15 image stitching computational photography derek...

geometric context from a single...