cs 1699: intro to computer vision introduction · probabilities for part iii. ... derek hoiem,...

Post on 19-Aug-2018

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS 1699: Intro to Computer Vision

Human Pose and Actions

Prof. Adriana KovashkaUniversity of Pittsburgh

December 8, 2015

Today

• Human pose and actions: Introduction

• Estimating human pose

• Recognizing human actions

– Using specialized features

– Using pose

– Using objects

– From ego-centric video

Next time (Last class)

• Review for the final exam + OMETs

• By Wednesday night, post on Piazza questions or anything you want me to review, for participation credit

• Extra office hours on Friday, 2-3pm

Final Exam

• Monday, Dec. 14, 12pm

• Same room (5502 Sennott Square)

• Similar to midterm exam (mostly short questions and a few problems), but longer (100 points)

• Will only cover topics discussed after midterm (but some of these use topics from first half)

Homework 4

Mean = 77.16, median = 99, max = 123

Homework 5

• Due Thursday

• See Piazza for correction about how to get probabilities for Part III

Participation

• Tentative grades entered on CourseWeb

• Median is 80%

What is an action/activity?

Action: a transition from one state to another• Who is the actor?• How is the state of the actor changing?• What (if anything) is being acted on?• How is that thing changing?• What is the purpose of the action (if any)?• Could be more or less complex

Adapted from Derek Hoiem

Terminology: Human activity in video

No universal terminology, but approximately:

• “Actions”: atomic motion patterns – often gesture-like, single clear-cut trajectory, single nameable behavior (e.g., sit, wave arms)

• “Activity”: series or composition of actions (e.g., interactions between people)

• “Event”: combination of activities or actions (e.g., a football game, a traffic accident)

Adapted from Venu Govindaraju

How do we represent actions?

CategoriesWalking, hammering, dancing, skiing, sitting down, standing up, jumping

Poses

Nouns and Predicates<man, swings, hammer><man, hits, nail, w/ hammer>

Derek Hoiem

How can we identify actions?

Motion Pose

Held Objects

Nearby Objects

Derek Hoiem

Today

• Human pose and actions: Introduction

• Estimating human pose

• Recognizing human actions

– Using specialized features

– Using pose

– Using objects

– From ego-centric video

Jamie Shotton, Andrew Fitzgibbon, Mat Cook,Toby Sharp, Mark Finocchio, Richard Moore,

Alex Kipman, Andrew Blake

Best paper award at CVPR 2011

Adapted from Jamie Shotton

Recognize large variety of human poses, all shapes & sizes

Limited compute budget

super-real time on Xbox 360 to allow games to run concurrently

Adapted from Jamie Shotton

rightelbow

right hand leftshoulderneck

Jamie Shotton

No temporal information

frame-by-frame

Local pose estimate of parts

each pixel & each body joint treated independently

reduced training data and computation time

Very fast

simple depth image features

parallel decision forest classifier

Jamie Shotton

inferbody parts

per pixelcluster pixels to

hypothesizebody jointpositions

capturedepth image &

remove bg

fit model &track skeleton

Jamie Shotton

Compute P(ci|wi)

pixels i = (x, y)

body part ci

image window wi

Discriminative approach

learn classifier P(ci|wi) from training data

Jamie Shotton

Train invariance to:

Record mocap500k frames

distilled to 100k poses

Retarget to several models

Render (depth, body parts) pairs

Jamie Shotton

Depth comparisons

Very fast to compute

inputdepthimage

xΔx

Δ

x

Δ

x

Δ

𝑓 𝐼, x = 𝑑𝐼 x − 𝑑𝐼(x + Δ)

image depth

image coordinate

offset depth

featureresponse

Adapted from Jamie Shotton

Θ

fΘ (I, x) =

To classify pixel x, start here

no

Toy example:distinguishleft (L) and right (R)sides of the body

no yes

yes

L R

P(c)

L R

P(c)

L R

P(c)

fΘ(I, x; Δ1) > t1

fΘ(I, x; Δ2) > t2

Adapted from Jamie Shotton

depth 1depth 2depth 3depth 4depth 5depth 6depth 7depth 8depth 9depth 10depth 11depth 12depth 13depth 14depth 15depth 16depth 17depth 18

input depth ground truth parts inferred parts (soft)

Jamie Shotton

30%

35%

40%

45%

50%

55%

60%

65%

8 12 16 20

Ave

rag

e p

er-

cla

ss a

ccu

racy

Depth of trees

30%

35%

40%

45%

50%

55%

60%

65%

5 10 15 20Depth of trees

synthetic test data real test data

Jamie Shotton

Trained on different random subset of images

“bagging” helps avoid over-fitting

Average tree posteriors

[Amit & Geman 97][Breiman 01]

[Geurts et al. 06]

………tree 1 tree T

c

P1(c)c

PT(c)

(𝐼, x) (𝐼, x)

𝑃 𝑐 𝐼, x =1

𝑇

𝑡=1

𝑇

𝑃𝑡(𝑐|𝐼, x)

Jamie Shotton

ground truth

1 tree 3 trees 6 trees

inferred body parts (most likely)

40%

45%

50%

55%

1 2 3 4 5 6

Ave

rag

e p

er-

cla

ss a

ccu

racy

Number of trees

Jamie Shotton

front view top viewside view

input depth inferred body parts

inferred joint positions (modes found using mean shift)

no tracking or smoothingJamie Shotton

front view top viewside view

input depth inferred body parts

no tracking or smoothing

inferred joint positions (modes found using mean shift)

Jamie Shotton

Today

• Human pose and actions: Introduction

• Estimating human pose

• Recognizing human actions

– Using specialized features

– Using pose

– Using objects

– From ego-centric video

Representing Actions

Tracked Points

Matikainen et al. 2009Adapted from Derek Hoiem

Representing Actions

Space-Time Interest Points

Laptev 2005

• Corner detectors in space+time

Adapted from Derek Hoiem

Representing Actions

Laptev 2005

“Talk on phone”

“Get out of car”

Derek Hoiem

Learning realistic human actions from movies, Laptev et al. 2008

Approach

• Space-time interest point detectors

• Descriptors

– HOG, HOF

• Pyramid histograms (3x3x2)

• SVMs with Chi-Squared Kernel

Interest Points

Spatio-Temporal Binning

Derek Hoiem, figures from Ivan Laptev

Results

Derek Hoiem, figures from Ivan Laptev

Today

• Human pose and actions: Introduction

• Estimating human pose

• Recognizing human actions

– Using specialized features

– Using pose

– Using objects

– From ego-centric video

Human-Object Interaction

Torso

Head

• Human pose estimation

Holistic image based classification

Integrated reasoning

Yao/Fei-Fei

Human-Object Interaction

Tennis

racket

• Human pose estimation

Holistic image based classification

Integrated reasoning

• Object detection

Yao/Fei-Fei

Human-Object Interaction

• Human pose estimation

Holistic image based classification

Integrated reasoning

• Object detection

Torso

Head

Tennis

racket

Activity: Tennis Forehand

• Action categorization

Yao/Fei-Fei

• Felzenszwalb & Huttenlocher, 2005

• Ren et al, 2005

• Ramanan, 2006

• Ferrari et al, 2008

• Yang & Mori, 2008

• Andriluka et al, 2009

• Eichner & Ferrari, 2009

Difficult part

appearance

Self-occlusion

Image region looks

like a body part

Human pose estimation & Object detection

Human pose

estimation is

challenging.

Yao/Fei-Fei

Human pose estimation & Object detection

Human pose

estimation is

challenging.

• Felzenszwalb & Huttenlocher, 2005

• Ren et al, 2005

• Ramanan, 2006

• Ferrari et al, 2008

• Yang & Mori, 2008

• Andriluka et al, 2009

• Eichner & Ferrari, 2009Yao/Fei-Fei

Human pose estimation & Object detection

Facilitate

Given the

object is

detected.

Yao/Fei-Fei

• Viola & Jones, 2001

• Lampert et al, 2008

• Divvala et al, 2009

• Vedaldi et al, 2009

Small, low-resolution,

partially occluded

Image region similar

to detection target

Human pose estimation & Object detection

Object

detection is

challenging

Yao/Fei-Fei

Human pose estimation & Object detection

Object

detection is

challenging

• Viola & Jones, 2001

• Lampert et al, 2008

• Divvala et al, 2009

• Vedaldi et al, 2009

Yao/Fei-Fei

Human pose estimation & Object detection

Facilitate

Given the

pose is

estimated.

Yao/Fei-Fei

Human pose estimation & Object detection

Mutual Context

Yao/Fei-Fei

Learning Results

Tennis

serve

Volleyball

smash

Tennis

forehand

Yao/Fei-Fei

Activity Classification Results

Gupta et

al, 2009

Our

model

Bag-of-

Words

83.3%

Cla

ssific

atio

n a

ccu

racy

78.9%

52.5%

0.9

0.8

0.7

0.6

0.5

Cricket

shot

Tennis

forehand

Bag-of-words

SIFT+SVM

Gupta et

al, 2009

Our

model

Yao/Fei-Fei

Today

• Human pose and actions: Introduction

• Estimating human pose

• Recognizing human actions

– Using specialized features

– Using pose

– Using objects

– From ego-centric video

Detecting Activities of Daily Living

in First-person Camera Views

Hamed Pirsiavash, Deva Ramanan

CVPR 2012

Hamed Pirsiavash

MotivationA sample video of Activities of Daily Living

ApplicationsTele-rehabilitation

• Kopp et al,, Arch. of Physical Medicine and Rehabilitation. 1997.

• Catz et al, Spinal Cord 1997.

Long-term at-home monitoring

Hamed Pirsiavash

ApplicationsLife-logging

• Gemmell et al, “MyLifeBits: a personal database for everything.” Communications of the ACM 2006.

• Hodges et al, “SenseCam: A retrospective memory aid”, UbiComp, 2006.

So far, mostly “write-only” memory!

This is the right time for computer vision community to get involved.

Hamed Pirsiavash

53

Wearable ADL detection

ADL actions derived from medical

literature on patient rehabilitation

It is easy to collect

natural data

Hamed Pirsiavash

ChallengesWhat features to use?

Low level features

(Weak semantics)

High level features

(Strong semantics)

Human pose

Difficulties of pose:

• Detectors are not accurate enough

• Not useful in first person camera views

Space-time interest points

Laptev, IJCV’05

Hamed Pirsiavash

ChallengesWhat features to use?

Low level features

(Weak semantics)

High level features

(Strong semantics)

Human pose Object-centric featuresSpace-time interest points

Laptev, IJCV’05Difficulties of pose:

• Detectors are not accurate enough

• Not useful in first person camera views

Hamed Pirsiavash

Challenges Long-scale temporal structure

time

Start boiling

water

Do other things

(while waiting)Pour in cup Drink tea

Wearable data: making tea

“Classic” data: boxing

Adapted from Hamed Pirsiavash

Appearance feature: bag of objects

Bag of detected objects

fridge TVstove

fridge TVstove

SVM

classifier

Video clip

Hamed Pirsiavash

Inspired by “Spatial Pyramid” CVPR’06 and “Pyramid Match Kernels” ICCV’05

Temporal pyramidCoarse to fine correspondence matching with a multi-layer pyramid

Temporal pyramid

descriptor

Video clip

SVM

classifier

timeHamed Pirsiavash

Accuracy on 18 action categories• Our model: 40.6%

• STIP baseline: 22.8%

Hamed Pirsiavash

Summary: Human actions

• Action recognition still an open problem

– How to represent actions?

• Types of data: atomic and more complex actions, ego-

centric video

• Common representations

– Space-time interest points

– Pose

– Objects (and temporal pyramids of objects)

• Pose

– Can be approached as a classification problem using depth data

top related