lecture on action recognition

7/24/2019 Lecture on Action Recognition

1/72

Action recognition

ECS734 Techniques in Computer Vision

Ioannis Patras

[email protected]

Slides thanks to Hays, Hoiem, Grauman, Oikonomopoulos


2/72

Past lectures

Recognition in static images

Object recognition

Image categorisation


3/72

Todays lectures

Recognition in image sequences

Action recognition(body gestures / facial expressions)


4/72

To come


Action recognition

(body gestures / facial expressions)

Tracking

Structure from motion

Surveillance


5/72

Todays lecture

Introductionapplication

Features Template classification methods

Recognition using pose estimation and

objects (in brief) Part-based action localisation


6/72

Todays lecture




objects Part-based action localisation


7/72

What is an action?

Action: a transition from one state to another

Who is the actor? How is the state of the actor changing?

What (if anything) is being acted on?

How is that thing changing?

What is the purpose of the action (if any)?


8/72

Human activity in video

No universal terminology, but approximately:

Actions: atomic motion patterns -- often gesture-

like, single clear-cut trajectory, single nameable

behavior (e.g., break eggs, lift spoon, kick, wavearms)

Activity: series or composition of actions (e.g.,

people having a conversation, interacting) Event: combination of activities or actions (e.g., a

football game, a traffic accident, cooking a meal)

Adapted from Venu Govindaraju


9/72

How do we represent actions?

CategoriesWalking, hammering, dancing, skiing, sitting

down, standing up, jumping

Poses

Nouns and Predicates


10/72

Applications

Human-Computer

interfaces

Augmented Reality

Sports Analysis

[ C. Sminchisescu, 2007 ]


11/72

Surveillance

http://users.isr.ist.utl.pt/~etienne/mypubs/Auvinetal06PETS.pdf


12/72

2011

Interfaces


13/72


14/72

How can we identify actions?

Motion Pose

Held

ObjectsNearby

Objects


15/72

Todays lecture






16/72

Representing Motion

Bobick Davis 2001

Optical Flow with Motion History
http://www.cse.ohio-state.edu/~jwdavis/Publications/pami01.pdfhttp://www.cse.ohio-state.edu/~jwdavis/Publications/pami01.pdfhttp://www.cse.ohio-state.edu/~jwdavis/Publications/pami01.pdfhttp://www.cse.ohio-state.edu/~jwdavis/Publications/pami01.pdf


17/72

Representing Motion

Efros et al. 2003

Optical Flow with Split Channels
http://graphics.cs.cmu.edu/people/efros/research/action/efros-iccv03.pdfhttp://graphics.cs.cmu.edu/people/efros/research/action/efros-iccv03.pdf


18/72

Representing MotionTracked Points

Matikainen et al. 2009
http://www.cs.cmu.edu/~rahuls/pub/voec2009-rahuls.pdfhttp://www.cs.cmu.edu/~rahuls/pub/voec2009-rahuls.pdf


19/72

Representing MotionSpace-Time Interest Points

Corner detectors in

space-time

Laptev 2005
http://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdfhttp://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdf


20/72

Representing MotionSpace-Time Interest Points

Laptev 2005
http://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdfhttp://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdf


21/72

Representing MotionSpace-Time Volumes

Blank et al. 2005
http://www.wisdom.weizmann.ac.il/~vision/VideoAnalysis/Demos/SpaceTimeActions/SpaceTimeActions_iccv05.pdfhttp://www.wisdom.weizmann.ac.il/~vision/VideoAnalysis/Demos/SpaceTimeActions/SpaceTimeActions_iccv05.pdf


22/72

Examples of Action Recognition

Systems

Feature-based classification

Recognition using pose and objects

Part-based recognition and localisation


23/72

Todays lecture


Features Feature classification methods




24/72

Action recognition as classification

Retrieving actions in movies, Laptev and Perez, 2007
http://www.irisa.fr/vista/Papers/2007_iccv_laptev.pdfhttp://www.irisa.fr/vista/Papers/2007_iccv_laptev.pdf


25/72

Remember image categorization

TrainingLabelsTraining

Images

Classifier

Training

Training

Image

Features

Trained

Classifier


26/72

Remember image categorization

TrainingLabelsTraining

Images

Classifier

Training

Training

Image

Features

Image

Features

Testing

Test Image

Trained

Classifier

Trained

Classifier Outdoor

Prediction


27/72

Spatial pyramids.

Compute histogram in each spatial bin


28/72

Features for Classifying Actions

1. Spatio-temporal pyramids (14x14x8 bins)

Image Gradients

Optical Flow


29/72

Features for Classifying Actions

2. Spatio-temporal interest points

Corner detectors in

space-time

Descriptors based on Gaussian derivative filters over x, y, time


30/72

Classification

Boosted stubs for pyramids of optical flow,

gradient

Nearest neighbor for STIP


31/72

Searching the video for an action1. Detect keyframes using a trained HOG

detector in each frame

2. Classify detected keyframes as positive (e.g.,

drinking) or negative (other)


32/72

Accuracy in searching video

Withoutkeyframe

detection

Withkeyframe

detection


33/72

Learning realistic human actions from movies, Laptev et al. 2008

Talk on phone

Get out of car
http://www.irisa.fr/vista/Papers/2008_cvpr_laptev.pdfhttp://www.irisa.fr/vista/Papers/2008_cvpr_laptev.pdf


34/72

Approach

Space-time interest point detectors

Descriptors

HOG, HOF

Pyramid histograms (3x3x2)

SVMs with Chi-Squared Kernel

Interest Points

Spatio-Temporal Binning


35/72

Results


36/72

Todays lecture






37/72


38/72

Human-Object Interaction

Torso

Head

Human pose estimation

Holistic image based classification

Integrated reasoning

Slide Credit: Yao/Fei-Fei


39/72


Tennis

racket




Object detection



40/72





Object detection

Torso

Head

Tennis

racket

HOI activity: Tennis Forehand


Action categorization


41/72

Felzenszwalb & Huttenlocher, 2005

Ren et al, 2005

Ramanan, 2006

Ferrari et al, 2008

Yang & Mori, 2008

Andriluka et al, 2009 Eichner & Ferrari, 2009

Difficult part

appearance

Self-occlusion

Image region looks

like a body part

Human pose estimation & Object detection

Human pose

estimation is

challenging.



42/72


Human pose

estimation is

challenging.

Felzenszwalb & Huttenlocher, 2005

Ren et al, 2005

Ramanan, 2006

Ferrari et al, 2008

Yang & Mori, 2008

Andriluka et al, 2009 Eichner & Ferrari, 2009 Slide Credit: Yao/Fei-Fei


43/72


Facilitate

Given the

object is

detected.



44/72

Viola & Jones, 2001

Lampert et al, 2008

Divvala et al, 2009

Vedaldi et al, 2009

Small, low-resolution,

partially occluded

Image region similar

to detection target


Object

detection is

challenging



45/72


Object

detection is

challenging

Viola & Jones, 2001

Lampert et al, 2008

Divvala et al, 2009

Vedaldi et al, 2009



46/72


Facilitate

Given the

pose is

estimated.



47/72


Mutual Context



48/72

H

A

Mutual Context Model Representation

More than oneHfor eachA;

Unobservedduring training.

A:

Croquet

shot

Volleyball

smash

Tennis

forehand

Intra-class variations

Activity

Object

Human pose

Body parts

lP: location; P: orientation;sP: scale.

Croquet

malletVolleyballTennis

racket

O:

H:

P:

f: Shape context. [Belongie et al, 2002]

P1

Image evidence

fO

f1 f2 fN

O

P2

PN



49/72

Activity Classification Results

Gupta et

al, 2009

Our

model

Bag-of-

Words

83.3%

C

lassificationaccuracy

78.9%

52.5%

0.9

0.8

0.7

0.6

0.5

Cricket

shot

Tennis

forehand

Bag-of-words

SIFT+SVM

Gupta et

al, 2009

Our

model



50/72

Todays lecture


Features

Template classification methods



Part based Recognition&Localisation


51/72

52

Part-based Recognition&Localisation

Implicit shape model

Goal:

Recognize categories of

actionsLocalize them in terms of their

bounding box (space +

time)

Challenges:Occlusions, clutter, variations,

Hypothesis: Analysis can be restricted on a set of

spatiotemporally interesting/salient events

I f ti th ti l ti l


52/72

53

Information theoretical spatial

saliency

Proposal: Use signal unpredictability as anindicator of saliency

HD=3.866

HD=7.201

Spatial Saliency: Unpredictability in a single frame


53/72

54

Scale (circle radius)

E

ntropy

0 20 40 60 80-0.2

0

0.2

0.4

0.6

0.8

1

29 59

Towards scale invariance

The entropy maxima reveal the spatial scale(s) of a salient

region

Detected salient pointsin a single frame

Spatial and spatiotemporal


54/72

Entropy(HD)

55

Spatial and spatiotemporal

saliency

Spatiotemporal Saliency:

Driven by signal unpredictability in a spatiotemporal

volume (cylinder / sphere)

Examine

entropy:

kkk vHvwvY

Entropys heightEntropys

peaknessdqudsp

dddqudsp

ssudsw

q

D

q

D

,,,,,,

Descriptor extraction codebook


55/72

56

Descriptor extraction codebookcreation

Optical Flow

after median subtraction

Spatiotemporal

Salient Point Detectionc1

c2

cN

Codebook

(class-specific)

Optical FlowInput sequence

t

Feature ensembles

O.Boiman & M.Irani [ICCV05]

Feature selection

Ensemble codewords

Optical Flow + Spatial Gradient

Descriptors.

Bin in histograms and concatenat

Class-dependent


56/72

57

Class-dependent

Spatio-temporal probabilistic

voting

Current frame

T

t

-t T-t

Parameters stored for each ensemble in the

training set

average spatial position of ensemble with

respect to subject center and lower bound.

distance in frames of the activated ensemblefrom the start/end of the action

average spatiotemporal scale of ensemble.

Localisation model learned for codeword/cluster :

d

e

idii epcepwcp

d

|||

X

T

S

de

ic

ic

de

iX cp

x

|

Di i i ti l i


57/72

Discriminative learning

Higher weights for pdfs with higher

localisation accuracy

Class dictionary comprise of

discriminative codewordsAdaboost on the codeword similarities

iii cpcpdw |log|exp( icp |

Spatio-temporal probabilistic


58/72

59

Spatio temporal probabilisticvoting


59/72

60

Hypothesis verification with

RVM-based classification

Mean-shift responses

used as features in RVM-based classification

Two class classification problem for class l

Select class l that maximizes the posterior probability

2

2

( , ')

2( , ')

CD F F

K F F e

N

ji

l

jl

l

jl FFKwwwFc ,);( 0

,......,,1 i

ffF

1;1|

wFc leFlp

Relevance Vector Machine (RMV) is variant of Support Vector Machine

Localisation of single actions


60/72

61

Localisation of single actions


61/72

Localisation accuracy (KTH)


62/72

Localisation accuracy (KTH)

Action recognition


63/72

64

Action recognition

KTH datasetaverage : 88% HoHA datasetaverage : 37%


64/72

Localisation under artificial occlusions (KTH)


65/72

Localisation under clutter (KTH)

Summary


66/72

Summary

Advantages

Highly flexible structure model

Each part casts votes independently

Only few training examples are needed

Fast recognition

Robustness to occlusions

Disadvantages

Loose spatial model that does not model co-occurence of parts.

False positives in background (clutter)

Take-home messages


67/72

Take-home messages

Action recognition is an open problem.

How to define actions? How to infer them?

What are good visual cues?

How do we incorporate higher level reasoning?

Take-home messages


68/72

Take home messages

Some work done, but it is just the beginning of

exploring the problem. So far Actions are mainly categorical

Most approaches are classification using simple features

(spatial-temporal histograms of gradients or flow, s-t interest

points, SIFT in images)

Just a couple works on how to incorporate pose and objects Not much idea of how to reason about long-term activities or

to describe video sequences

To come


69/72

To come


Action recognition(body gestures / facial expressions)

Tracking

Structure from motion

Surveillance

References


70/72

References

C. Sminchisescu. Learning and Inference Algorithms for Monocular

Perception - Applications to Visual Object Detection, Localization and

Time Series Models for 3D Human Motion Understanding, 2007.University of Bonn, Faculty of Mathematics and Natural Sciences.

Habilitation Thesis.

A. Bobick and J. Davis, The recognition of human movement using

temporal templates, IEEE Trans. PAMI., vol. 23, pp. 257267,

Mar 2001.

Alexei A. Efros, Alexander C. Berg, Greg Mori, and Jitendra Malik.

Recognizing action at a distance. In Proceedings of IEEE ICCV '03 -

Volume 2, 2003.

P Matikainen, M Hebert, and R Sukthankar.Trajectons: Action

recognition through the motion analysis of tracked features. In Workshop

on Video-Oriented Object and Event Classication, ICCV 2009 Ivan Laptev. On space-time interest points. International Journal of

Computer Vision, 64(2-3): 2005

B. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space-

time shapes, in: ICCV, Beijing, China, Oct 1521, 2005.

References


71/72

References

Ivan Laptev, Patrick Prez, Retrieving actions in movies, in:

Proceedings of the ICCV07, Rio de Janeiro, Brazil, October 2007,

pp. 1

8. Ivan Laptev, Marcin Marszaek, Cordelia Schmid, Benjamin

Rozenfeld, Learning realistic human actions from movies, in:

Proceedings of the CVPR 08, Anchorage, AK, June 2008, pp. 18.

Modeling Mutual Context of Object and Human Pose in Human-Object

Interaction Activities Bangpeng Yao and Li Fei-Fei IEEE CVPR 10. San

Francisco, CA, USA. June 13-18, 2010.

O. Boiman, M. Irani, Detecting irregularities in images and in video, in:

ICCV, Beijing, China, Oct 1521, 2005.

P. Felzenszwalb, D. Huttenlocher Pictorial Structures for Object

Recognition,IJCV Vol. 61, No. 1, January 2005

Deva Ramanan, Learning to parse images of articulated bodies, in:NIPS, 19, Vancouver, Canada, December 2006

V. Ferrari, M. Marin, and A. Zisserman"Progressive Search Space

Reduction for Human Pose Estimation IEEE CVPR, Alaska, June2008.

References


72/72

References

Yang Wang and Greg Mori, Multiple Tree Models for Occlusion and

Spatial Constraints in Human Pose Estimation, ECCV, 2008

M Andriluka, S Roth, B Schiele, Pictorial Structures Revisited: PeopleDetection and Articulated Pose Estimation In: IEEE CVPR, 2009

M. Eichner and V. Ferrari "Better Appearance Models for Pictorial

Structures", BMVC, London, September 2009.

Paul A. Viola and Michael J. Jones. Rapid object detection using a

boosted cascade of simple features. In CVPR (1), 2001.

Matthew B. Blaschko, Christoph H. Lampert, "Learning to Localize

Objects with Structured Output Regression", ECCV,

Marseilles,France,2008

A. Oikonomopoulos, I. Patras and M. Pantic, "Spatiotemporal

Localization and Categorization of Human Actions in

Unsegmented Image Sequences" . IEEE Trans. Image Processing,vol. 20, no. 4, pp. 1126-1140, Mar. 2011

Multiple Kernels for Object Detection, A. Vedaldi, V. Gulshan, M.

Varma, and A. Zisserman, in Proceedings of the ICCV, 2009

lecture on action recognition

Documents