learning realistic human actions from movieskovashka/cs3710_sp15/actions_nils.pdf · learning...

54
Learning Realistic Human Actions from Movies Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** INRIA Rennes, France ** INRIA Grenoble, France *** Bar-Ilan University, Israel Presented by: Nils Murrugarra University of Pittsburgh

Upload: trinhtuyen

Post on 23-Nov-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Learning Realistic Human

Actions from MoviesIvan Laptev*, Marcin Marszałek**, Cordelia

Schmid**, Benjamin Rozenfeld***

•INRIA Rennes, France** INRIA Grenoble, France

*** Bar-Ilan University, Israel

Presented by: Nils Murrugarra

University of Pittsburgh

Motivation

2

Action recognition useful for:

• Content-based browsing

e.g. fast-forward to the next goal scoring scene

• Human scientists

influence of smoking in movies on adolescent smoking

Internet has tons of video and still growing

Human actions are very common in movies,

TV news, personal video …

150,000 uploads every day

Motivation

3

• Actions in current datasets:

• Actions “In the Wild”:KTH action dataset

[3] Slides version of " Learning realistic human actions from movies.“ Source:

http://www.di.ens.fr/~laptev/actions/

Context

4

Web video search

• Useful for some action classes: kissing, hand shaking

• Noise results and not useful for most action

Web image search

– Useful for learning action context: static scenes and objects

– See also [Li-Jia & Fei-Fei ICCV07]

Goodle Video, YouTube, MyspaceTV, …

How to find real actions?

Context

5

Movies contains many classes and many examples of realistic actions

Problems:

• Only few class-samples per movie

• Manual annotation is very time consuming

How to annotate automatically?

Method – Annotation [1]

6

01:20:17

01:20:23

1172

01:20:17,240 --> 01:20:20,437

Why weren't you honest with me?

Why'd you keep your marriage a secret?

1173

01:20:20,640 --> 01:20:23,598

lt wasn't my secret, Richard.

Victor wanted it that way.

1174

01:20:23,800 --> 01:20:26,189

Not even our closest friends

knew about our marriage.

subtitles

RICK

Why weren't you honest with me? Why

did you keep your marriage a secret?

Rick sits down with Ilsa.

ILSA

Oh, it wasn't my secret, Richard.

Victor wanted it that way. Not even

our closest friends knew about our

marriage.

movie script

• Scripts available with no time synchronization

• Subtitles + time information

How to use the previous information?

• Identify an action and transfer time to scripts by text alignment

[1]. Everingham, M., Sivic, J., & Zisserman, A. (2006). Hello! My name is... Buffy--automatic naming of characters in TV

video.

Method – Annotation

7

On the good side:

• Realistic variation of actions: subjects, views, etc…

• Many Classes and many examples per action

• No additional work for new classes

• Character names may be used to resolve “who is doing certain

action?”

Problems:

• No spatial localization (no bounding box)

• Temporal localization may be poor

• Missing actions: e.g. scripts do not always follow the movie (not

aligned)

• Annotation is incomplete, it can’t be a ground truth for test stage

• Large within-class variability per action in text

Method – Annotation - Evaluation

8

1. Annotate action samples in text

2. Perform automatic script-video alignment

3. Check the correspondence based on manual annotation

Example of a “visual false positive”

A black car pulls up, two army

officers get out.a: quality of subtitle-script matching

a = (# matched words)/(# all words)

How to improve?

Method – Annotation – Text Approach

9

“… Will gets out of the Chevrolet. …” “…

Erin exits her new truck…”

Problem: Text can express the same action in different ways:

Action:

GetOutCar

Potential false

positives:“…About to sit down, he freezes…”

Solution: Supervised text classification approach

• Given an scene description, predict if a target action is

present or not

• Based on bag-of-words representation

Method – Annotation – Text Approach

10

Features:

• Words

• Adjacent pair of words

• Non-adjacent pair of words within a small window

Method – Annotation – Data

11

12 m

ovie

s20 d

iffe

ren

t

mo

vie

s

a>0.5

video length <= 1000 frames

60%

Goal

• Compare performance of manual annotated data with automatic version

Method – Action Classifier [Overview]

12

Bag of space-time features + multi-channel SVM

Histogram of visual words

Multi-channel

SVM

Classifier

Collection of space-time patches

HOG & HOF

patch

descriptors

[4], [5], [6]

[3] Slides version of " Learning realistic human actions from movies.“ Source:

http://www.di.ens.fr/~laptev/actions/

Method – Action Classifier - Features

13

Space-time corner detector

[7]

Dense scale sampling (no explicit scale selection)

Multi-scale detection

Method - Action Classifier - Descriptor

14

Histogram of oriented spatial

grad. (HOG)

Histogram of optical

flow (HOF)

3x3x2x4bins HOGdescriptor

3x3x2x5bins HOF descriptor

Public code available at www.irisa.fr/vista/actions

Multi-scale space-time patches from corner detector

Method - Action Classifier - Descriptor

15

Visual Vocabulary ConstructionUsed a subset of 100’000 features sampled from training

videos

Identified 4000 clusters with k-means

Centroids = Visual Vocabulary Words

Bag-of-features

Method - Action Classifier - Descriptor

16

Vector BOF GenerationCompute all features

Assign each feature to the closest vocabulary word

Compute vector of visual word occurrences.

17 8 . . . 2 39

vw1 vw2 vw3 . . . vwn

Method - Action Classifier - Descriptor

17

Global spatio-temporal grids

In the spatial domain:1x1 (standard BoF)

2x2, o2x2 (50% overlap)

h3x1 (horizontal), v1x3 (vertical)

3x3

In the temporal domain:t1 (standard BoF), t2, t3 and centre-focused ot2

Spatio-temporal grids Examples

Method - Action Classifier - Descriptor

18

Global spatio-temporal grids

Entire Action Sequence

17 8 . . . 2 39

vw1 vw2 vw3 . . . vwn

Action Sequence

Action Sequence Splitted on 2

over time

17 8 . . . 2 39

vw1 vw2 vw3 . . . vwn

1st half

10 8 . . . 35 1

vw1 vw2 vw3 . . . vwn

2nd half

Normalized

Method - Action Classifier - Descriptor

19

Global spatio-temporal grids

Method - Action Classifier - Learning

20

Non-Linear SVM:

• Map original space to a higher space, where the data is separable

Method - Action Classifier - Learning

21

Channel c is a combination of a descriptor (HOG or HOF) and a spatio-

temporal grid

Dc(H

i, H

j) is the chi-square distance between histograms

Ac

is the mean value of the distances between all training samples for

the channel c

The best set of channels C for a given training set is found based on a

greedy approach

Multi-channel chi-square kernel

Evaluation - Action Classifier

22

Findings

• Different grids and channels combination are beneficial to

increment performance

• HOG performs better for realistic actions (context, image

content)

Evaluation - Action Classifier

23

Number of occurrences for each channel component within the optimized channel combinations for the KTH action dataset and our manually labelled

movie dataset

Evaluation - Action Classifier

24

Sample frames from the KTH actions sequences, all classes (columns) and scenarios (rows) are presented

Evaluation - Action Classifier

25

Average class accuracy on the KTH actions dataset

Confusion matrix for the KTH actions

Evaluation - Action Classifier

26

p<=0.2 ; performance decreases insignicantlyp=0.4 ; performance decreases by around 10%

Automatic Annotation avoid cost of human annotation

Noise Robustness

Why?

Evaluation - Action Classifier

27

Correct PredictionClass not present,

prediction says YES

Class present,

prediction says NO

Evaluation in Real-World Videos

Evaluation - Action Classifier

28

Action Classification example results based on automatic annotated data

Evaluation in Real-World Videos

Evaluation - Action Classifier

29

Evaluation based on Average precision (AP) over actions.

Clean = Annotated

Chance = Random Classifier

Demo - Action Classifier

30

Test episodes from movies “The Graduate”, “It’s a wonderful life”,

“Indiana Jones and the Last Crusade”

Conclusion

31

SummaryAutomatic generation of realistic action samples

New action dataset available www.irisa.fr/vista/actions

Bag-of-features expanded to video domain

Best performance on KTH benchmark

Promising results for actions in the “wild”

DisadvantagesStill improvement in automatic annotation is required. Only a 60%

was achieved.

Parameters for the grid of cuboids are not well-justified, how were

determined. Similarly, the # of visual words for k-means algorithm.

K-means is susceptible to outliers.

A greedy approach for determine the best set of channels can

achieve sub-optimal results.

Future directionsAutomatic action class discovery

Internet-scale video search

Questions

32

References

33

[1]. Everingham, M., Sivic, J., & Zisserman, A. (2006). Hello! My name is...

Buffy--automatic naming of characters in TV video.

[2]. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008, June).

Learning realistic human actions from movies. In Computer Vision and Pattern

Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1-8). IEEE.

[3]. Slides version of " Learning realistic human actions from movies.“ Source:

http://www.di.ens.fr/~laptev/actions/

[4]. Schuldt, C., Laptev, I., & Caputo, B. (2004, August). Recognizing human

actions: a local SVM approach. In Pattern Recognition, 2004. ICPR 2004.

Proceedings of the 17th International Conference on (Vol. 3, pp. 32-36). IEEE.

[5]. Niebles, J. C., Wang, H., & Fei-Fei, L. (2008). Unsupervised learning of

human action categories using spatial-temporal words. International journal of

computer vision, 79(3), 299-318.

[6]. Zhang, J., Marszałek, M., Lazebnik, S., & Schmid, C. (2007). Local features

and kernels for classification of texture and object categories: A comprehensive

study. International journal of computer vision, 73(2), 213-238.

[7]. Laptev, I. (2005). On space-time interest points. International Journal of

Computer Vision, 64(2-3), 107-123.

Action Recognition Using a

Distributed Representation of

Pose and Appearance

Subhransu Maji1, Lubomir Bourdev 1,2, and Jitendra Malik1

1University of California at Berkeley2 Adobe Systems, Inc.

Presented by: Nils Murrugarra

University of Pittsburgh

Goal

35

[3]-poster http://people.cs.umass.edu/~smaji/presentations/action-cvpr11-poster.pdf

Motivation

Motivation:

• Humans can easily recognize pose and actions from Limited Views of a single image

36

• Action and pose is identified by body parts (occlusions) at different

locations and scales.

Poselets

37

Poselet:

• Body part detectors of joint locations of people in images.

• They are used to find patches related to a given configuration of joints.

[3]-poster http://people.cs.umass.edu/~smaji/presentations/action-cvpr11-poster.pdf

Poselets - People

38

L.Bourdev, S.Maji, T.Brox and J. Malik, Detection People using Mutually Consistent

Poselet Activations,ECCV 2010

Robust Representation of Pose and

Appearance

Poselet Activation Vector

39

• Poselet annotation are reused from a previous article.

• Represent each example by the poselets that are active.

Estimate 3D Orientation of Head and Torso

Data Collection

40

Manual Verification

• Discard images with high disagreement

• Low resolution and high occlusion

• Only used rotation in Y

Amazon Mechanical Turk

Human Error

• Small error in canonical views (front,back, left and right)

• Measured as average of standarddeviation

3D pose of head and torso Annotations

3D Estimation - Goal

41

Goal

• Given a bounding box of a person, estimate its 3D orientation of head and torso.

3D Estimation - Descriptor

42

Procedure

• Discretize 3D orientation [-180, 180] in 8 bins [Classification] .

• Angled estimation based on interpolation

• Highest predicted bin

• Two adjacent neighbors

0.7

Each entry correspond to a poselet type

0.8 . . . 0.2 0.9

pt1 pt2 pt3 . . . ptn

3D Estimation – Example Results

43

3D Estimation – Evaluation

44

Head Orientation: 62.1 % Torso Orientation: 61.71 %

2D Action Classifier - Goal

45

Goal

• Given a bounding box, estimate an action category

2D Action Classifier - Method

46

Joint Locations Annotation

Pose alone can’t learn to identify actions

2D Action Classifier - Method

47

Appearance information would help

Solution

• Learn appearance considering poselets per action category

• Based on HOG and SVM

2D Action Classifier - Method

48

Windows

7

2

• Find Poselet k-Nearest Neighbors

• Select the more discriminative

• Learn appearance model based on

HOG and SVM

Approach

2D Action Classifier - Method

49

Object Interaction can help?

• It was considered an interaction with horse, motorbike, bicycle and TV.

• A people-object model spatial location was learnt [object activation vector]

2D Action Classifier - Method

50

Context can still help us?

Add action classifier for other people in image

2D Action Classifier - Evaluation

51

Conclusion

52

SummaryA method for Action Recognition in static image was presented.

It is based mainly in:

Poselet features

An Appearance model

Object Interaction

Context information

DisadvantagesThe use of bounding-boxes is not realistic. A better scenario is that

given an image, an algorithm should detect all people actions

automatically.

Related to the Poselet Activation Vector, a intersection threshold of

0.15 is defined. How this threshold was determined? . A similar

situation happens with the Spatial Model of Object Interaction.

Questions

53

References

54

[1]. Maji, Subhransu, Lubomir Bourdev, and Jitendra Malik. "Action recognition

from a distributed representation of pose and appearance." In Computer Vision

and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 3177-3184.

IEEE, 2011.

[2]. Bourdev, Lubomir, Subhransu Maji, Thomas Brox, and Jitendra Malik.

"Detecting people using mutually consistent poselet activations." In Computer

Vision–ECCV 2010, pp. 168-181. Springer Berlin Heidelberg, 2010.

[3]. Poster Version of "Action recognition from a distributed representation of

pose and appearance.“ Source: poster:

http://people.cs.umass.edu/~smaji/presentations/action-cvpr11-poster.pdf